I had the chance to attend and speak at this year's Berlin Buzzwords conference last week, dedicated to the topics of search, scale and storage.
The conference featured many international speakers and several prominent figures of the Big Data world, like Shay Banon of Elasticsearch, Uwe Schindler of Apache Lucene or Ted Dunning of MapR Technologies.
Oh yeah, it also featured yours truly, talking about what Dataiku's been developing to simplify data innovation:
- Flow, a next-gen data pipeline orchestrator (beta coming soon!)
- dctc, our swiss army knife for remote file manipulation (beta HERE)
But more on that in my next post. First, here are some highlights and evolving trends since last year's Buzzwords conference.
Search is growing
First of all, search keeps growing in importance, and represented almost half of the conference. The focus of these search talks also changed: last year, they were very technology-related, with some great presentations about Lucene's internals. This year's focus was on usage, with several talks about real-life implementation of search technologies (at SoundCloud, for example).
Implementing your own search engine is no longer reserved for Fortune 100 companies, and even small companies see the huge advantages that a good search engine can give, especially for online companies.
And on the usage topic, the conference's verdict is clear : 100% of the speakers used Elasticsearch. And actually, many of them did not use it for the "Elastic" part, but for all other goodness that Elasticsearch packs: never underestimate the competitive advantage of having a product that allows you to get started and have the results you actually expect in a few minutes.
We saw the same thing a few years ago with MongoDB: despite some very real technical shortcomings compared to some other systems (like concurrency, replication and robustness), MongoDB has taken a huge lead in the NoSQL space thanks in no small part to how great the initial experience is.
It's also interesting to note that several people I met did mention Algolia when discussing search options, who we profiled in this blog post. Nice job, guys!
Near real-time exploration over real-time analysis
Another noticeable evolution from last year's conference was the almost complete disappearance of "real-time" talks. There had been several talks about trying to mix Hadoop and analysis of real-time data, with the general conclusion that it was not trivial, quite hacky, and maybe not really needed all that much. IMHO, there are few business cases that truly require analysing data at the very moment it's produced, when hourly stuff is easily attainable with standard Hadoop-based solutions.
What IS needed, however, is real-time exploration, drill down and reporting, and we did see some nice stuff on that. SQL on Hadoop has taken a step forward in the last year, with a noticeable focus on low-latency, compared to what is currently achievable in Hive.
I was pleasantly surprised by the Apache Drill talk by Michael Hausenblas and Ted Dunning. I had been closely following the initial development of Drill, which had been a bit rocky due to the extremely open nature of the project. But they've made some fantastic progress in the last few months. Drill's flexible architecture has huge promise: you can basically plug custom languages, operators and storage backends. I can't wait to try it out on some of the problems we're working on, like when dealing with prices: how to replace missing values by the average of the category, not the entire column? Go with SQL, build something custom, or integrate something like Drill directly into our platform?
Last but not least, Hive itself intends to stay in the running for performance, and has presented some plans (Stinger) for improved latency by moving away from MapReduce in some cases in favor of Tez, a more flexible Hadoop-based computation framework that could be roughly described as Map-Reduce-Reduce-Reduce-…
So it really was a buzz being in Berlin, meeting so many cool people working on cool stuff. And speaking of cool stuff....when we talk to the people here we're more and more convinced that our approach of making Data Innovation easier and more productive fills a very real gap in the current offerings.Next up: a bit about my talk on our own technology, Dataiku Flow and dctc.