The previous post on my trip to Strata describes my first day there. You may want to read it here.
The next two days were focused on keynotes and presentations, as well as exhibitors products demos.
I spent most of the day at vendors booths for products demos. It was interesting to see the approaches of different types of companies, ranging from very large vendors to new start-ups:
Big guys are fighting a lot on Hadoop distributions (Cloudera, MapR, Hortonworks or now intel...) or specialized analytical databases (Greenplum, Teradata - Aster Data, HP Vertica, SAP Hana...)
New comers (or relatively new...) focus on more specific problems. For instance, Revelytix offers datasets management, where one can model data inside a Hadoop cluster, store medata, audit and track changes that happen to a dataset, or easily connect with end tools like R; Qubole operates mainly on AWS for now and aims at modeling data within Hadoop, or provides ways to move data around to HDFS / Hive, write queries and see results on samples and schedule jobs; Platfora developed a very nice HTML5 interface to provide visualization capabilities on top of Hadoop, using a middle tier component called Fractal Cache to speed up access to precomputed data; Continuuity offers a SDK where a developer can build and deploy his Big Data applications; Skytree or 0xdata focus on machine learning (in two very different ways, high performance single node library for the first and distributed open source for the latter)....
As a take away, I would say that:
Except in certain specific cases Hadoop is really mainstream
A lot of software focusing on the end user and "relying on Hadoop" in fact means "relying on Hive"
There is a willingness to ease the way to work with Big Data, Hive being an example, but more generally there are a lot of moves to make Hadoop more user friendly by bringing SQL-ish functionalities on top of it.
I got back to keynotes and presentations. The morning keynotes were impressive, the room being literally packed with people (hundreds, thousands?). These sessions were more high level since they were short, but provided us with very good insights on the challenges about working with data.
The rest of the day was dedicated to more hands-on sessions:
Real-time stream processing and visualization using Kafka, Storm, d3.js : where LivePerson teamed up with ZoomData to show the software stack (see the title of the presentation...) they used to process and visualize a large flow of structured and structured data
High-volume data collection and real-time analytics using Redis: 2 software engineers from the Carnegie Mellon university detailed how they built sensors for collecting environmental data, and what challenged they faced for collecting and processing large amounts of data
Real-world machine learning on Big Data: by Alexander Gray from Skytree, where he explained the different tasks of machine learning (classification, regression, clustering...), and the pros and cons of several ML methods (and the need for trade-off between accuracy vs. speed / simplicity / interpretability)
LinkedIn endorsements by Sam Shah and Peter Skomoroch: one of the best presentation I've seen (see it on Slideshare), data science for real (engineering, machine learning, product development...)
Building recommendation platforms with Hadoop from Cloudera: a very nice presentation too, showing how one can build a recommender using the entire Hadoop stack, from data collection to machine learning using Mahout (with real code samples)
I really enjoyed the Strata Conference, especially the fact that there was plenty of concrete experiences and demonstrations of great technologies. Talking about this, I am excited to work on the Dataiku products, we are pushing further the ability to work easily with large datasets. The Dataiku Innovation Platform will remove a lot of the technical and methodological hassle related to data projects, and allow our clients to focus on getting value out of their data and build new services on top of it.
If you want to hear and see what we do with big data and data science at Dataiku, and how we are helping our customers innovate with their data, come visit us at our booth at the Big Data Paris event on April 3rd and 4th!