I've been lucky enough to travel to Santa Clara, California, and attend the Strata Conf event.
For those of you who don't know, this is one of the main events of the year in the Data world, bringing together many kinds of folks, from vendors to practitioners, from Big Data engineers to Data Science hackers.
What struck me most at first is the size of the event and the overall quality of the organization (even for "simple" things like the registration process). This is something I have rarely seen here in France, and I guess is a good indication on how mature the US market is on these topics. The tag line of the event was "Making Data Work", which reflects quite well on the content of the different sessions. They are split in several types : Data Science, Design, Hadoop... and are a good blend between vendors speechs, high-level thinking and hands-on tutorials and demonstrations.
The first day, Tuesday, was dedicated to hands-on tutorials, with half day or full day sessions. You bring your own laptop and develop your skills on real-world data projects.
I attended the Kaggle session, "Just the basics, core data science skills" presented by the Kagglers William Cukierski and Ben Hamner.
Their introduction about what Data Science (or part of, more on this later...) is in practice was really great and full of pragmatic facts:
the lack of actual time spent on working with data - not just talking about it
the relationship with "Big Data Barry" - this typical (?) 22 years experience IT guys who just wants more processes, processes on processes and standards
the data scientist as "being worse at statistics than any statistician and and worse at software engineering than any software engineer"...
Before experimenting on a real world problem (a spam detector), they also explained most of the steps involved in a modeling project from data preparation to simple visualization for the analysis ("sell the message, not the graphic") using concrete examples.
The only concern is, in my very own opinion, the over-use of the term "Data Science". To me, Kaggle contests are more prediction contests than pure data science projects, which may involve several others steps (from building the data collection pipeline to ultimately creating a data product - for instance serving recommendations in a website). Even if it is very important to build highly accurate models, it can also be important to generate insights (i.e avoid "black box" predictions where it is hard to say which or how features influence the target variable) and to take into account portability to production system (see for instance why Netflix Prize winning algorithm was not used in production).
My second session was "Python for Data Analysis", from Wes McKinney, the author of the pandas Python library.
Pandas is a very well designed API for data collection, munging, agregation, end even some statistical analysis. Even if he wrote it initially for the financial sector, it is very well suited for any kind of business dealing with structured data. It has a wide collection of built-in functions and methods for dealing with data (including plotting utilities). Its base data structures, the Series and Data Frames, are powerful and efficient.
Wes showed how to do most of the data munging tasks: data import, dealing with data types, adding/dropping columns, aggregations and transformations (the very cool GroupBy method, which can be used for really fancy processing).
Coupling Pandas and scikit-learn for instance then provides with a very nice stack for data analysis projects.
A a final note, per Wes own words, the purpose of Pandas is not specifically to deal with very large datasets, so features like parallel processing are not supported (yet?). Pandas remains very powerful, and can be a tool of choice for a lot of projects (if you use Python).
Stay tuned for the second part of this post, which will cover the 2 other days of Strata !