I wrote a blogpost about sitting in on an interview of Olivier Grisel (OG) by our own Florian Douetteau (FD) a couple weeks ago. This is the second part of that interview where they discuss tips and tricks for beginner data scientists.
Olivier Grisel (OG) works at Inria Parietal on developing scikit-learn, one of the most popular machine learning libraries for Python. He’s an expert on Machine Learning, Text Mining, and Natural Language Processing. A couple of weeks ago I got the chance to sit in on an interview of him by our very own Florian Douetteau (FD).
In the first blogpost I wrote on that conversation where they discussed scikit-learn, MLlib and what direction he thinks Big Data technologies are going. Olivier talked about how scikit-learn is evolving to compete with new libraries such as MLlib that were originally designed to work on distributed dataframes. He got into the comparative advantages of the two libraries and how scikit-learn was evolving to perform computations on data that can't be processed in one server.
Today Olivier takes it down a notch on the technical aspects and answers the questions we all have about getting started in Data Science.
Choose wisely to scale widely
FD: From the point of view of someone who is getting started in Machine Learning, wondering which framework to use and which algorithm will be able to scale up, what would you advise?
OG: A good rule is to look at the volume of data that you’ll eventually have to deal with, taking into account future growth of course. A serious machine these days has about a hundred gigaoctets of RAM. Of course the original dataset can be much larger. Once you extract features and you transform them in a numerical table, you can get a much smaller dataset; one that can be processed in-memory and that you can run predictive models on with scikit-learn. Wanting your system to be scalable doesn’t automatically mean you have to use MLlib.
The challenges of big data infrastructures
FD: For people starting to look into how to manage larger volumes of data, what would be your advice on how to get a good measure of the challenges?
OG: It is important to start by building baseline models even before you start doing Machine Learning. You can calculate averages for example. When you do those basic calculations in Spark, you can look at the pipeline and the processing times and make sure that even before you add the complexity of machine learning you’re not setting something up that will not be useful.
I also advise people to select a sample of their data that can hold in-memory and do comparative Machine Learning analyses with all available algorithms, even ones that can not always be deployed across a cluster. You have to keep in mind that MLlib limits its panel of algorithms to the ones that are scalable. That means their library today is not as rich as say R or Python packages. It is always good to subsample and come back to the world of small data to be sure that you are following the right approach.
You should also do this so you can make sure that when you add more data to your sample you are actually improving the performance of your model. That is a good thing to check.
It is also a good idea to try to replicate a working analysis pipeline from one language or framework (e.g. Python) to another (e.g. R or Scala). A tool like Data Science Studio makes it really easy to design two pipelines with different programming languages on the same data, run them in parallel and compare the results. Some operations might be more natural or more efficient to perform in some frameworks and doing such pipeline translation exercises is a quick way to build up some practical intuitions. Once your two pipelines yield consistent outputs, ask a more experimented colleague or expert friend for a quick code review. He or she might know ways to make your code more efficient or more concise and idiomatic.
FD: Can you give us an example of a large dataset that you or your team have worked on with scikit-learn?
OG: It really depends on the models that we’re working on and testing. Some models will crack after tens of thousands of samples. People in my team work with datasets that are several terabytes large. They do a lot of preprocessing and re-dimensioning on them though.
That is really something that has to be taken into account by less experimented users. If they have really large volumes of data to work on they can chose to use Spark a little bit naively. In the end this approach can end up being really inefficient. They can use several hours of CPU to do things they could have done in 5 minutes on a laptop if they had given it more thought upstream. All these new technologies can be double-edged swords.
Python vs R
FD: Do you have an answer for people beginning in Data Science and wondering which language and framework to learn?
OG: I believe you have to chose according to your own affinities as well as whether you have an expert around. You learn a lot in Data Science by communicating. If you are going to meetups or signing up to Kaggle competitions with other people, it is really important to seize that opportunity to interact with experts. Exchanging tips and tricks with them is a great way to get started.
After that it is actually fairly easy to transfer concepts from one framework to another. If you have mastered one language, the methodology is the same. That is what is most important. This is also where a tool like Data Science Studio can useful. You can use it to try out different languages and technologies in one environment. It's easier to translate an analysis from one language to another, like Python and R, when you can easily compare results.