A couple of weeks ago, I got the chance to sit in on an interview of Olivier Grisel (OG) by our own Florian Douetteau (FD). Olivier is one of the main contributors of the scikit-learn machine learning library so the two of them discussed what Olivier's working on and how other technologies are evolving. This is part one of that interview.
Olivier Grisel and Scikit-Learn
FD: Olivier, you have been a major contributor to scikit-learn for some time now. Can you tell us a little about how you contribute?
OG: I have been working on the scikit-learn project since around 2010. I started doing it in my free time. In October 2013, I joined Inria, the French Institute for Research in Computer Science and Automation. I am part of a team called Parietal that works on brain modeling with MRI data. Within that project, I am more specifically in charge of getting scikit-learn to evolve in the long term, in terms of performance as well as scalability.
FD: Scikit-learn has evolved a lot the past few years and has a lot of traction. Can you tell us a little bit about what is to come?
OG: Absolutely. We have seen an important evolution over the past few years, with a lot of new users and contributors. According to our site’s stats, we have between 150,000 and 160,000 unique visitors per month, one-third of which are returning visitors. We also have a growing number of contributors. These days, for example, we have almost 300 pull requests that we have to process.
The user community itself contributes most of the new evolutions of scikit-learn. Based on what they are working on, they constantly develop modifications and additions to the library and submit them for future versions. We then test these modifications and add the changes to each new version. For example, for the release of the last beta one of our contributors developed an LDA estimator. It’s an algorithm that is a bit of an alternative to the MMF we already have in scikit-learn, but more scalable.
What I am working on is a more long-term project that revolves around issues of volume (so it is not a part of the next release). We are trying to make sure that more and more scikit-learn algorithms can manage data in streaming mode, or out of core, rather than by leading the whole dataset in memory. We want them to load the dataset incrementally, as they train the model.
Scikit-Learn vs. MLlib
FD: We are hearing a lot about Spark at the moment in the machine learning world. Have you had the opportunity to try it out? How does it compare to scikit-learn?
OG: I have had the opportunity to have a look at Spark by doing a couple of tutorials (here and here). The main difference between Spark and Python or scikit-learn is that Spark is by default a system that will manage data that cannot be processed in-memory, in a distributed way, and by dealing with persistence. This has oriented the design of MLlib (ed: the distributed machine learning framework on top of Spark) from the start. They chose to only implement algorithms that are scalable in the data that they can process and in the number of workers in the cluster. They have dealt with this double scalability by only choosing algorithms that have this particularity.
Scikit-learn was originally designed to process data in-memory so we aren’t biased in that way. We have some very useful algorithms that only work on small datasets. But it is true that we have lots of algorithms that were implemented in batch mode. I am working on refactoring them for more scalability.
Scikit-learn was not built to function across clusters. We do not want to change everything to process resources stored across a cluster. We do want to keep it as a possibility though, and make sure that some scikit-learn models can be embedded in a framework like Spark so they can be distributed across a cluster. For example, when you look at training a random forest, you can very easily train each tree individually, if you assume that your data is small enough to be replicated across the cluster. For medium sized datasets, we also want to speed up the search for hyperparameters and cross-validation, which are naturally parallelizable.
I am also interested in approaches that focus on efficient out-of-core processing first (like Dato is doing) before tackling cluster distributed computing (as Spark is focusing on). I haven't really looked into the details yet, but it seems that if you can better process out-of-core and focus on algorithm efficiency, you should be able to waste fewer resources. This could become a driver in the future development of scikit-learn.
FD: Can large volumes of data that are stored in a distributed manner lead to biases in performance and results? I am thinking for example of calculating random forests with Spark.
OG: The MLlib random forest algorithm is parallelized directly at the level of training of each tree when it is choosing a feature to split on. It does not take into account all of the possible splits. It builds a histogram and calculates in parallel on partitions of the dataset. Then it uses this summarized information to build its split. That makes it an approximated algorithm. In practice, this is rarely an issue when you are working on a sample to build a model. The result is close enough. The implementation itself may not be as efficient though.
Is Feature Generation the Future?
FD: When you look at a data project, a lot of the time — if not most of the time — is spent on preparing data and generating features. Scikit-learn moved in the direction of feature engineering these past few months. Is that a direction that you’ll be maintaining? Will you be working towards an integrated pipeline? It seems like a little bit of an endless road. Are some parallel projects going to specialize in specific data types and formats, while still respecting the scikit-learn conventions and philosophy?
Features are always a critical point when creating a scikit-learn predictive model. Since the last release with pandas dataframes, we are getting better at integrating the toolbox to manipulate data from any format and changing it to any other format or any other representation.
I agree with you that feature engineering will always be specific to a certain application. We want to remain a generic library. If we were going to specialize and develop features in specific domains, it would be as a part of separate specific libraries. For example, in Astrophysics there is a dedicated library called AstroML. My team at Inria works on neuroimaging data. We have developed a specific library called nilearn as a side project of scikit-learn. It is better to separate the scope of the different projects. It allows better communication around a community’s specific practices.
FD: On that subject of feature engineering, do you believe Spark and MLlib change the way data scientists work?
OG: The recent data frame API is one of the advantages of Spark. It gives data scientists a very intuitive, flexible, and expressive tool for testing different representations of their data.
On a higher level, in the latest spark.ml package they allowed the creation of pipelines and predictive models in a “chain” which considers the assembly of the data as features. It is possible to cross-validate the interaction of parameters in different steps of the chain. It is a strength for this type of API to make it easy to test that. This is also possible in scikit-learn by plugging in transformers who use data frames as entries and adding custom scikit-learn transformation scripts. Making that process easier is the type of practices we should work on.
Look Out for These Projects
FD: Thank you very much for this great talk! Is there anything you would like to add?
OG: I think the Python ecosystem is more and more aware of the current state of technologies, especially when it comes to dealing with large volumes. Java and Scala are ahead of us, most notably with Hadoop and Spark. Developers are very aware of that and they are working on answers. There are a lot of interesting projects today, such as Blaze, Dask, or XRay. They are developing APIs that are just as expressive as pandas and that can do out-of-core calculations and eventually distributed ones. Wes McKinney’s Ibis project for Cloudera is also very interesting. It is in Python but uses Impala as a back-end, making it an alternative to PySpark. I do not believe you can use it in production today but there will be interesting developments on that subject.
You can check out part two of Olivier's interview where he gives advice to people starting up in data science and looking at what technologies to set up.