Scikit-learn is one of the most popular data science tools in the world, and we wanted to learn more about how it's made. We spoke with Olivier Grisel, a full-time open source developer and one of the core contributors behind scikit-learn. We've talked broadly about open source already, but now we're honing into the specific development and collaboration behind scikit-learn.
Claire Carroll: Can you tell me a little about what scikit-learn does and who the users are?
Olivier Grisel: The goal of scikit-learn is to be a library that provides easy to use implementations of standard machine learning algorithms used for data science and predictive modeling. We’re not trying to advance the state of the art of machine learning research. Instead, we try to make it easy to transfer the result of that research into a usable and deployable tool. The main users of scikit-learn are data scientists. They could be researchers who need to analyze data either for scientific applications or for business applications. The value of scikit-learn comes from its very homogeneous API; it makes it easy and fast to try very different kinds of models on the same data, and then see which kinds of models work best and subsequently invest more time on data preparation and parameter tuning for those models.
I believe that scikit-learn is very efficient for the first stage of this workflow. The only competitor that I see is the R programming language: it’s also very easy to try many methods within the R ecosystem.
Scikit-learn is not necessarily always the best option because sometimes you might want to implement your own customized version of a standard algorithm to be able to finely tailor the model toward a very specific problem. But at the same time, many people still use scikit-learn because it works and that's enough for their use-case. They don't need to re-implement everything.
From the traffic that we measure on the online documentation, we estimate that there are approximately 600,000 monthly scikit-learn users. I believe Jupyter developers estimate that their own user base is in the order of a couple of millions. So we estimate that there are approximately three million data science people on Earth and approximately one-quarter of them use scikit-learn on a regular basis.
CC: I want to talk a little bit more about the development process of scikit-learn. Maybe this is misleading, but looking at the GitHub contributions, it seems like development sort of dropped off after 2016. It sounds like you are still involved, but is it still a very active contributor network, or what are the goals with development now?
OG: I think the statistics of the contribution is a bit weird because we used to check in more automatically generated code in the git repository. So the volume of modified files has dropped a bit because of that, but they are not the primary source files, so I think this is an artificial change. Personally, I have been recently working more with upstream projects; for instance, I have been working a bit on the Python standard library to tackle problems that I identified when working in scikit-learn. It's often better to fix them upstream.
Several members of my team at Inria had the opportunity to work more with upstream projects. For instance, we have been working to optimize the memory usage when serializing large NumPy arrays. We invested time on this issue because it's important for distributed computing and therefore scikit-learn.
I also do more mentoring than I used to do in the past. I can often spend more time pair programming with the other developers either at Inria or Columbia University rather than coding myself. I still try to code because it's fun, but this is what happens over time. I should do more code reviews also because there are many more contributors working.
CC: It looks like right now, scikit-learn is at version 0.20. Do you see it getting to a 1.0, and if so, what would that look like? What's the next step?
OG: We've had this discussion several times in the past. It's a bit of a psychological barrier :). But keep in mind that it's not very important for most users. What we wanted to do before issuing the 1.0 is finalizing the public API to be able to set it in stone. There are a couple of use cases that our current API cannot address very well. We want to address those specific points before deciding to release 1.0.
But, for the majority of our users, things won’t change much. We regularly add new preprocessing tools, new estimators or improve the usability and computational or statistical performance of the existing ones. But the scikit-learn API is already very stable at this point. So I would not attach too much importance to the symbolic value of the 1.0 version number. Maybe from a marketing point of view, it would be a good idea to release it.
CC: Right. So how many core developers are there total? Because according to GitHub anyway, you are the second top developer. How many would you say are the core team?
OG: I think there are between 10 and 20 people who are actually reviewing on regular basis. I think there is only one core developer, Joel Nothman from the University of Sydney, who is reviewing close to 99% of all the incoming pull-requests. To merge a pull-request, it needs to receive at least two positive reviews by core developers. We need more core developers to review pull-requests on a regular basis.
At this time, there are approximately 40 developers with merge rights, but I think many of them are semi-inactive because they have moved on, they work on something else and have no time to work on scikit-learn anymore.
One of the goals of the Inria Foundation is to get financial support to hire more people to do this work in the long term. Right now, the people who are involved in the long-term maintenance of the project are mostly of two kinds: the majority are working for academic institutions. And there are also a couple of independent developers/data scientists working as contractors and they decide to allocate some significant amount of their own time to work on Open Source.
We noticed that we get comparatively much less long-term involvement from employees working at big tech companies. One goal of the Inria Foundation is to make it possible for big companies that use scikit-learn internally to financially support the project if they cannot directly allocate engineer time to help with the maintenance and code review efforts.
CC: So I guess a related question would be, you have code review, but how do you see people being able to contribute to scikit-learn without actually, if they're not writing code themselves? It sounds like you're doing mentoring and code review and things like that.
OG: There are other kinds of contributions. For instance, there are people teaching students in universities or in a professional setting and therefore helping grow the user base. Organizing events, from small meetups to larger conferences, is also a very good way to support the community. And also if you use scikit-learn and you find a bug, please open an issue that explains how to reproduce the issue from a simple code snippet. That’s very helpful. Sometimes you might spend time understanding what is the cause of the problem even if you don't know how to fix to the issue. In this case, writing a good bug report with a minimal reproduction case takes a significant work but actually helps us write a fix much faster. Minimal reproduction cases are a very appreciated form of contribution.
scikit-learn meetups are held all around the world.
We've got one more feature coming up soon with Olivier, but if you want to learn more about the open source environment in the meantime, check out the 451 Research White Paper on the landscape and variety of tools.