The Value of Open Source to the Data Community

Data Basics, Dataiku Company Lynn Heidmann

There’s no question that open source technologies are state-of-the-art and a critical piece of an effective AI strategy. But how does open source get built, and more importantly, with a community of contributors, how do they ensure integrity in their software? Get the answers from The Banana Data Podcast featuring special guest Andreas Mueller, a core contributor of Scikit-learn.


andreas-muellerAndreas Mueller is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to Machine Learning with Python”, describing a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he has been co-maintaining it for several years. He is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. You can find his full cv here. His mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Not a Podcast Person?

We've got you. Read the entire transcript of episode 14 here.

Triveni Gandhi: So today's episode is the first of a two-part series on open source development. I guess the question is, what really is open source, right? I think I know what it means, but it doesn't seem like there's a real clear definition out there.

Will Nowak: Yeah, so I'm excited for the conversation that we're going to have today, all about open source, open source software in particular. And so when I think about open source software, key distinction that I make in my mind is that it's software that anyone can view the underlying source code for, and even inspect it and modify it. So anybody can go in, can tweak code to their heart's desire. The one distinction I would make, is that open source does not necessarily imply free. So just because someone can manipulate the underlying source code, it doesn't actually mean that that software ultimately will be free to use. And also there's some tools that are free, but not open source. So it gets a little bit confusing.

Triveni: So maybe an example then of this is that Python and R as coding languages are open source, but a product like Dataiku is free but not open source.

Will: That's exactly right. Yeah. So languages like Python and R, which we talk a lot about R&D at open source and as we'll discuss very much today, there are libraries or packages that are included or can be written for these languages like Scikit-learn. So Scikit-learn is a great example of an open source software tool.

Triveni: Okay. Yeah, I think I'm excited to talk to Andy today just because I want to understand what it's like to actually develop open source tools when there isn't some sort of company or business line driving it, right? Like, these are people contributing out of the good will of their heart.

Will: Yeah, it's super interesting to think about how open source projects are motivated, and then ultimately decisions do have to be made too, right? So if I login to Scikit-learn and I submit what's called a pull request, which is basically me making a change and hoping that that change is accepted, somehow, someone determines what is kind of production level Scikit-learn.

Will: And so just like any other organization where groups of individuals make a common decision, open source software consortium's have to do the same thing too, right? They have governance structures internally. They have people that somehow determine what gets included, what gets excluded. But again, open source is something that at face value, anyone can see the code and modify the code.

Triveni: Well welcome to the podcast, Andy is so exciting to be here.

Andreas Mueller [Andy]: Thanks for having me.

Triveni: So tell us a little bit about yourself, you know, how did you come to be involved with Scikit-learn, your path to where you are now.

Andy: The project was started by a team in Paris. Most of them working at Inria and a lot of them are still active. So there they were a team of about three, four people, now they are about seven. So back then I started just contributing in my free time during my Ph.D. So that was about seven years ago when I started contributing. They asked me at some point to become release manager and so I started doing the work for curating the releases, making sure things get done. I was like one of the most active people on the project for a couple of years.

Andy: In the last year or so actually we've gotten a lot more full-time contributors and so structure changed again. And so I am now not doing that much coding anymore, but I'm leading a team of two that are working on Scikit-learn full-time. And there's several people now in Paris who can actually work full-time on Scikit-learn as well.

Triveni: So it's its own organization almost in the zone right now.

Andy: Well there's several different organizations. So there's an open source project, which is we have a governance document and so we are an informal organization in a sense and we're a good help organization, whatever that means. But there's several entities where people like group and work and so, one of them is mainly in Inria and the Scikit-learn foundation. Then I'm at Columbia with 12 people working for me at Columbia. Then there's more independent contributors. And then even we have someone else Adrian working on Scikit-learn full time as part of Anaconda.

Will: So you mentioned GitHub, so maybe taking it back, you just talk a little bit about contributing, like what does that actually mean. Maybe to start, how does one even literally contribute to an open source project like Scikit-learn?

Andy: GitHub is really the main platform for the project and it's the main way we communicate. And that's intentional because GitHub is a very open platform. All the repositories are public and everybody can see what's going on. And so we try to keep as much of the communication on that platform as possible so that our decisions are transparent and our development process is transparent.

Andy: If you want to contribute, the easiest way is usually either opening issues of things that don't work for you or where you feel like the documentation is unclear or wrong or attacking an issue that someone else pointed out. Sometimes we have people coming in saying, "Oh, there's this algorithm I really like but it's not included." Can I help include his algorithm? But that's really actually a lot of work. And even for people that are deeply involved in the project, including a new algorithm, usually it takes several months.

Andy: So I myself, I started fixing typos in the documentation-

Will: [crosstalk 00:05:41] viable stuff.

Andy: ... and doing some formatting and so on. So I always recommend for anyone that's new, just start simple, get into the process like the GitHub workflow and like engaging with the contributors, then go from there.

Triveni: Well, so on that note then, can tell us about the governance structure that you use at SK Learn, you know, to decide what algorithms are going to make it in, what issues are going to get prioritized and when a release is going to get scheduled, those kinds of things.

Andy: So interestingly for the longest time Scikit-learn didn't have a governance structure. So we're now coming up on the 10-year anniversary of the first release I think. And we've had a governance document for about a year now. So it was one of the things I really pushed for.

Andy: Maybe, let me describe a little bit how it was before and how it is now. And so before basically, we were governed by consensus of everybody that wanted to be involved. It was mostly the people with commit rights. So there's one concept that we had the whole time, which is what we called core developer, which is basically the people with commit rights and they were nominated by the community. And then the community agreed. I don't think there ever was a no vote and don't make this person core contributor.

Andy: So someone started contributing and then at some point there was an acknowledgement, "Oh, we should give this person commit right because they showed zeal and we want to encourage them to commit more and take on more responsibility." And this role has been the same the whole time. But now we formalized more, how do we actually make decisions before it was just consensus-based.

Andy: But the one issue that I saw with this in particular was that there's no way to resolve conflict. There's a long writeup of how Node.js ran into this problem where they didn't have a conflict resolution mechanism. And so basically what our current governance model is have consensus among the core developers and if there can't be consensus, then we revert to voting and if there's a tie in the vote, then there's a technical committee which has seven people that are also voted in and dominate until they resign basically, but they're mostly there for tie-breaking.

Andy: So we have sort of an inner group but they're in a new government structure but their role is mostly tie-breaking and they commit to steering the project but they don't really have more power than the core developers.

Triveni: What do you mean commit to steering the project? So is that like they are deciding the next phase or the roadmap for Scikit-learn?

Andy: They say they will help build the roadmap, like there is no really formal decision power. So we are doing the roadmap in like pull requests and we are discussing ... all of the committers and contributors can contribute to this. Even people that are not contributors can just go on the issue tracker and say "Hey why don't you add this to the roadmap," or, "I don't think this is useful," and we will take their comments into consideration. It's more like similar to my role as the release manager before, it's more like you need to have dedicated people to do certain things otherwise they don't get done.

Andy: And so in particular if it's volunteer-driven like it is or mostly volunteer driven, then you need to make sure the responsibility is somewhere. And basically the technical committee says we are responsible for driving this project. And it's not like their voices count more, but they say, "We are actually going to do it."

Will: So in terms of that idea, volunteers and how you encourage contributions, I'd love to dig a bit more into that. So how do you support people who are contributing to Scikit-learn, right? And I guess explicitly we can maybe talk a bit about the finances of it all, right? So if I'm spending a lot of time, which does take a lot of time to build such a robust framework, how does that happen? Is it just great people spending their nights and weekends doing it or some people paid by the foundation or some companies supporting this?

Andy: So for most time of Scikit-learn it was mostly volunteer contribution mostly by PhD students and postdoc. So it was very academic. Now, as I briefly mentioned earlier in the last maybe two years or so, this transitioned more to a full-time model. And so right now, a lot of the core developers are actually being paid to work on the project, that wasn't really the case before.

Andy: And so there's three different kinds of funding models basically that we are using right now though I'm working part of my time on Scikit-learn though it is mostly governance these days and not so much coding. And I have two people, Nicolai and Thomas Fan who work on Scikit-learn full-time mostly coding and also doing maintenance and code reviews and all the other things they could do as a contributor.

Andy: And me, we are funded by basically government grants and also other philanthropic organizations. So that's like the academic model of writing a grant and then someone gives you money. So the Sloan Foundation was very generous and gave me quite a bit of money to get me started. We now have money from the NSF, from DARPA, and we just very recently you got money from Chan Zuckerberg. Chan Zuckerberg in the summer started this really great program on like core open source software with the focus on I think biomedical research. These are the sources where I draw my funding from.

Andy: In Paris they used to have similar but slightly different model where basically they do neuroscience research and so they had part of the time that they contribute to Scikit-learn as like sub-items on the neuroscience research grants. That's a very common way to fund open source. The new model is with this foundation, so it's Scikit-learn Foundation at Inria. This is paid for by companies who can have sponsorships, so we have different levels of like gold, silver, bronze.

Andy: These are companies sponsorships where a company basically buys into this and gets in return a seat at the table to decide priorities. So there's the Scikit-learn Consortium, which is the collection of the companies giving money. They can basically meet with the Foundation and give their priorities. It's not a direct influence, but it's more of like they can make recommendations or ask for things and give feedback on our work.

Andy: There is a third model where someone is directly employed by a company. So we have Adrian who is directly employed by Anaconda and works on Scikit-learn. So, this is a different model again for funding open source in that a company directly pays someone, that's actually somewhat rare, but Anaconda is one of the companies that does this quite regularly for a bunch of projects like Dask.

Triveni: What does the company get out of that?

Andy: So they are a big player or maybe the biggest player in the scientific Python ecosystem, right? And so they build platform services and they do consulting around the scikit ecosystem. And so strengthening the scikit ecosystem means their product will have more demand. And on the other hand, having people directly embedded in the projects also makes them like more knowledgeable about the project. So if someone needs help with Scikit-learn, if you have a Scikit-learn co-developer on staff, that's clearly very helpful for them. And finally they also have like some influence or at least they have a pulse on what is happening in the project.

Triveni: Okay. So, when you talk about the sponsorship level for enterprises, do you then see a lot of folks coming in because they're using it in production and they need something fixed right away, you know, it's hot bug fix or I need this special request right away. Does that happen a lot or is it more about, "Okay we've taken your input, please wait, we'll decide what's right."

Andy: So the model right now is really much more a long-term engagement. In particular, recursive foundation does not actually have a way to put something into Scikit-learn. So it's still the community that decides in the end what goes into Scikit-learn and what doesn't. And the recycle usually is about half a year. So even if you do a hot fix, most companies want to use release versions and so they would have to wait half a year for the hot fixe.

Andy: We sometimes do bug fix releases, but that's only for very severe bugs or if like there's a couple of bugs. But I don't think we ever really had something happening where someone came to us with, "Here's this bug, fix it." There's like some cases, but it's never been a company. There has been like individual people which where like, "Hey, can you fix this bug quickly? To which our response usually is, "When we get to it, you can fix the bug and we'll review it".

Triveni: Yeah, so not a hot fix, but maybe like a tepid fix, lukewarm.

Will: More broadly, how would you articulate the core value of open source software, right? So I could just start a company that says, "I'm going to make software that builds machine learning algorithms." In fact, companies do this, so why are you in it and why should more broadly people care and feel passionately about supporting open source software?

Andy: My personal motivation is really that I feel creating the software has a really, really big impact in that it helps a lot of people deploy better models, both in science and in industry. I could potentially have a similar impact with commercial software, but in the outcome, there's two things that are different. One is that this is much more an equalizer in that everybody has access to the software.

Andy: If you look at things like Matlab Mathematica, which are still used in some places, if people go to a university in maybe a country that doesn't have as much funds as U.S. have, they might not have access to these tools. Or if you have someone sitting at home trying to do their analysis on their own or trying to do some citizen science or building their company, they might not have access to these tools.

Andy: On the other hand, I think people in general are now a bit more weary about including commercial software anywhere. And I feel like generally commercial software products are probably sort of dying/dead and most software that you can buy is services, right? I think even Windows is like now under ... is your division in Microsoft. So Windows is now a tool to sell cloud computing, right? Windows is not a product in its own anymore, it's a tool to sell a service. And so I think there's not actually that much space for a commercial development of software and that's just so. So, that was more on the like impact part.

Andy: There's obviously also a lot of things on how does it actually work. And so having an open source project gives you a lot of freedom in what you actually want to work on because no one is boss. And what people found is that this actually leads to higher quality software.

Andy: In some cases, it means software develops very quickly, in other cases, depending on the ... or maybe let's say the pace of the software changes naturally with the maturity of the software and not with sort of the deadlines. So Scikit-learn now moves quite slowly because it is quite mature and is quite widely used. And so we're not going to make any abrupt changes and we know we shouldn't make any abrupt changes.

Andy: If you look at like TensorFlow, which is maybe similarly widely used now, they actually made an executive decision to change everything and to make everything incompatible. I'm not saying it was a bad move, but it's definitely like a very different move and it's more of a top-down move than a bottom-up move.

Andy: And so having those more developer-driven and community-driven development I think is something that I find really engaging in open source.

Triveni: I think it's great. It's actually democratizing machine learning and the people who are developing it are the ones who are also using it. I think that has a real value there. So, how do you guys then ensure sort of the integrity of the product, right? And not to say that you as developers or even open source contributors are not good, but when you think about the differences in coding styles and practices and even methodology for like a given machine learning algorithm, how do you guys make a decision on what is considered good practice and correct for this huge project?

Andy: So in terms of procedure, everything requires code review by two core developers, that is sort of our minimum quality on standards. In terms of what do we include, we have pretty strict standards on how widely used an algorithm is and how mature the algorithm is in terms of actually ensuring the correctness of an implementation in machine learning is quite hard. And so I don't think we have a solution to that. There is this saying from open source that many eyes make every bug shallow. This is what often leads open source projects to have actually better code quality because so many people can look at the code, but it doesn't mean that there's no bugs.

Andy: And if you look at the issue tracker and you look at the bug flag, you'll see there's many of them, but there's probably still less than in any commercial product or at least they're known and you can see that they are there.

Triveni: Do you see a tension between the two kinds of users, the corporate enterprise people who are sponsoring SK Learn and the individual contributors or maybe you know the researchers, university folks. Are you building SK Learn to try and please everyone? Are you just building it for yourself? How do you, how do you manage all of these competing sort of people in the room?

Andy: I think it's quite interesting because there's been a transition between these. So, I think all good open source projects start as being built for yourself. Like if you don't have a need that you're filling was the software, you're probably not going to write good software. But now myself, I mostly don't use it that much unless unless I use it for teaching. And I'm mostly now developing. And I think the same is true for many people in the project that now we're more developing it also for other people.

Andy: And initially it was clearly with a scientific community focus, but now we broadened more to industry use as well. But I think to answer more your question, now we try to please everybody in that we definitely want to make sure that the original community of scientific users are still served by also better serving industry. Though there's like a couple of things basically in how we scope, where we want to go, that limit both of these.

Andy: So there's a very interesting point I think someone from the math club project brought up recently is that a project can either be an application or a library and an application tries to provide things that are easy to directly use. Whereas the library tries to give you a collection of functionality and Scikit-learn decided to be more of a library, which makes it a little bit tricky to use in a couple of cases. If you compare it to ours, much more of an application that it tries to directly interface with the users.

Andy: And so there's a thing that's slightly missing maybe in that something that builds on Scikit-learn but that is more directly for the data scientists to use and more as an application that does easiest things in a nice way simply where Scikit-learn tries to be really strict, and it's really clear what is happening and you have to be really explicit about every single step, which is good for a production environment but not so good for like interactively hacking around.

Will: So sticking on the thread of our studio, I'd love to hear your thoughts more broadly on, you mentioned this cool distinction between the library versus an application. So, you developing a library, open source toolkit. What is your thought in your ideal way for profit companies to come in and basically provide, I'll call it manage open source services on top of what you're doing, which is fully again open source?

Andy: So I'm totally happy for anyone to do that, right? That is the point of open source. It would be great if they would give something back to us, but they're not, they don't have to. That is the point of open is they can take it and do their own thing and if they can build a great product then everybody's richer. On the other hand, I'm not sure I believe in that happening right now.

Andy: Just maybe a little bit tricky given that we're here at Dataiku and what I heard from many companies is that they find it hard to find real value add and figure out where exactly they can place their product. Because there's a bunch of companies that over time tried to serve Scikit-learn in the cloud or something and like most of the bigger cloud providers have some form of integration. You can ship a Scikit-learn model on Amazon Sagemaker, but so it is at a point where it's sort of slightly too easy for it to make a good product, to bridge the gap, but it's also still too hard for the user. So it would be nice for it to be easier, but I don't think a lot of people are willing to walk into a platform to help them ease that because it's not actually that hard.

Will: To this point about making ML more accessible, whether it be understanding the coefficients on a model or something more broad, can you maybe describe your work with DABL a little bit?

Andy: DABL is a relatively new toy project of mine, stands for Data Analysis Baseline Library and as the name suggests, it helps you to dabble with data science problems. I've been interested in automatic machine learning for quite a while and so automatic machine learning is about finding the right model for a problem and tuning that model, tuning hyper-parameters and so on.

Andy: However, if you look at data science workflow, this is only a very small part of the data science problem. What's interesting about it is that it's very easy to automate in a sense compared to the other problems.

Andy: With DABL I'm trying to look at what are the different pieces that work that you need to do to get a minimum data science workflow going for a supervised learning task. And so it starts with like data cleaning, data visualization, building initial models, searching more complex models and then trying to evaluate and analyze your model. And these are sort of the basic steps that want to make sure [inaudible 00:23:53] is as easy as possible. In many applications, just getting an initial solution is something that's very valuable and you don't really necessarily need to tweak out the last percent.

Andy: If you look at things like Kaggle competitions is really about like you have a fixed data set and you want to get the best possible model. That's not really how data science works in the real world where usually you get new data and you can change what data you collect. And so I want to make sure that each step is as lightweight as possible so you can easily iterate between the steps. And so there's a bunch of tools in DABL for doing preprocessing, for doing data visualization, for doing some automated machine learning that is quite simple, but probably good enough. Then some automated ways to do model debugging plots and so on.

Triveni: So it sounds like it's like you said, the baseline and the idea is not to have it be a push the magic button and get some output, but rather a starting point for someone looking to build out the flow. That's awesome. Is that also open source? Can people contribute to it?

Andy: Oh yeah, of course. It's a commonly open source and right now it's like mostly developed by me and a little bit from some people on my team. It definitely needs a lot more development, but I think it's already quite useful. Like I added a couple of new plots. If you get a new data set, like the first thing you should really do is to look at it and plotting for supervised learning is not that great in Python. And so it really gives you like all of the plots that you want to see very immediately and it figures out which are the interesting plots and all these things so you can very quickly look at the data set to maybe see data quality issues and so on.

Triveni: Well thank you joining us today for this conversation. It's been great. Before we go, one question for you, if there was anything in the world, not software related that you could have as open source, but isn't currently open source, what would that be?

Andy: I think I would probably pick the U.S. Constitution.

Triveni: Okay.

Will: Mic drop.

Triveni: Oh yeah, mic drop.

Andy: It's really the operating system on which the country runs, but it hasn't been updated in a really, really long time. Maybe it could use some patches.

Triveni: I do think you'll have trouble with consensus on that one though.

Andy: Probably yes.

Will: And by the way, Dataiku is a silver sponsor of the Scikit-learn consortium discussed in today's episode. We'll be hosting the Scikit-learn Sprint in our Paris office in January

Triveni: So for today's Banana Fact keeping in the spirit of the month of the December, I have some information about holiday lights, right? So we all put up these lights on our house, on our trees, on our whatever during the holiday season. And actually those lights, if you're not keeping them year to year, get recycled. And so every year over 20 million pounds of discarded lights go to China. Where it's considered the Christmas light recycling capital of the world.

Triveni: And so these Christmas lights are basically pulverized. They get separated into brass, copper, plastic, and then get turned into a bunch of new stuff, you know, like slippers and different kinds of gadgets.

Will: That's surprising and disheartening. People keep your Christmas lights, recycle them, use them year to year. Don't throw them away.

Triveni: But if they're broken, Will.

Will: Oh right. That's all we've got for today in the world of Banana Data. We'll be back with another podcast in two weeks. But in the meantime, subscribe to the Banana Data Newsletter to read these articles and more like them. We've got links for all the articles we discussed today in the show notes. All right, well, it's been a pleasure Triveni.

Triveni: It's been great Will. See you next time.

You May Also Like

Bringing the Office of Finance Into the Age of AI With Dataiku

Read More

How to Build Machine Learning Models

Read More

Meet Melissa, AI Product Manager at Dataiku!

Read More

Data Scientists: Level Up Your Projects With These Statistics Concepts

Read More