While many open source offerings fall outside the domain of data science, everyone knows that there is an open source tool (or twenty) for every data project. That's why we commissioned 451 Research to explore the top routes to open source data science adoption.
To supplement 451 Research's analytical approach, we sat down with Dataiku Data Scientist, Patrick Masi-Phelps to get a first-hand account on the impact of open source into the big data landscape:
Claire Carroll: So let's start from the beginning: What's open source?
Patrick Masi-Phelps: Open source means your technology, your software, or your programming language has been made open to the public.
CC: What value does open source provide?
PMP: I think open source is great for the general public because it means that an individual’s or company’s contributions towards a certain field then get released to the public, so the public can use them for some analogous use case or completely separate use case. I feel like you hear in the news, all the time, about how like NASA developed some cool radar technology to track asteroids orbits’, which then medical researchers use to improve x-ray scanners. There’s the benefit to the community.
I think there’s also a benefit to the company or individual who released the open source technology, because often times when you release software or a new programming language, the public will contribute their own improvements back to it. It’s this two-way street, like that’s why Microsoft has made a lot of their technology open source, because they feel that it’s advantageous to their own business to receive feedback from the community.
In the data science world, all of the modern tools are open source, for the most part. So especially Python, R, SQL, all the different Spark flavors, so Spark R, PySpark, SparkSQL, all that good stuff. So, all those languages are open source and I think that’s made doing data science accessible to a lot of different companies.
CC:What open source software do you use on a regular basis as a data scientist?
PMP: I use all of the standard data science programming languages which are all open source, a lot of Python. Different companies have contributed towards packages in Python which I use regularly. Scikit-learn is open source, and that’s the big one for machine learning. I use a lot of the data preparation open source packages, especially in R. GGPlot is pretty great, Dplyr is great.
A specific open source thing that I’m working on right now is in the astronomy world. Google and UT Austin did this partnership a little while ago where they discovered some exoplanets using some open source libraries like TensorFlow and scikit-learn. And they released all their code to the public. So I was able to just clone their GitHub repository, fire it up on my own machine, and start tweaking some things to try and improve their models. It’s a very specific open source use case that is near and dear to me right now.
CC: What does Dataiku contribute to open source? How are we involved with the terrain?
PMP: One of the core principles of Dataiku is incorporating all of these open source languages into one central place, one central workspace, where you and your team can analyze your data or build projects and models. We felt from the get-go that it was important to play nicely with R programmers and Python programmers and folks who know all the different Spark languages. So, if you’re familiar with the Dataiku flow, I’m sure you know by now that you can include R code as well as Python code, all these different open source languages, all within the same open source pipeline. In terms of contributing to the open source space, we have a fairly active community of users who contribute their custom plugins or other feature add-ons and things to something we call the plugin store.
CC: I wanted to ask you a little about the businesses and clients that you work with on a regular basis, no names of course, but what do you see them using most? Are there any trends that you’re seeing?
PMP: It depends on the client. Some of the clients who have very large data sets, so stuff like transaction information or point stock information, they often have to use the different Spark libraries all the time. And for them I think, doing machine learning on big data is really difficult, and I haven’t personally seen a ton of companies use it successfully at this point. So for those folks who are dealing with giant data sets right now, they’re mostly just using the data ingestion, data cleaning, pipelining open source libraries. For other clients, I think it varies.
CC: Last question! Do you have any insights with regard to what’s on the horizon with open source?
PMP: Different open source libraries for building neural networks. My understanding is that the first popular one was TensorFlow, and TensorFlow is currently used by a lot of different businesses. I use it right now in some projects, but I’ve been hearing some more buzz about PyTorch, which is another open source language for neural network stuff. So, keep an eye out for PyTorch.