The open source software movement didn’t start with data science solutions, but it sure has carried forward many of the biggest advancements in data science in the past decade or so. Starting with Hadoop’s release in 2006 and the development of Pig to Hive and Spark, open source tools have paved the way for data scientists to achieve more and more impressive feats with bigger and more complex data.
Many of these solutions were born in large technology companies whose businesses were pushing up against the technological limits of their time -- Hadoop was born out of Google research, while Pig came from Yahoo, and Hive from Facebook -- and were then open sourced so that a much broader community could not only benefit from them but also build better and more robust versions of them. Open sourcing made sense for these companies, who had more pressing challenges for the engineering teams to work on, and it also proved to be extremely valuable to the data science community at large.
Pig was made to work with Hadoop.
There’s no question that open source technologies in data science are state-of-the-art and that organizations that adopt them signal that they’re dynamic and future-minded. But open source often lacks a layer of user friendliness, which can limit its adoption to only the most technical members of an organization. In this post, we’ll describe the benefits and drawbacks of open source data science software, and we’ll finish by explaining how democratizing these solutions is at the heart of Dataiku.
The Technical Advantage
We often say that the bleeding edge of data science algorithms and architecture is only about six months ahead of what is being open sourced, whether directly by these companies or via original development or reverse engineering. If you look at a tool like TensorFlow, which is a library for building and training neural networks, there is simply nothing like it on the market right now. It was open sourced by Google (specifically by their Google Brain project) in late 2015, which means that anyone using it is using the same standard that Google is using for their neural network development.
If you contrast this with proprietary solutions, the quick access to advanced technologies is a clear advantage.
Attracting and Keeping Talent with Technology Honey
Finding, hiring, and keeping data scientists (i.e., people with a background in machine learning) is hard. The fact that they are very much in demand has two impactful results for employers: they cost a lot to hire, and they have a lot of options. One sort of “technological honey” you can use to attract the best talent is open source solutions, which allow users to hone their skills on tools that will become more widespread in coming years.
It’s important to remember that this talent is not uniform. Some machine learning experts code in Python, others in R -- if your organization uses open source tools, then it might not matter to you. While if you use a proprietary solution, you can only recruit people who know (or are willing to learn) the solution you use.
Hive can be a good source of honey.
The open source community is currently so dynamic that team members will always feel that they're growing and learning, and that's a huge element for retention. For hiring a team, you're saying to prospective employees that you are committed to being part of not just the present but the future. You're offering them the chance to grow their skills with the technologies which will be the most widespread in the future.
Building an Open and Dynamic Culture
Recruiting and retaining talent is only the first part of the equation -- then, you need to build an open and dynamic culture where people work together with passion and purpose. When you look at how productive and dynamic open source communities like the Apache Software Foundation are, that culture can be contagious -- collaboration, creativity, and an incentive to contribute real, meaningful work to the broader project. That spirit is probably the most important thing that can be transferred from the open source world to a single organization.
It may go without saying, but having a team that keeps up with open source developments also tends to mean that they'll be more likely to find innovative solutions to the problems you're solving. Everyone’s trying to break down silos within organizations, so it makes sense that being plugged into the exciting open source data science world would provide encouragement and inspiration for your team.
Now, the Downside… and a Solution!
It's important to remember, though, that keeping up with that rapid pace of change is difficult for enterprise-sized organizations. These latest innovations are usually highly technical, so without some sort of packaging or abstraction layers that make the innovations more accessible, it's very difficult to keep everybody in the organization on board and working together. You might technically adopt the open source tool, but only a small number of people will be able to work with it.
Proprietary solutions have the advantage of being usable right out of the box. And if you’re looking for a suite of data science tools, a proprietary solution lets you start analyzing data pretty much from day one. With open source tools, you need to assemble a lot of the parts by hand, so to speak, and as anyone who’s ever done a DIY project can attest to, it’s often much easier in theory than in practice.
One of the main reasons we built Dataiku is to allow organizations to benefit from the innovations and dynamism of open source tools while also providing an out-of-the-box platform where everything - the people, the technologies, the languages, and the data - is already connected and given a user-friendly interface. We like to think of Dataiku as a "control room" of the over 30 open source tools that we integrate with -- whether you use Hive or Pig, or code in Python, R or Scala, Dataiku will let you use solutions you are familiar with and seamlessly integrate them with the next step in the process. And a visual interface lets you use many of these solutions even if you don't know how to code in a particular language -- or at all.
Our goal is to bring you the best of both open source tools and proprietary solutions. We continue to develop the product so that it integrates emerging open source technologies -- like our integration with Spark 2 in our latest release -- so that our users can stay more up-to-date than ever before. So, we encourage you to explore the open source world -- and we’re happy to be your control room for this cornucopia of technologies.