The DevOps movement tries to reconcile development and operations departments to encourage them to develop IT projects together. Today, we need to make that same transition in data.
What Exactly Is DevOps?
Traditionally, companies distinguished two different roles in IT:
- Developers, who work on developing new projects (the “Dev part”).
- Engineering operations (also known as system administrators) who are in charge of setting up those projects in the production environment and maintaining the servers (that’s the “Ops part,” for operations).
Obviously, with the pressure on companies to constantly innovate, it became much tougher for IT departments to keep up,. as developers were moving in new code much faster. With Agile and Lean methods adopted by managers everywhere, established processes became too rigid. So instead of maintaining siloed departments, companies adopted the DevOps philosophy to get these teams to work together. DevOps, bringing Devs and Ops together at last!
DevOps today is more of a mindset than an actual person or title. It’s a mindset that involves everyone working on the software. The basis is to never consider your IT system and software applications as a finite product that you occasionally add new applications to. Rather, it’s a constantly improving structure on which developers and system admins work to innovate and adapt to the end users’ needs, as well as increase its general efficiency. It’s also a reorganization of your technological stack to make this possible.
Why We Need DevOps for Data Science Projects
The reality today is that many companies suffer from the same problems they used to have with development teams, but with their data teams. More and more data projects carried great promises when they were first discussed, but are delivered late, and when they are, are misunderstood by business teams and/or under-perform.
The same way that software developed outside of the live environment often doesn’t behave as expected, data projects often even have to completely recoded to even work to production environments. And once deployed, they have to be extremely closely monitored because live data often strays compared to fixed historical data. Moreover, techniques such as real-time scoring or online machine learning are more and more popular. They necessitate heavy involvement of both data scientists and infrastructure engineers.
There is also a vicious effect of the traditional split between data scientists developing projects on one side, and infrastructure engineers putting these into production and maintaining the data management systems. It leads to a “them-vs-us” culture - when something goes wrong, during design or production, problems bounce between teams. This is also true for business teams, analysts working on the project, or end-users of the projects, who are somewhat excluded when they should be deeply invovled.
What DevOps for Data Science Might Look Like
We need DevOps for data science soon. Why now? Because increasingly, everyone in the company today has to be data driven, not just the data team. Managers are increasingly aware that they have to use data to make decisions and monitor their activity much closer. They have tools, skills, and analysts available to make BI accessible without going through the data team. However they still need an infrastructure that supports that.
On the other side of the pipe, many database engines and data management platforms are improving the accessibility of data and increasing performance. There is more data available than ever, in formats much more diverse than ever, and new ways to process it make it exploitable.
So everyone working on data has to be more collaboratively involved. And the infrastructure behind it has to allow for that collaboration. In this sense, DevOps for data science would be even broader, involving:
- Data scientists and other members of data science teams
- Data engineers and infrastructure managers
- Operational business teams
How Could Data Science Projects be Organized?
In order to understand how DevOps could work for data science, it's helpful to look at how exactly DevOps is being done. Indeed, the DevOps mindset has been adopted by pretty much all innovative companies everywhere, with different ways of doing, so there are lots of use cases available. For some it’s done through tools to automate as much of the system administration people’s jobs as possible so they can work on improving their structures. For others, it’s recruiting a person whose role as a DevOps is to encourage collaboration between operational and development teams.
Google recently communicated on their DevOps processes with an approach of their own. They’ve started recruiting coders as IT admins. As Wired puts it, their approach can be summed up as: "Don’t get IT people who specialize in running Internet services to run your Internet services. Have software coders run them instead.” They will tire from repetitive tasks and write software capable of replacing their manual work.
In general, companies basically set up processes to involve IT operations in the development phases and developers in the production process as well as the monitoring of projects.
That’s how DevOps should work for data science as well. The goal is to encourage co-creation of projects between business users who come to the data team with a problem - from data scientists to analysts, algorithm developers who develop the project and infrastructure engineers to system administrators who are responsible for integrating the project to the current infrastructure. This can be done by creating groups that work together.
However you can do it, get it done. DevOps principals for data science are growing more fundamental to data projects every day. It's vital so your operational teams can monitor their data, understand how data projects are built, and how they can get value from the output.