The DevOps movement has been adopted by many companies in the past ten years as a way to reorganize their IT processes and favor innovation. At its core it tries to reconcile development and operations departments to encourage them to develop IT projects together. Today, we need to make that same transition in data, by introducing the DataOps.
To understand DataOps, we must first understand DevOps
Traditionally, companies distinguished two different roles in IT:
- developers, who work on developing new projects (the “Dev part”)
- and then pass them over to engineering operations (AKA system administrators) who are in charge of setting them up in the production environment and maintaining the servers (that’s the “Ops part”, for operations).
Obviously, with the pressure on companies to constantly innovate, it became much tougher for IT departments to keep up as developers were moving in new code much faster! With Agile and Lean methods adopted by managers everywhere, established processes became too rigid.
So instead of maintaining siloed departments, companies adopted the DevOps philosophy to get these teams to work together. DevOps, as in a movement to bring Devs and Ops together at last!
What exactly is DevOps?
After extensive browsing of the Internet, and talking to smart people around me, what I can tell you is this. DevOps is a movement to encourage collaboration, automation, and constant innovation. It was developed when companies started applying Agile methods for engineering and feeling the need for increased collaboration. They started bringing in new people, the DevOps. These people were a part of the developing team as well as the IT team and helped make the connection between both. When this wasn’t enough to support the innovation rate of big tech companies, they redefined the movement.
DevOps today is more of a mindset than an actual person / title. It’s a mindset that involves everyone working on the software. The basis is to never consider your IT system and software applications as a finite product that you occasionally add new applications to. Rather, it’s a constantly improving structure on which developers and system admins work to innovate and adapt to the end users’ needs, as well as increase its general efficiency. It’s also a reorganization of your technological stack to make this possible.
Why we need DataOps for data science projects
The reality today is that many companies suffer from the same problems they used to have with development teams, but with their data teams. More and more data projects carried great promises when they were first discussed, but are delivered late, and when they are, are misunderstood by business teams and/or under-perform.
The same way that software developed outside of the live environment often doesn’t behave as expected, data projects often even have to completely recoded to even work to production environments. And once deployed, they have to be extremely closely monitored because live data often strays compared to fixed historical data. Moreover, techniques such as real-time scoring or online machine learning are more and more popular. They necessitate heavy involvement of both data scientists and infrastructure engineers.
There is also a vicious effect of the traditional split between data scientists developing projects on one side, and infrastructure engineers putting these into production and maintaining the data management systems. It leads to a “them-vs-us” culture. When something goes wrong, during design or production, problems bounce between teams. This is also true for business teams, analysts working on the project or end-users of the projects, who feel excluded. (If you’re interested, find out more on how to prevent other frequent problems of data science production)
This is why we need a movement similar to DevOps in data, the DataOps.
From DevOps to DataOps for data science
And we need those DataOps soon. Why now? Because data science teams are facing challenges and opportunities that they can take advantage of.
Everyone in the company today has to be data driven, not just the data team. Managers are increasingly aware that they have to use data to make decisions, and monitor their activity much closer. They have tools, skills, and analysts available to make BI accessible without going through the data team. However they still need an infrastructure that supports that.
On the other side of the pipe, many database engines and data management platforms are improving the accessibility of data and increasing performance. There is more data available than ever, in formats much more diverse than ever, and new ways to process it make it exploitable.
So everyone working on data has to be more collaboratively involved. And the infrastructure behind it has to allow for that collaboration. In this DataOps is even broader than DevOps; It involves:
- Data scientists, and other members of data science teams
- Data engineers, and infrastructure managers
- And operational business teams!
How do you organize your Data Science Projects with DataOps?
To see how to organize DataOps the simplest is to turn to see how DevOps is being done. Indeed, the DevOps mindset has been adopted by pretty much all innovative companies everywhere, with different ways of doing, so there are lots of use cases available. For some it’s done through tools to automate as much of the system administration people’s jobs as possible so they can work on improving their structures. For others, it’s recruiting a person whose role as a DevOps is to encourage collaboration between operational and development teams.
Google recently communicated on their DevOps processes with an approach of their own. They’ve started recruiting coders as IT admins. As Wired puts it, their approach can be summed up as: "Don’t get IT people who specialize in running Internet services to run your Internet services. Have software coders run them instead.” They will tire from repetitive tasks and write software capable of replacing their manual work.
In general, companies basically set up processes to involve IT operations in the development phases and developers in the production process as well as the monitoring of projects.
That’s how DataOps should work as well. The goal is to encourage co-creation of projects between business users who come to the data team with a problem, data scientists, analysts, algorithm developers who develop the project, and infrastructure engineers, or system administrators who are responsible for integrating the project to the current infrastructure. This can be done by creating groups that work together. It can be done by having a DataOps guy, that is aware of all the parties’ jobs and constraints. Or it can be done with a tool that allows them all to work together on their projects.
However you can do it, get it done. DataOps is growing more fundamental to data projects every day. It's vital, so your operational teams can monitor their data, understand how data projects are built, and how they can get value from the output. And so you data scientists never go to your infrastructure engineers with a project that just “can’t be recoded,” or that’s been trained on data from 6 months that’s become unavailable since.
Keep your ears open, you’ll be hearing a lot more about DataOps soon...
If you want more tips on how to organize your data projects and put them into production efficiently, check out our Production guidebook.