While a majority of AI’s business value comes from deploying models operationally, a significant percentage of data science projects never actually make it out of the lab to even start making a real-world impact. Why is that so? In this recap from a recent webinar with GigaOm Research featuring Dataiku’s Lead Data Scientist Katie Gross, we break down some of the key barriers to machine learning deployment and what data teams (notably data scientists) have the power to do to help prevent this from happening.
Here are a few key reasons behind data science projects never being operationalized:
1. Hype-driven initiatives (i.e. “I want us doing AI by the end of the year!”)
Many times, teams want to do AI because it’s flashy and great to tell investors, but they don’t actually know what they want to do with it. If it’s not actually valuable and the organization doesn’t have business metrics that AI would improve, the initiative will probably fall flat.
Teams should avoid building out data projects for the sake of it and, instead, put a strategic plan in place regarding how this will impact roles already in place, what investment and resources will be required, what they hope to get out of data science, and more. Helping staff to understand how data science and AI fit into the larger company’s strategy can be just as important as educating people on the technology itself. By clearly communicating the value, existing employees can more easily see how AI relates to their role.
2. Not starting with a business objective or use case from the beginning
Many times, organizations jump into using AI and machine learning (such as in a new part of the organization that isn’t currently using the technology for any of its operations) and aren’t sure exactly how it can provide impact or value. For example, a team may have heard that building out a churn prediction model could prove fruitful, but they didn’t define the goals, business problem the model will help solve, or how to use the model in production. Data scientists and anyone working directly with machine learning models should always make sure there’s no disconnect between building the model and actually doing something with it. It’s a seemingly simple question that many companies don’t ask at the beginning, and should.
3. Data scientists and IT not pushing for discipline
This really comes down to having certain standards in place to ensure that things don’t break in production. Once a tool (such as Git) has been established to manage all of the code versioning, the next step is to package and code the data. It’s important to create packaging scripts that generate snapshots in the form of a zip file for both code and data in the script, which should be consistent with the model (or model parameters) that is being shipped. The zip file can then be deployed to production. Monitor the situation in case the data file size is too large and you need to step in to snapshot and version the necessary data files.
Change Management Challenges in Machine Learning Deployment
1. Data scientists like to do their own things and enjoy full control: When working on their own laptop, data scientists can install whatever packages they want, know how they’re setting up their virtual environments, are in control and can easily see everything, and more. Often, though, they need to collaborate with other people and that’s where collaborative data science platforms come into play. By making use of version control technology like Git or working on a machine that exists in the cloud, teams can collaborate with colleagues and avoid silos and duplicate work.
2. People don’t always trust the model or the underlying technology: Model development is in fact an iterative process — there should be feedback loops where the business side is sharing and monitoring things in terms of effectiveness and communicating that back to the data science team to react and change features, use different training data, and so on. If the stakeholders on the business side aren’t involved in assessing the real-world utility of the model, this can likely lead to problems down the road. In general, it’s important to use technology to augment (not replace) the work done by humans, who are kept in the loop for reality checks and human judgement on elements of accuracy and fairness.
3. IT translating models into traditional languages/platforms like Java and .NET: Given that the smooth flow of the modeling process relies heavily on the existence of a consistent IT environment, it’s important to use newer technologies such as Python and R. Code environments make it easy to work with technologies (such as scikit-learn) that do get improved and updated frequently.
Tooling Issues With Machine Learning Deployment
If data scientists have a disparate combination and/or inconsistent usage of data science tools, chances are the data workflow — including machine learning deployment — will be anything but seamless. By allowing everything data-related to happen in one single tool (such as Dataiku), teams will have more transparency into how the data was cleaned and modeled, version control and rollback capabilities, and collaboration.
After experimenting with a range of models built on historic training data, the natural next stage of the AI lifecycle is machine learning deployment to score new, unseen records.