On a recent webinar with Jim Kobielus of TDWI (that you can watch below), we talked about mature MLOps. You can watch our presentations, but I thought it would be helpful to summarize some of the concepts based on over ten years working with machine learning and, over the last three and a half years, specifically focusing on MLOps.
First, the definition of MLOps maturity changes over time. This blog highlights the capabilities that I feel represent mature practices today. This could easily change in a year. Second, I am not proposing a maturity model with winners and losers or laggards. Where you are in your MLOps journey is up to you and greatly depends on the number of models you have built and how critical those projects are to your business. So, no judgment here, just a perspective.
Based on my experience, MLOps can be broken down into three or four key areas. I prefer four: deployment, monitoring, lifecycle management, and governance. To be clear, I view AI governance as its own area, but the intersection is significant, so it warrants discussion here. Looking at each of these areas, there are aspects of roles, processes, and technology at play.
A mature viewpoint on deployment starts with design. While machine learning operations may seem to be about models, the reality is that models don’t exist without data to train them and data to run them. So, MLOps is really a combination of DataOps and ModelOps together with DevOps practices. Mature MLOps deployment must fit a few key criteria, which I outline below.
First, deployment must include both data pipelines and models. Simply deploying scoring models into production without the data pre-processing steps results in a lot of extra work. So, mature players recognize the need to deploy the data pipelines and models together on production data. Second, deployment of these pipelines and models from the design and experimentation phase to production should be easy and seamless. In less mature operations, there is significant work to recode pipelines as they move onto production systems.
Finally, mature MLOps takes advantage of DevOps best practices and uses a Dev-Test-Prod approach with multiple environments, each with its own function. Without this best-practice approach, you risk putting projects into production that are not tested and ready and the results can be disastrous.
Monitoring is part of management, but it represents such a significant function in MLOps that it warrants its own discussion. Predictive models are the product of data, algorithms, and tuning. When you deploy data pipelines and models into production, you don’t expect your artifacts to change, but everything around them can and will. So, the data in production can be different from the data the model trained on, the infrastructure can break down, and more. The point is that you need systems to monitor your pipelines, models, infrastructure, and services to make sure they are doing what they are supposed to.
For mature players, this means having pipeline monitoring, data monitoring, infrastructure monitoring, and service monitoring. You should be able to answer these questions: Did my data pipeline execute as expected without errors? Is the production data significantly different from the training data? Is my storage and compute infrastructure up and running? Is my service responding in line with my SLA either in terms of response time or throughput? If you can monitor these elements, then you have a pretty good chance of catching issues and fixing them before they cause serious problems for your business and you have a pretty mature approach. This leads me to the next area.
3. Lifecycle Management
This is a place where your maturity really shows. Monitoring will alert you to issues (hopefully with enough time to do something about it before you have a service outage), but what you do with that information and how you update your projects in production is lifecycle management. Mature MLOps teams have a plan for when things go wrong because they know they will. They have tools for troubleshooting like access and event logs. They are also ready with failover and fallback options like a failover model or rules. This plan gives them the time they need to figure out what is going on and fix it or while they bring a newer version of the model through testing into production.
Again, the dev-test-prod process from DevOps comes in. Also, having version management is nice because it means you have an older version of the model you could use as a fallback if needed. However, I don’t recommend that as a best practice. In many cases, the model will need to be retrained on newer data. Mature Ops teams can do this on their own without the help of the data science team because the training pipeline is already connected to production data. They simply retrain and then test the new version to see if the issues, like data drift, are fixed. If so, they run some additional load testing, and then they have a new version they can use in production.
Lastly, when they roll this new version into production, it happens seamlessly. There is no stopping and starting services or interruptions to services. At this point, you can see that how you handle lifecycle management really does show your level of MLOps maturity.
4. AI Governance
Governance practices are often lumped into MLOps or vice versa. I can see the case for both, but I think the mature organization understands the roles that each can play in creating a safe and high-performance operations environment. First, governance is not a dirty word. Governance just means having the operational best practices, visibility, and control in place to ensure that your organization is doing what needs to be done to manage operational, legal, and regulatory risks and compliance. Having governance for AI projects just makes sense and mature organizations and MLOps teams know it.
For example, governance asks operations to limit access to production systems to a small number of trained people. The operations teams want this too because they know that these trained people are less likely to make mistakes resulting in downtime. The governance team may ask for audit logs to ensure that access protocols and change processes are being followed and to provide to regulators if needed. The MLOps teams like logs too, especially for troubleshooting. The governance team may ask for approvals and sign-offs for production projects. The Mature MLOps team wants these as well because they know that having business stakeholders, data scientists, and other experts in the loop will maximize the success of their production projects.
I guess this is just a long-winded way of saying that mature MLOps today is a lot of common sense and best practices that we already know and use for software development and IT operations but applied to AI projects across deployment, monitoring, management, and governance.
If you want to hear more on this, check out my webinar with Jim from TDWI below.