Looking back to last year, the hottest topics for analytics leaders around AI and machine learning (ML) were MLOps and AI Governance. Of course, last November, ChatGPT changed everything. The emergence of Generative AI has energized the conversation around new AI use cases. However, once everyone wakes up from their Generative AI haze, the realization will hit that we still need to operationalize and govern AI projects, now more than ever.
One of the key challenges is understanding how these critical concepts fit into the AI development lifecycle. I often hear, "Is MLOps part of AI Governance, or is AI Governance part of MLOps?" The answer is neither. While these topics have both recently entered our collective consciousness and often utilize some of the same objects in our process, they are different — different users and different thinking.
My Definitions
- MLOps is the practice (people, process, and technology) used to manage and automate the production lifecycle of ML and AI models. It ensures their robust deployment on production systems and continuous improvement to provide reliable outputs for production applications.
- AI Governance is the processes, policies, and internal enforcement necessary to ensure AI technology's ethical and responsible use, manage operational risk, and maintain legal and regulatory compliance for AI and advanced analytics projects.
People of Interest
By my definitions of MLOps and AI Governance, different individuals across the organization are responsible for the success of each function. MLOps is operations focused on reliability and is best suited for someone from an operational IT role. AI Governance focuses on risk and compliance and best suits someone from that background. People with different mindsets, probably from various departments, should be working on these topics.
ML Engineer:
Putting projects into production at scale requires specialized skills. The first job is to take a project from the experimentation phase into production. This is time for engineers and when you should see data engineers and ML engineers. The engineers prepare the project for production and deploy it on production systems.
ML Operator:
Once the project is up and running on production systems, a new role is needed to manage the project over time. This ML operator is responsible for keeping the project up and running. They monitor the project metrics and alerts, including data drift, and then troubleshoot and coordinate project repairs to ensure they meet SLAs for downstream apps.
AI Governance / Risk Manager:
Across the entire AI lifecycle, someone must decide when and how to apply governance practices and ensure teams follow those practices. This risk manager role also needs to ensure that the organization follows best practices for legal and regulatory compliance.
Objects of Interest
MLOps and AI Governance share many of the same objects in the AI lifecycle — access control, audit trails, failover plans, and more. The people performing each function care about some of the same things, but they care about them for different reasons.
Access Control:
Both MLOps and AI Governance teams care about who can access production systems. From an AI Governance perspective, it is a best practice and often mandated by MRM regulations that systems have limited access by only trained operators. From an operations perspective, this makes sense as well. The operations team wants to limit downtime. Letting untrained people into the system is a sure way to break things. So, limiting who has access makes sense for everyone, but for slightly different reasons.
Audit Trails:
From an AI Governance perspective, auditing allows the business to show that it is following the correct process. As an operator, the logs are suitable for tracking the root cause of issues. If, for example, the operator sees a change right before a pipeline fails, they can fix the problem by rolling back and hopefully learn how not to cause that problem again.
Failover Plans:
AI Governance may mandate that the business needs plans to keep operating plans to show regulators. From an operations perspective, this is common sense. It would be best if you had a way to ensure your services stay up and running, and having a plan ensures that the operations team knows what to do when things go wrong.
One Goal: Safety and Value
As the world of AI evolves, especially with the emergence of Generative AI, how we scale our AI function and grow our AI teams will be critical to our success. Thinking about the motivations, skills, and objects that people work with can help us better understand where we have unique functions. MLOps and AI Governance are related, and the teams care about many of the same things, but they are different functions with different roles and goals. Ultimately, we need both to generate real value and safely scale AI adoption.