When scaling AI, it's no secret that IT leaders face a myriad of challenges associated with governance, democratization, security, and speed (as we outlined in our previous article). Here, we've outlined some of the most impactful steps they can then take in order to help the company be successful in its AI initiatives — from scaling push to production and keeping architecture simple to building sound MLOps strategies.
Scale Push to Production
Hands down one of the biggest steps IT teams must take to allow for organizational change around AI is scaling the ability to put entire AI projects into production. This includes not only the processes and tools for doing so, but also educating more people across the business about what pushing to production actually means so that they are educated and aware of the benefits, work involved, and risks.
Business leaders view the rapid deployment of new systems into production as key to maximizing business value, but this is only true if deployment can be done smoothly and at low risk. Continuous integration and continuous delivery (CI/CD) concepts apply to traditional software engineering, but they apply just as well to data science, machine learning, and AI systems.
After successfully developing a model, a data scientist should push the code, metadata, and documentation to a central repository and trigger a CI/CD pipeline. An example of such pipeline could be:
- Build the model
-Build the model artifacts
-Send the artifacts to long term storage
-Run basic checks (smoke tests/sanity checks)
-Generate fairness and explainability reports
- Deploy to a test environment
-Run tests to validate ML performance, computational performance
- Deploy to production environment
-Deploy the model as canary
-Fully deploy the model
Many scenarios are possible and depend on the application, the risks from which the system should be protected, and the way the organization chooses to operate. Generally speaking, an incremental approach to building a CI/CD pipeline should always be preferred — i.e., a simple or even naïve workflow on which a team can iterate on is often much better than starting with complex infrastructure from scratch.
A starting project does not have the infrastructure requirements of a tech giant, and it can be hard to know upfront which challenges deployments will face. There are common tools and best practices, but there is no one-size-fits-all CI/CD methodology. That means the best path forward is starting from a simple (but fully functional) CI/CD workflow and introducing additional or more sophisticated steps along the way as quality or scaling challenges appear.
Keep Architecture Simple
When it comes to architecture for supporting AI systems, things can get complicated relatively quickly. Sometimes, it’s simply the result of legacy systems — in enterprises with complex organizational structure and lots of history, it’s impossible to start from scratch, and things might already be messy before getting started. Of course, developing a best-in-class AI platform wouldn’t be so difficult if one was starting with a blank slate!
But oftentimes teams add complexity because they want to work with certain technologies or try certain things, even if the business needs don’t require something so intricate, and this is problematic. Overly complicated systems can seriously hinder AI efforts as it becomes increasingly more difficult to implement additional tools and exponentially harder to maintain over time.
Today, many companies in the United States aren’t building the systems themselves but rather working with consultants for support. Regardless, the message is the same: keep it simple in terms of architecture so that even as technologies ebb and flow, switching between them is seamless both for the IT team and for the businesses they are supporting.
Build Sound MLOps Strategies
It’s one thing to smoothly deploy the first versions of models, but what about the next ones? How do people in the organization make decisions to upgrade models, and who is responsible for it?
While MLOps often gets lumped in with data or AI Governance, the two are not the same. While governance (practices and processes ensuring the management of data assets within an organization) is largely owned by IT managers and their teams, almost everyone in the organization — including IT teams — has a role to play in MLOps (the standardization and streamlining of machine learning lifecycle management).
MLOps isn’t just important because it helps mitigate the risk of machine learning models in production (though that is one good reason to develop MLOps systems), but it is also an essential component to massively deploying machine learning efforts and in turn benefiting from the corresponding economies of scale. Going from one or a handful of models in production to tens, hundreds, or thousands that have a positive business impact will require MLOps discipline.
Good MLOps practices will help teams at a minimum:
- Keep track of versioning, especially with experiments in the design phase.
- Understand if retrained models are better than the previous versions (and promoting models to production that are performing better).
- Ensure (at defined periods — daily, monthly, etc.) that model performance is not degrading in production.
For IT managers and their teams, MLOps needs to be integrated into the larger DevOps strategy of the enterprise, bridging the gap between traditional CI/CD and modern machine learning. That means systems that are fundamentally complementary and that allow DevOps teams to automate tests for machine learning just as they can automate tests for traditional software.