A colleague recently told me a story about a large U.S. tech company that built its own data science and machine learning (ML) platform. It employed some of the best data scientists and software developers in the world, and yet after three years, the platform was only 50% functional — ML model operationalization was still a struggle.
If they couldn’t do it, is there hope for the average organization to successfully scale AI efforts? Yes! However, the reality is that hope probably doesn’t lie in building a data science and machine learning platform from scratch. While there are other components of a company’s AI strategy where the build or buy debate is worth considering (e.g., models/applications or data processing tools — see Figure 1), it’s becoming increasingly difficult for organizations to scale AI long-term by building an entire platform to coordinate all AI efforts.
Figure 1: Build and Buy Considerations for AI Efforts Across the Organization
This blog post unpacks five reasons why for most companies looking to establish sustainable AI for decades to come, building an Enterprise AI platform is likely an insurmountable challenge. So, instead of asking yourself how to build an AI platform, ask yourself why you should consider the more scalable option — finding an end-to-end option that covers the entire lifecycle.
1. Rapid Evolution of Data Science & ML Technologies
The data science and machine learning technology revolution is exciting, with the rapid acceleration of new technologies disrupting at massive scale. Companies that are agile and able to adapt to the pace of the revolution succeed and have a leg up among their competition, while large organizations that struggle to pivot their efforts often struggle to keep up.
To understand how complex and fast-moving the landscape is, look no further than Technoslavia, a rapidly evolving nation of shifting tectonic plates and islands changing every year since 2015 (Figure 2).
Figure 2: Technoslavia: The (Fragmented) World of Data Infrastructure in 2020
This furious pace of innovation means AI platforms need to be up-to-date — a very costly exercise that requires constant focus and foresight into emerging technologies. These additional opportunity costs must be considered when building a homegrown AI platform.
For example, see Figure 3 — if you had not prioritized Kubernetes as a key technology integration as part of your AI platform build back in 2017, it would be safe to say your solution would not meet today's expectation and would struggle to be competitive as you wear an unnecessary burden of infrastructure costs.
Figure 3: A look based on Google Trends at the evolution of the world of data infrastructure
2. Development and Maintenance Costs
We have worked on a number of total cost of ownership (TOC) exercises with our clients, including organizations that have embarked on their own internal AI platform builds (asking themselves how to build an AI platform that's scalable for hundreds or even thousands of models in production). Aligning with Gartner's 2020 Critical Capabilities for Data Science and Machine Learning Platforms, the estimates below show a back-of-the-envelope calculation to consider when deciding whether to build an AI platform for today's modern enterprise.
For the sake of this exercise, we will not consider the build of each capability but more so the efforts required to “stitch together” the capabilities to deliver an end-to-end solution:
Capability #1 integrated with Capability #2
= 8 person team ($100K p.a. salary) for 6 months.
= 8 x $100K/2 (6 months)
= $400K for 1 link connect 2 capabilities together
Therefore 11 Capabilities (n),
= n*(n-1)/2
=11*(10)/2
= 55 Links
= $4.4 million
Admittedly, this is an oversimplification of the process, but regardless of the metrics used, the total investment likely far supersedes initial cost estimates and project delivery timeline expectations.
Once developed, leaders also need to consider maintenance and ongoing development. This slightly older — but still relevant — paper from Google research highlights that machine learning code itself only contributes 5% to the high-level technical debt, where we see the remaining 95% is tied up in "Glue Code” (i.e., dependencies holding it all together).
Figure 4 [From Google research]: Only a small fraction of real-world ML systems is composed of the ML code,
as shown by the small black box in the middle — the required surrounding infrastructure is vast and complex
Organizations today — especially considering the current worldwide economic climate — simply cannot scale with this level of debt. This is true both from a resourcing as well as a return on investment (ROI) perspective
3. Resourcing & the Data Talent Shortage
Let’s assume for a moment that your organization does have the requisite funding to undertake building an AI platform from scratch — should you? Unless you are Google, Amazon, or Facebook with an army of data scientists and development talent, the answer is still likely no. Why?
The lack of AI talent will plague organizations for some time to come — there is simply not enough data talent to meet demands. In order to succeed in today's challenging economy, companies will need to leverage their skilled data science resources to tackle big business challenges rather than use this type of resource to build AI platforms.
Figure 5 [Source]: The lack of top data talent will plague organizations for years to come —
does it make sense to use these scarce resources to build an AI platform, or to actually get started solving business problems?
4. Development Complexity
The aforementioned Google research paper not only illustrates the high technical debt brought on by building an AI platform, but also its complexity. There is so much “glue” — so many features that are outside the core functionality of simply building a machine learning model — that building all of them from scratch to have an AI platform that truly allows for the scaling of Enterprise AI efforts is incredibly challenging.
For example, beyond the data scientist and engineering teams, today's enterprise has many other people that could benefit from the use of data in their day-to-day roles, forwarding the organization’s overall ability to leverage AI in all areas. In order to scale with limited resources, it is essential that organizations consider leveraging and upskilling this talent rather than solely depending on hiring new talent.
That means beyond ease of project documentation, the ability for data experts and novices to collaborate via an intuitive UI is a critical component of an AI platform; however, these kinds of features are generally lower on the list of priorities (not to mention non-trivial to incorporate into the platform in a robust way) for the software development teams as they prioritize homegrown solutions.
Figure 5: Scaling AI is about giving everyone the power to work with data via a robust AI platform usable by all profiles, not just a small group of siloed experts
Other features that are not only challenging to build but potentially hard to get teams to prioritize are elements of an AI platform that steer the enterprise toward responsible, human-centric AI. When development teams are under pressure to deliver a platform, features like subpopulation analysis and individual prediction explanations may fall by the wayside.
Figure 6: AI platforms are very complex, so when building them from scratch, they may not meet the real needs of the business.
This can expose the company to very real risk, as machine learning failures due to lack of Responsible AI practices are well documented (from unsafe cancer treatment recommendations to recruitment gender bias, just to name a few of the most recent and disastrous examples)
5. Security Costs and Concerns
The most valuable data an organization owns is often its most sensitive data. In addition to the development costs of an AI platform, the investment in security is also paramount. Data needs to be secure at rest, in transit, access controlled, and compliant.
A recent AWS TCO paper indicates that the cost of adding compliance and security on top of programming notebooks is $1.3 million and spans a total 91 development months. As above, this is just to meet today's standards. Compliance is constantly evolving with stricter protocols and guidelines, adding additional costs and development time for some time to come.
The Bottom Line
Delivering sustained success with AI across the enterprise involves hundreds — if not thousands — of data projects that benefit all divisions and departments. In order to achieve this, organizations need an AI platform that not only allows for simple model operationalization, but collaboration between different profiles, Responsible AI, and of course robust security requirements.
It’s worth noting that open source provides fantastic resources to tackle each of these categories, and while one can source and build solutions for each individual requirement, an often-overlooked challenge lies in having all of these categories work together in harmony.
Often the short-term view may be to build a highly customized solution that might solve today's most pressing use cases. However, for even a small number of use cases, the cost can quickly creep into the multi-millions with development spanning many years and technical debt accumulating exponentially.