An Introduction to FinOps for Controlling the Cost of AI Projects

Dataiku Product, Scaling AI François Sergot

Among most companies' organizations, two services have a special impact on software projects: Finance and IT. Historically, they have inherited the responsibility of enforcing internal rules regarding cost and software consistency, hopefully for the better of the whole company. In real life, enforcing finance- and IT-centered rules on other teams results often in conflicts, voiding any potential benefit.

In order to overcome this, more and more companies are adopting a philosophy where finance, IT, and business are actually working together with a value-oriented mindset and a common goal to successfully deliver key projects. This approach coalesced around the term FinOps. 

Machine learning (ML) projects are no exception and they should embrace the FinOps approach for several reasons:

  • As ML matures, it needs to be cost-accountable. The sooner ML teams work on cost management, the smoother this transition will be.
  • A majority of ML projects are using public cloud infrastructure (e.g., AWS, Azure, Google) where cost control is critical.
  • ML is inherently resource-intensive, be it to store and process the vast amount of data it requires or to train models, which inevitably translates to high costs.

blue chart with lines and euro sign in the middle representing FinOps

FinOps is a holistic approach that aims at bridging the cultural gap between teams. From a practical standpoint, you will need accurate cost numbers and allocations to fuel those discussions. This is where the ML solution comes into play: it needs to provide the right data for those teams to make sound strategic decisions. Let’s look at the type of costs the FinOps team may want to look at.

Money Must Be Funny in a Rich Man's World

If you have a financial background or are a hardened budget owner, you will be in familiar territory here. A typical cost analysis is based on CAPEX and OPEX:

  • CAPEX is short for capital expenditure and comprises all spending that results in your company gaining a property (e.g., a physical server).
  • OPEX is a short for operational expenditure and covers all spending that you need to regularly to operate your business (e.g., a cloud service subscription)

From a financial point of view, OPEX has always been more closely monitored as it impacts key finance indicators such as the EBITDA, and your CFO and shareholders are surely watching those numbers closely. 

In the last decade however, cost has actually shifted from CAPEX to OPEX. How can this be? It essentially is based on one trend: the cloud.

With their infrastructure as a service, these providers offer a good compromise between numerous and potential high-value business initiatives and financial control: your cost starts low and increases with business traction, meaning no big spending at a project kickoff when the success or failure is uncertain.

Once live, another key aspect is that this cost evolves in sync with activity: if the project is cancelled or downsized, it will consume fewer or no resources, and its OPEX impact will decrease accordingly. If the project is expanded, it will scale up linearly (no big hardware investment when current infrastructure reaches its limits) and can handle peak activity. This capacity to match cost with activity is one of the main reasons companies are engaging in cloud migration projects.

And this is where IT comes into play: they usually are the “owner” of the cloud platform and are tasked with ensuring its consistency, security, and usability. And, most importantly, they own the cost and will need to reallocate it properly. Hence the FinOps triangle:

finops-diagram

However, to make this collaboration work, you need reliable data from the field — decisions cannot be taken on uncertain ground, and you can’t let the data platform OPEX constantly rise.

So, a healthy ML solution in a FinOps world must be able to: 

  • Measure the cost globally across all used services to help build reliable budgets and cost analysis.
  • Relate this cost to meaningful and actionable business axes, such as per project, infrastructure, or solution.

This will allow the FinOps team to make decisions based on updated, accurate costs and help highlight where the money is spent as well as where optimization could be conducted (e.g., if two similar projects have widely different costs, maybe one needs to be cost-optimized).

We will see next what numbers actually make sense in a data platform.

Cost Typology of a Data Platform

You obviously need people to drive your ML project, but we will not mention these human costs here. However important that is, we will leave it to a more HROps approach as it is widely different from one company or project to another and also has few ML specificities.

When dealing with the other costs, you have two big buckets: data and compute.

  • The cost of data is what you are paying as storage. Most cloud providers are invoicing this per gigabyte (GB), and the resulting total data storage cost (even with huge amounts of data) is usually rather low. For example, on AWS, storing 10 terabytes (TB — or 1,000 GB) of data costs about $250$ on S3 or Redshift. When looking at a global budget, this cost is usually not the issue and can even be recorded approximately without impacting final budget decisions.
  • The cost of compute is what CPU you need to transform data, train a model, or make predictions. This cost is more uncertain and potentially much higher, especially in an MLOps approach that encourages frequent retraining of models. Compute costs can also be very diverse — from SQL engines to Kubernetes clusters — making it harder to aggregate.

One other aspect is the cost of integration for your entire solution. This is hard to estimate, and there are unfortunately no standards here. It highly depends on your ML platform architectural decisions. Typically, the more fragmented your platform is between solutions, the more integration costs you will have.

How Does Dataiku Help Manage AI Costs?

As Dataiku is conceptually an end-to-end platform and, as such, integrates with many services, measuring usage and costs is a natural fit.

We will see below how you can leverage Dataiku itself to have a holistic report of your AI project costs with a business perspective. This is not the only option you have — others (like leveraging specific cost-aggregation tools or cloud vendors’ native tools) have their advantages and drawbacks as well.

How to Monitor the Cost of Data with Dataiku

As we saw above, the cost of data is usually small compared to the cost of compute. With Dataiku, you can calculate it using standard data processing. This can be based either on direct connection to the storage system, allowing you to fetch the size of all databases and schema and then calculate the cost with the related pricing you have. Alternatively, you can connect directly to cloud providers’ billing data to fetch the cost and then process it.

Globally speaking, the biggest challenge here will be to decide on how you want to allocate the cost, as data storage is shared among several projects.

How to Monitor the Cost of Compute with Dataiku

As we saw above, measuring and allocating compute is more strategic, but unfortunately, more complex as it needs to aggregate data from several sources. Again, you can retrieve the cost from the cloud provider, but allocating out to meaningful business axes can be challenging and requires advanced data processing.

In this spirit, Dataiku has a solution called Dataiku Resource Usage Monitoring specifically conceived for this purpose: it aggregates all possible computing costs and, critically, gives you a very precise lineage on this cost.

Once you start recording this usage, you can actually see the level of detail provided:

compute resource usage

But, even if such data is a gold mine, it needs some work to be actionable. To do this, you need to transform and enrich it, typically with a dedicated project and flow that:

  • Filters the compute metrics you are interested in
  • Joins with Dataiku structural object lists (projects, recipes, users)

The resulting flow can be simple or complex — it depends on many factors, such as the variety of infrastructure used, or the logic of matching Dataiku projects with business projects. This is the flow of our sample cost analysis project that you can find on our knowledge base:

Dataiku flow

The solution has built-in dashboards that will be updated automatically and can be easily consumed by the project teams, but also by IT, finance, or business sponsors — the infamous FinOps triangle:

instant usage investigation

How to Make It Count

Now that we have the data, let’s look at how to optimize it. There are of course a lot of specifics here that will depend on how your company is managing cost control, but here are a few hints.

Don’t Wait Too Long to Look at Cost

Building a cost monitoring dashboard is something that should be done globally for all Dataiku projects. There is nothing worse than having to redesign a project to reduce its cost once it is working and ready to go to production (actually, there is worse: doing that once it is in production…).

When a new project is starting, it should automatically be monitored. You can then consider the cost aspect right from the design phase. That is in perfect alignment with the promise of the cloud: start small and match cost with success.

Think Actionability

Once you have a dashboard, think about actions! With the right data, you can identify the proper cost and review projects’ return on investment (ROI) to make decisions.

Should you decide to decrease the cost, the dashboards you have built will allow you to engage in targeted plans, with actions such as removing an entire project, redesigning a project to use less costly resources, decreasing computing jobs’ frequency, engaging with vendors to renegotiate contracts, and probably many more depending on your context.

You may also need more deep-dive analysis, especially with data engineers or architects to help understand the cost breakdown. The capability to split the cost among many different business axes is key in this context to highlight what those people need to focus on.

Use and Reuse

FinOps should not be seen as focusing on optimizing spending only. When you build cross-projects dashboards on costs with detailed breakdown, they should help spot the projects where cost is actually well optimized and see how this can be transcribed to other projects. In other words, FinOps need to be part of the global best practices for your ML projects.

→ Get the Full Ebook on Advantages of Use and Reuse Across AI Projects

Additionally, reducing cost can also be achieved by factoring operations across your projects. If FinOps can help you identify where factoring can happen, it can be a motivation to trigger such an analysis. Can models be shared among projects (and thus be trained once)? Can portions of your data landscape be pre-processed in a global curated project to be reused (instead of doing this processing in each project)? All those are not direct actions that can be taken by FinOps but are closely related and are great money-savers, on top of their other advantages — a goal that FinOps dashboards are here to quantify.

Next Steps

With all we have seen above, I hope you have more concrete ideas on what you can do around FinOps for ML projects. Should you want to start experimenting with this, here are a few useful links to help you move on

You May Also Like

Dataiku Makes Machine Learning Accessible, Transparent, & Universal

Read More

Explainable AI in Practice (In Plain English!)

Read More

Democratizing Access to AI: SLB and Deloitte

Read More

Secure and Scalable Enterprise AI: TitanML & the Dataiku LLM Mesh

Read More