In the first installment of this three-part series, I quoted a widely accepted statistic on the environmental impact of AI: The carbon (CO2) footprint of each standard, trusted AI model roughly equates to the lifetime footprint of five American cars — 345 kg CO2 per model throughout its lifetime. When compared to the footprint of 5.4 tons of CO2 per capita per year in the U.K., this is not a small figure — it actually equals 19 days of heating with gas. For large organizations that aspire to find balance in achieving the triple bottom line (people, planet, profit), reconciling enterprise-wide AI initiatives with the impact on the planet can be a daunting task.
Can Frugal AI play a role in reducing the environmental impact and fostering equilibrium in the triple bottom line for AI? From a design perspective, and at a high level, there are three levers that can be flexed to achieve frugality: data, compute, and network architecture. In this second installment, I will focus on the source of AI — data itself — and aim to answer a critical question: Is frugality consciousness in data collection opposed to AI efficiency?
Data is a precious thing and will last longer than the systems themselves." -Computer scientist Tim Berners-Lee
Let’s start by talking about one of the fundamental building blocks needed for AI/ML: data. Almost by default, many organizations set out on a path to collect everything because data is the new oil, after all, and we are all being told the same thing: Companies simply don’t do enough with their data.
The types of data that exist within an organization usually fall into five categories: transaction, master, reference, reporting, and meta.
- Transaction data describes business events, is core to the business, and is oftentimes the largest volume of data in an enterprise. Such data would include customer purchases, accounting books, and more. In short, everything which allows the company to keep the pulse of its activities.
- Master and reference data should be seen as the data backbone of an organization, with stability in structure and much tighter data controls and management processes. Referential data is designed to unify all other data sources in order to build consistent views on activities. It usually covers critical components such as customers, suppliers, products, entities and employees, as well as their key characteristics. Reference data is referenced and shared by a number of systems, is universal and standard, and is much lower in volume than transaction data. A key challenge for organizations lies in the need to properly manage and enforce such referentials to significantly accelerate all resulting analytics efforts.
- Reporting data is an aggregation and compilation of past events and used for analytics and business intelligence. Such data can consist of transactional data, master data, and reference data. Production processes should guarantee auditability and explainability, notably for any fiduciary or customer reporting.
- Finally, metadata is data about data and gives you the answer to any question that you cannot answer by just looking at the data.
Here we have described data generated by organizations themselves, which can come in different shapes and forms — structured, unstructured, and images. They all have their own challenges. Beyond these internal sources, companies are also exposed to a jungle of external data designed to enrich perspective or give specific insights on precise themes. One could argue that sustainability as a field is among the highest generators of ancillary data sources.
Making the 'Right' Choice for You and Your Organization
With the increasing proliferation of data across an organization, more and more it will be a question of choice, and adhering to that choice over time to make a real impact. The first step in making an impact is baselining where you are to better monitor impact, and for this you need to have a stability (pro forma), a benchmark of your historical performance to build on. Secondly, it is important to keep in mind that too much reporting and analytics can lead to no decision making at all. Frugality is as much a question of data strategy as of analytics strategy. The more data you have, the more you can speculate on wWhat-ifs’, without delivering a business decision.
Let’s take a simple example: All organizations today aim to limit their energy consumption and CO2 emissions, which, back to our people/planet/profit triple bottom-line objective, is as much a question of reduced planet impact as of actual savings.
Exploring energy saving is ultimately a question of choices, as factors driving energy consumption can be multiple: For a factory, it can be around its level of production, type of production, energy efficiency of its machines, exposure to energy mix, and specific settings of a precise machine. This, then, also has to be balanced with expectations from an efficiency and quality standpoint. Additionally, data used could be internal, external, easily accessed by a number of employees on site or highly complicated to collect, and in very large volumes as it is collected from machines that can all have different protocols.
As you can imagine, any organization engaged in acting upon this topic would have the temptation to first collect all data, and explore as many paths as possible. This exploration stage, which will require data access, is absolutely critical to make the right choices, and forging conviction and consensus to be able to take action will be equally important. In this situation as in any other, having a way of quickly prototyping before expanding to a real production (and more data intensive) set-up will be critical.
Is Data Frugally Compatible With Customer Personalization?
Even before computers, companies have always collected data on their customers. Today, over-collection has become the norm and a "shortcut" to avoiding the increased rigor required to actually slim down that list of actually useful KPIs.
All organizations, and all the more so with B2C business models, increasingly overcollect data about their end consumers under the guise of user experience, personalization, and customization. On the one hand, overcollection may be simplistically viewed as a gateway to pitch more ‘stuff’ to end consumers, but a deeper look reveals that new products and services are borne out of massive datasets.
To alleviate the friction between the need for massive datasets and computing resources to innovate and create new products and services and the need for operational applications of AI, one approach could be to leverage non-frugal AI data collection and computing for R&D activities and frugal practices for operational applications of AI. The training and testing of common operational AI applications could mean collecting less master data (e.g., location, web searches, web history, IP addresses, communication history) as a means to achieve net reduction in compute carbon emissions.
For organizations with B2B business models, the temptation of over-engineering can be hard to resist. In a global world marked by the development of an abundance of ancillary services, having a full customer 360 perspective with proven next-best action products is highly complex, with the constant need to balance human intuition and our collective capacity to build constantly growing ML-powered analytics and indicators. All investment banks today aim to have a full and close to real-time understanding of each of their global customers exposure and market actions to be able to make THE right recommendation. However, as a level of human decision making and macro-economic trends will continue to drive unpredictability, the search for perfection and the temptation of over-engineering and the resulting data overconsumption should always be put in the perspective of tangible and achievable goals.
Looking Forward
It has become more affordable to train and test machine learning models, and it is estimated that, over the last 10 years, there has been a 30% year-on-year decline in compute costs. What if the decline in costs, the desire to collect everything, and lowered barriers to entry for AI leads to habits that are unsustainable and adversely impact the environment?
Data strategy for AI does not exist in vacuum and depends critically on the overall data strategy of an organization (which is more likely to be ‘take all’). Most gains from frugality will come from thinking carefully about what data to train on, how, and how often and less from reducing data collection.
If I assume that top-line strategies are refreshed every three to five years and aptly executed, will flexing the corresponding AI strategies to adapt to operational and R&D AI areas of foci, aid in frugality, impact on the environment, and the ability to balance the triple bottom line?
In this article, I talked about data, one of the levers that can be flexed to design AI frugally. In the next article, I will focus on combining ESG and AI to build a better world. I hope this article triggers reflection and strategic thinking for all practitioners that aspire to move the dial on the impact of their enterprise-wide AI initiatives.