Classifying the risks behind AI means exploring the reference information and data the service was built on, the implementation of the service, and the impact of this adoption. We propose a three-dimensional model to evaluate these risks and, in order to achieve this evaluation, we will detail the vectors of risks and explain their contextual occurrence and severity. The first dimension (which is the focus of this article) is the source data and we will discuss the other two dimensions (AI models and service implementation and the impact of adoption) in future articles.
The Source Data
The heart of AI is the algorithm that learns and actively provides the service, but it primarily relies on the source data:
- When it is being designed (training and testing)
- When it is on active duty (inference / scoring / active learning)
The natural limits of AI used to be the computation power and the data availability. Data is no longer a challenge for early adopters and tech leaders, but it remains one for many other organizations or for some specific problems still to solve. The source data is a challenge in many dimensions. Some are obvious to data professionals, but there are deeper threats that we’ll try to cover in the nexts sections of the document.
Features and Variables Used in the Model
A very well-known metaphor about data is that it is the new oil. If we were to extend this hydrocarbon comparison to risks, it would come to the fact that having the final product (the AI service or model) requires combining and transforming multiple components (or variables in this instance). The scale of the number of variables involved in the design of ML models could go from ten to a thousand variables or features.
The problem here is that the more variables you have, the more complex it is to put in place the right level of control — like a balanced distribution of sample data for each parameter — and the harder it is to identify outlier values or combinations. These safety mechanisms are generally the sole good practices to ensure the efficiency of a model with finite computation resources and limited training duration.
Model Contemporality and Data Locality
Since data is the reference point for model training, it also puts a condition on how the model is the most efficient. On the contrary, if the data you trained your algorithm on is not collected under satisfying conditions, your model is likely to be inefficient or even dangerous, as it would rely on biased information.
There could be many ways to detail it or to consider this problem, but the major dimensions to approach for assessing the risks related to it are probably time and space:
- We must mind the time frame used to collect data and be attentive to any recurring, cyclic, or exceptional event that could impact the model.
- It’s important to ensure that the population represented in the train set (the dataset used for training our model) is representative of the population the model will be exposed to.
- The right scale should be chosen for each dimension and, if necessary, partitioned to have the right level of explanation for the model.
The more complex the temporality and the population segmentation is, the more risk and the more there’s a need for strong governance and damage control — in case of an issue with a later version of the model. On the contrary, there are a number of problems that are solved with data from very stable environments and conditions, thus they are never exposed to imbalance or bias.
For example, if we were to compare industrial use cases with marketing use cases, we can agree that data from sensors in research or manufacturing purposes present a lesser risk than use cases and data from social networks. The population and activity of a specific group of people in a social network discussion don’t necessarily represent the actual repartition of the people. Therefore, a subset of the population would be more active than the rest, depending on news trends. Plus, this would probably last a short period, while sensors would indiscriminately behave the same in the predefined environment.
However, if there were a sudden drop in performance of a model influencing the production line in real time, consequences would potentially be immediately more disastrous, from a significant economic impact to damage to infrastructure or an even bigger accident. The situation would probably be different in marketing as the impact on an isolated defect would generally drown in the mass, and a drop in performance would probably be observed progressively.
Trust and Lineage
After a certain amount of time observing ML projects, we see most of them reach a phase where they need to increase the quantity or improve the quality of data at their disposal, as this is the easiest way to improve the quality of the service delivered. Analysts and engineers in charge of implementation may either choose to collect additional data from other departments or expand their data sources via open data or data from third parties that is purchased or collected on requests.
The challenge, while doing so, is to ensure that the quality of data matches the project’s expectations and to trace back the origin and the method used to collect this data. Otherwise, there’s a reasonable chance that future versions of the model can’t be trained or used in production for the following reasons:
- Future iterations of the train set could have bias included.
- The collection method could be violating a regulation.
- Accessing up-to-date iterations of this data would be impossible in the future.
If the data lineage is not possible or only partially achievable, the risks and potential countermeasures should be known and shared with all the project stakeholders. The organization must put in place safety measures like:
- The progressive release of any service that relies on the model and continuous monitoring of its efficiency
- The search for new data providers or processes to trace data origin and integrity better
- The complete regulation review and the explicit communication about all gray areas
As AI is only newly adopted in many sectors, every gray area is likely to be subject to regulation, thus using data from an unknown origin is always a risk no matter the type of agreement you have with your data provider. In the next article of the series, we’ll unpack identifying risks at impacts in the AI modeling and service implementation stage.