How to Build Trustworthy AI Systems

According to the National Institute of Standards and Technology (NIST) “Trust and Artificial Intelligence” report, “AI has the ability to alter its own programming in ways that even those who build AI systems can’t always predict.”

Unlike spreadsheets, dashboards, and ERP systems, AI software can automatically and autonomously change its coding via retraining, adaptive learning, or reinforcement learning. AI also operates thousands of times faster than human workers, so when AI software does harm, it can do it at a scale that most managers have never seen before.

For these reasons, AI practitioners and developers are creating new processes to make AI trustworthy. But trustworthiness is very relative, which is what makes developing this type of technology very challenging. Trustworthy AI is important when it comes to driving frontline user adoption and managing risk (both of which are essential elements to avoiding analytics and AI project failure).

This two-part series will dig into both of those reasons, as well as provide an overview on:

Types of risk, harm, and bias
Key attributes of trustworthiness (i.e., accuracy, reliability, explainability)
Ways to put trustworthy AI into practice

Trustworthy AI for Frontline User Adoption

Even the most accurate AI model can’t generate business value if it’s not used.

What AI developers consider trustworthy differs from what management considers trustworthy, and both are often different from end user trustworthiness. Two of these three levels are non-technical, subjective decisions made by people.

trustworthy AI for frontline user adoption

“Trustworthy” isn’t an attribute of data, an AI model, or an AI app. It’s a relationship between a person and data, a model, or an app created solely by the person.

The history of AI is full of statistically good models that were business failures because frontline users didn’t trust them.

A Real-World Example: Duke University

A Duke University hospital admissions team spent a year developing an AI application to help emergency room staff decide if a patient with chest pain should be admitted to the intensive care unit (ICU). Increasing the accuracy of those decisions would decrease ICU workload, improve patient care, and reduce costs for both the hospital and patients.

However, doctors objected to being told how to do their job, and busy emergency room staff rejected the extra steps in the admission process required to use the application. It failed in just three weeks.

Trustworthy AI for Risk Management

While user adoption promotes the potential upside of AI, risk management focuses on the potential downside. A first step here is defining “downside” — based on your organization's principles and goals — in a way that it can be measured.

For some organizations, it might be financial, such as revenue or cost. For others, it may include brand reputation or social fairness. Every organization and context might be different and they can change over time, such as emphasizing costs during economic contractions and revenue during growth periods.

A Real-World Example: Zillow

Zillow’s residential home price model is a cautionary tale of how risk changes with time and context. Initially, its predictions were displayed on Zillow.com so visitors could see what their home (and others) might be worth.

Over time, the model improved and a large team of internal property experts began using it to buy and flip homes for a quick profit. That worked so well that management removed the human experts from the decision-making process to speed things up, bought 3,000 homes without sufficient human oversight, lost $570 million, and laid off 2,000 people.

Types of AI Risk to Consider

One may define risk as:

Risk = potential harm * likelihood to occur

Most industries, organizations, and departments have different thresholds for each component. The Hippocratic Oath’s “first, do no harm” in medicine emphasizes harm regardless of its magnitude. The U.S. Federal Trade Commission (FTC) defines risk the same way for some types of harm, such as to civil liberties, where no amount of harm is allowed. In most practical situations, risk is weighed against benefits:

Unfair = potential harm * likelihood to occur – benefit

Three hundred people in the U.S. drown in bathtubs each year and yet we still have bathtubs. Over 80 people die in car accidents per day, yet 17 million new cars are sold each year. This is both because the likelihood is small and the benefits are big. The FTC uses this for most AI risks, equally weighting benefit and harm: “To put it in the simplest terms… [an AI] practice is unfair if it causes more harm than good.”

Types of Harm to Consider

When evaluating trustworthiness of your AI systems, determining which types of harm are most critical depends on your organization’s principles and goals. Types of harm include:

types of harm to consider

Today, many AI practitioners focus on only a few potential harms like an organization’s finance and compliance. Developers, managers, and users will need to consider more types as AI becomes more widespread, and make explicit design-time decisions on which risks are important and how to measure them in production. If they’re not defined in a measurable way, then they’re just documentation — and that’s insufficient for generating trust.

Harm from AI systems is typically either never considered by developers or an unintended consequence. A common way that it occurs is from bad data. Three ways that AI models can generate harm are by using biased datasets:

Feature and target data that does not accurately represent the truth
Inference (also known as prediction or scoring) datasets are significantly different than training data
Malicious internal or external actors introduce bias in datasets

The next section reviews types of bias to guard against.

Types of Bias to Consider

The question isn’t whether your training data, model, or inference data is biased (it is) — it’s how and whether it’s important to you. Three high-level sources of bias in AI systems are:

types of bias to consider

A Real-World Example: Statistical, Human, and Systemic Bias in Healthcare

A widely used commercial model estimates how sick a healthcare patient is. Scarce resources such as teams of dedicated nurses and extra primary care appointment slots are then allocated based on that. The model, however, used next year’s healthcare costs as a proxy for sickness magnitude. The developers introduced streetlight bias by using readily available billing data. The problem is that Black patients have less access to medical insurance than white patients and thus are billed less for the same degree of sickness.

Thus, for the same level of sickness, the model inferred that white patients are sicker and they get additional care. Billing is a bad target variable proxy due to systemic inequalities in the healthcare marketplace. Researchers estimate that if this bias were removed, the number of Black patients getting additional care would increase by 160% or more than double.

Note that not all biases are harmful, even for protected classes. Many retail apparel product recommendation models, for example, are more accurate for women than for men, but that’s not considered unfair. Each organization needs to decide which biases, harms, risks, and sensitive subgroups are important to them.

Metrics for Detecting Bias in Sensitive Subgroups

Dataiku — The Universal AI Platform™ — provides explainability features that help analytics and AI project builders (and their stakeholders) stay aligned to their organizational values, increase trust, and eliminate bias.

There are many metrics for detecting bias in sensitive subgroups in training, inference, and prediction data including:

Demographic representation: Does a dataset have the same distribution of sensitive subgroups as the target population? The Kolmogorov-Smirnov test could be used.

Demographic parity: Are model prediction averages about the same overall and for sensitive subgroups? For example, if we’re predicting the likelihood to pay a phone bill on time, does it predict about the same pay rate for men and women? A t-test, Wilcoxon test, or bootstrap test could be used.

Equalized odds: For boolean classifiers that predict true or false, are the true positive and false positive rates about the same for sensitive subgroups? For example, is it more accurate for young adults than for the elderly?

Equality of opportunity: Like equalized odds, but only checks the true positive rate.

Average odds difference: The difference between the false positive and true positive rates.

Odds ratio: Positive outcome rate divided by the negative outcome rate. For example, (likelihood that men pay their bill on time) / (likelihood that men don’t pay their bill on time) compared to that for women.

Disparate impact: Ratio of the favorable prediction rate for a sensitive subgroup to that of the overall population.

Predictive rate parity: Is model accuracy about the same for different sensitive subgroups? Accuracy can be measured such as precision, F-score, AUC, mean squared error, etc.

Gini, Theil, and Atkinson indices have also been used to measure disparity. Which metric or test to use depends on the point in the AI product lifecycle and understandability by those consuming the information. Data scientists, for example, may be comfortable with a Theil index, but it may be too complex for business stakeholders and thus lower trust.

Key Attributes of Trustworthiness

A wide variety of trustworthy AI attributes have been proposed but there is a consensus — from central banks to global system integrators to Microsoft and Google — that they include accuracy, security, explainability, and accountability.

Accuracy and Reliability

Accuracy applies to both data and models and, within data, to both features and target variables. For features, it might be monitored by the null value rate or changes in distributions. Target variable accuracy is always a concern, especially when it’s generated manually by people. People, like models, are biased. For example, when labeling images (is it a cat, dog, car, etc.) people have a minimum error rate of 5%, and experienced radiologists contradict themselves about 20% of the time.

Reliability refers to a model’s accuracy being consistent under expected conditions for an expected amount of time. For example, is it expected to be reliable for a couple of days, a month, a year? Being transparent with stakeholders on the conditions and timeframe for reliability builds trust.

Safety, Security, Resiliency

High-risk AI systems should be secured against malicious attacks from internal or external bad actors, such as changing target variable data with the intent to harm. They should also continuously monitor for such attacks and detect them as soon as possible. The level of security and monitoring that’s appropriate depends, of course, on the risk. An AI system cross-selling toothpaste is certainly lower risk than one detecting credit card fraud and thus may have different security requirements.

A Real-World Example: A Model to Identify Good New Customers

Resilience applies to sudden changes in the environment or usage. One of the variables a company looking to identify good new customers used was how many cars a person owned. The more cars they had, the more likely they were to be a good customer. That works in New York City since most people with multiple cars are wealthy. The company expanded into the Southeast and applied the same model to people there. Model accuracy dropped quickly because in the rural Southeast low-income people tend to have many cars.

The company lost a lot of money and postponed their expansion plans. The meaning of a key variable — multiple cars — had changed and they did not detect it in time. If harm is done, bias is detected, or accuracy drops below acceptable thresholds, then trustworthy AI systems quickly detect it, alert stakeholders, shutdown, repair, or retrain.

Explainability

Explainability applies to both data and models, and — like trust — is in the eye of the beholder. There are four AI system components worth explaining and four kinds of stakeholders to explain them to:

AI component and stakeholders

Of course, data practitioners do not need all four AI system components for every AI app since risk can vary greatly. How explanations are presented can make a big difference in driving trust also, even for expert end users.

Accountability

Who is held responsible when harm occurs? Users and other stakeholders have an expectation that someone will be and that they’re at an appropriate managerial level, not a developer. Clearly stating who is responsible in model and app documentation helps build trust.

Conclusion

AI systems can retrain and adapt in unpredictable ways, at speeds and scales that amplify risk. Building trust means aligning accuracy, explainability, and accountability with human values to drive adoption and manage harm. Stay on the lookout for part two of this blog series, where we dive into the practical applications — the people, processes, and technology used to develop trustworthy AI.

How to Build Trustworthy AI Systems