Practical Applications of Trustworthy AI

Scaling AI, Featured Catie Grasso

In part two of this blog series (check out part one here), we get more concrete about the people, processes, and technology used to develop trustworthy AI.

Since trust drives adoption and adoption drives ROI, it shouldn’t be a surprise that the same key driver of AI ROI is also a key driver of AI trust, namely, a multi-stakeholder approach with interdisciplinary development teams that gets frontline users involved early.

Identify your frontline users, and not just a persona but specific people. Get a few of them involved in co-developing and reviewing your data, AI model, and AI app plans. Specifically, co-design A/B tests with them to prove the value and measure the risk of your models. Be careful to guard against funder bias by depending too much on the opinion of frontline user management rather than actual frontline users.

Verification, Validation, and Valuation 

Consider three levels of quality for data and AI products:

  • Data: Accuracy and transparency
  • Users: Usability and adoption
  • Business: Value and ROI

Traditional software engineering maps the first two quality levels to verification and validation. We can add a third “V,” valuation. A few of the techniques in each are:

verification and validation

Audits

Like code reviews, even the best AI teams can benefit from audits. Three types of audits are:

  • Internal: Performed by a separate team within the same company or organization. Facebook, who is at the cutting edge of AI technology, uses this approach extensively.
  • External: Hire an outside team to review data and AI products. Some conglomerates and holding companies use teams from another company with the same parent rather than external consultants to minimize legal issues and promote knowledge sharing.
  • Regulatory: Audits performed by law enforcement agencies.

In order for audits to effectively build trust, there needs to be a degree of equality between AI developers and auditors. Auditors should have the same skill level and incentives as developers (perhaps even quotas), and not be influenced by funder or incumbent bias (i.e., regulatory capture).

Audits, like other risk management, should be ongoing and proactive. Waiting for users to detect problems erodes trust. There’s a saying in software engineering that it’s okay to have bugs, but it’s not okay for the users to be the first to know about them. Audits may be triggered by changes in model accuracy, detection of bias, or the time since the last audit.

Reporting 

Audits, accuracy monitoring, bias detection, service-level objective compliance, and checking for discriminatory outcomes build trust when they are proactive and continuous. Ideally, users should be able to review a history of such checks, the issues raised, harm detected, and their resolutions. Updating that documentation continuously and making it open to all potential stakeholders and users builds trust.

Most stakeholders don’t need to know the internal details of an AI system. Increasingly, though, some users are other developers who do need details in order to trust data and AI products. Those details include:

data and models

Descriptive statistics are most useful when they’re up to date, which is easy to provide in today’s platforms via dashboards, statistical workbooks, and computational notebooks. An extensive list of data and AI product characteristics to consider can be found in ISO 25012, W3C data quality vocabulary, and W3C data catalog vocabulary.

Technology 

Technology plays a key role in developing data and AI product trust. When done well, it serves as an objective, always-on watchdog that simplifies AI developers’ daily work by eliminating tedious, manual tasks. Automated tests are more likely to be executed and reduce the likelihood that developers will find workarounds. In order to build AI trust, Dataiku provides a centralized place for the checks and balances outlined below:

  • Variable importance in 
    • Models
    • Individual predictions
    • Nearest neighbor predictions
    • Sensitive subgroups
  • Model stress tests
  • Rapid deployment of new data and models, shutdown of bad ones, and rollback to previous versions. Risk, for example, is higher for a model that takes three days to shutdown than for one that takes three minutes.
  • A/B testing of models in production
  • Detecting data and model accuracy drift 
  • Detecting data and model accuracy drift in sensitive subgroups (i.e., bias)
  • Ongoing backtesting
  • Triggers to remove a model from product, rollback to a previous version, or automatically retrain
  • Alerts sent via email, SMS, Slack, Microsoft Teams, JIRA, Datadog, etc.
  • Residual analysis: Are inaccurate predictions random or is there bias?
  • User adoption analysis: Are the potential users who decide not to use the product random or is there bias? For example, is the adoption rate in rural hospitals much lower than that in suburbs?
  • Prediction override analysis: An “override” is when a user decides not to use a prediction and instead goes with their own judgment. Are overrides random or is there bias? For example, are overrides more common for expensive products, high-end customers, or life-threatening diseases?
  • Automatic dataset and model documentation generation
  • Automatic update of dashboards, statistical workbooks, and computational notebooks
  • Enterprise-wide data and model catalogs for assessing systemic risk   

There are a variety of variable importance methods including Shapley values, Individual Conditional Expectation (ICE), Local Interpretable Model-Agnostic Explanations (LIME), and BayesLIME which adds credible/confidence intervals to LIME. There is no one “right” method. It depends on what your stakeholders trust.   

Conclusion 

Trust is a relationship between an AI product and a potential user, not a static attribute of the product. The user decides. An AI product doesn't need to be trusted by everyone to be successful. Grocery store recommendations, for example, are used by about 2% of customers but still generate big value.  

Empathy, transparency, explainability and diligent monitoring build trust. As AI becomes more industrialized, practitioners can spend less time on data wrangling and bias-variance trade offs, and more time on understanding, quantifying, and measuring the harm and biases that users care about. A more human-centered approach will drive adoption, trust, and ROI.

You May Also Like

How to Build Trustworthy AI Systems

Read More

Dataiku Ranked #1 in Product Owner Use Case in Gartner Critical Capabilities Report

Read More

The Governance Blueprint CoEs Use to Scale Self-Service and AI Agents

Read More

Build Responsible GenAI Applications With the RAFT Framework

Read More