Who’s Responsible for Responsible AI?

Use Cases & Projects, Scaling AI Triveni Gandhi

In this fireside chat, Roy Wilsker, Senior Director, Technical Fellow, and AI Compass Group Member at Medtronic, and Neil Menghani, Master's Student at The Courant Institute of Mathematical Sciences of NYU, provide two different perspectives on the topic of Responsible AI — the lens of academia and an industry point of view. Bridging any gaps between academia and industry to garner insights from both the frontlines of work on Responsible AI and the in-depth research taking place today, this chat guides us into the tricky topic of who is actually responsible for Responsible AI. 

→ Watch the Fireside Chat Here

Here is a full transcript of the chat:

Triveni Gandhi: Hi, I'm Triveni Gandhi, Senior Industry Data Scientist and Responsible AI Lead at Dataiku. Welcome to today's fireside chat on Responsible AI. I'm very excited to be with our two guests today, both are experienced in the space of Responsible AI and are helping drive innovation in a lot of different ways. 

So, I'd love to introduce our guests. I'll start with Roy Wilsker, a Senior Director leading the advanced technologies and data science group at Medtronic. Roy has worked in a wide variety of IT roles ranging from programming to strategic planning. He currently serves on Medtronic's AI Compass Group, which develops internal guidance on the responsible use of AI within the organization. Welcome, Roy.

Roy Wilsker: Thank you, it’s very nice to be here.

Triveni Gandhi: Also with us is Neil Menghani, Master's student in mathematics at NYU's Courant Institute, where he is working on fairness and algorithmic recommendations for the public sector. He has prior experience as a data scientist in the financial services, technology, and transportation industries. Great to have you here, Neil.

Neil Menghani: Thanks, Triveni. I’m really excited to be here.

Triveni Gandhi: I'm really excited about this conversation because we have two different sides of the equation. We have Roy, who's coming from years of work in the industry, on what you might call the front lines of responsibility, and Neil, who has industry experience but is also deep in interesting research work. What I want to get out of today for our viewers is how we can bridge this gap between academia and industry. Is there even a gap? How do we actually work together to push ourselves to more responsible practices? I think that the combination of experiences here is great to start unpacking some of that.

Before we get into that, I'd like to ask both of you, what brought you to responsible AI? It's a relatively new concept that has come up more now in the past four to five years because of all the emerging biases and problematic AI that people are calling out. It's becoming much more real for every company that is using AI. So, just curious to know, what drew you in? What brought you to this stage in your career? 

Roy Wilsker: First of all, I will say for future conversations that we have a great deal of respect for what's being done in the academic world. We're very happy to see the kinds of things we do. We always try to take those things and figure out where they fit in, so we can discuss that more later. In terms of where I come into this, one of the reasons I really love working for Medtronic is that Medtronic is a very mission-driven company. It's a very ethics-driven company. Ethical work is really one of the tenants of the company. Secondly, as I've been working more and more with machine learning, it's been very clear from many of the stories, especially in the medical industry, that there are real issues. 

There was a horrendous issue that came out about three or four months ago that showed that there was a system for pain medication dispensing and it had been trained on data that said black patients felt less pain than white patients. Therefore, it is under-prescribing for black patients. That's the kind of thing that should be a wake-up call for everybody when they look at this stuff. 

So, some of the things that we look at are:

  • How do we make sure that we're doing things for the right reason and that it's for beneficial use?
  • How do we make sure that what we put on the table is actually going to be used in a reasonable way? We try to think of other ways that it might be used, and sometimes there are issues with that.
  • How do we try to think of what the issues are, especially in a medical area? 

This is not new in the medical area. The FDA and other regulatory agencies have been looking at what's called Softwares and Medical Device, SAMD, for a while. The question is always: “Why should we trust this; why should we trust a medical device that's powered by software?”. Before it was coded software but now it's machine learning-based software. The questions are still the same — “Why should I trust this?” or “Why should I feel that this is going to handle different populations of patients fairly?”

Many of the questions that we ask are in terms of fairness. Do we have a robust population that we tested it on? Do we understand what implications there are for subpopulations? We make sure that the things it is doing are medically based and not indicative of some historical bias. So, that's where I've come into this. These are real issues. They are issues that are not only affecting the rest of the world but certainly, affect our move into more innovative strategies and products. We want to make sure that we're doing the right thing when we chose products and therapies.

Triveni Gandhi: Thank you, Roy. It's heartening to hear that, in this day and age, there are companies with ethics-driven purposes, especially given how much can go wrong so quickly. All of your examples are spot on. Thank you.

Neil, I'd love to hear what brought you to this table and why you are doing what you do at NYU?

Neil Menghani: That's a great question. What drew me to study math and data science at Courant is that I've always been interested in both the theory and the application of these algorithms that we use as data scientists. I really wanted to build my theoretical foundation. Then, what drew me to work with the machine learning for Good Lab at NYU was that I've always been interested in the role of local and state governments in areas like housing, criminal justice, and recently, COVID testing which is a really interesting use case. 

If you examine where testing sites are located, some of the decision-making processes can include algorithms like COMPAS software, clustering to see how far people have to drive to get to the locations. Sometimes this may result in unfairness because those who are closer to cities or places with more concentrated populations may be left behind and have to travel very far to get tested. 

At least especially early on in the pandemic. So that was something that I saw the lab that I'm currently working with was studying and I really wanted to get involved. Right now, I'm more involved with criminal justice use cases, specifically looking at recidivism and how we can make fair recommendations in that context.

covid testingTriveni Gandhi: Fantastic. Well, there's a longstanding joke that it's not a panel on responsible AI until someone brings up COMPAS. So thank you, Neil. 

Neil Menghani: Honestly, it's a nice clean data set that I think can be used for building some of these algorithms and testing out theory early on, but a lot of what we work on has applications in other contexts as well.

Triveni Gandhi: That's fantastic.

Roy Wilsker: Can I say one thing about that? There's a very interesting piece on where they were getting accuracy levels that were high, but it was really the false positives and the false negatives where the bias issues showed up. That's a very interesting thing for us. It's not just enough to look at the AUC or the F1 score. You really have to ask yourself, what do the false positives look like? What do the false negatives look like? How is this dealing with different subpopulations and so on? So, I think that's a great example.

Triveni Gandhi: Agreed. I think that the subpopulation piece is one of those things that, once you realize we should be looking at it, is like a light bulb. I'm curious if you have any insight here, Neil? From the academic side, when did that come to light? When did that actually become relevant and how long did it take for industry to pick that up? Someone like Roy, who's attuned to this, probably picked it up much faster, but what is the average? 

Neil Menghani: I do think it's been fairly recent. I would say over the past decade that's been an active area of research. In a broader context, I think it's important to look at the entire pipeline. I know we're going to be diving a little bit more into this, but when you perform this subpopulation analysis and look at false positives or false negatives between different subpopulations in your actual model, this is to optimize how you're making predictions. I think where new areas of research are headed is actually looking at, not just the model itself, and Roy actually alluded to this. You can have a fantastic model that's predicting super well, but you may have other sources of bias arise.

What I look at is the recommendation stage, which comes after the modeling or prediction stage. You can actually have a very accurate prediction, and going into this stage you can assume that you have a true probability or true prediction. Yet, you may still come out with unfair outcomes due to the way that you're making these decisions or how you're thresholding your predictions. There's been really interesting research. There's a paper I highly recommend by Chouldechova, from Carnegie Mellon. It's called Fair Predictions With Disparate Impacts. That disparate impact piece is super key. Even if you have a great model, you still may not be recommending the correct result.

Triveni Gandhi: You brought up the idea of the pipeline, and this is something that we talk a lot about at Dataiku. There are stages to our data pipelines. We have data ingested, then we have modeling, and then we have this export reporting. How is it actually being used out there by the consumer? By the business user? I'm curious. Let's maybe start at the data side. We know that there's a lot of focus on the model bits, but knowing that all data is biased, what are the recommendations that AI Compass is thinking about putting out there or checks that you think might be most relevant in this area, on your data first, before even getting into the modeling issues?

Roy Wilsker: You're going to get me on my favorite hobby horse here. I'm going to suggest the way that we're looking at this kind of thing. First of all, we consider MLOps very important. If you're in a regulated industry, there are generally two rules. One rule is that you have to have a process, and number two is that you have to have evidence that you actually follow the process. So, pipelines and MLOps are actually very important for what we're doing. We can actually see what we did right, what we did wrong, and how we can improve the process. How can we show that we actually did the things that we said we wanted to do? There's a wonderful phrase that says, "The only time a data scientist is comfortable with his or her data is before they look at it."

One of the things, a good practice I suggest is to create a baseline model before you've done all of your hyperparameters and your tuning. I suggest to people, at the point where you've done that model, using that with some of the fairness algorithms. Use that with Lyme, use that with Shapely, so it can actually say what the algorithm seems to be caring about. You're going to ask yourself the question, “Does that make sense?” and “Is there a medically relevant reason why it's looking at that, or is it using some factor that just seems a little bogus?” 

I think that every data scientist should be friends with statisticians if they do not have a good statistics background themselves. This is a really good time to turn to a clinical researcher and say, "I'm seeing something here. It doesn't seem right to me. Can you tell me whether this is actually something good?”,“It's a new correlation that we didn't know existed before, or can you tell me if this really is pointing to some kind of bias in the data that we should look more carefully to see what we can do to eliminate?”. 

So, I really encourage people to use those kinds of tools much earlier in the process so that they can find out as early as possible that there are actually problems with that data. Then, they may be able to do something about it instead of getting all the way to the inference stage and all of a sudden find that there are problems. 

Triveni Gandhi: That's a great suggestion. If we did nothing to this data, would we still see bias? Probably 95% of the time the answer is yes because all data is biased. 

Roy Wilsker: The tools can tell you where they see it.

Triveni Gandhi: That's exactly right. An iterative process is really more similar to how you're probably doing it in academia, Neil. So, I'm curious to know if you have any similar experience to what Roy is talking about here — where you've come to the recommendation stage and you realize that there's an issue that's just not resolvable there and you need to go back to the beginning of the pipeline; or, you need to talk to another person in your research group to actually get this resolved upstream. How does that work and how often does that happen? 

Neil Menghani: Propagation of different sources of bias from other areas of the pipeline is a key part of each of these stages. If you have bias in your data up top and bias in your models, this will flow down to the way in which you're making your recommendations. Even if you have a perfect methodology or you develop some algorithm that can help you tune the way you're recommending a certain action, if you have bias at other stages it will totally flow through.

Triveni Gandhi: It's interesting. We're talking about the pipeline, so we know that there's the data pipeline, the data processing and build, then you've got your model fairness and subpopulation that we've discussed. Then, there's that last component of getting it out in front of somebody. How do you make sure they're not actually misusing it, because even a fair algorithm can be misused? What are the kinds of things that you're seeing come up in academia? Roy, knowing that then, do you see these things becoming more relevant in the industry? So, Neil, kick us off with some of the explaining and reporting methods that are coming forward.

Neil Menghani: So, actually, we drill it down even further. For the post model build, we drill it down to recommendations and then decision making. The recommendation is basically the transition from the algorithm to the human, where you take some probability prediction and then translate that to a decision recommendation. The human, of course, should always be in the loop for actually making that decision. 

I can talk a little bit about what we're thinking about on the recommendation side, and Roy can address the decision-making part of it. Essentially, when we think about recommendations, we have these two competing ideas. We have a disparate impact and then we have utility. So for disparate impact, we have a motivating example where we have group A and group B.

For group A, let's say we have a probability prediction of 0.51. For group B, let's say we have a probability prediction of 0.49. These two groups are perfectly separated in this way. Of course, this would never be true in reality, but it's just a motivating case. If we're using a threshold of 0.5, then we'll be putting everybody in group A for a certain action to take and everybody in group B for not taking that action. 

In that case, we're going to see a 100% false-positive rate for group A and a 0% false-positive rate for group B. So, the false-positive rate is very key. Even if we have a perfect model, we may end up seeing disparities in that way. Now that's a disparate impact. For utility, we do want to consider our predictions and some potential true differences between groups. Let's say that, between two groups, we see 0.9 for group A and 0.1 for group B.

We, of course, would not just want to consider the fact that there will be a false-positive rate disparity. We want to also incorporate those predictions to take a certain action. So, those are two extreme examples. It will be somewhere in between, but the work that we're doing right now is to find the balance between rejecting a certain methodology for making a recommendation, based upon false positive rate disparity and also incorporating some concept of utility, which in this case, we're using probabilities to act as a proxy for that utility. That's how we're thinking about recommendations.

Triveni Gandhi: So, knowing that you are dealing with the extremes, or you're trying to test out things in the extremes, but knowing that there's probably reality somewhere in the middle. Roy, how do you approach this? How do you guys come to the table and say, "Okay, what's the threshold here? What are we actually going to try and address, knowing that we're not dealing with extremes?”. 

Roy Wilsker: I want to point out that one of the problems that you just mentioned is actually a straight statistical problem. That is, if you have the same distribution of two different size populations, you may have two individuals who have exactly the same value.  Because the size of the confidence interval depends on the population size, one may get approved for something and one might get rejected for something even though they have exactly the same value. So, it's interesting that we're starting to see these things show up in machine learning which are really classical statistical problems, and we can be educated on those kinds of things.

When we look at software as a medical device, we're looking at both externally facing artificial intelligence and internally facing artificial intelligence. Externally facing is when we're affecting products and patients. Internally is when we're doing things like using it for human resources, manufacturing, or logistics. We apply these things to both, but a lot of the things I'm saying about the software as a medical device are for externally facing things. So, when we look at that kind of thing, we look at a couple of factors. One thing is: what's the risk factor? If we make a wrong decision, what impact does that wrong decision have?

If you think of a second axis as the autonomy level when this decision gets made by this system, is there a human in the loop? How fast does a human have to react to this? Let me give you a quick example. If you're doing something where you're talking about a feeding tube, and the feeding tube gets jammed, it takes you five minutes to respond to that, there's nothing life-threatening about that. It's okay. On the other hand, if you have a COVID patient who's on a ventilator and you get an alert on that, you have to instantly respond. You have to be there within the next 30 seconds. Those are the kinds of thoughts we apply when we're looking at software as a medical device.

Is this the final arbiter, or is this a recommender system that's going to say to an expert, "Maybe you should do this."? Will they make the final decision on that, or, is this some kind of device that's going to do a treatment and we need to make sure that treatment is being done correctly? What's the impact of the decision? If the impact of the decision is light, you have more leeway. If the impact of the decision is very serious, then you've got to be much more careful with it. 

For example, with one of our products, we actually analyze images to decide what a radiologist, or in this case an enterologist, should look like. In this case, we don't just give the enterologist the images we think they need to look at. We give them a margin on each side of the images we think they need to look at so that they have enough information to be able to make an intelligent judgment call on what they're looking at. I think you have to look at the issue you're trying to solve. You have to ask yourself how autonomously the machine learning system is working. What's the impact of that machine learning system? Then, based on that, you really have to think about, what do I have to do to protect against the false positives, the false negatives, and so on. How do I need to make sure that that patient is protected as well as possible?

Triveni Gandhi: I think that's a great point — giving the full context of how a prediction or a recommendation is made. Like you're saying with the imaging, you're giving a little bit of margin on the side, instead of saying, "Just look at this little tiny bit," because there's only so much someone can take from that. It's interesting to think about this in the recommendation systems that Neil is working on. When you think about what COMPAS is, COMPAS is actually about recidivism, for folks who are in the criminal justice system. 

The idea is to help correction officers support those folks' reentry into the workforce and/or into civilian life, but what happens is that this COMPAS algorithm is being used by judges to determine prison length, sentencing, right? That's not what the algorithm was built for at all. So, this misuse that can stem from either complete distrust or complete overtrust of AI is also quite interesting. This is why I really do appreciate the human-in-the-loop aspect of what you're saying.

Roy Wilsker: So one example of that is where people created an algorithm to look at images and decide whether someone was gay or not. Their intention when they built this algorithm was to look for bias in hiring and to look for where people had not been brought into the hiring process because they "looked gay." People pointed out to them very quickly that it was very likely to be used by  employers to avoid candidates who they thought “looked gay”. So, you have something where you can have a very different intended impact than the impact you really had. Sometimes, it's much more subtle.

One of the ones I'm fascinated by is machine learning imaging tools that can look to see whether somebody has diabetic retinopathy, a condition where blood vessels in the eyes are starting to get damaged because of the high sugar levels and so on. It used to be that you had to go into a doctor's office to be put in front of a machine, and they would peer into your eyes to do this. 

Well, we've started to come up with devices where anybody, any technician, can bring this device over to you, have it look at your eyes, and use a machine-learning algorithm to decide whether there's an issue there or not. The plus of that is many more people are getting access to that kind of treatment than otherwise would. The minus that someone pointed out is, when somebody would have before gone in for that treatment, the doctor who looked at them, in addition to looking into their eyes, also looked at other things, other health concerns. 

They had much more of a demonstration or much more of a conversation. They're coming in and having their eyes checked and the rest of that conversation never happens. You have to ask yourself, what might happen and how do I compensate for that? How do I mitigate that issue so that I can still get the benefit of giving this treatment to many more people, but without the problem that they're not getting some of the other treatment that they would've gotten had they gone to a doctor's office like before?

retenopathy Triveni Gandhi: Those are fantastic examples. The question of intentions is really critical for responsible AI because as one of my friends likes to say, “the road to hell is paved with good intentions.” Are we actually sticking with our intentions throughout the entire pipeline? Including the deployment, including how it's being used. I'm curious, Neil, maybe you can give us some thoughts here on how these questions are being addressed in academia. You see a lot of research come out on diabetic retinopathy. Are academics really thinking about this impact stuff or are they still really up in the ivory tower lording over us all? I say that as someone who loves the ivory tower, by the way.

Neil Menghani: I would say I can speak to what our group is thinking about. When we're developing these sorts of algorithms, I mentioned before how we have a dichotomy between disparate impact and utility. In order to strike a balance, there are some parameters that we allow to be configured when using these algorithms to evaluate for bias in recommendations. When putting these algorithms in practice, it's very important that we give specific guidelines for how they're meant to be used and how those that are using these algorithms across a variety of domains are making sure not to misuse them as they configure those parameters in a way that makes sense for their domain.

Triveni Gandhi: Awesome. Well, thank you so much. I know that this conversation could go on for a really long time, and I personally still have so many questions, but we are at our time for today. I want to thank you both, Roy and Neil, for joining us to discuss the different issues that we're seeing in the industry versus academia and that bridge. Looking forward to reading more of Neil's research and seeing where Roy takes Medtronic into the responsible future! Thank you both for being here and see you again soon hopefully. 

Roy Wilsker: Great conversation.

Neil Menghani: Looking forward to keeping the conversation going as well.

You May Also Like

The Ultimate Test of ChatGPT

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

Maximizing Text Generation Techniques

Read More