On Ethical Data Science & Dropping the Best Model Approach

Data Basics, Scaling AI Lynn Heidmann

If you're a business, do you really want to bend over backwards to get 99% accuracy with a machine learning model when a simple linear regression that gets you 94% accuracy gets the job done? This episode of the Banana Data Podcast examines this very question, along with a discussion in becoming more ethical data scientists (plus, bonus: a section on federated learning in healthcare).

 

Not a Podcast Person?

No problem, we get it — read the entire transcript of the episode below.


Triveni Gandhi: We'll be taking you through the latest and greatest in data science, without taking ourselves too seriously. Today we'll be discussing the problems with our traditional approaches to machine learning, the implications of privacy when it comes to healthcare and advertising, as well as the basics of federation learning. So, Will, tell me more about the problems with our traditional approaches to machine learning.

Will Nowak: Gladly. So we're going to be referencing a blog post on the blog bentoml.com, written by Christoph Molnar. So thanks Christoph. And so it's called One Model to Rule Them All. And so to your point about problems with our traditional approach to machine learning, this really resonated with me. Christoph makes the point that we focus too much on error minimization. So we're always thinking about how can we optimize our model performance. It's all about model performance. But his great insight is that so many other things matter in machine learning. And to me that was really powerful.

Triveni: Yeah. So what are the other things that matter to him?

Will: Yeah, so in addition to the model itself, he makes the point that we need to think about problem formulation, the data-generating process, model interpretation, application context and model deployment. So kind of the whole life cycle from inception of what model you're even going to think about making, creating, using, and then how it gets used, how it gets deployed.

Triveni: That's really interesting. It strikes me that we haven't ever really... Data scientists who are self-learned, right, and so many of us are, we never really get education on those things. The things that we get education on are how do you manipulate the data frame, how do you reduce your error rate. And it's interesting to me that he's actually calling out those parts of data science that are so important, but we're never really taught about.

Will: Yeah, exactly. I mean, he has a section of the article that really resonates with me as well, that it's so easy to focus on numeric metrics, because they're so tangible, they're so measurable. It's pretty easy to say, "Hey, look, my error score is lower than yours." Whereas it's quite challenging to say, "My problem formulation is much better than yours." And I think that's really true, but, nevertheless, doesn't mean that we should ignore things like problem formulation, which I think we are a little bit.

Triveni: Yeah, and I think it's interesting because the... So another place where I think we see this problem is with the Kaggle world. And I mean, whether you say Kaggle or Kaggle, you can see that these ideas are getting replicated over and over again, where the competition's... So Kaggle, for those of you who may not know, is a competition website where you can go and get data and a question, and the goal is to create a model that best predicts the outcome that the competition is looking for. So predict the housing market in 2019 using this Zillow data, whatever it might be. So Kaggle is really just giving you a clean dataset. It's telling you what to look for. And the only thing it wants back is the top model, the model that has the best prediction. So none of the other stuff seems to matter in the competition.

Will: Yeah, no, and it's something we actually see a lot with our clients, is that in addition to model performance, what really matters is how it performs in the world. So if you're a business, do you really want to bend over backwards to get 99% accuracy when instead a simple linear regression that gets you 94% accuracy kind of gets the job done, whatever that job may be. And again, something that Kaggle just kind of ignores. Cost benefit of model training versus model performance.

Triveni: Oh yeah. And so often when I look at those top leader boards on Kaggle, it's the guy who ran a model for 10 days over three GPUs, is the one who won. All right. And so like that's not practical in the real world.

Will: Yeah, for sure. And also, I think, another point that Christoph brings up is this idea of the data-generating process. So like first of all, you need to make sure that the data is kind of accurate in representing the problem you're trying to model. But then also, I think, another thing that we talk about a lot at Dataiku is is this data ethical in some way? Where did it come from? Are you okay with using the data that you're using to build this model?

Triveni: Definitely. Yeah. This idea of bias and where's the data coming from also kind of reminds me about privacy, and how is our data being used. It's been a wild couple of years, where we've seen our data being used in all sorts of crazy ways. And it actually reminds me of an article I wanted to tell you about, Will, which is targeted advertising.

Will: Ah yes. Targeted advertising. Say more.

Triveni: Sure. Yeah. So The New York Times actually has this whole series on data privacy that they have going on right now. And it's really cool. One of these data privacy articles is called These Ads Think They Know You. And it's really interesting. So we all know that, to an extent, the ads that we look at on our computer screens or phones get targeted to us to an extent. I go and I look at a pair of shoes on Zappos, and then two days later when I'm looking at something else, the advertisement that's showing up on my window is for those same shoes that I saw on Zappos.

Triveni: So we know that there's some of this data sharing going back and forth. But what this article reveals is that there's something much deeper and much more complex happening behind the scenes that we often don't even realize. So The New York Times decided to sort of lift the veil on a part of this thing, and they bought some ad space. They then picked sort of 16 categories of people that they wanted to target ads to. And then instead of selling ads to these target audiences, they actually use the ads to showcase the information that was being used to target them.

Will: Very cool.

Triveni: So they list out a couple of example ads here. The one that resonated most with me is this. This ad thinks you're trying to lose weight and still loves bakeries. And that is very true for me. So that was great. But it's interesting. Their argument here is that, well, we can tell that you're trying to lose weight based on what sites you're visiting, but we know that you still love bakeries because we have your credit card history and we can see where you're spending your money. So we think that, okay, yeah, there's always targeted ads and blah, blah blah. It's actually, it goes a lot deeper than I think we realize.

Will: Yeah. I mean, it's interesting, and I can't claim to be really holier than thou, and I want to like be clear about that, because I'm working in this space, working with data. And there's been times that I've been working on projects for organizations or for clients or what have you. And going back to the previous article, talking about the data-generating process, you don't always necessarily know where that data came from. And so again, that's something where we can talk on this podcast today, or kind of in the future, about what are some policies, what are some practices we can put in place, so that data scientists take some responsibility for the data that they're working with. Because again, if it just handed to you as a black box and say, "Oh we know this about these people." How do you know it? Did you get it in a way that we consider ethical? Is its use ethical? Valid questions.

Triveni: Yeah, I mean, it's great you raised that because The New York Times, when they did this experiment, they actually used the data that these data-bundling companies hand over to folks who are buying ads or creating ads. And these profiles of people, how they've created these profiles, is proprietary and hidden. So not even The New York Times knows how they came up with this stuff. And it's a bit scary, right? Because there's then long-term implications of targeted ads affecting not only your purchasing, what you're buying, or whatever, but what you're thinking, how you're feeling, how you might vote. There are some real concerns here.

And it's interesting because these companies themselves, according to the article, some of these companies now are trying to say, "Well, we should be better about our data collection policy. We should be more clear. We should be regulated." But they might try and get themselves regulated in a way that isn't very regulated. So it's one of these ongoing questions, I think.

Will: We've seen some strides in that recently, though, right? All this Facebook in the news stuff. There's been many situations in which the broader public is starting to get upset about this situation, so that could potentially lead to some shifts and regulation and policy. But we'll see.

Triveni: Yeah. I mean, California does have their own data protection law now, and it's quite similar. I wouldn't say it's exactly the same, but it's similar to the GDPR that we have in Europe. The only problem is that it's restricted to California. And so what about the rest of us in the 49 states. We don't have that same protection yet. And so it's interesting to see how this is now coming to light, and whether or not the government or private companies themselves take on an initiative to focus on this, is interesting.

Will: Yeah. Related, Glen Weyl, an economist at Princeton, in a recent book, wrote about kind of this phenomenon of data privacy. And his theory is that we should have markets for data. So your data should be owned by you, and you could think about all that jazz related to blockchain and decentralized networks. But in all seriousness, owning your data and then being able to spend it or use it as a resource, as people have said, data is the new oil, that's his idea, which is somewhat compelling.

But then related, and now I'm going to mix together several prominent podcasters in my life, but Ben Thompson of Exponent and Stratechery, I think this is his idea, that that's not going to work actually, because data has network effects. So your data, individually, doesn't matter really at all. So it's hard for you to say, "Oh, I'm going to take my data, I'm going to wield the price with my data, I'm going to have power with my data." Individually though, it doesn't actually work that way, because like no one cares about one person's data. We just care about the data in aggregate. So all interesting ideas, smart people thinking about this. But it seems like no one's really cracked the case yet.

Triveni: Well yeah, because whatever you're thinking, these guys are thinking one step ahead. So if we're going to make any movement on this issue, it has to be a real commitment from the folks who work with this data to give sort of that guarantee that we're not using it in a bad way, or we're being regulated in some way.

Will: Yeah. This idea of the Hippocratic Oath for data scientists.

Triveni: I love that. I love that. We'll call it the Nowak Oath.

Will: Yeah, but really, I mean, I think that's something that we'll see change. As a society, we're like, oh, medicine is so important that you really need to take this vow. Whereas I'm a practicing data scientist, but I've never made a sworn statement to anyone for anyone. So that's something again for us to make progress in.

Triveni: All right. Cool. And now it's time for In English Please. This is the part of the show where we take a moment to break down some complex data science topics in English. So, Will, can you describe to me, in English, please, federated learning?

Will: How familiar are you with SGD, stochastic gradient descent?

Triveni: Oh boy. I know that it means producing something.

Will: Producing something. That's all you need to know, kind of. So in general, the idea, for our listeners at home, so so much of the AI ML space is about prediction. And when we try to predict things, we're trying to do supervised learning. So we already have some historical training data and we're trying to learn from it. So when you try to learn from it, we're trying to fit some sort of mathematical function. And so when you're trying to fit a mathematical function, one way you can fit that function is through this process called stochastic gradient descent. And so this gradient descent idea is really important. The idea is that we have some parameters that make up our model. So like some numbers. And we're trying to find the right numbers, because when we have the right numbers, that's going to lead to some minimization of our error.

Will: So really you could sum up much of supervised machine learning is just that. Trying to find the right numbers, the right parameters of a mathematical function that are going to minimize error. And lots of smart people spend lots of time thinking about how you do this. But the idea of gradient descent is that you could take a partial derivative. So if you're not familiar with, or you haven't brushed up on your calculus recently, don't worry too much. But in general, you could think of this, the classic example is you're standing on the side of a mountain and you want to get to the valley. Why the valley? Because it represents the minimum elevation. In this case we're looking for the minimum of the error. So if you want to get to the valley, you just look around and you see which is the steepest direction down and then you take a step in that direction. If you keep doing that, in theory, you'll end up in the valley.

Will: And so this is the idea with gradient descent. You keep taking small steps in the direction of the gradient, in an effort to minimize your functions error. So far, so good?

Triveni: So far, so good.

Will: Okay, so oftentimes the way this works in machine learning is that you have all of your data stored on one central server, one central computer, and you do all of your training, then, in there. The idea with federated learning is that data does not have to be aggregated. So in the paper that I read, Google was using Android phones that are spread out throughout the world, that all contain individuals' data. and so the idea, or the promise of this, there are several, but one is that if I care about my data's privacy, going back to our previous conversation, I don't have to agree to share that data with Google or with anyone else. I can keep that data about me on my phone. But, nevertheless, Google can, or someone, can learn from that data.

Will: So the way it works is your data will stay on the phone, but this model exists on some sort of central server. And remember, a machine learning model, it's just like a bunch of parameters, just a bunch of numbers. So we have these parameters that live on this central server. Those parameters get pushed out to phones and then the phone says, "Okay, this is the current state of the model. Now I'm going to look at Triveni's data," which is representing supervised learning instances. "I'm look at Triveni's data and see what cases I predict correctly according to the model's current state, what cases I predict incorrectly according to the model's current state. And I'm going to learn from that, using gradient descent."

After a gradient descent occurs, now Triveni's phone says, "Hey, I've got some updates to these global parameters. Like, Hey Mr. Model at the central server. I've got some updates for you." So once that phone gets back on wifi, those network updates are pushed to the central server and this process repeats. So I think the promise here, in theory, one of the promises, is that data never needs to leave your phone. So the data can stay local, the data can stay protected, but nevertheless, we can kind of build this global model. So it's a pretty cool idea.

Triveni: Yeah. So it sounds like even though my phone is sending something back up to the server, it's not sending anything about me. It's just saying like, "Hey, I've put some more data into this model and I think I tweaked it. I think I got us closer to the valley. So here's the map, how I got down to the Valley."

Will: That's exactly right.

Triveni: I love it. I love this analogy, Will.

Will: So it's a pretty cool, pretty smart idea.

Triveni: It's great. And it makes sense now, in the context of the article I read, which was actually by the MIT Technology Review, and they post all their stuff on Medium, which is great.

Will: They probably know something about data science.

Triveni: I would think MIT Tech, I mean, I don't know. Anyway. They were talking about using federated learning to train on health data so that folks that are doing different kinds of machine learning in the health space, predicting cancer rates, predicting remission rates, whatever it might be, they can actually use federated learning to improve their models and get more information, without having to worry about all of the data privacy issues that come with health data, because health data's super private.

Will: That seems like a good application.

Triveni: Yeah. And so it's really interesting. It sounds like what they're suggesting is, okay, you do this federated learning model on your server and you send it out to the different hospitals, or this data's collected, and that model's running its own thing on the back end in the hospital. And then it sends you back the parameters, back to the main server. And in theory, that improves things.

Will: Yeah. I wonder, the one thing that seems challenging about this, if you're thinking about an organization like Google that has some sort of operating system that's shared on all of these phones, the schema of the data is going to be consistent. The data on your phone and the structure of it. So that's what I mean by the schema of the data. The structure of your data looks the same in your Android messaging application as it does in mine. And this is just the ignorance on my part. I don't know enough about hospital data. So listeners out there, contact us and tell me more. But I don't know about the schema of data at various hospitals. Because you need it to be the same, or similar enough, such that this model training can go off without a hitch. So if anyone knows about that, let us know.

Triveni: Yeah, I mean, I know there are a lot of companies working on this now, but healthcare is notoriously varied in how different healthcare providers store their data. And that's why every time you go to the doctor and you change insurance, you have to go to a new doctor, you got to give them all that information again, because it's not like they can just quickly grab it from your last provider. There's no unified central system, there's no unified sort of, this is how we record our data. And that's going to be a big challenge, obviously.

Will: So one thing that this makes me wonder, is what if you don't want your data really included at all? So I think, if you read the fine print at times, you probably have the option to kind of opt out of having your data contributed to model training. And right now, my impression is that people don't care that much. They don't want to know everything. I don't want all my data about Will to be pushed off to the world. But if my data about Will helps contribute to your map of the territory or a map of the mountain in the valley, then that's okay. I wonder, though, as time goes on and people kind of start to savvy up here, I don't know, if there's going to be a shift in that too. If people say, hey, your personal data is not being transferred but it is contributing to these models that are ruling your life. Maybe they'll say, "Hey, not only do I not want my personal data shared, this whole federated learning thing, I'm not for that either." Could happen.

Triveni: You know, it's interesting because now you can go onto your Google browsers, your Chrome, and you can say like, opt me out of tracking my ads. Do not track me. But that's an opt out.

Will: Yeah, default.

Triveni: I think we need to be moving to a world where it's opt in. Yeah, I do want to give you my data. And in some ways, I know like when you open up like an Apple, a brand new computer, it says like, "Hey, can we send your data to developers?" Or you get a new app, "Can we send data to developers?" Sure, I'm opting in. But right now so much of our world defaults to you have to opt out, and figuring out how to opt out is not simple always.

Will: Yeah. Are you familiar with Dan Ariely?

Triveni: No. Tell me more.

Will: I might be mispronouncing his name as well.

Triveni: Dan.

Will: Dan Ariely is a behavioral economist, but he has this fascinating, I believe it's a TED Talk, where... and sorry, spoilers, if anyone hasn't yet seen it and wants to watch it, but cover your ears... where he presents, I think it's like kidney donation statistics for various countries, and some of them really high. So everyone wants to be a donor. Sure, we'll all donate after we pass. No worries. Other countries, super low. Like we're talking like 95% of the population versus like 3% of the population. So definitely statistically significant. Like huge differences. And everyone sees it and they're like, oh, what's wrong with these countries who aren't donating? Why are they so mean? Why are they not generous? It must be something wrong with their culture. And then the hitch here is that the countries that have lots of donors, it's an opt out. The countries that have very few donors, organ donation is opt in. So this opt in, opt out. Very important. And I think that's great. We should definitely start applying this more to the data science space. It's a good idea.

Triveni: Yeah. And especially in healthcare, especially in like I don't want my credit card history shared with these guys to then send me targeted ads. That's crazy. So things to think about. But yeah, I really liked this article, and they did actually mention that IBM, and some startups that are backed by Google actually, are actually using federated learning to then help with predicting like patient resistance to drugs and survival rates with certain diseases. So there are places that are using this more and more. And hopefully as they go forward, they're thinking about doing some opt in, opt out strategy there.

Will: Cool.

Triveni: Cool. So I think that's about all the time we've got for today, in terms of Banana Data. But we realize that people might have come to this podcast expecting to learn about bananas.

Will: Data about bananas. That would make sense.

Triveni: Right. So in order to not disappoint, I do have a piece of banana datum for you, datum being the singular of data, Will. Not sure if you knew that one.

Will: Correct, correct.

Triveni: Anyway, so some banana datum. Did you know that approximately 50 billion tons of Cavendish bananas are produced globally every year?

Will: No.

Triveni: Well, the UN does. They're the ones who put out that statistic.

Will: All right.

Triveni: Cool.

Will: Learn something new every day.

Triveni: Now you know some stuff about bananas.

Will: Awesome. That's all we've got for today, in the world of Banana Data. We'll be back with another podcast in two weeks. But in the meantime, subscribe to the Banana Data newsletter to read these articles and more like them. We've got links for all the articles we discussed today in the show notes. All right, well, it's been a pleasure, Triveni.

Triveni; It's been great, Will. See you next time.

You May Also Like

5 New Dataiku Features to Streamline Your RAG Pipelines

Read More

Dataiku Is a Gartner Peer Insights Customers’ Choice

Read More

2025 Retail & CPG Trends: Hyper-Personalization, GenAI, & More!

Read More

Keep Track of All Your Models (Including LLMs) With Dataiku

Read More