The Future of Data

There has been some speculation recently that predictions in the future will only get worse; that is, having more data, better quality data, better algorithms, and more computational power doesn't necessarily lead to better predictions. This episode of The Banana Data Podcast breaks down what the future of data might hold.

Not a Podcast Person?

No problem, we get it - read the entire transcript of the episode below.

Will Nowak: Okay, so the first article I want to discuss comes from Quartz. It's called Here's a Prediction: In the future, predictions will only get worse. So in this case, Allison Schrager talks about how, despite us having better data, better quality data, better algorithms, more computational power, all that jazz, that doesn't necessarily lead to better predictions. Which is I think probably counter-intuitive, why it's a catchy article, but she makes some interesting points that I think are totally valid.

The big idea is just that all those things are true. We should be getting better at predicting the future, no doubt, but the current state of the world is constantly responding and evolving. Her argument is that obviously we live in a more interconnected society, a more technologically-driven society. The world is changing more rapidly too.

So even though, in theory, we have better decision making processes, and in practice we probably have better decision making and predictive processes, the world's changing so fast that even though our prediction methods are really good, the world's going to outsmart us. It's going to change too much. On that, the predictions are actually going to end up being worse.

Triveni Gandhi: Wait, even if we're collecting new data and getting all sorts of new data points and up-to-date, latest fresh data?

Will Nowak: So I think, yeah. I mean, the example that really came to mind for me was using an app like Waze. Do you use Waze at all?

Triveni Gandhi: Well, I don't drive, but yeah. I live in New York City. Come on.

Will Nowak: Yeah, fair enough. So for listeners out there, Waze, a traffic navigation app, but so in this case, Waze does its best, uses AI machine learning to direct you to the shortest route. But if you're in a car driving, and you have the Waze app turned on and it says, "Go left," but so does everyone else who's also on that same road. Because left is the non-crowded, fast option, well suddenly all of those cars are now turning left with you and the left option, option A is now more crowded.

So, that's a simple but illustrative example of what a powerful technology Waze is, no doubt, but it still can't necessarily take into account all of the ways in which the world is adapting so rapidly.

Triveni Gandhi: Well, and this is the same problem in the financial system too, right? Where you have these hedge funds using complex modeling to say, sell stock X. All of a sudden everyone's selling stock X. So then its value goes down and then no one wants to start selling it. I guess what you're seeing is like this feedback loop, where the model says, "Do this," and then everyone does that, so then the model becomes obsolete.

Will Nowak: Yeah, no. I mean, this is definitely an area where we could learn a lot from finance, I'm sure. One way to overcome this has to do with observation weighting in your model building. So for our data science practitioners listening out there, this is something that hopefully they're doing or should be doing in their model building.

So if you're thinking about building a time series model, by weighting, I mean the importance that you give to any single observation in your predictive model and it's pretty intuitive, but it's something that you can easily accomplish mathematically, which is weight more highly, the more recent observations. So that's something that makes a lot of theoretical and intuitive sense and also leads to better performing models. So something that we should definitely be putting into place.

Triveni Gandhi: I think it's more than just math that's going to address these problems, right? We need to also be looking at the story and the context of what we're trying to understand. So for me, I have a background in political science. I did a PhD in political science, but in 2016 we had this person come out of left field really who had not been seen before in our data or whose methods or whose just rise was so unprecedented. As a result, we really couldn't pick up on that nuance with our data, with our models that already existed.

In fact, those who were willing to sort of rewrite the script or rewrite what it meant in the data to have this sort of brand new factor were the ones who were better at actually catching or picking up the right prediction of that or the right outcome of that election. So I hope that moving into the future, we're not just saying, okay, well we're going to weight our observations differently or we're going to do this mathematical modeling. But we're also going to make sure we as data scientists and we as practitioners are keeping the context and the nuance of what's going on in the real world in mind to better understand and make use of our models.

Will Nowak: Yeah, no, I think that makes a lot of sense. People have spoken before about big data and the death of theory and we just can look at the data for answers and we don't need this theoretical framework. I think what you say is totally correct. We still need theoretical underpinnings for mathematical models. So for sure. But you know what's something else that could make our model predictions bad?

Triveni Gandhi: What?

Will Nowak: If our code breaks.

Triveni Gandhi: Oh boy, I had that happen.

Will Nowak: Yeah, me too. Definitely written plenty of buggy code in my time, but I wanted to talk about an article written by Vicki Boykis called Python's Caduceus Syndrome.

Triveni Gandhi: Oh, what is caduceus?

Will Nowak: Caduceus was the staff carried by Hermes way back when. So the staff is special because it's entwined with two different snakes. So these snakes, they crawl up the staff and at the top, one snake has a head that looks forward and one snake has a head that looks back.

Triveni Gandhi: Okay. So this is the medical symbol, right?

Will Nowak: Yep. But so in this case, really the head looking forward and the head looking back, that's the key metaphor that we need for Python.

Triveni Gandhi: Okay. Well and it's snakes. So it all works.

Will Nowak: It all works. Yeah. But so I mean I'm sure the vast majority of our listeners are familiar, but Python is an open source programming language that's really popular in the data science space. Just to be brief about it, on January 1st, 2020 there will no longer be updates made to Python version two. Instead the Python community will exclusively shift efforts towards Python 3.

Triveni Gandhi: Okay. Wow. So that's soon. That's like in six months.

Will Nowak: It is soon though, there's been warnings and there's been preparations for this a long time coming. So if you're an organization that has important Python code running in production and you're planning to shift to Python 3, you've known for a long time that you need to be aware of this transition. Why you need to be aware of this transition is that Python 3 is not backwards compatible. There are lines of code that you could have written in Python 2, using Python 2 syntax that won't run using Python 3 compiler.

Triveni Gandhi: Wait, so why are they doing this? Why are they stopping Python 2.7, or I guess Python 2?

Will Nowak: Yeah, so great question. Some of the details here are lost on me because my knowledge of Python only goes but so deep. But in general, the argument that Vicki makes here is that Python 2 was a great language. It was easy to learn. It got on or brought on a lot of practitioners, both individuals and large organizations.

So because of that, because Python 2 has become so popular, the Python development community said, "Hey, let's keep this success rolling. Let's make Python 3 the best language in the world so we can have new users use it, but we can also have the biggest, baddest organizations in the world running all their production code in Python." To achieve that, there are some details that they've decided to include in Python 3 that were absent from Python 2.

Triveni Gandhi: Wow. So it's going to redefine the language all altogether?

Will Nowak: I wouldn't say altogether, but they're going to make some shifts for sure. But the point of the caduceus syndrome is that should the Python community look back and say, "Hey, there's a lot of code that already exists that's been written in Python 2. Should we really make sure that we're not making breaking changes to the language? Should we make sure that we're prioritizing what already exists? Or should we be the snake that looks forward and say, look at all that Python 3 still has to do, that Python 3 still can do in the future and should we really optimize this language to be the language of the future?

Triveni Gandhi: So I imagine there's debates, right? Because nothings ever simple. So I'm guessing that there's people who are pure Python 2 and those who are, no, we got to go forward. What's the real debate here now?

Will Nowak: You're exactly right, Triveni. Guido van Rossum who's the Python BDFL.

Triveni Gandhi: What is the BDFL?

Will Nowak: I guess I misspoke. He's no longer the Python BDFL. So BDFL is the benevolent dictator for life. So Guido is the individual who is often credited with creating the Python language. So a pretty nice feather in his cap for sure. But there was a debate about some changes in syntax between Python 2 and Python 3. Again, my understanding and don't want to upset anyone here, but my understanding is that this debate about syntax changes, it was just a heated debate that caused Guido to, I think become a little bit dispirited and decide that the politics of this had become too much and to step down from that benevolent dictator position.

So this is something that you see a lot in open source communities where success is great. The product or the technology grows and grows and gets bigger and bigger and more users and more developers and that's all awesome. But then when you have so many people, everyone has a different viewpoint. Everyone wants something different. Some people want the snake to look back, some people want the snake to look forward. So you're totally right that I think Python's community is also experiencing some of this turmoil.

Triveni Gandhi: Yeah, I mean definitely. It's similar to how startups also grow and people's visions change and what the investors want or whatever it is will dictate the future. I'm curious to know what the future is going to look like for Python. So if Guido is stepped down, is it ruled by committee now? Is there a new BDFL what's the vision really?

Will Nowak: Yeah. I mean I think that open source technologies kind of have their own ecosystem of political management and how they have a hierarchy and they make changes. Obviously people who inhabit the world of GitHub can go and kind of see on the message boards and the branches and the pull requests what this all looks like. So I'm not exactly familiar with Python, but I think you're right that in general, this idea of open source governance, is something that's still not necessarily a solved problem.

Triveni Gandhi: Yeah. Especially since Python, especially the certain packages for machine learning underpins so many of existing practices and models. I know that I work with scikit-learn packages all the time to do predictions and so I want to make sure that what I'm using is being governed and updated in a way that's reusable. That I'm not stuck with, okay, now this thing's going to break because we couldn't decide how to keep it up to speed.

Will Nowak: Yeah. Yeah. So we'll see. But definitely I think you and I both have a big stake in Python success, so I'll just wish it well for its future.

Triveni Gandhi: Yeah, I'm going to go and fix my code now. I'll be ready for Python 3.

Will Nowak:
Sounds good. Now it's time for it In English Please. This is the segment of the show where we break down complex data science topics. Last week we were talking a little bit about GANs and I think that's something where definitely, for me and probably many of our listeners we could benefit from a little bit more clarification. So, Triveni, I was wondering if you'd explain GANs to us all in English please.

Triveni Gandhi: Sure. So everything you need to know about GANs is in the title. GANs is an acronym that stands for generative adversarial networks, right? So the three words there actually tell you what you need to know. So the first thing is that these are neural networks. What they're doing is creating a distribution of data, or they're mimicking data that can be text, image, or even plain numbers in the idea of generating new points that are in theory, indistinguishable from real data.

So for example, there's this website called thishumandoesnotexist.com. If you go there, you'll see photos of people. None of those people are real. All of those images have been made by a model using GANs. Okay. So what did GANs do to make these people? Well, in a general adversarial network, there are two neural networks and they're adversaries of each other. So we have one network that's the discriminator, and that's very similar to what we already are familiar with in terms of classification.

Here's some data to predict for me whether or not this picture of a person is a real person or not a real person, right? That's the model that's going to say, okay, yep, I'm seeing that. Yep. I'm seeing a real person. No, I'm not seeing a real person. Then you have the second network, which is the generator, and that's the network that's actually creating these fake images. So the generators creating a fake image of a person and feeding it into the same data that the discriminator's looking at. So we're adding in fake images into a dataset of real images.

The discriminators judging, yep, this is human. This is not human or real or not real. The generator in turn is seeing what the discriminator is saying and updating itself to better fool the discriminator. So they're adversaries in that the discriminator wants to suss out the fake data and the generator wants to fool the real data. So the idea here is that we're using these GANs, one, to create different distributions of data that we can then use to maybe do some better forecasting.

This is very common in the financial services as we were talking about earlier, or in the case of these images, which also then kind of raises questions around, well what's real, what's not real? If a model can't distinguish it, can we distinguish it? In this sort of era of false information, what does that really mean? But essentially the GAN model is working against itself to improve the data that it is producing for you. So it's not really predicting something for you. In fact, it's actually creating a distribution of data for you then to work off of.

Will Nowak: Cool. All right, well thanks for explaining that in English.

Triveni Gandhi: You got it. So, Will, before we head out, I want to talk about one last article today and it's called How a Feel-Good AI Story Went Wrong in Flint. This was written by Alexis Madrigal in the Atlantic, actually. So I don't know if you're familiar with the Flint, Michigan crisis?

Will Nowak: I would say to some extent. Yeah, aware of it.

Triveni Gandhi: Sure. So what's going on is that in 2004 the city redirected where it was getting its water from. So the water source actually started corroding the lead pipes and lead was leaching in from people's water pipes into their water and lead is not good to drink. So the city needed to go through and replace all of those lead pipes with copper. Right? But it's not that simple. It never is. The records on which homes have lead pipes were really messy and incomplete, and it was going to be expensive to have to dig into every single home to look for copper pipes.

Will Nowak: Makes sense.

Triveni Gandhi: So in 2016 there was these volunteer computer scientists who worked with funding from Google actually to create a model to predict which homes have lead or copper pipes. They did so using a cheaper system called hydrovacing. So they were able to use hydrovacing to test out a bunch of different homes for lead and be able to feed that data back into their model to create a prediction for, okay, these homes do have lead. These homes likely don't have lead.

So they found actually that the age of the house, the location, and the value were most predictive in finding lead pipes. So once that model was built, in 2017 the city was able to replace 6,000 pipes in homes across the city. So they excavated around I think 8,300 pipes and of those, over 6,000 were actually lead that needed to be replaced. So they had about a 70% accuracy rate. So it was all really good.

Will Nowak: That sounds pretty good. So, so far so good. No problem here.

Triveni Gandhi: Yeah. So yeah, where did it all go wrong? Well. So in 2018 a new firm was brought in by the city to really amp up the replacements. They were going to go in and dig in and just get everything sorted out. Except the company chose not to use that model anymore. So what happened was the group that was working with the city in 2017 left around October and the new guys, the new company didn't come until the end of 2017 around December.

So they came in, there was no overlap, there wasn't any sharing. The new company came in and basically said, "All right, we just have to start at ground zero." So they didn't use the model and instead they changed their methods because not only did the company not really know how to use that model or even have the database, but the mayor was also facing political pressure. This is where I think it gets relevant to the data science world because people were saying, "I don't care what a model says. I don't care that a model thinks my house is fine. I want you to dig it and tell me that it's fine."

So obviously the mayor is going to face political pressure. She doesn't want people to not believe her or to say, "Oh no, you missed my house." That was happening a little bit in 2017 where people were saying, "Why are you skipping me?" So as a result, they just sort of ignored the model and started excavating based on what they thought made sense or sort of to appease the different parts of the city. As a result, they started excavating a lot of places and just finding copper pipes and missing the lead. I mean, obviously the crisis is still ongoing because there are plenty of houses that still are suffering from the lead pipe issue.

Will Nowak: Yeah. I mean, to me this brings to mind another theme of this podcast, which is the importance of education for all. As we've discussed previously, not just the practitioners but the public, so about these machine learning models and more simply I would guess, probability and this idea that you might be safe 99% of the time. But someone says, "Well, I'm not comfortable maybe because I don't understand probability or maybe just because I do and that's the way I feel." I suppose people are entitled to their feelings.

But, "I'm not comfortable with 99% certainty, I want that extra 1%," and I don't envy the politicians in the world who need to help fight that battle. But I think we as people in the data science space and people who in theory have some understanding of model building and probability should be doing a better job to say this is what the output means. So maybe when they say this was a feel good story that went poorly, that's maybe one way that they would say they could've done better to make it go less poorly.

They could have, in addition to creating this really wonderful model that does make a lot of sense, they could have said we needed to do better to ensure people that, no, this model won't be perfect. But for reasons X, Y, and Z, it's kind of the most sensible thing for our city to do.

Triveni Gandhi: Like, okay, this is the baseline where we're starting with. The model says these are the homes to dig. We're going to do those. Once we know that we've gotten those ones out, we'll go back and check yours even if the model says it's fine. Right? That can be sort of a compromise. But I wonder too, given that so much of our lives are already using machine learning models, I mean it's obvious to me what the difference is here, but how do we figure out how to convince people that, well, you're already trusting models to do a lot of stuff for you that you don't even realize.

So now when it comes to this issue, which is a very prominent, and in the top of your mind, what's changed here? Think about medical probability statistics. Sort of like, okay, we found this. We have these two treatments for your illness. This one has a 85% chance of success. This one has a 90% chance of success. People are going to say, "Okay, give me the 90 one," right? But they're not arguing, "No, I don't want any treatment unless you give me 100% accuracy on this, or 100% chance of success."

Will Nowak: Yeah. But I think it's more, to my mind, it's more the lack of knowing. So to explain what I mean by that, we don't yet have these driverless cars where there's a 99.99% chance it's going to be fine. But I know if I'm driving that I'm in control and even though there's realistically a much higher chance for me to make a mistake, I just know I have the control.

So this is similar to what we see I think in the medical world where a lot of people choose to opt into high cost testing, where we're pretty sure based upon historical probabilities that you don't have this issue. But we can perform this costly test to with certainty tell you whether you do or do not have this illness. To people there too, I think they're not comfortable living in that uncertainty even though all signs point to them being fine.

Triveni Gandhi: It's a high risk domain.

Will Nowak: Yeah, and so I think in this case it's different than our Facebook algorithms being optimized here, but we're still not that comfortable with people trusting models and probabilities. That's my impression.

Triveni Gandhi: Well, yeah, I get that. Right. I mean sort of the machine learning world still has a ways to go to prove that yet what we're telling you is going to fit the story. Right? Then someone could make the argument based on what we talked about earlier that no, your model is actually terrible because you're not taking in any context. You're using historical data, that data's all wrong, whatever. In fact, actually the article here discusses that later on, the city found that the hydrovacing procedure that they were using to build some of the data out was actually missing certain lead pipes.

So it's never a clean, easy thing, but I hope that as we move forward, politicians themselves become educated on what these models mean and what they can and cannot do. We as data scientists and people in the data world start providing more, in English please, down to earth explanations of what's happening so that the average person has a basic literacy or understanding of what's happening in the model, what's happening in the data science so they can make a better judgment. Not think like, "Well, some magical machine thing told me this, I don't want to believe it at all," versus, "Okay, I kind of understand what happened here or why they've made this decision."

Will Nowak: Yeah. Also saving room I think for collaboration in the model design process. So to your point, if models are performing poorly, like they're not accurate in some sense, then that's justified and people should have the ability to complain accordingly or at least to be educated and to say, "Hey, this model is going to be correct 70% of the time, so 30% of the time we're going to get it wrong."

But I think another way in which people need to be educated and we need to collaborate and both parties need to be educated in this case is the people who are building the model, they probably have a lot to learn from the people who live in Flint and say, "Oh, actually there's this thing that you computer scientists aren't aware of, but I bet this really is something that's highly correlated with lead pipes. So you should take this into account in your model too." How often are we getting that on the ground information? I feel like not enough.

Triveni Gandhi: Yeah. Then one last point for me on this is that it's not also just that the model is going to be the final say. So the model says, okay, these 70 houses have lead. These 30 we can ignore. Well the city is going to go ahead and hit the 70 first, and then once we've gotten out the ones that the model thinks is for sure, we'll still go back and check the ones that the model said is a no.

I don't think the idea is to use the model or use that data as the final source of truth, but rather as a signpost in the wilderness, if you will, to find out where to go first. Definitely get some of these high risk homes safe and then go back and check everyone. I mean that's, if I were mayor, that's what I would do. Again, it's not an easy task to balance all these different political forces and people and city council and all of that.

Will Nowak: It brings us back to the centaur decision-making that we discussed last time. So again, for listeners, if you missed episode three, go back and check out centaur decision-making.

Triveni Gandhi: Great. Thanks, Will.

Will Nowak: All right, thank you.

Triveni Gandhi: Okay, before we head out, banana fact of the podcast. So according to the Guinness Book of World Records, in 2001 a bunch of bananas took the title of quote "largest bunch of bananas" and it actually held 473 individual bananas and weighed 287 pounds or 130 kg. It was actually grown in the Canary Islands.

Will Nowak: Wow. That is quite an outlier.

Triveni Gandhi: It is indeed.

The Future of Data

You May Also Like

Evaluating AI Agents Effectively for Enterprise Use

CIOs on the Frontlines: Lessons From Perdue Farms and BCLC

Building AI Agents for Life Sciences: From Silos to Synthesis

Scaling GenAI in Financial Services With Dataiku and NVIDIA

The Future of Data

Don't Miss the Latest

Subscribe to the Dataiku Blog

You May Also Like

Evaluating AI Agents Effectively for Enterprise Use

CIOs on the Frontlines: Lessons From Perdue Farms and BCLC

Building AI Agents for Life Sciences: From Silos to Synthesis

Scaling GenAI in Financial Services With Dataiku and NVIDIA