Deepfakes and Data Upskilling

Scaling AI Catie Grasso

In an ever-evolving technology landscape inundated with competitive players, it’s important for data scientists to question and critically analyze how to focus their learnings. Who grants authority to those in charge of validating content? How do we remain cognizant of big tech and corporations that shape our content and decisions? How do we upskill our efforts in a way that is trusted and not overly narrow in focus? This episode of the Banana Data podcast includes some initial thought starters on these big questions centered around trust.

Not a podcast person?

No problem, we get it — read the entire transcript of the episode below.

Triveni Gandhi: You're listening to the Banana Data podcast, a podcast hosted by Dataiku. I'm Triveni.

Will Nowak: And I'm Will.

Triveni Gandhi: We'll be taking you behind the curtain of the AI hype, exploring what it is.

Will Nowak: And what it isn't.

Triveni Gandhi: Capable of. For our launch of season three we're talking about trust. How do you dodge deepfakes? Who classifies data as safe? Who do we listen to or study when we're trying to upskill?

Triveni Gandhi: Hey Will! Welcome back.

Will Nowak: Triveni, what is up?

Triveni Gandhi: It is season three and we're here to talk about trust.

Will Nowak: Trust. Yeah. Specifically I think we're going to kick off the show by talking a little bit about deepfakes, correct?

Triveni Gandhi: Oh man. You know how I love to hate deepfakes, right?

Will Nowak: So, unlike you, I'm not on the Twitter. What's a deepfake?

Triveni Gandhi: Deepfake is basically when people take images and videos, do some crazy mathematical transformations, and make a new video or image that isn't true. It can be pretty problematic. It can cause a lot of harm.

Will Nowak: Yeah, yeah. I can see the negative implications already for sure. What do we do about it? I think it's your job and my job to figure that out.

Triveni Gandhi: So, this is obviously something a lot of people have been thinking about and working on. I wanted to talk to you about this new startup called Atestiv that is committed to helping fight deepfakes by providing validation to images and video and other files that get created.

Will Nowak: So in this case, if I'm The New York Times and I want to publish an image and I want my image to be verified as non-deepfake, do I have to pass my images to this organization first?

Triveni Gandhi: It's like this digital fingerprint to say that “this is the image that it says it is”. I think it's a different kind of approach to the question of deepfakes. We talk a lot about finding algorithms to unmask them or combat them, but this is more about data quality almost.

Will Nowak: Broader theme that maybe we'll talk about today on the podcast. Do you go with a centralized authority? This is what all the blockchain aficionados are all about, is that previously we had to have a central bank that mandated what a dollar was. Right now, Bitcoin says you don't need to have a centralized authority that mandates that this is indeed a certified dollar. Instead, you can trust this other entity which is the blockchain. It is cryptography. In general here, it's like, do we want to have one central authority that's going to take all the images in the world and say this is a trusted image and this is not a trusted image? That's one approach — the centralized authority. People have to go to them to get their images stamped for approval. Then I think what you’re contrasting that with, correct me if I'm wrong, is a more “tech-first” approach where you're saying we could train a more sophisticated algorithm to look at deepfakes, look at the image created by a deepfake, and understand, hey this image, there's something about the way that this image looks that tells me that this image is not a real image, but it's a fake image. That's the “tech-first” approach as opposed to the central “authority-first” approach. Is that correct?

Triveni Gandhi: Yeah, I think that's it. It's a “data-first” versus “tech-first.” Or data privacy sort of, or data validation.

Will Nowak: So which one's better?

Triveni Gandhi: I don't think either is. I think it depends on the case, the use case.

Will Nowak: I disagree. I feel like if we have the tech aspect of this work sufficiently, then that's just better, because you don't need to rely on people to change their behavior. You can just have every TV that ships has this algorithm built into it, which says we can just read the images that are coming across the screen and throw up a warning whenever those images are deepfake images according to the algorithm we're using behind the scenes. It would be so much easier to implement that. They can produce TV as always, and then you would just have a model watching these images and this model would flag and tell you when it was a deepfake or not. Whereas if I have to submit every time I want to publish something to the world, I have to go run it by a trusted authority who says, Will, you haven't doctored this image, you can then publish it. That to me seems much more bureaucracy. It might be impossible, so we might have to rely on good old fashioned stamps of approval.

Triveni Gandhi: Well it's a combination. So, one the idea of your television could have an algorithm to unmask a deepfake or alert you, that's going to be out of date really fast. Then, part of the issue with the “tech-first” approach is that we have to stay up to date on bad actors and how they're constantly evolving their methodology, which in turn makes it harder to build out actual protections.

Will Nowak: Just a fool's errand. It's going to be impossible to constantly stay ahead of the deepfakers through technological means and, therefore, we have to rely on governing bodies and social processes.

Triveni Gandhi: Okay, we need to improve our technology around understanding deepfakes and unmasking them and stop giving people capability to make deepfakes. We also need to know that there's some source of truth, ground truth, and this digital fingerprint thing is one way to do it. But, then, who is this company to say that we're the trusted authority because we blockchain fingerprinted it? It brings us back to that central social problem of who do we trust regarding AI, regarding how AI should be governed and managed? Unfortunately I think that's just the crux of the issue, is this: Where do we place our trust to know that what we're doing with AI or with the outputs of AI is true or correct or not harming people?

Will Nowak: Yeah. I think on this podcast and people listening to this podcast, they're presumably fans of tech and think that technology can solve problems in the world. I think the little that I've learned about this, I've continued to be surprised in terms of how much our past and current world does rely on good old fashioned trust. Obviously central banks and dollar printing being one example of this, but also I use Google Chrome as a web browser and you know if you type in there's a little lock that appears by the URL. Have you ever seen that lock?

Triveni Gandhi: Yeah, the HTTPS thing?

Will Nowak: Yeah. Again, exactly. If someone goes to HTTP versus HTTPS, the S is indicating that this is a secure connection. So, in part, this idea of a secure connection is there are web-certifying authorities that say, hey, if you want your website to be certified as trusted so you can have a secure connection between your website, which we understand, and your clients, you have to come to us. So we've said that certain people, Google is one of these organizations that we've socially trusted to hand out web certificates. So it's actually not that high tech in some ways.

Will Nowak: At the end of the day if I, Will, want to make a website and I want my website to be a trusted entity, I've got to go to some other trusted entity like Google and say, hey Google, can you sign off that I'm really Will? This is really Will's website and then when people come to my website, you can back me up and say this guy's legit. That layer of bureaucracy, even though tech is making it better, does still exist.

Triveni Gandhi: Right, but then, why do we trust Google? It's because they've been around for a while, because they're huge. We like to default our trust into large institutions that have implanted themselves in certain ways. That's why Google and Twitter and even Facebook carry weight in terms of decisions they make or how they share information or whatever news filtering they are doing.

Will Nowak: I agree with you that we tend to trust big entities, but if I'm going to trust someone, I think in large part, one thing that both this startup and Google can share is that trust comes from doing what you say you will do and providing a way to validate that. So maybe a small startup that's small and young can still gain trust quickly.

Triveni Gandhi: Yeah, I mean hopefully. I think that when it comes to trust in terms of images and what we're seeing on the internet now, we're stuck in a place where everybody wants to have their own version of the truth. Yes, 95% of us can agree and say startup X is the one who's going to fingerprint all digital assets and we trust it, but there is going to be 5% that's like, no they're not trustworthy. Our version over here is. So, even though truth should be objective, we're always finding a way to make it subjective. So, I think going back to your point Will, for me it is a “tech-first” and a data quality, data validation approach to combat deepfakes. I don't think we can rely on one or the other and as AI practitioners, we should be pushing ourselves to be ahead of the curve, ahead of the fakers because they'll always find a way to improve.

Triveni Gandhi: We have a responsibility to keep working towards heading them off or being one step ahead of them.

Will Nowak: The advice here to people who are building AI and who are working in the math and the tech space, it's any different. It's your responsibility obviously to be a good citizen and in terms of working hard to stay one step ahead, that's just part of the game. I guess I would also say to this point how and what we prioritize in learning curricula is also relevant here as well. So, if people who are learning AI are thinking like, okay, one threat to the universe is fake truth and using AI to perpetuate fake truth, then maybe one of course in any AI curriculum is best practices in AI to combat perpetuation of fake news through deepfakes. So that's also a related topic that we should get into a little bit.

Triveni Gandhi: Okay. Now it's time for “In English Please.” So, keeping in the spirit of our discussion today, Will, can you explain autoencoder neural networks in English please?

Will Nowak: Yeah, I can definitely try Triveni. We've talked a bit on the podcast previously about neural networks. They are what they sound like. They're networks, networks of data, and oftentimes making the Shrek reference, we like to say that neural nets have layers.

Triveni Gandhi: Like an ogre.

Will Nowak: Exactly. The concept of an autoencoder is that we're passing in some data as an input to our neural network layer. The whole goal of the autoencoder is to take the input and the output is to produce that same input again. So, if I pass in a picture of my face as input to the model, I want the model to produce as output a picture of my face.

Triveni Gandhi: Okay.

Will Nowak: The key innovation in autoencoders is that the dimensionality of the input is probably pretty high. You think about an image, it's going to be all the pixels in the image of my face. All those pixels are going to be passed as an input to the autoencoder model. Then, what happens, if you envision a funnel, is that that data gets funneled down, it gets shrunk down, to a layer of many fewer dimensions in the middle. So, it gets squeezed. This squeezing is taking the image in my face and compressing all the data and all those pixels into a small amount of nodes. Then, from that small amount of nodes, we then use that information to recreate the image of my entire face. So, it's taking input, it's producing the same input as output, but it's shrinking the information in the intermediate step.

Will Nowak: So, the reason why this is important, particularly with things like deepfakes, is you can imagine training an autoencoder on many images of my face. I'm getting a neural network model to become really good at taking images of Will's face, shrinking them down to the key components that we need to know, and then expanding them back out to images of my face. The trick here is maybe I get some video recording of Triveni speaking, I take the video of Triveni speaking, and I pass it as an input to the Will autoencoder model. When it takes Triveni's face as input, but as output, it's trained to produce my face, now suddenly we have an image that looks like me speaking as Triveni was previously speaking. This is the core concept of how autoencoders work and how they're used, particularly in deepfake models.

Triveni Gandhi: That sounds like a nightmare, but thanks for explaining that in English. One thing we've talked a little bit about previously but not super in depth, is the idea of upskilling. For people who are interested in getting into data science or maybe they're data adjacent and they want to be more hands on, there are a lot of ways to go about it. There's actually a lot of information and there's so many ways you could do it that I think it gets overwhelming to make that choice. So, Will, I wanted to get your thoughts on where you think people should start.

Will Nowak: Sure. I know you have lots of thoughts on this as well. I guess I would start if someone told me that they wanted to get more technical, which is an ill-defined term, but let's stick with it. I would again pose back to them a question and say, "What's your ultimate goal?" The ecosystem, to use a phrase that's almost empty, but I think it's still definitely true, is growing rapidly. So, now do you want to be a data scientist, a data engineer? Do you want to work with distributed data? Do you want to work building beautiful visualizations? What do you actually want to do? Why do you want to get into the data science space? So, let's think more about your goals and then focus your energies on learning the fundamentals. That would be my starting question to them.

Triveni Gandhi: Yeah, I think that's fair. Especially because there's so many technologies. There's so many different approaches. People still fight over Python versus R. How do you know where to start. I think it depends on the answer to, well what do you want to do?

Will Nowak: It used to be if you wanted to do data science, make sure you understand your basic statistical tests and models and you can predict things in a supervised or unsupervised way and boom, you're a data scientist, good for you. But now we talk about data adjacents, but even the data science or the data ecosystem... I think we talked previously on an episode about neural networks. It's neural networks versus non-neural networks as classes of algorithms, if you want to do predictive stuff, I think need to make more decisions, which is scary because it gets into this decision of, are you cutting yourself off and saying I'm only going to focus on neural nets and never learn anything about this other class of algorithm?

Will Nowak: But I think that more and more you probably need to make those decisions earlier now because the ecosystem is so fast. You can't learn everything.

Triveni Gandhi: No, you can't. Even within certain things like neural nets, there are different libraries and different backends and whatever that you have to choose from. And choose to learn from. I think it can get pretty overwhelming for the average person just walking in and saying, "I want to learn how to do a neural net, but do I use Keras? Do I use TensorFlow? Do I use PyTorch? I guess that's a bigger question along the lines of “How do we make choices around what we should learn or upskill in when there are so many options out there?”

Will Nowak: Yeah. I think there's many different heuristics you could use. What I want to learn, what I think will be valuable for me. What I'm interested in. What other people have had success with in the past. What people think is going to be successful in the future. There are so many different decision rules you could use to help you figure out how to learn. A lot of that's just personal philosophy, but I guess the informed aspiring data scientists should keep that in mind.

Triveni Gandhi: I think this raises an important point around upskilling. There is a lot of material out there on how to learn how to do machine learning, how to do AI. Actually a lot of different companies are now putting out their own documentation and coursework and everything. I guess the question here really is, how do you upscale in a way that is one both trusted and not narrow in focus?

Will Nowak: Definitely relevant. Definitely ties into our previous discussion. Just recently, Google Cloud I saw expanded their offerings to give anyone who wants it 30 free days of training. In the Google Cloud training, obviously you're learning things about data science and data engineering, but you're doing it on the Google Cloud platform. So that to me, is making this particularly salient and relevant at the moment, is how do you learn in a responsible way? Is there a problem there? Is there a conflict of interest?

Triveni Gandhi: Well, I mean not even conflict of interest, but how do I... I'm a brand new entrant to the data science field and I want to learn how to do whatever it is. Do I go and I learn Google Cloud? Should I learn AWS? Should I learn Python? Should I learn R? Should I go back and read a stats book end to end?

Will Nowak: It used to be if you want to get better at data science, just go learn math. Math is very free. Anybody can learn math, that's great. But now if you go to a data science meetup, everyone says, "Oh, how should you prepare for your new job? Got to know about the cloud." The thing that sucks about that is a cloud is not free. Also you have to make choices and say, I think to your previous point, do I want to get stuck in with Google or do I want to use Microsoft or Amazon or some other provider? There's a lot of risks and also expenses that I think data science learners are dealing with in 2020 that they weren't dealing with as little as five years ago, maybe even more recently.

Triveni Gandhi: Oh yeah, definitely. I think you and I are both self-taught data scientists and I never really worried about what cloud provider I was using when I was learning. I was just more about what are the functions I need, what are the libraries I need, how do I make this graph pretty and centered on the page? Those were my big concerns, not these other things. Now even when I started at Dataiku a year and a half ago, I really felt this need to understand Hadoop and MapReduce and Spark. I can see already that the field and we have shifted more towards Kubernetes and different ways of managing your cloud compute environments. Even within a year and a half or whatever it is, we switched to some new technology that is the up and up. I think that for people who are looking to upskill into data science, you will never go wrong by learning the foundations, the true basics. Having a general overview of the other parts of the GCP environment versus the AWS versus Azure Microsoft is good to know, but I don't think unless you know that I want to work for company X that uses this product, you're probably not going to want to try and specialize upfront.

Will Nowak: I don't think you want to specialize, but I do think it's an issue. At the moment, a lot of progress in the data science space is just coming through still big data and big compute. Do you want to think about things like distributed model training or just really big containerized servers, basically thinking about instead of training something on your laptop, training on a big machine in the cloud. So again, it used to be I could just do it on my own. Now I need to say, okay well which cloud vendor am I going to run this model on? Because if I don't pick a cloud vendor, I can't run this cool model, and I'm not going to be a relevant data scientist. So that's a bummer.

Triveni Gandhi: I don't know.

Will Nowak: Losing a sense of equity in the data science world I think, because it's getting so complex.

Triveni Gandhi: I disagree in a way. What you're talking about is enterprise production-level data science at scale. I just threw like seven buzzwords at you I know. But the idea is that you're talking about big data, big servers, big compute, and that's the kind of problem that I would say 90% of data scientists are not worried about. Because they're working in teams that are dedicated to bigger MLOps or they're working at a company that isn't in the big huge data space or they're someone who wants to start being in the data science space. They want to become an analyst or an engineer or whatever, but they're not going to immediately start by going into big production level data science. So, sure, we need to keep up to date with all of these things, but I don't think that there's any replacement for the foundational knowledge that you'd need to be able to say, I understand this particular thing in depth very well. The actual library and package, it's just a matter of relearning some code. If I go to Google and I learn neural networks with TensorFlow, but I don't really learn neural networks and I just learned how do I manipulate this API to get what I want out of it, when I go somewhere else and try to use PyTorch or, I'm not going to understand what I'm doing differently and why.

Will Nowak: Yeah, no, surely you need to understand the fundamentals, but I don't know. I just think that if you're someone who wants to do something innovative at this point, instead of just relying on a more simple set of base tools. Like all I need is Python and Mumbai and I can start writing my own crazy cool libraries and algorithms and can do something really amazing on my own. I don't think that's really the way interesting progress happens at this point in time. I think it is, and you can continue to push back on this, but to my mind it's reliant on other foundational tech. Then again to this point of equity, that foundational tech either costs money or it's tied to a corporation. So TensorFlow is free, but to some extent, by aligning yourself with TensorFlow, you're aligning yourself with future movements made by Google. The school will continue to support that library.

Triveni Gandhi: This actually reminds me of a bit of a Twitter debate I saw earlier today. In fact, there's an argument that these big huge funded AI labs, the Googles, the Facebooks, even academic labs that have a ton of funding behind them, can take that time, take that effort, and build out all these great innovations, but also spend money on good PR to promote their innovation and to promote what they've done. So, people like you and I then are like, well look at what Google did. Look at what big lab X did. So now that must mean they're innovators. But in fact, there are probably a lot of people who are innovating at certainly a smaller scale but are still innovating. They're not getting traction. They're not getting any sort of weight because they don't have the big money behind them. So that's one thing.

Triveni Gandhi: But then in terms of upskilling as someone entering the field, I don't expect someone entering the field to be innovating right away. So I don't think they need to have big data compute.

Will Nowak: Sure.

Triveni Gandhi: Then the other thing is that if we really want our data science and our AI to be equitable, then it's less about whatever technology and more about the choices we're making with that technology.

Will Nowak: True. That's a deep cut that I can't argue with. I do think, going back to our previous conversation about deepfakes and who you trust and fake news and all that jazz, one thing that was a tool I definitely use probably earlier in my career I guess, was I believe it's the Stack Overflow developer survey. I'm sure many of our listeners are familiar with Stack Overflow and really popular community forums for all things tech, but their developer survey is what it sounds like. Asking software engineers a bunch of questions. To your point about not having the budgets to push this particular model or this particular technology, instead of relying on the big pocket organizations, I guess I would encourage people out there check out things like that where it's a more democratic representation of what's good and what people like, what they find value in, and then you as an aspirational data scientist or an aspirational science team can say, I'm not going to leave the marketing hype. I'm just going to look at this survey and see what people are actually using and have positive things to say about.

Triveni Gandhi: Yeah, it's definitely part of that trust issue to say that we have group verified or groups sourced this. That makes it more believable. I hear what you're saying. I do think that there is something to be said about learning those big fundamental technologies, whatever it might be. But I don't think a brand new team or a brand new data scientist needs to pick a battle here. I think it's more about getting your base and then figuring it out based on my need, based on my use case, based on what I want to do, and what's best for me.

Triveni Gandhi: All right, Will, so it's the end of the episode. Usually this is where I provide some sort of crazy fact for you.

Will Nowak: My favorite part of every episode.

Triveni Gandhi: Favorite part, but I'm actually changing it up this season.

Will Nowak: Oh.

Triveni Gandhi: Yeah. So now we're going to do math brain teasers.

Will Nowak: Oh, this is going to require more of me. Alright.

Triveni Gandhi: Yeah. So I'm going to give the brain teaser and then in our next episode, we'll give the answer. So you have two weeks to work on it. We'll see how well you did.

Will Nowak: Okay.

Triveni Gandhi: Okay. So our first brainteaser of season three, you are going to write an equation, a mathematical equation, where the answer comes out to five but you could only use the number two and you can only use the number two twice. You can use any kind of mathematical symbol. So you could do parentheses, you could do exponents, factorials, decimals, square roots, anything, but you could only use the number two twice. We'll see if you get it next week.

Will Nowak: That's all we've got for today in the world of Banana Data. We'll be back with another podcast in two weeks. But in the meantime, subscribe to the Banana Data newsletter to read these articles and more like them. We've got links for all the articles we discussed today in the show notes. All right, well, it's been a pleasure Triveni.

Triveni Gandhi: It's been great Will. See you next time.