What Makes Good Data Science?

Scaling AI Lynn Heidmann

Different people across an organization might have different definitions for what makes good data science. For data leaders, it might be impact on the business, while for people like data scientists or engineers, it might be more detailed and nuanced - like the quality or accuracy of the model. How can everyone across the organization find common ground for what makes good data science? This episode of the Banana Data Podcast has answers.

 

Not a Podcast Person?

No problem, we get it - read the entire transcript of the episode below.


Will Nowak: Welcome to the season finale of the Banana Data Podcast. So after an exciting nine episodes, we're mixing it up a little bit for this one with a new format.

Triveni Gandhi: We've talked a lot this season about the different practices that make for good and bad AI. So today Will and I are going to recap what we think are the most critical practices for anyone looking to build a scalable AI solution.

Will Nowak: Let's dig in.

Triveni Gandhi: Yeah, I'm going to start off with what I think is really important, and conveniently it starts at the beginning of your pipeline. I'm going to argue that people should be intentional with their data, and what does that mean? Well, build the process of collecting the right data into your practice from the get go and give it just as much importance as the modeling and the final prediction and all the tweaking that you do. I think so much of our focus gets put on, "This is the greatest new method. This is the greatest new learner," whatever it is, but your data's bad, it doesn't matter.

Will Nowak: Yeah, so as I think we'll find many times during today's episode, kind of intentionally, we're hearkening back to some of our previous conversations. I remember we had a disagreement before, and I think we're going to have that same disagreement again with some new light shed on it, and to me, I agree that we need to have good data, but I also think it's important when working in an organization not to lose sight that the work you're doing, it's not a silo, but it's part of a broader organizational goal, and so I think what I hear you saying is the data scientist and the data engineers in an organization need to be championing for data quality and data acquisition, but if you're in the C-suite of an organization, your big organizational goals are not just the data, the data is serving a broader purpose. You can't say ... if we're selling widgets, if we're selling some-  good, we're not going to shift our sales practices so that it somehow improves our data quality.

Triveni Gandhi: Of course, but I do think that folks in the C-suite at least can indicate to their tech leads or to their IT teams that are doing the data collection, "This is a priority. You need to go out and find the best ways for us to collect high fidelity data," and so one thing I saw recently was an article about location-based advertising practices, and in fact a lot of money gets poured into these apps and technologies to track where you are at any point to then give you localized ads, but it turns out in some of these cases, the GPS data or the location data for a person on their phone isn't very accurate or it's not very specific, and so the targeting itself is really, really poor, and I think actually the article was arguing that people lose up to 30Ks worth of money just on these poor targeting practices.

Will Nowak: Campaigns that are kind of wastefully done.

Triveni Gandhi: Basically wasteful campaign spending.

Will Nowak: Yeah. I mean this will get into something that I want to talk to you about in a little bit, but just this idea of retrospection, and again going back to the bottom line, whatever that bottom line may be, in general for an organization, one bottom line is probably always efficiency I would imagine.

Triveni Gandhi: Sure. Yeah.

Will Nowak: And so I agree with you in this point that if you think you're pursuing goal X, which in this case is location-based advertising, and you have data, but that data quality is poor, then for sure it's a waste of everyone's time, and in this case probably lots of money as well.

Triveni Gandhi: Yeah, and so I think that's why yes, okay, data scientists, data engineers, all those folks need to be the champions for good data, but if they don't have the right infrastructure in place because someone at a higher level hasn't indicated that as a priority, then there's only so much they can do. So part of it is not just like, "Okay, we need to get good data into our systems," and all of that, but also interrogating your data upfront so that you know what your limitations are and what you will be able to confront or not confront.

Will Nowak: Yeah. A question I have kind of a following on that though for you is you say be intentional with your data, and to me that implies that we have a hypothesis, A, we collect data, B, we use that data in some process, C, and then we derive business value, D.

Triveni Gandhi: Yeah.

Will Nowak: A linear pipeline, linear process. And that makes sense, but how then do you square the circle, adding in space for kind of data exploration in your ideal data practice, Triveni?

Triveni Gandhi: Well, I mean -

Will Nowak: Because sometimes you get data and you're like, "Well, this is an interesting bit of data exhaust that's coming from the organization." It wasn't gathered with an explicit purpose, but in theory, the creativity of a data scientist is meant to shed insight, find patterns, find some sort of business value where no one expected it to be. So how do you think about that?

Triveni Gandhi: Well, I think the idea that it's A, B, C, D is false. It never really is. Even in an ideal situation, we're always using our data to help drive our hypothesis that we're trying to build. We don't start with a hypothesis and then go look for the data to support it. We start with a hypothesis, find the data, find out it doesn't really support it or it doesn't even make sense, but it honors other questions, and so then you start driving forward with those.

Will Nowak: I agree with you, but I think some might disagree. I think we've talked about previously how the promise that some people think ... and I think to some extent it's true, thinking about unsupervised learning techniques, and you and I think kind of agree that data science should still be a human driven affair.

Triveni Gandhi: Agreed.

Will Nowak: But some people say, "Hey, let's just throw big data into the machine and then novel insights will come out," and sometimes that does. If you think about, again, going back to the advertising use case and you want to come up with buyer personas to initiate a marketing campaign, if you have a huge data set and all of your users and you throw it into an unsupervised learning algorithm and you get these clusters, these are not necessarily clusters that fall along simple demographic lines, but somehow the algorithm understood this group was like this group was like this group, and now you're going to market differently based on kind of magical AI.

Triveni Gandhi: Yeah, but see, I think that's where the problem is. When you rely on this magical AI, you actually create a lot of places for irresponsible AI, and so what I mean by be intentional with your data is understand your data so you understand what you're working with so that you're not trying to draw conclusions on things that don't even exist. So I could take, for instance, a bunch of marketing data or demographic data, throw it into an unsupervised cluster model, get my clusters, use that to then predict who's going to spend the most money or who should I target certain ads to, but if I'm not paying attention, it's possible that my data is skewed really heavily towards one demographic over another, or it's completely missing a subset of the population because it's not a universal sample. If I don't recognize it up front and build that into my process, I'm practicing irresponsible AI, and I think that's something we've talked about on the podcast is how to be better data scientists, not in terms of the best models, but in terms of ethical, clean, above board things.

Will Nowak: Yeah.

Triveni Gandhi: I am arguing that it starts at the very beginning with your data.

Will Nowak: Yeah. So this makes me think about data leakage, which we've talked about a little bit in the past, but not a ton. Just again for our listeners, so data leakage is when you include a feature or multiple features in your set of predictors that are directly associated in some way with your target variable. So if I'm trying to do a binary classifier to say who are my high spend customers, who are my low spend customers? A very naive feature to include in that would be the amount that they have spent with us historically.

That's a stupid example, but in general this idea that if you include something in the feature set, it's going to directly predict your outcome variable and you're going to get a great result as a modeler that you didn't necessarily expect, and so what I wanted to say to you is again, I agree with you that this balanced approach of knowing your data, and so then in this example, if you have leakage and you have a 99% accurate model, maybe you should say, "Wait, does this pass the smell test? Or maybe there's something wrong. There's some assumptions about my data that wasn't aware of. I need to go back and revisit those." So I think that's true, but you still want to, I mean we do want to make sure we're leveraging algorithms and getting the good from them, but I think your point of being a responsible practitioner is you know the data too, so if the algorithm says something that's too good to be true, you can call it on it's bluff.

Triveni Gandhi: Yeah, I mean we talk about data leakage all the time. That's one of the things that you get taught as a data scientist is don't let there be data leakage. So maybe let's add a couple of other checks and views up front so that we know what we're doing with our data. That's all.

Will Nowak: Yeah. Cool. So you've talked about data quality, being intentional with one's data. So now I'm going to bring it up a notch, thinking about good data science practices in an organization, and to me, the title for my best practice is to invest in a shared environment, but by shared environment I am talking about technologies. So tools that you, your practitioners, the people who are doing this work in your organization, what are they using to do the work? So we've talked about Jupiter notebooks and we've talked about the Cloud. The loud means many things, and we can touch on that a little bit right now. We've talked about databases and data storage technologies and API nodes and API queries. So all of these things I would say are tools that are used in an efficient data science practice at an organization, and so by a shared environment, I've seen in my experience, there are a lot of benefits for organizations that bring and consolidate all of these tools into one central place.

So to me ... and we can talk about why I feel passionately about this, but I think the first thing that comes to mind is ... and I think that it's something that we've talked about in terms of the ethics of it, bringing more people to the table, so to speak. You get more viewpoints involved in things like machine learning model building, machine learning model assessment, and there we're going to, I think, hopefully do better to improve ethics and make less of these huge botches when it comes to models that were deployed and oops, that model was terribly inappropriate in this way or that way.

Triveni Gandhi: Yeah. Well, I wonder too, how does this sort of encourage collaboration between these different groups? So if everybody's on the same page, everyone knows that I can access data inside of this database or I can build a model or build some sort of thing and share it with a totally different team, how is that going to help improve collaboration and in turn sort of improve the overall value that the teams are providing?

Will Nowak: Yeah, I mean it's definitely true. I think you would agree that collectively we're all much more intelligent than any one of us individually. So I think by bringing people into the same space ... and so part of this is just knowledge management. So in your organization saying, "Okay, we have a good data catalog," so people know where to go to get that table, to get that information on our customers, but also just logging into literally the same web portal to do all those tasks. I think there are efficiencies that you get from that, and something that's a little bit more specific on this point, just this idea of technicality in users I think is something that a shared environment can help alleviate.

So what I mean by that is it used to be if you were a data scientist, you were a super proficient coder, you knew five different languages, you know the math, you knew all of these super specific skills, but then someone else who is a "business user" or a business practitioner in your organization, it was like you had to translate between two different languages to work towards one shared goal, and so now again, there are some tools in the marketplace and also just in general, if people are inhabiting the same "space" and you and I ... I can log in and see your work and then I can ask you in that platform, I can comment on your code or comment on whatever the work that you've done is and say, "Triveni, what is going on here? Can you explain?" I think that's going to make those handoffs and make that productionization much more efficient.

Triveni Gandhi: Well what would you say to maybe the data scientists and the engineers who kind of like that they're siloed? It sort of elevates them, it gives them their own space to prototype fast, do all of these things, and they're not being slowed down by, "Oh, the resources on this are limited," or, "I have to go and explain it to so-and-so." How would you address those concerns?

Will Nowak: That's a great question. I think two things, the easier one first. To the point about, which I liked that you brought this up, fast prototyping. Again, I think in the past it used to be ... and we talked about this previously too when we talked about Tableau and their recent acquisition, this idea of standalone data vis. I believe that was our last episode. Again, standalone data work I do think is kind of dead, or at least it should be dead in a data proficient organization. You have people who are sourcing data, massaging data, and then using it for reporting or APIs on the output side. That's all happening, and again, I think to rapidly prototype that you need to have this sort of shared environment, whereas before it was like, "Oh, well I just have some cool PDF report and I'm presenting to you, boss." I can do that in a standalone fashion, whereas now there's so many pieces that are so tightly woven together that actually to prototype quickly, again, I think you need this shared environment.

So that would be 0.1. I think for that quick prototyping, which I think is important, you want to have a shared technological environment for all of your data practitioners to use, but then to your second point about the reticence to leave your safe space, I think that's very true. Having worked pretty broadly with different organizations and different personas in the data science space, we're all creatures of habit and we also like safe spaces, so I get that. I think that's something that's incumbent on leadership just to say, "Hey, from an organizational perspective, if you data scientists are never talking to the people who are consuming your reports or you're never talking to people who are sourcing and kind of acquiring the data that you're using, there are going to be large penalties for us to pay, broadly speaking. So though it might be difficult for you to have to go talk to Triveni, or if you have to log into the same shared workspace as Triveni and share work tickets with Triveni, because you're working on ultimately the same data project, it might seem not fun, but I think it's very necessary for the integrity of your organization."

Triveni Gandhi: Yeah, I mean that sounds like, again, another thing that can be implemented or done very practically at the technical level but still requires commitment and devotion from C-suite.

Will Nowak: Yeah, and I would say something I've also seen done quite well in some companies that I've worked with and not as well with others, is the power of marketing. So not just marketing to your external clients or customers, but marketing things internally. So if you have some major initiative, and in general, I think if organizations are trying to be data-driven as opposed to just standalone insights, but they're really trying to make themselves a smart organization, which I think can be done, they can't just say that they're going to do it and have it happen overnight. Changed behavior is hard, but I think marketing makes a difference. So if you say, "Hey, this is the tool that we use and this is why it's good and this is why we think it adds value to your life and the organization's life." You could game-ify the experience. People always love little internal competitions. There are things that you can do more than just telling your employees, "Hey, you have to log into this shared environment and we're now suddenly data-driven." That's not going to work, but if you put in the time, I think the benefits will come.

Triveni Gandhi: Okay. I think that's fair. One thing you said kind of reminded me of my next point, which is the idea about sort of diverse viewpoints in a shared environment.

Will Nowak: Yeah.

Triveni Gandhi: And I, in fact, would argue that a good data practice is diverse, not only from the sort of responsible ethical AI standpoint, but to scale better as the practice grows. So what am I thinking of? Well, specifically I'm thinking about the sort of recent developments in autonomous driving and cars, and there are folks that are now arguing, well Uber and Google and everyone who's making an AI car is actually doing so with data that is very localized. I believe Uber is testing out their cars in a suburb of Arizona, and so the way people drive there, and the way pedestrians work there, or even what's on the street is so different than New York even, or another country.

Triveni Gandhi: I don't know if you've ever been in a car in India, but it is life changing. You think about India, China, places that could theoretically benefit from automated cars, but are dealing with a whole different subset of cultural norms and ways that people interact with the road, what you can expect on the road. If a pedestrian were to cross in front of an AI car in India, the car would probably stop, be like, "Oh there's a pedestrian," but when it's a real driver in India, they can accurately judge, "Okay, I'm not going to run that person over. I can keep going," or, "I only need to slow down a little bit," and so you need to be able to build that in, because if you suddenly stop in the middle of traffic in India, you're going to get a lot of horns.

Will Nowak: Yeah, cause a big traffic jam.

Triveni Gandhi: Big traffic jam.

Will Nowak: I mean this idea of interpolation versus extrapolation I believe is one we haven't touched on in this podcast, but kind of Stats 101. So if you have data that falls between this goalpost and this goalpost, making predictions in here is fairly reliable, but extrapolating, making predictions outside of the range for which you have data, something that typically we caution new students not to do in this realm or to do with caution, and so I think that's a very good point, whether it's autonomous cars or I'm sure other use cases, that's something we're doing too much, but a question I have for you: so diversity is a big loaded term that means many things to many people, and so can you talk more about, when you say being diverse to scale, are you talking about the people who are building your products internally? Before we were just talking about essentially data, training data. What and how are you thinking about diversity, or what's your hierarchy of the most important sort of diversity that an organization should use and pursue first?

Triveni Gandhi:Yeah, definitely. I mean people, data, and just sort of vision. If you're really looking to scale your practice and you want to use AI to grow your business globally, or even to other states, or whatever it might be, or other markets, then you need to know that upfront or be willing to address that up front and not expect to be able to create a one size fits all sort of practice. You will end up dealing with different things, and then practically of course, yes. Getting your data from diverse sources, getting information and getting people in the room who don't all look the same. When you get in diverse viewpoints on a shared platform, you're going to get a better result. It's just common sense. Do you want to be surrounded by a ton of yes men, or do you want people who are going to challenge you? Do you want people who are going to bring in a perspective that you have literally never thought about before because there's just not in your purview? Those are the things that are going to make a more scalable and really self-sufficient and well structured practice versus, "Okay, let's just quickly prototype something. Let's get it ready. Oh it works. Okay, let's just keep building on this and let's get it out the door." Be diverse upfront, and that sort of wide range of viewpoints will reflect itself in the final product.

Will Nowak: Yeah. No, and we haven't really talked much about hiring, but if you're an organization and you're bringing in people, building out these teams, I mean that's something ... it's not a nut that I've cracked. It seems really challenging to me because, again, I totally agree that you want different viewpoints, but just because I have five different people who maybe their skin color is different, maybe they come from different countries, different states, in some ways they quantify as diverse, but they're all Stanford CS grads, well then they have a very common important shared background. So how diverse actually are they? How do you quantify that? How do you think about it, and when you're a leader in this organization, what are the measures of diversity that you're prioritizing? Is it the way they look or is it that the way they think or is it the way they code? Again, it's not easy, but I do agree that we can't just have too much modernized thinking.

Triveni Gandhi: Yeah, I mean definitely diversity in terms of demographics is a huge one, and it's something that people have been talking about a while, both in tech and AI, but definitely the diversity of background is really important, and strengths. I think so often we go out to hire someone and we say, "Well, this person is really good at doing these things that I can do really well too, or that people on my team can do really well, and so this just makes sense because they could probably pick up work from X, Y, and Z," but in fact, if I went out and looked for someone who has skills or an experience or a way of thinking that filled a void on the team or was something that we had never seen before, that actually might end up being better for the longterm team, instead of just getting more and more clones of the same people.

Will Nowak: Yeah. True. So this is a nice segue into I think our last point, which is my second point on what I would do to build an efficient and successful data practice, but this idea of bringing in different perspectives. To start, maybe I'll quote Donald Rumsfeld, which you maybe wouldn't expect me to on this podcast, would you?

Triveni Gandhi: No. If I had known that in advance, Will -

Will Nowak: You would have come more prepared, right?

Triveni Gandhi: Probably.

Will Nowak: No, but just briefly, I believe he's often credited with being wary of the unknown unknowns. So what don't you know you don't know? And so again, this idea, I think we can agree with Donald Rumsfeld that that is a threat. So if you have all this homogenous thinking in one room, you're not aware of your own biases, you're not aware of some exciting possibility that exists for you to tap into because you're so stuck in group think. So you talked about diversity and people and data, and I was thinking that what I would encourage data leaders to do is to invest in external consulting, and by this I mean consultants, people that you bring in to advise you, and I can talk about who exactly I think would be good for that, but in general, prospective consulting, and then we've also talked about kind of retrospective auditing, and so I think in general, bring in these outside service providers to help you, particularly as you're an organization, as pretty much all are, looking to kind of scale up and become more data-driven.

I think that just in general when you're trying to do something new in life, it's good to have a coach, someone to help. I think it's smart for people to look externally for help, A, because there's probably some experts out there that know the data space. They might not ... of course they probably won't know the individual business, but they'll know the data space, and then also even if they don't know the data space, which they should, but even if they don't, they have that external viewpoint. So I think those two assets in combination are really valuable, and something that ... again, I see some of the clients that I work with investing in outside opinions and investing in outside resources to upskill them internally, and some people, honestly, I think it's coming to a bit of hubris in thinking, "Okay, we've got this. We don't need anyone to tell us what to do. We know our business and we're smart. We can figure out AI on our own," but I think that's a little bit naive.

Triveni Gandhi: Yeah, I mean, I think it's naive. My only concern is that the consultants that are coming in in these data spaces are part of the same system that needs a lot of change.

Will Nowak: True.

Triveni Gandhi: And so I would worry that a company might hire consultants who will just put them back onto the same track that everyone else is already on, and that kind of goes back to even how we train data scientists in boot camps or in machine learning courses, or whatever it might be. So yes, I think that consulting and auditing, especially, are very important, but I would encourage folks to look for consultants or audits that are going to push you, and not just help you hear more of the same, and knowing that you're going in to a consultation expecting to be told, "You're doing this wrong."

Will Nowak: Yeah. I think that's, again, a point I wanted to make too, whether it's that or just more broadly, we've talked about journalistic coverage of AI and how some business leaders will say, "I want to transform my business with AI," but it's been so blown out of proportion, the capabilities of AI, they need someone to come in and again, to this point, temper their expectations, essentially say no. Which is also hard too, because you think about the incentives. If I'm a consultant who's hired to help you revolutionize your data practices, I want to leave having you believe that I've revolutionized your data practices, not leaving and being like, "Will told me no," but I think that's responsible and necessary. So do responsible work and also, more practically just to think about what applications are feasible and not.

Triveni Gandhi: Yeah. Again, I agree. I just worry about consultants who need to be ... like you said, need to be deriving value or showing their value in order to then convince the company that they've actually done something for them, and in doing so, they ignore all of our previous steps. They don't really look at the data, they don't invest in diverse viewpoints, they don't have a shared platform. So it's just about knowing what you want to get out of the consultant first is helpful, and yes, there are things that we don't know, and we don't even know that we don't know them, but hopefully through your own research, through listening to this podcast, you're able to start picking up on some of those things that might be pretty critical.

Will Nowak: Yeah, I mean, to that point, I agree with you that you should have a consultant in a consulting arrangement where you're willing to hear, "No," or you're willing to hear that this is just not going to work out. At the same time, the optimist in me ... and again, what I've seen in practice in my working, kind of in this consulting role, both in my current job and in previous jobs, is that in addition to kind of an overblown conception of what AI can do for you, also a really poor understanding of what are easy use cases and what are hard use cases. So for example, sometimes a lot of what I've heard lately is people have ... think about a data table, different rows that correspond to the same entity, so maybe a person. Say you have one row that says Triveni and then it's got in the next column your social security number, and in the next row it's got Gandhi, and then it's got your date of birth, and that's it, and somehow row one and row two, they're both you, and we want to merge all that information together. That's actually a very challenging data problem that kind of seems like, "Oh, can't we just figure out that these are the same thing? That's simple, right?" It's actually really hard.

Triveni Gandhi: Yeah.

Will Nowak: But then going to the point about autonomous vehicles, the work that's been done in image recognition as you know, in image classification, it's tremendous, and the ability now for any mom and pop shop to get some cameras ... and I realize that cameras have their ethical risks, but get some cameras and start leveraging video and image data and using that intelligently for their business, that's low hanging fruit that I think people just aren't aware of, because it seems hard. You're like, "Man, how would I program a computer to see something?" It turns out actually we can do that in many ways rather well. So that disconnect between what's known, what's unknown, what's possible, what's impossible, I think there's a lot of room for growth there.

Triveni Gandhi: Yeah, and it reminds me of the AI winter situation, because if you're not investing in consulting and auditors to come in and be like, "Hey, what you're trying to do here is a little bit out of bounds," you're going to run the risk of over promising, under delivering, and your organization's AI practice or data science practice is suddenly at risk. So bring in the folks who can give you sort of reasonable expectations for your work.

Will Nowak: Yeah.

Triveni Gandhi: Great.

Will Nowak: So with that though, I think we've kind of recapped.

Triveni Gandhi: Yeah.

Will Nowak: And now it's time for your favorite part of the show.

Triveni Gandhi: My favorite part of the show, the banana fact. Yes. So you mentioned something about image recognition, and that actually ties into what I want to talk about. I don't know if you know this, Will, but bananas are going extinct.

Will Nowak: Didn't know that.

Triveni Gandhi: Yeah, I wanted to save that fact for the last episode -

Will Nowak: It's a little foreboding.

Triveni Gandhi: So we could just kind of end on a really low note. No, I'm kidding. So the current species of banana that are most commonly eaten, the Cavendish species, are actually being affected by this fungal disease that began in Malaysia in the 1990s and it's spread everywhere now. So this is not the first time a fungus has wiped out bananas. So the original species of bananas that people used to eat was wiped out in like 1965, and so when you eat banana flavored candy, that's the flavor of the old bananas -

Will Nowak: That's the old bananas.

Triveni Gandhi: Yeah, Which I kind of prefer, because I think current bananas are a little bland. Sorry. Sorry, bananas, but AI is here to rescue us so that we can keep eating our bland bananas.

Will Nowak: Oh wow.

Triveni Gandhi: In fact, a company has built a smartphone tool that allows farmers to take pictures of their crops and look for certain pests that indicate the presence of the fungus so that if they see it early on, they can treat it, address it, and prevent large-scale crop failures. So it's a really cool way to use image recognition to actually help out a lot of folks in low income countries that depend on bananas as part of their livelihood and to sort of ward off crop disease early on.

Will Nowak: There we go, using AI to capture that low hanging fruit.

Triveni Gandhi: Literally.

Will Nowak: Literally.

Triveni Gandhi:

Yes.

Will Nowak: There you go. God bless, and I think they'll hopefully keep the bland bananas around for you, but I think that's our last banana fact. So with that, we should thank our listeners for tuning in and being with us, but we're going to be back in October for season two, but we'll also have a sneak peak before then, kind of give you a book behind the AI curtain.

Triveni Gandhi: In case you miss us too much, be sure to subscribe to the Banana Data newsletter and share our podcast with your network. We'll catch up on all things data in no time.