Want to learn about what's hot (and what's not) in AI for 2020? Look no further than Season 2, Episode 3 of The Banana Data Podcast, where hosts Triveni Gandhi and Will Nowak unpack topics like AutoML, explainable AI, cloud computing, federated learning, and more.
Read the Episode
Not a podcast person? Here's the full transcript of the 2020 AI trends episode.
Triveni Gandhi: You're listening to the Banana Data Podcast, a podcast hosted by Dataiku. I'm Triveni.
Will Nowak: And I'm Will.
Triveni Gandhi: And we'll be taking you behind the curtain of the AI hype, exploring what it is ...
Will Nowak: and what it isn't ...
Triveni Gandhi: ... capable of. This episode, we're placing bets on the biggest data science trends for 2020 and reflecting on whether or not the trends of 2019 lived up to the hype.
Will Nowak: First trend I want to talk to you today about, Triveni, was AutoML.
Triveni Gandhi: That's all the rage.
Will Nowak: It is kind of all the rage and that's what I want to talk about. So, first of all, just to reiterate what this is, what it means for our listeners. So AutoML, standing for automated machine learning. So just this concept of building a predictive model, be it supervised or unsupervised, in a more automated way. So what does that mean? It means doing things like hyper parameter tuning or feature selection or algorithm selection or data pre-processing, which traditionally have been done manually through scripted code in an automated way. So that's what it is, and you say it's all the rage and that's what I want to talk about. I think that it's kind of all the rage.
Triveni Gandhi: Kind of? Explain.
Will Nowak: Kind of all the rage. So in my experience with AutoML, it's been rapidly accelerating in adoption and organizations that adopt it, seem to enjoy it.
Triveni Gandhi: Okay.
Will Nowak: Right? So they find themselves, if they're sophisticated enough, to have framed a data science or a machine learning problem succinctly and successfully, then if they throw AutoML tools at it, particularly if they don't have honestly the best machine learning engineers in the world, oftentimes they might find that AutoML tools will outperform or at least equally perform the work that they've made to this point. So they've been doing a lot of work, they've been bending over backwards to solve a machine learning problem that they have framed and now they use AutoML and they get good performance out of it and life is great.
Will Nowak: So I have seen rapid adoption with groups that have adopted it, if that makes sense. Once they start, they like it. However, when I talk to people more broadly, I still think for as powerful as AutoML technologies seem, and then we'll talk about the caveats and the risks inherent to AutoML, but for as powerful as AutoML seems, I think there's still a lot of organizations that either are unaware of it, individuals are unaware of it or dismissive of its potential.
Will Nowak: So this is why I wanted to talk to you about this. I think that it's got a lot of promise, it's got some risk, people who try it tend to like it, but still, many individuals and organizations in the data science space are wary of it. So that's why I bring today to you, my trend or my thought for 2020 is what will the future of AutoML be? So what do you think?
Triveni Gandhi: Yeah. Well, I'm on the not so crazy about it, side of things, right? Because obviously hyper parameter tuning and like those things are important, but when it comes to feature selection, when it comes to thinking through why your model or what's going to be the best model for you or what are the right inputs here, I don't necessarily want to hand it over to a computer and say, "All right, do it all and then just give me the output." I want to at least have some kind of humanness in the loop, right? I think human in the loop is very much a big part of sort of my approach to data science and AI. So I'm hopeful that in 2020 we see more usage of AutoML as a baseline or a starting point upon which someone who's actually a person can iterate. I don't like the idea of here's an AutoML algorithm that was trained by whatever and I have no visibility into it or no understanding of why or how it made that choice, but I'm going to go with it. That's problematic.
Will Nowak: Yeah. No, I mean, I think I see kind of development in 2020, and more broadly in the future going two ways and so to this point about what you're mentioning, domain specificity and broader reasoning about the problem and the problem you're solving hopefully, I read recently ... Do you know who Francois Chollet is? Does that name ring a bell to you?
Triveni Gandhi: It does not.
Will Nowak: So Francois is the inventor of Keras. So Keras, the wrapper on top of TensorFlow, and I believe he's still working with Google, or at least he once was. So he recently wrote a paper where he developed what he calls The Abstraction and Reasoning Corpus Framework. So he's saying, "Hey, artificial intelligence has focused too much ..." and this is something that's been a theme for us as well, "... focused too much on individual model performance and instead, we need to be thinking about abstraction and reasoning."
Will Nowak: So when humans learn, we're not just solving single, minute problems, but rather this concept of abstraction and then logic and reasoning. They're big problems that honestly are really hard and so I think that's why the field has been shying away from them, but it's exciting to see a leader like Francois bringing attention to that. So I think to your point, that's something that we need to focus on that has been underrated but to my point, I still think in parallel paths we'll see AutoML adoption rapidly accelerate in 2020.
Triveni Gandhi: Yeah, and I wouldn't be surprised, right? Especially as enterprises are scaling up their data science sort of practices and centers, they need to have some kind of tooling that can help them get off the ground. But do I think it's the only answer for a data science practice at an enterprise? No.
Will Nowak: But that's an interesting question you pose, to help them get off the ground. Do you think if you're a nascent organization in the data science AI space and you're looking to put your first model into production, which we learned about a lot last time in our conversation with Dan, should you be using AutoML tools or should you go out and hire just one solid, machine learning engineer who can build through code a solution to this problem for you? What do you think, hot take?
Triveni Gandhi: Hot take. Well, my hot take here is that AutoML, I think people put too much importance on as end all be all, plug and chug solution. And in fact, it works really well when a group or a company or whoever knows what their problem is, knows what they're looking for and trying to solve but in so many cases of new AI adoption, you don't even know what you don't know and that's what data science truly is, is the exploration, is the digging into your interior data, understanding it at the deepest level, trying lots of different random models and things and hyper parameters and then being able to say, "Okay, now I understand that this is what I need to be addressing to achieve my goal of whatever the goal has been set."
Triveni Gandhi: So I don't think you can just jump in to a brand new sort of data science practice and say, "Okay, we're just going to let AutoML do it," It's not a magical box. You need to have some folks who are experts in understanding data, but also have either industry knowledge or the knowledge of your specific issue to then make the best use of that data.
Will Nowak: Yeah. I think this, again, reminds me of last week's conversation in thinking about I think a point you wanted to mention but, more broadly, a point we brought up with Dan, which is the concept of data science literate management.
Triveni Gandhi: Right.
Will Nowak: You don't necessarily need to hire someone to do the data science, but you need to hire someone to manage or oversee it. So I think that's kind of what you're getting at and that, correct me if I'm wrong, but that is something that's really important to make sure that your end data product is explainable or interpretable or performing as you want it to be performing.
Triveni Gandhi: Right, well and that actually brings me to my trend, my first trend of 2019 going to 2020, which is explainable, interpretable, responsible, ethical AI. So there's a lot of different terms out there.
Will Nowak: And then we should be clear on what you think about each of those going into 2020, because I think we've talked about on this podcast before how like interpretable doesn't imply ethical.
Triveni Gandhi: Yeah.
Will Nowak: And something that can be explained doesn't necessarily mean that it's performance, but which one do you want to focus on most, interpretable AI or ethical AI?
Triveni Gandhi: Well, so I think what we've seen in 2019 is a lot of people saying different kinds of words, ethical, interpretable, all these things. And I think the trend we're going to see in 2020 is scaling back and looking at it as this umbrella of responsible AI. So responsible AI means a lot more.
Will Nowak: I like that.
Triveni Gandhi: Right, so it has an ethical component, it has interpretable when necessary, it has explainable when necessary.
Will Nowak: So to you, responsible is just bundling all these things together, which I'm not saying is bad.
Triveni Gandhi: Well, no, but it's even more than that, right? It's also governance over your models and production, stuff, again, that Dan was talking about. It's about knowing who is literally responsible for parts of the flow. If there's an error upstream in your data processing that's causing a loss downstream, who was held responsible for that error and who's overseeing the correct execution of the entire flow and the production and all of that, the productionization of the models?
Triveni Gandhi: So I think responsible AI, it does have ethical components, it does have interpretable components, but it's also bigger than that, and it's about creating AI or data science practices in a company as its own entity, right? It's not just this thing that some teams do, right? Oh, they do data science, they do AI, but in fact, it's its own entire branch of the organization.
Will Nowak: I could see that going both ways, because it's important. So you want to dedicate it's entire singular branch, like an organization's marketing is very important so there's a marketing sub organization, but at the same time, I think we talk about the promise of AI as being so pervasive that every individual department will benefit from it and kind of implement it. So do you think there should be one overarching responsible AI department within an organization that then has members embedded in other teams?
Triveni Gandhi: I think the approach does need to be a top level sort of commitment to responsible AI, plus like an oversight organization or whatever it might be, but one that is highly involved with the different aspects of data science at an organization. And this might not look the same for every org, because if you're a smaller company and your data science team is like three people, they themselves will also be involved highly in sort of the responsible AI execution. But even at a big company, even if you have like chief responsible AI officer, maybe they also are working with the actual data scientist to say, "Look, what are you doing here? Does this actually fit our values and our mission around responsible AI?" I think that's going to be really important.
Triveni Gandhi: One thing, and I actually want to hear what you have to say on this. Recently, I read a piece that argued instead of trying to make AI explainable, we should optimize for the right outcomes. So what does that mean? Well, we talk a lot about explainable interpretable AI and as a result, you might not get the most performance model, right? Because you can't use a neural net, you can't use some black box model, but this article's arguing that instead of trying to govern the AI and dumb it down, govern the optimizations, say that what are you actually trying to do here and are you achieving that. Who cares what's happening inside.
Will Nowak: Yeah, I mean it brings me to the aphorism what gets measured matters or what matters gets measured. So you can think about it one way or the other but in general here, I do think in this case, that statement carry some importance in that it's easy for us to quantify accuracy or precision, but it's hard for us to quantify broader metrics of AI quality or AI responsibility. So I like it, I'm into it, I think that I would optimistically say let's see if this as a trend for 2020. Though you bring this up to me now, it's not something I've heard a lot about or thought about. So maybe we're thinking more 2021 timeline, if I'd be more pessimistic, but I like it. I think it's a cool idea to think about better ways that we can operationalize responsible constructs.
Triveni Gandhi: Yeah.
Will Nowak: The third trend for 2020 I want to discuss with you, Triveni, is broadly put, the concept of the cloud. In particular, I just want to reflect on first of all, kind of the rapid adoption that I've seen with the clients with which I work, towards and in the cloud, and then some trends in terms of how I see it evolving in 2020. So I guess just to, again, set the stage here, I've been impressed in 2019 with how quickly people who've previously said that movement to the cloud and by the cloud. I mean using distributed resources posted across potentially the public internet for data storage and data computation, that's what I mean by adoption of the cloud, how quickly people have said, "We don't see ourselves moving there," or even, "We don't see ourselves trusting the cloud." Because if you're using something like Amazon S3 for simple object storage in the cloud, you're taking your data and you're storing it with Amazon, then that's implying a lot of trust.
Will Nowak: You're taking some valuable assets that's yours, AKA your data, when you're letting Amazon hold onto it for you. And so I think the narrative, if we were sitting here a year ago having this conversation, the narrative was a lack of trust and just in general, a lack of readiness to port a large part of an organization's data science practices from on prem. So from servers and kind of work that's happening on premises, into one of the cloud vendors. So in this case, to call out the elephant in the room, I'm talking about Google, Microsoft and Amazon. So that adoption's been really rapid and so I'm interested to see in 2020 how it continues to progress, and I have some hypotheses. But first of all, you've seen what I've seen?
Triveni Gandhi: Oh yeah, definitely. I think you see it more with places that are ready to actually move their data practices into the next phase. There's a tendency to sort of resist the cloud and cloud computing when you're not quite ready to actually put machine learning into practice at the organization. But when you have that go ahead from the higher ups and from sort of that investment into actually using machine learning, you see this rapid adoption of cloud storage and even cloud compute.
Triveni Gandhi: Cloud compute is becoming even bigger, where everyone wants to be using Kubernetes, Docker, Spark over Kubernetes. However many ways I can put as many cloud computing pieces together, I would like to do that.
Will Nowak: Yeah, I think that for nascent organizations or by nascent, I don't mean an organization that's just starting its data science practice, I mean organization that's just starting, like startups. I think it's interesting and it's not typically an audience that we talk to a lot on this podcast, but for them, I'm particularly interested to see how they think about cloud adoption because to echo your point, I think the promise of cloud was always ... or the many promises of cloud, but one was kind of this concept of elasticity, which was instead of having to make some big upfront investment in servers, you could say, "I turn it on when I want it, I turn it off when I don't want it," And that's kind of been true. But so thinking again about going forward in 2020, as quickly as I've seen the adoption of cloud in 2019, I feel like there's been a slight, or maybe even more than slight, backlash against the cloud with regards to cloud costs.
Will Nowak: Again, there's no free launch so these cloud vendors are not stupid and they know that if they're offering you elastic services that you can spin up, spin down, turn on, turn off, when you turn them on, you're going to pay for it. And so now ironically, I feel like I've seen some pricing schemes that's saying like, "Commit to fixed bundle of storage or compute and you get lower costs," and that actually reminds me very much of on premise where you would say, "I'm committing by provisioning a server. I'm going out and buying this machine, so I'm making an upfront cost and therefore it's less"
Will Nowak: so like what's kind of old is new again or what's new is old again. Seeing it cycle through is fascinating, but I think in general, my prognostication for 2020 with regards to cloud is just the consumers are becoming more and more savvy. So what I mean by that is they're not just blindly accepting cloud and cloud costs, but they're really starting to think in a more educated way about pros and the cons, AKA the costs, the expensive costs of cloud computing.
Will Nowak: And in one way in which, again, I've seen this happen and I definitely foresee it continuing in 2020 is the move to multi cloud vendors. So just using the power of competition for good in this case, because if you say I'm fully committed to cloud vendor A, then cloud vendor A, just like any other commitment, can just jack up the prices on you and it's hard for you to switch, whereas most organizations ...or not most organizations I should say, some organizations I've seen, I think intelligently so, decide to maybe split 50/50 or 80/20 their adoption into cloud vendor A and B. In that case, if they're seeing some sort of performance or maybe just a lack of feature updates or upgrades from vendor B or A, they can always kind of tweak their adoption.
Triveni Gandhi: Yeah. Well and I think in 2020, IT monitoring is going to become even bigger because of this, because people need to be monitoring how they're using their cloud resources to understand if they're buying too much or if this machine is on for an unknown amount of time that's not necessary, whatever it might be. So more IT monitoring kind of goes in hand with sort of the cloud, the move to cloud.
Will Nowak: Yeah, and I think one thing I think about are the economics of all this as well, which I love discussing, thinking about the incentives. So the cloud organizations charge you for storage and they charge you for compute. So if you're using some sort of services on top of that, of which there are increasingly many, how does that pricing look? So is that organization kind of incentivizing you to store more or compute more or is that organization trying to help you be more efficient in terms of your storage or compute? Are these other vendors that lie on top of the cloud, are they helping or hurting you with regards to your costs? I think that's something ... There's not one single answer, but if you are a manager or someone in procurement in this space, I would encourage you to think about when you're buying a new technology, how is that technology going to augment or reduce your cloud spend.
Triveni Gandhi: Well, so this actually brings me to the last trend for 2020 that we see, and that's actually federated learning. So for true fans of the podcast, you might remember we discussed federated learning in our very first episode many months ago. And at that time, I thought it was the greatest thing and I still think it's really great. I think based on the conversation we've had here today, Will, I could see federated learning really taking off in 2020.
Triveni Gandhi: So reminder, federated learning is the idea that instead of collecting all data into a single computer and running a machine learning model, you in fact build out a baseline model that gets sent out to, let's say, devices, mobile phones. The data on Will's phone stays on Will's phone, the model gets sent to his phone, it retrains with Will's data and then the model parameters or the model outputs and inputs get sent back to the main sort of computing center.
Triveni Gandhi: And so in this way, we can use Will's data to create a better model, but we don't actually ever hold on to his data. We never see it, we never collect it, it stays on his device and it's a great value add for privacy. So as we think about ethical and responsible AI and data privacy practices, federated learning has a lot of potential power there. But in line with sort of the idea of cloud computing, I think it also adds a lot of power for organizations that want to do sort of heavy compute, but don't necessarily have the resources to do so. And so using the pretty efficient and modern processors in most mobile phones today, you could actually make use of a lot of compute power that doesn't exist in a cloud storage service.
Will Nowak: Yeah. No, that's super interesting, super important, and again, I think something that only the most advanced in rapid adopters are getting towards, but I agree with you that it's going to accelerate. And just to be clear, I've often heard this described as computation or AI on the edge. So when people talk about edge devices or the edge, I think they typically mean things are not happening in centralized servers, but an edge device would be like an iPhone.
Will Nowak: And again, just this concept of instead of taking data, like talking to Siri on your phone and then sending it to the central server for scoring or training, you can use the edge device for just model execution or model scoring or the edge device for training. And so I think you're right, that the iPhone processor is super powerful and in the short term, I think organizations, again, I agree with you, are going to decide, "Hey wait, we can do some on edge device training. Why don't we do that?"
Will Nowak: What's really interesting to me, maybe we'll be talking about this one year from now, is when my battery ... I think this already happened a little bit, but when my battery starts to drain really, really rapidly, and I'm like, "Why is my battery draining so rapidly? Oh, it's because organization X." I'm just using this app and I think naively that all I'm doing is benefiting from my use of the app. Actually, when I have the app turned on, they're doing sorts of crazy computations behind the scene on my phone.
Triveni Gandhi: Oh, yeah.
Will Nowak: So again, I think the average consumer is probably not thinking a lot about on-device, machine learning model training, but maybe in December 2020, things will be a little bit different and we'll start to maybe see public discourse around that. And also, I wouldn't be surprised if there was at some point kind of legislation about how organizations could use external devices that they don't own to manage compute.
Triveni Gandhi: Yeah. And Will, I think when you think about maybe the bottlenecks of federated learning, so there's a professor at Michigan State University, Mei Zhong, and he argues that the main bottleneck is actually our communication bandwidths. So the phones themselves are really good and we know the iPhone or whatever has this fancy AI chip set, but the actual sending of information to and from a phone, the wireless communication we have is not that good.
Will Nowak: Yeah.
Triveni Gandhi: This could be really interesting and useful, but it requires a different kind of an investment in infrastructure that doesn't relate to AI necessarily.
Will Nowak: No, that's a good point. And I think kind of tying back, it's only somewhat related, but a little bit true to the previous conversation. It's somewhat of a wonky point, but one thing, again, I talked about multi cloud, people should be aware, right? If you're staying, let's say, in Microsoft's arena, by arena, you're staying in Microsoft's network, everything is super snappy within Microsoft's network. But then if you want to pass data between Microsoft and Amazon, well not even because they're doing anything nefarious, just because of the way the world works, it's going to be harder.
Will Nowak: So I think that's maybe even a broad trend in and of itself, just kind of this increased awareness of network latency as a bottleneck for AI and machine learning.
Will Nowak: If you're looking for more predictions, Dataiku has released both a great white paper and webinar for 2020, exploring AI trends and what's next for the data-driven enterprise. Links to both in the show notes.
Triveni Gandhi: All right, before we head out, it is time for the Banana fact of the episode. And this batch actually comes to us via EGG SF, which is The Human-Centered AI Conference by Dataiku. So Will and I were at EGG a few weeks ago and picked up this interesting fact. So Will, you've heard of a Canadian tuxedo, right?
Will Nowak: I have.
Triveni Gandhi: Yeah, the Canadian tuxedo is basically a denim shirt or denim jacket wearing with jeans, right? Very Canadian. So I just thought this was a way of people like saying like, "Oh Canada, crazy about denim." In fact, I learned that the Canadian tuxedo was invented in 1951 by Levi's because Ben Crosby, this famous singer, had gone to some hotel in Toronto wearing a jean jacket and jeans and had tried to get in and they said, "Oh, I'm sorry, sir. You're only allowed to come in if you're wearing a three piece suit." And then he was like, "Okay, but I'm Bing Crosby." And so they were like, "Oh, no, no, you can come in, it's fine." But Levi's heard about this and actually tailor made a jean denim tuxedo for Bing Crosby to wear in Toronto. So that's the origin story of the Canadian tux.
Will Nowak: That's all we've got for today in the world of Banana Data. We'll be back with another podcast in two weeks. But in the meantime, subscribe to the Banana Data newsletter to read these articles and more like them. We've got links for all the articles we discussed today in the show notes. All right, well, it's been a pleasure, Triveni.
Triveni Gandhi: It's been great, Will. See you next time.