Best Practices for Data Science Pipelines

An organization's data changes over time, but part of scaling data efforts is having the ability to glean the benefits of analysis and models over and over and over, despite changes in data. That's where the concept of a data science pipelines comes in: data might change, but the transformations, the analysis, the machine learning model training sessions, and any other processes that are a part of the pipeline remain the same.

Not a Podcast Person?

No problem, we get it — read the entire transcript of the episode below.

Will Nowak: Today's episode is all about tooling and best practices in data science pipelines. So before we get into all that nitty gritty, I think we should talk about what even is a data science pipeline. It's this concept of a linear workflow in your data science practice. So what do I mean by that? I think lots of times individuals who think about data science or AI or analytics, are viewing it as a single author, developer or data scientist, working on a single dataset, doing a single analysis a single time. So all bury one-offs. And so, so often that's not the case, right? An organization's data changes, but we want to some extent, to glean the benefits from these analysis again and again over time.

So what do we do? How do we operationalize that? And we do it with this concept of a data pipeline where data comes in, that data might change, but the transformations, the analysis, the machine learning model training sessions, these sorts of processes that are a part of the pipeline, they remain the same. And so again, you could think about water flowing through a pipe, we have data flowing through this pipeline. So we'll talk about some of the tools that people use for that today. But I was wondering, first of all, am I even right on my definition of a data science pipeline?

Triveni Gandhi: There are multiple pipelines in a data science practice, right? So you're talking about, we've got this data that was loaded into a warehouse somehow and then somehow an analysis gets created and deployed into a production system, and that's our pipeline, right? But there's also a data pipeline that comes before that, right? Where you have data engineers and sort of ETL experts, ETL being extract, transform, load, who are taking data from the very raw, collection part and making sure it gets into a place where data scientists and analysts can pick it up and actually work with it.

Will Nowak: Yeah, that's a good point. That's also a flow of data, but maybe not data science perhaps.

Triveni Gandhi: Right? And so I would argue that that flow is more linear, like a pipeline, like a water pipeline or whatever. But what we're doing in data science with data science pipelines is more circular, right? Because no one pulls out a piece of data or a dataset and magically in one shot creates perfect analytics, right? It takes time.
Will Nowak: I would agree.

Triveni Gandhi: Right? There's iteration, you take it back, you find new questions, all of that. And so the pipeline is both, circular or you're reiterating upon itself. But every so often you strike a part of the pipeline where you say, "Okay, actually this is good. We should probably put this out into production." And maybe that's the part that's sort of linear.

Will Nowak: I would disagree with the circular analogy. And I think sticking with the idea of linear pipes. Maybe like pipes in parallel would be an analogy I would use. This concept is I agree with you that you do need to iterate data sciences. It's never done and it's definitely never perfect the first time through. But this idea of picking up data at rest, building an analysis, essentially building one pipe that you feel good about and then shipping that pipe to a factory where it's put into use. That's the concept of taking a pipe that you think is good enough and then putting it into production. So putting it into your organizations development applications, that would be like productionalizing a single pipeline. And then in parallel you have someone else who's building on, over here on the side an even better pipe. This pipe is stronger, it's more performance. And then once they think that pipe is good enough, they swap it back in. So it's parallel okay or do you want to stick with circular?

Triveni Gandhi: I mean it's parallel and circular, right? Because I think the analogy falls apart at the idea of like, "I shipped out the pipeline to the factory and now the pipes working." But in sort of the hardware science of it, right? When the pipe breaks you're like, "Oh my God, we've got to fix this." But you don't know that it breaks until it springs a leak. And in data science you don't know that your pipeline's broken unless you're actually monitoring it. And so I actually think that part of the pipeline is monitoring it to say, "Hey, is this still doing what we expect it to do? Is the model still working correctly? Are we getting model drift? Is it breaking on certain use cases that we forgot about?"

So that testing and monitoring, has to be a part of, it has to be a part of the pipeline and that's why I don't like the idea of, "Oh it's done." And now it's like off into production and we don't have to worry about it. It's you only know how much better to make your next pipe or your next pipeline, because you have been paying attention to what the one in production is doing.

Will Nowak: Yeah, that's fair. I think, and that's a very good point that I think I tried to talk on this podcast as much as possible, about concepts that I think are underrated, in the data science space and I definitely think that's one of them. I mean people talk about testing of code. I'm not a software engineer, but I have some friends who are, writing them. If you're thinking about getting a job or doing a real software engineering work in the wild, it's very much a given that you write a function and you write a class or you write a snippet of code and you simultaneously, if you're doing test driven development, you write tests right then and there to understand, "Okay, if this function does what I think it does, then it will pass this test and it will perform in this way."

So software developers are always very cognizant and aware of testing. But data scientists, I think because they're so often doing single analysis, kind of in silos aren't thinking about, "Wait, this needs to be robust, to different inputs. This needs to be robust over time and therefore how I make it robust? I write tests and I write tests on both my code and my data." And even like you reference my objects, like my machine learning models. So that's a very good point, Triveni.

Triveni Gandhi: Yeah. I know. And I think the testing isn't necessarily different, right? My husband is a software engineer, so he'll be like, "Oh, did you write a unit test for whatever?" And it's like, "I can't write a unit test for a machine learning model. What does that even mean?" Right. But what I can do, throw sort of like unseen data. I can throw crazy data at it. I can see how that breaks the pipeline. I can monitor again for model drift or whatever it might be. So, and again, issues aren't just going to be from changes in the data. It's also going to be as you get more data in and you start analyzing it, you're going to uncover new things. Right? And then does that change your pipeline or do you spin off a new pipeline? Do you have different questions to answer? And that's sort of what I mean by this chicken or the egg question, right?

Do you first build out a pipeline? But you can't really build out a pipeline until you know what you're looking for. But once you start looking, you realize I actually need something else. And then that's where you get this entirely different kind of development cycle. And especially then having to engage the data pipeline people. Maybe changing the conversation from just, "Oh, who has the best ROC AUC tool? Okay. Is this pipeline not only good right now, but can it hold up against the test of time or new data or whatever it might be?" And being able to update as you go along. I think it's important.

Will Nowak: Yeah. So related to that, we wanted to dig in today a little bit to some of the tools that practitioners in the wild are using, kind of to do some of these things. So maybe with that we can dig into an article I think you want to talk about.

Triveni Gandhi: Yeah, so I wanted to talk about this article. It's called, We are Living In "The Era of Python." Which is kind of dramatic sounding, but that's okay.

Will Nowak: One of the biggest, baddest, best tools around, right?

Triveni Gandhi: The article argues that Python is the best language for AI and data science, right? And so when we think about having an effective pipeline, we also want to think about, "Okay, what are the best tools to have the right pipeline?" And so this author is arguing that it's Python. So when you look back at the history of Python, right? It's really taken off, over the past few years. Python used to be, a not very common language, but recently, the data showing that it's the third most used language, right? After Java script and Java. Both, which are very much like backend kinds of languages. Especially for AI Machine Learning, now you have all these different libraries, packages, the like. And people are using Python code in production, right? I have clients who are using it in production, but is it the best tool? Is it the only data science tool that you ever need? I disagree.

Will Nowak: Just to be clear too, we're talking about data science pipelines, going back to what I said previously, we're talking about picking up data that's living at rest. So you have SQL database, or you using cloud object store. So basically just a fancy database in the cloud. And then the way this is working right? Is you're seeing it, is that oftentimes I'm a developer, a data science developer who's using the Python programming language to, write some scripts, to access data, manipulate data, build models. That's kind of the gist, I'm in the right space.

Triveni Gandhi: Yeah, definitely. Yeah.

Will Nowak: What's wrong with that? It seems to me for the data science pipeline, you're having one single language to access data, manipulate data, model data and you're saying, kind of deploy data or deploy data science work. That seems good.

Triveni Gandhi: I'm sure it's good to have a single sort of point of entry, but I think what happens is that you get this obsession with, "This is the only language that you'll ever need. Right? Learn Python."

Will Nowak: Well you're a big R fan.

Triveni Gandhi: I am an R fan right? I was like, I was raised in the house of R.

Will Nowak: You and that army.

Triveni Gandhi: I mean, what army. I became an analyst and a data scientist because I first learned R.

Will Nowak: It's true. I learned R first too.

Triveni Gandhi: Right, right. It's a more accessible language to start off with. Yeah. Okay. You can make the argument that it has lots of issues or whatever. Sorry, Hadley Wickham. But it is also the original sort of statistical programming language. Right? And where did machine learning come from? It came from stats. So it's sort of a disservice to, a really excellent tool and frankly a decent language to just say like, "Python is the only thing you're ever going to need." Because frankly, if you're going to do time series, you're going to do it in R. I'm not going to do it in Python.

Will Nowak: See. Again, disagree. So, I mean, you may be familiar and I think you are, with the XKCD comic, which is, "There are 10 competing standards, and we must develop one single glorified standard to unite them all. And then soon there are 11 competing standards." So I think that similar example here except for not. And what I mean by that is, the spoken language or rather the used language amongst data scientists for this data science pipelining process, it's really trending toward and homing in on Python. And so I think ours is dying a little bit. I know Julia, some Julia fans out there might claim that Julia is rising and I know Scholar's getting a lot of love because Scholar is kind of the default language for Spark use.

So yeah, there are alternatives, but to me in general, I think you can have a great open source development community that's trying to build all these diverse features, and it's all housed within one single language. That's the dream, right? You have one, you only need to learn Python if you're trying to become a data scientist. It used to be that, "Oh, makes sure you before you go get that data science job, you also know R." That's a huge burden to bear. And so now we're making everyone's life easier.

Triveni Gandhi: Oh well I think it depends on your use case in your industry, because I see a lot more R being used in places where time series, and healthcare and more advanced statistical needs are, then just pure prediction. Right? I don't want to just predict if someone's going to get cancer, I need to predict it within certain parameters of statistical measures. Right? And so that's where you see... and I know Airbnb is huge on our R. They have a whole R shop.

Will Nowak: Yeah. Fair enough. But it's again where my hater hat, I mean I see a lot of Excel being used still for various means and ends. And I wouldn't recommend that many organizations are relying on Excel and development in Excel, for the use of data science work.

Triveni Gandhi: Okay. How about this, as like a middle ground? Python is good at doing Machine Learning and maybe data science that's focused on predictions and classifications, but R is best used in cases where you need to be able to understand the statistical underpinnings. Because R is basically a statistical programming language. The Python stats package is not the best.

Will Nowak: But it's rapidly being developed to get better.

Triveni Gandhi: But it's rapidly being developed.

Will Nowak: I think we have to agree to disagree on this one, Triveni.

Triveni Gandhi: Okay.

Will Nowak: Now it's time for, in English please. Where we explain complex data science topics in plain English. So Triveni can you explain Kafka in English please? And it's not the author, right?

Triveni Gandhi: Kafka is actually an open source technology that was made at LinkedIn originally. Cool fact. And it is a real-time distributed, fault tolerant, messaging service, right? So, that's a lot of words. Essentially Kafka is taking real-time data and writing, tracking and storing it all at once, right? So when we think about how we store and manage data, a lot of it's happening all at the same time. So I get a big CSB file from so-and-so, and it gets uploaded and then we're off to the races. With Kafka, you're able to use things that are happening as they're actually being produced. So think about the finance world. People are buying and selling stocks, and it's happening in fractions of seconds. And so you need to be able to record those transactions equally as fast. That's where Kafka comes in.

Another thing that's great about Kafka, is that it scales horizontally. What that means is that you have lots of computers running the service, so that even if one server goes down or something happens, you don't lose everything else. It's very fault tolerant in that way. And so it's an easy way to manage the flow of data in a world where data of movement is really fast, and sometimes getting even faster. So it's sort of the new version of ETL that's based on streaming.

Will Nowak: Thanks for explaining that in English. The reason I wanted you to explain Kafka to me, Triveni is actually read a brief article on Dev.to. So a developer forum recently about whether Apache Kafka is overrated. So the discussion really centered a lot around the scalability of Kafka, which you just touched upon. Kind of this horizontal scalability or it's distributed in nature. And so I want to talk about that, but maybe even stepping up a bit, a little bit more out of the weeds and less about the nitty gritty of how Kafka really works, but just why it works or why we need it. I wanted to talk with you because I too maybe think that Kafka is somewhat overrated. And so not as a tool, I think it's good for what it does, but more broadly, as you noted, I think this streaming use case, and this idea that everything's moving to streaming and that streaming will cure all, I think is somewhat overrated.

Triveni Gandhi: Yeah, sure. I mean there's a difference right? Between streaming versus batch.

Will Nowak: Yeah. So do you want to explain streaming versus batch? Go for it.

Triveni Gandhi: Sure. So yeah, I mean when we think about batch ETL or batch data production, you're really thinking about doing everything all at once. And I guess a really nice example is if, let's say you're making cookies, right? So you would stir all your dough together, you'd add in your chocolate chips and then you'd bake all the cookies at once. But with streaming, what you're doing is, instead of stirring all the dough for the entire batch together, you're literally using, one-twelfth of an egg and one-twelfth of the amount of flour and putting it together, to make one cookie and then repeating that process for all times. And maybe you have 12 cooks all making exactly one cookie. So that's streaming right? Where you're doing it all individually. But batch is where it's all happening. Maybe at the end of the day you make it a giant batch of cookies.

Will Nowak: Yeah. So that's a great example. I think just to clarify why I think maybe Kafka is overrated or streaming use cases are overrated, here if you want it to consume one cookie at a time, there are benefits to having a stream of cookies as opposed to all the cookies done at once. Maybe you're full after six and you don't want anymore. So just like sometimes I like streaming cookies. Sometimes I like streaming data, but I think for me, I'm really focused, and in this podcast we talk a lot about data science. And at the core of data science, one of the tenants is AI and Machine Learning. And so when we're thinking about AI and Machine Learning, I do think streaming use cases or streaming cookies are overrated.

So we haven't actually talked that much about reinforcement learning techniques. And so reinforcement learning, which may be, we'll say for another in English please soon. I would say kind of a novel technique in Machine Learning where we're updating a Machine Learning model in real-time, but crucially reinforcement learning techniques. And again, I think this is an underrated point, they require some reward function to train a model in real-time. So by reward function, it's simply when a model makes a prediction very much in real-time, we know whether it was right or whether it was wrong. And if you think about the way we procure data for Machine Learning mile training, so often those labels like that source of ground truth, comes in much later.

Will Nowak: So if you think about loan defaults, I could tell you right now all the characteristics of your loan application. I know you're Triveni, I know this is where you're trying to get a loan, this is your credit history. That I know, but whether or not you default on the loan, I don't have that data at the same time I have the inputs to the model. So therefore I can't train a reinforcement learning model and in general I think I need to resort to batch training in batch scoring. So the concept is, get Triveni's information, wait six months, wait a year, see if Triveni defaulted on her loan, repeat this process for a hundred, thousand, a million people. And then once I have all the input for a million people, I have all the ground truth output for a million people, I can do a batch process. I can bake all the cookies and I can score or train all the records.

I think everyone's talking about streaming like it's going to save the world, but I think it's missing a key point that data science and AI to this point, it's very much batch oriented still.
Triveni Gandhi: Well, yeah and I think that critical difference here is that, streaming with things like Kafka or other tools, is again like you're saying about real-time updates towards a process, which is different real-time scoring of a model, right?

Will Nowak: Yes. Yes. Exactly. Good clarification.

Triveni Gandhi: And so like, okay I go to a website and I throw something into my Amazon cart and then Amazon pops up like, "Hey you might like these things too." Now that's something that's happening real-time but Amazon I think, is not training new data on me, at the same time as giving me that recommendation.

Will Nowak: That's example is realtime score.

Triveni Gandhi: Right. It's a real-time scoring and that's what I think a lot of people want. But then they get confused with, "Well I need to stream data in and so then I have to have the system." But all you really need is a model that you've made in batch before or trained in batch, and then a sort of API end point or something to be able to realtime score new entries as they come in. Then maybe you're collecting back the ground truth and then reupdating your model. Right? So then Amazon sees that I added in these three items and so that gets added in, to batch data to then rerun over that repeatable pipeline like we talked about.

Will Nowak: Yeah.

Triveni Gandhi: And so I think streaming is overrated because in some ways it's misunderstood, like its actual purpose is misunderstood.

Will Nowak: Yeah, I think that's a great clarification to make. Just this distinction between batch versus streaming, and then when it comes to scoring, real-time scoring versus real-time training. Definitely don't think we're at the point where we're ready to think real rigorously about real-time training. And honestly I don't even know. I don't know, maybe someone much smarter than I can come up with all the benefits are to be had with real-time training. But to me they're not immediately evident right away. But one point, and this was not in the article that I'm linking or referencing today, but I've also seen this noted when people are talking about the importance of streaming, it's for decision making. So the idea here being that if you make a purchase on Amazon, and I'm an analyst at Amazon, why should I wait until tomorrow to know that Triveni Gandhi just purchased this item?

Yeah, because I'm an analyst who wants that, business analytics, wants that business data to then make a decision for Amazon. So I'm a human who's using data to power my decisions. That's fine. I get that. That you want to have real-time updated data, to power your human based decisions. So it's another interesting distinction I think is being a little bit muddied in this conversation of streaming. That's fine. But if you're trying to use automated decision making, through Machine Learning models and deployed APIs, then in this case again, the streaming is less relevant because that model is going to be trained again in a batch basis, not so often.

Triveni Gandhi: Right? Yeah. Unless you're doing reinforcement learning where you're going to add in a single record and retrain the model or update the parameters, whatever it is. I could see this... Last season we talked about something called federated learning. And I could see that having some value here, right? Where you're saying, "Okay, go out and train the model on the servers of the other places where the data's stored and then send back to me the updated parameters real-time." Again, the use cases there are not going to be the most common things that you're doing in an average or very like standard data science, AI world, right? Banks don't need to be real-time streaming and updating their loan prediction analysis. And so I think again, it's again, similar to that sort of AI winter thing too, is if you over over-hyped something, you then oversell it and it becomes less relevant. And so I think Kafka, again, nothing against Kafka, but sort of the concept of streaming right? Needs to be very deeply clarified and people shouldn't be trying to just do something because everyone else is doing it.

Will Nowak: Yeah. Yeah. I agree. And I think we should talk a little bit less about streaming. Again, not that it's not appropriate and it's time and place and just more about... I just hear so few people talk about the importance of labeled training data. And so people are talking about AI all the time and I think oftentimes when people are talking about Machine Learning and Artificial Intelligence, they are assuming supervised learning or thinking about instances where we have labels on our training data. People assume that we're doing supervised learning, but so often I don't think people understand where and how that labeled training data is being acquired. What is the business process that we have in place, that at the end of the day is saying, "Yes, this was a default. That was not a default. This person was high risk. This person was low risk."

You need to develop those labels and at this moment in time, I think for the foreseeable future, it's a very human process. It's a somewhat laborious process, it's a really important process. And I think people just kind of assume that the training labels will oftentimes appear magically and so often they won't. So I guess, in conclusion for me about Kafka being overrated, not as a technology, but I think we need to change our discourse a little bit away from streaming, and think about more things like training labels.

Triveni Gandhi: Last season, at the end of each episode, I gave you a fact about bananas. Now in the spirit of a new season, I'm going to be changing it up a little bit and be giving you facts that are bananas. You ready, Will?

Will Nowak: Let's give it a shot.

Triveni Gandhi: All right. Today I want to share it with you all that, a single Lego can support up to 375,000 other Legos before bobbling. So in other words, you could build a Lego tower 2.17 miles high, before the bottom Lego breaks.

Will Nowak: Let me go try that out.

Triveni Gandhi: Go try it out.

Will Nowak: That's all we've got for today in the world of Banana Data. We'll be back with another podcast in two weeks, but in the meantime, subscribe to the Banana Data newsletter, to read these articles and more like them. We've got links for all the articles we discussed today in the show notes. All right, well, it's been a pleasure Triveni.

Triveni Gandhi: It's been great, Will. See you next time.

Best Practices for Data Science Pipelines

You May Also Like

Agentic AI Governance: 4 Criteria to Evaluate Tools

Perturbing Prompts to Assess Bias in LLM Tasks

AI for Marketing Analytics: Your Guide to Hyper-Personalization

The Tricky Discipline of Governing Agentic AI: Policies, Rules, and Standards

Best Practices for Data Science Pipelines

Subscribe to the Podcast

Subscribe to the Dataiku Blog

You May Also Like

Agentic AI Governance: 4 Criteria to Evaluate Tools

Perturbing Prompts to Assess Bias in LLM Tasks

AI for Marketing Analytics: Your Guide to Hyper-Personalization

The Tricky Discipline of Governing Agentic AI: Policies, Rules, and Standards