Data scientists are among the most highly coveted and prized resources at leading companies today, and it’s no wonder why. Their unique combination of expertise across business, technology, and mathematics domains is what enables them to analyze and interpret data in order to create models, data products, and insights that improve business outcomes. A data scientist is typically a hybrid creature — part statistician, part machine learning (ML) expert, part business analyst, and part computer programmer. As the common saying goes, “A data scientist is someone who's better at statistics than any software engineer, and better at software engineering than any statistician.”
Most data scientists do know how to code; Python, R, and SQL are among the most popular and widely used programming languages by this group. The flexibility of pure code allows advanced data practitioners to perform an endless variety of tasks: access or prepare data, perform feature engineering and train ML models, or even develop custom visualizations and applications where business users can consume results.
However, in order for data scientists to optimize the value they deliver to their organizations, it’s important that they don’t get bogged down or distracted by writing the wrong type of code. By wrong type, I don’t mean poorly constructed or malicious code. Rather, I mean code whose primary purpose is to connect disparate components that otherwise would be incompatible — more commonly referred to as “glue code.”
What Is Glue Code?
Glue code is code that doesn’t contribute to the core objectives of an analysis or a model system; it exists solely to chain together dependencies and ensure the overall system runs smoothly. To use a theater analogy, think of glue code as the stagehands and backstage team that make a production run seamlessly during performances. And just as any thespian could tell you, being part of the production crew rather than a lead actor doesn’t mean less work. In fact, it usually means more! Glue code is the same; it’s doing a lot of heavy lifting behind the scenes.
In “Glue: the Dark Matter of Software,” Marcel Weiher calls glue code “invisible and massive” and “quadratic.” That is, the glue code is proportional to the square of the number of things you need to glue. With the complexity of modern infrastructure and diverse technologies that need to be cobbled together today, glue code can quickly become the dominant component in a project’s overall code base.
Because a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code, it may be less costly to create a clean native solution rather than reuse a generic package.”
-Hidden Technical Debt in Machine Learning Systems, Google
In this article, we’ll further explore how writing glue code detracts from data scientists’ overall mission and discover ways they can shed this burden when working on AI projects.
Data Scientists Aren't Full Stack Developers
When aspiring data scientists first learn the skills of the trade, such as programmatically manipulating data or training models with ML libraries, the datasets provided for training exercises are likely isolated, curated sources: perhaps standalone CSV files or a zipped directory of files downloaded from Kaggle or GitHub. Novices likely run jobs in-memory, either locally on a laptop or in a cloud-based lab instance spun up for trial purposes. In these artificial, educational setups, glue code is a relatively trivial task since there aren’t many complexities or dependencies, and the resulting code doesn’t need to be tested and hardened for production use.
But at large enterprises, this is far from the case. Data is stored in dozens of different systems and applications, each with its own special format, security requirements, and idiosyncrasies. Corporate data is dirty, rife with missing values, duplicates, and quality issues that need to be detected and corrected each time the pipeline is run. Computation might occur in-memory, but on big data with millions of rows, it likely needs instead to be pushed down to the database’s native engine or shipped out to clusters of distributed resources in the cloud to optimize runtimes.
Most data scientists are not formally trained as data engineers or cloud architects, and the intricacies of network security, cloud resource allocation, and pipeline architecture lie outside their true area of expertise.
Fortunately, the availability of data science platforms like Dataiku presents an opportunity for data teams of all sizes to automate and simplify cumbersome backend processes like connecting to data storage and compute infrastructure, or recursively checking pipelines for schema inconsistencies. As a result, data scientists can stop writing glue code and free up time to focus on the core function of their jobs — producing models and delivering data solutions that drive better business decisions.
It's a Distraction From Core Priorities
By nature, data scientists tend to be curious and interested in trying things out for themselves to see how they work. Because of this, it can be easy for them to get distracted by going down each new rabbit hole in the technical ecosystem surrounding their project. For example, trying to improve runtime performance by using Spark, learning to convert their existing Python code into PySpark code for Spark compatibility, creating Docker images and spinning up Kubernetes clusters (but maybe forgetting to shut them down!)...these are activities that many data scientists have likely engaged in during the course of their project work.
However, the constant context switching and time spent researching and reading online forums and documentation is a time sink, and one that has a real opportunity cost for companies. Building and troubleshooting glue code might keep them busy, but the result is that your most highly paid resources are overworked, yet underutilized when it comes to their specialized strengths.
Modern data science platforms like Dataiku have commoditized much of this work, automating supporting tasks and removing the complexity from end users. Getting sidetracked by hand-coding and tuning these processes means less time focused on the high-value tasks that data scientists are uniquely qualified to perform.
How Dataiku Can Help
Performance expectations of data teams are high: After all, data and AI are now considered core business functions and key value drivers. To succeed in this fast paced environment, it’s critical for teams to automate and systematize the repetitive aspects of data projects to save time for tasks that require human insight and judgment.
Dataiku provides a single central platform where teams can design, deploy, monitor, and maintain all of their data and AI projects. Its powerful tooling helps data practitioners avoid repetitive or time-consuming “glue code” tasks via capabilities such as:
- Pre-built connectors to dozens of leading data sources, to simplify and standardize data access and formatting processes.
- Auto code generation specific to the selected runtime engine.
- Comprehensive pipeline consistency checks, to ensure schemas and dependencies are stable both across a workflow and over time as data evolves.
- Configurable data quality checks and failsafes to ensure scheduled jobs run as expected or else revert to a contingency plan.
- Managed Kubernetes clusters that are automatically started, stopped, and orchestrated for high performance, scalable cloud computing.
- Model and Flow auto-documentation that captures all relevant metadata about project datasets, model settings, and transformations.
- And much more.