Maximize Results With Code: Tips From a Data Engineer

Dataiku Product, Scaling AI Marie Merveilleux du Vignaux

Guests to the Toronto session of the 2023 Dataiku Everyday AI Conferences heard from solutions engineer Kateryna Savchyn of Dataiku, as she spoke in depth about specialized coding options available to augment existing Dataiku products and solutions.

→ Watch the Full Session

Building Complex, Code-First Projects

The overall goal of Savchyn’s talk was, to use her words, “to accelerate you through the thought process of building analytical projects and walk you through the ways in which you can build a completely code-first pipeline, how you can build your own custom estimators, and all the ways you can augment beyond visual tools.” As a solutions engineer, her background is in data science, so her talk drew from real-life examples from questions she has been asked and problems she has solved in her career.

everyday ai toronto Kateryna Savchyn

In general, she focused her talk on how to write clean code, how to formalize it into a code library, and how to manage the code in Dataiku. She explained that data scientists are always looking for ways to build their own estimators and share them for reuse with the rest of their organization, while front-end designers are interested in the ways they can build visual front-end projects as a wrapper around projects.

I want to show you common ways we optimize compute and scalability, and how you can automate your projects to make sure that what you built once becomes a self-sustaining project that continues to deliver value after you’ve created it. 

-Kateryna Savchyn, Solutions Engineer, Dataiku

Performing Exploratory Analysis and Data Transformations

Savchyn started her deep dive by recognizing how much flexibility comes built-in with Dataiku products. “The good news is that Dataiku is a platform that comes with its own set of APIs that are accessible from either in the interface or externally,” she said. “This will give you a lot of flexibility in the ways that you can build your pipelines from scratch or automate them with any approach you’re interested in.”

exploratory analysis and transformations

The first step for any project, she explained, after accessing data, is within the context of a notebook environment or a preferred IDE (integrated development environment). Developers can choose from embedded notebooks or embedded IDEs based on preference, and integrations are available with local IDEs that a developer or data scientist might already be using.

Hopping into a quick demo, she showed how data flows through a given project. “If I want to explore code or transform it, I can start by building a code recipe. I can select the language I’d like to code in, or use code from a dynamic notebook environment,” she began. “I’d like to draw your attention to the import of the Dataiku package. This is a wrapper around the API that makes it easier for you to interface with the artifacts in your project.”

What’s cool about this is that the data can live anywhere. It can be in cloud storage, it can be in a database, and I don’t have to worry about defining that connection from scratch in my code. 

-Kateryna Savchyn, Solutions Engineer, Dataiku

She again brought up the different ways a coder could interact with Dataiku. “I can edit in a code notebook — my dynamic environment that you’re familiar with,” she said. “You can run your different blocks of code, formalize it, troubleshoot it, and save it back into the recipe component.” She also used her demo to showcase an IDE version and local integrations, too. “Now I have the flexibility of coding how I’m used to and saving and synchronizing with the project, and still collaborating with the rest of the team.”

Making Code Reusable

Savchyn continued, focusing this part of her talk about how to expose code that has been created to other developers and other teams to utilize it and adapt it for individual use cases. She first spoke about a feature called “code sample.”

code sample sharing and reusability

Code sample is a library of different templates that I can copy and paste directly into my code and build on top of so I don’t have to redesign it every time,” she said. “What’s cool about these samples is that you can create as many as you need. If you build a really cool transformation, you can save it here and it’s accessible to other users.”

She then spoke about plugin recipes. She encouraged listeners to think about these as visual wrappers around code that has been created to perform a certain task. “We have a whole library of plugins that have already been created by the Dataiku community,” she said. “We also have an open source GitHub and as the developer, now only can you leverage these, you can use them as a baseline to extend on top of and add functionality to these, or you can build a plugin from scratch.” Plugins are versatile tools, allowing users to wrap different artifacts and components into them, or wrapping a macro or model into an automation scenario.

She made sure to reinforce the library concept. “This is the one-stop shop where you can all contribute and build custom functions or custom classes for your project,” she said. Having a library avoids the scenario of every data scientist building their own code and each having their own definitions scattered throughout a discrete pipeline, per Savchyn.

Working With Code Administration

Based on questions she had received from coders, Savchyn then moved towards the administrative side of the house. She wanted to share best practices around installing packages for exploratory analysis, version control in Dataiku, and how to handle projects if a coder already has his or her own version control methodology and external repositories for managing code.

code amin everyday ai toronto

In response to the first question, she said that, “Dataiku allows you to create as many R or Python code environments with whatever base interpreter you need for various tasks. From a usability standpoint, that’s really helpful because whenever you’re writing a recipe, or a notebook, or an IDE, you can just point to the code environment that you want to use for that particular script.” In essence, this helps individual coders get out of each other’s way because they have their own environments and packages.

With regard to version control, she noted that every Dataiku project automatically comes with its own Git repository. “Any time changes are made throughout the project,” she began, “it’s logged as a Git commit, and you can explore all of your project level Git commits, compare them, and roll back changes as needed.” Also, Dataiku handles script-based version control, meaning that every recipe has its own history tag that shows all of the commits. These can be compared or reversed as needed. For the deepest level of control, Dataiku also includes ways to synchronize to custom remote repos because some customers already have their own code bases with their own version control methodologies.

For a given project, coders are able to define a remote repo, synchronize to it, push to it, and manage it however they’d like. After setting up a global project library, they can also set a remote repo to it and manage it that way. Even on a notebook level, they are able to track individual development notebooks, adding a level of tracking beyond the Dataiku defaults.

Our intention is to reduce the level of administrative overhead for you while also being flexible to accommodate your working style.

-Kateryna Savchyn, Solutions Engineer, Dataiku

Building Code-First Machine Learning

“One of the big draws of Dataiku is obviously our visual learning labs,” she said. “The labs are specifically designed to accelerate you through the process of building an ML pipeline, tweaking it, making changes, and getting to the champion model faster, but there are multiple ways that you can augment your experience.”

code-first ml everyday ai toronto

Savchyn explained that the first way was to take models created in the visual labs and offload them into a Jupyter notebook to explore the syntax underneath, or use it as a foundational building block to build on. “You’ll notice that some of our labs are actually code-first,” she continued. “If you’re working on a deep learning case and you want to build your own neural network, you can go in and change the underlying architecture of that model with Python code.” 

Thirdly, she spoke about a feature called experiment tracking. “The integration of that feature with MLflow lets you start your own recipe and build a completely code-first ML pipeline, but then benefit from Dataiku keeping tabs on all the experiments you run,” she said. From an admin perspective, it becomes easier for the coder to find out which are the highest-performing experiments. From here, there are ways to wrap them up and expose them to the rest of the organization. One of those ways is the aforementioned plugins, allowing the coder to wrap up all of the model into a visual interface, making it simpler for others to use it and compare it to models that come out of the visual lab.

Designers Can Use Dataiku, Too

In addition to coders, Savchyn wanted to call attention to front-end designers. “They’re also part of the team,” she said. “We let them work from the same context of the project that the rest of the developers work with. For the designer, we have different web application frameworks to give them a range for their comfort level and experience.” Dataiku solutions make it easier to use, test, and deploy. She also mentioned that, “Dataiku is scalable with Kubernetes, so designers don’t have to worry about the admin burden on how to surface it to the consumer.”

custom front-end everyday ai toronto

In a live demo, she showed the Dataiku web applications panel and its split screen UI. “Once you’re developing via web application, you’ll be in this split screen paradigm, where on the left you can import any packages you need and update your code, while on the right you can see the way that your visual interface is shaping up,” she said.

Making the Most of Compute and Scalability

For any ETL (extract, transform, and load) project, Savchyn explained that, “You want to make sure that you can accommodate the size of the data that you’re working with, or the complexity of the transformations in your project. There are three ways that Dataiku makes this very simple.”

compute and scalability everyday ai toronto

First, she shared that if coders are working in any SQL framework where the data lives in SQL databases, they can use Dataiku to offload all of the compute directly into the database and maximize the infrastructure. Second, coders can create a more distributed compute by managing multiple Spark clusters and exposing it to the developers. They can then write in their preferred Spark flavor, and resize the clusters depending on what they need for that transformation. Lastly, she explained that if coders are writing R or Python for the flow, they can containerize that task if they know ahead of time that the task will require additional compute resources.

In a demo, she showed that for coders building visual recipes, Dataiku creates the SQL logic automatically. Non-coders can also choose the engine they’d like to use because they might be less familiar with Spark and its configuration.

Achieving Full Automation

“Automation is very important,” Savchyn said. “We have our automation scenarios feature, but I want to illustrate how parameterizing and using project variables in combination with automation scenarios can be a powerful way to make your project self-sufficient.”

script-automate-retraining automation everyday ai toronto

That last mile of the project makes the difference between it just being a prototype and being an actual, functional living project after you’ve created it.

-Kateryna Savchyn, Solutions Engineer, Dataiku

In a demo she pointed out that, “The most important thing to notice is that there’s a reference using this dollar sign syntax to a project variable called “Latest date labeled” data. You can create as many variables as you need, set a value to them, and once they exist, you can update them using code from the context of whatever recipe you’re using, but you can also dynamically update them from the context of the scenario.”

She encouraged the audience to think of automation scenarios as the, “CI/CD component of Dataiku, where you can automate tasks, run jobs, and develop your own checks and balances for the processes that you’re running on an ongoing basis.” Coders have the option of chaining together a sequence of different steps from a library, or they can start from scratch using Python scripts.

In addition, coders are able to set up “triggers” and “reporters.” As the name suggests, triggers initiate a process, and custom trigger points can be created by writing code. Reporters are the feedback mechanisms that set off alerts or notifications, and can send artifacts if needed. The ability to apply custom parameters to a project and the ability to code them means that data scientists can select whatever feedback loops or actions they want to reproduce — without having to maintain anything by hand.

Next Steps and Rich Resources

Savchyn made sure to leave guests with vast amounts of information so they felt more comfortable setting off building solutions in Dataiku. “How you can write your code, package it into reusable components, build your own custom estimators, find building examples for front-end designers, and learn scalability and automations to make sure all of it is repeatable — is available,” she said.

She pointed to the Dataiku Academy, which contains free, self-paced exercises to build projects and see them in action; a rich set of documentation pages with tutorials and example scripts; and a very active Community for users to ask questions, talk about use cases, and work through challenges and approaches.

You May Also Like

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More

The Ultimate Test of ChatGPT

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Maximizing Text Generation Techniques

Read More