How Dataiku's AI Assistant Upgrades Data Science Workflows

Use Cases & Projects, Dataiku Product Kirsten Hoogenakker

Full disclosure: As a citizen data scientist, I’ve never formally trained as a software developer or professional data engineer. However, I do have plenty of experience iterating through code until I meet my goals — whether that’s doing exploratory data analysis (EDA) in a notebook to understand a dataset, engineering features in preparation for machine learning, or reusing code someone else has written for a specific data task. Luckily for me, Dataiku provides AI Code Assistants in Jupyter Notebooks and VS Code to help address the time-consuming tasks I run into when building out data pipelines in code. 

These AI Code Assistants take advantage of code assistance services from popular LLM providers such as OpenAI, Cohere, or Hugging Face, with extra prompt engineering intelligence baked into the request by Dataiku.  Depending on which LLMs your company has chosen to make available via the LLM Mesh, your Dataiku admins will choose which service to use as the default model for this feature. Read on to learn about my favorite shortcuts that save both my time and sanity!

AI Code Assistant in Jupyter Notebooks

%aiexplain

Let’s say I’m starting off with a copy of someone else’s notebook — something I’d grabbed from git or a snippet of code pulled from Stack Overflow. In some cases, there may be a library I’m unfamiliar with or a function I hadn’t thought of using. I may have some idea of what the code is doing, but I might need a little more help. Enter %aiexplain. This handy tool provides a quick overview that guides me on where to start syntactically and gives me extra context. 

shortaiexplain

Normally, if there were parts of the code I didn’t understand, I’d be relegated to sifting through documentation and Stack Overflow threads, making multiple web searches to fully understand what was going on in a block of code. %aiexplain acts like a personal translator and tutor, not only providing a step-by-step explanation of the code block but also encouraging upskilling within an organization. If I’m an analyst looking to better understand how the engineer transformed the raw data for my dashboard, this tool can help me better understand the workflow in its entirety without extensive knowledge of code. This democratized understanding of technical topics helps break down barriers and communication struggles between teams without placing undue burden on data scientists to always do the explaining.

%aiask

Great, so now that I have an idea of what’s going on in the notebook, I can start making my own additions. Let’s say I want to do some simple EDA to understand if there are missing values in the datasets that are acting as inputs to this notebook, but I’m uncertain of the syntax for that step. 

One option to get help would be heading over to ChatGPT to start asking about this syntax. However, I know the company AI policy that requires me to keep internal data within the organization. Instead of going outside the platform, you can leverage %aiask within the notebook in Dataiku. In the same way I would ask questions in ChatGPT, I can conversationally interact with this assistant to answer questions about my specific code within the Dataiku notebook with the added benefit of ensuring security and governance. 

Since I have the outliers resulting from the %aiexplain in the block above, I might leverage %aiask to visualize that data in the notebook as shown below.

shortaiask

Like the previous assistant, I no longer need to run through multiple Google searches, skimming documentation or Stack Overflow. All my research and iterations can happen in one place, which simplifies my workflow and provides confidence that I’m learning and growing in my coding abilities.

%aiwrite

At this point, we’ve been able to explain blocks of code, ask questions about the data we’re ingesting and visualize it, but what about actually writing and running the transformations themselves? As a subject matter expert, I know the steps that need to be done to clean and transform the data, but I’m not quite sure how to translate those into Python. As a beginner coder or someone who is code-curious, %aiwrite can write the code to accomplish tasks I write in natural language.

shortaiwrite

This process might feel a little like cheating, but comes with a number of benefits: saving me countless hours researching libraries and their syntax, creating a higher quality end product, and improving my overall coding experience — less frustration with trial and error! Plus, %aiwrite allows me to harness the power of Python to go beyond the normal visual recipes I’ve been using in Dataiku. This extensibility means I can own more parts of the data pipeline and go further in my data projects without an additional handoff between teams.

AI Code Assistant in VS Code

If I'm only a citizen data scientist, I'm even less of a full-fledged application developer! I can read and adjust someone else's web app backend code, but will definitely accept all the help I can get. Luckily, the AI code assistant can also be used in VS Code from inside Dataiku Code Studios, and I can use this helper — along with the Code Studios template specially designed to work with Streamlit —  to set aside the fear of developing my own web apps from scratch.

In the same way I leveraged the assistant in a Jupyter notebook to help me write code, in VS Code, I can leverage a shortcut to explain parts of the code to me or even write code from a comment. 

vscodegif

The panel on the left hand side records the conversation and questions I ask of the AI Code Assistant. The same way this can be helpful when you ask questions in ChatGPT, the conversation history can help me go back to previous questions or find ways to re-engineer my coding prompts. Unlike ChatGPT, this assistant passes through the context of what’s in my code and code blocks that have been run before to generate answers specific to the code I’m writing, but via APIs so that my code is not publicly shared or used for third-party model retraining.

The assistant even takes on some of the most dreaded development tasks — unit testing, refactoring, and documentation to ensure my code is standardized and ready for production. When I first started programming, I had no idea how to properly document my code and I certainly didn’t know the purpose of unit testing. Having assistants for both of these tasks means projects can get into production faster and with additional guardrails to ensure the code is built and commented properly. 

In the realm of coding, where every line is a potential breakthrough or a puzzle waiting to be solved, having a reliable guide can be a game-changer. As a citizen data scientist navigating the intricate world of code, I've encountered countless scenarios where I needed advice but didn’t want to bother my more technical colleagues for help. The AI Code Assistant in Dataiku emerged as my trusted companion. It's more than just a tool; it's a connecting bridge between seasoned experts and those with curious minds eager to learn. Let this assistant be your ally, too, and watch your code journey unfold with newfound confidence. Happy coding!

You May Also Like

Moving Beyond Guesswork: How to Evaluate LLM Quality

Read More

A Tour of Popular Open Source Frameworks for LLM-Powered Agents

Read More

Navigating Regulations With Dataiku’s Governance Capabilities

Read More

Custom Labeling and Quality Control With Free-Text Annotation

Read More