This is a guest article by Sooter Saalu. Sooter is a data professional with over four years of experience working across data-centric roles. He has a working knowledge of Python, SQL, and several BI and data engineering tools. With an educational background in psychology and a five-year career in making complex topics seem simple to technical and non-technical audiences, Sooter brings a unique perspective to the numbers.
Dataiku and Snowflake are a dynamic duo — and they like it that way. In fact, Dataiku was named Snowflake Machine Learning/AI Partner of the Year in 2021 and again in 2022. And now, Dataiku, the platform for Everyday AI, has partnered with Snowflake, the Data Cloud Company, to deliver an elastically scalable platform that liberates all your data across clouds to scale AI and machine learning (ML) projects.
Snowflake is one of the most scalable cloud data warehouse solutions out there. It owes its popularity over the last few years to its decoupling of compute and storage for enhanced scalability, fast performance, and ability to manage datasets of all types and sizes across clouds (Amazon Web Services, Microsoft Azure, and Google Cloud Platform).
And now, Dataiku features native integrations with Snowflake for instant data access, pushdown execution of data science workloads, running data pipelines, and batch model scoring.
In this blog, new and existing Dataiku users will learn how to quickly connect to your existing Snowflake Data Warehouse from inside the Dataiku platform and then prepare and process Snowflake data using Dataiku’s unique pushdown via Java User-Defined Functions (UDFs).
Prerequisites and Setup
There are no skill prerequisites for this walkthrough. All you need is a Snowflake account. If you don’t already have one, you can sign up for a free 30-day trial. If you already have access to Dataiku Online, you can connect to your existing account. Or, we’ll walk you through how to activate a 14-day free trial right from Snowflake.
Load Data to Your Snowflake Data Warehouse
Your Snowflake account comes loaded with sample data, but for this walkthrough, you will use the Snowflake Marketplace to access a freely-available curated dataset: Labor Data Atlas.
From the Snowflake Marketplace, add a dataset to your Snowflake workspace by selecting the Labor Data Atlas database from Knoema. This dataset showcases labor market statistics curated from certified public sources such as Eurostat.
Click Get. A window will pop up, giving you access to change the name of the database and add in user roles.
Click the Options dropdown and add the PUBLIC role to this database. Then select Get again to create the database. The PUBLIC access role will help Dataiku access and connect to the database without requiring additional permissions. Learn more about user roles in Snowflake.
Go to the Data tab to confirm that Snowflake has created the database there. You should see the labor database listed:
Launch Dataiku From Snowflake Partner Connect
Now, it’s time to launch your Dataiku instance from Snowflake Partner Connect. Start by selecting Partner Connect on the Admin tab. You will see the Dataiku option under the Data Science and Machine Learning category. Select Dataiku and click Connect. Use the default settings that appear.
You can either connect your existing Dataiku Online account or create a Dataiku trial account with the details you added to your Snowflake workspace. Your 14-day free trial takes less than two minutes to set up and allows up to five users to explore Dataiku’s capabilities and collaborate on AI projects.
Select Activate to complete the signup process for your Dataiku trial account. Add an email and password, accept the legal terms, and fill out some know-your-customer (KYC) information.
Next, you will see the Dataiku Online launchpad showing a running Snowflake connection. Dataiku will use that connection.
Alternate Connection Method: Connect From Your Dataiku Account
Alternatively, you can connect to Snowflake from your Dataiku account by going to this launchpad and adding a new feature:
Dataiku enables you to add data, storage, or extensions to other tools. Select Snowflake as a data source. You’ll see a prompt asking you to input your Snowflake workspace credentials. Complete the configuration and you’ll have a Snowflake connection added to your Dataiku instance.
Using Your Snowflake Data in Dataiku
You now have an active connection and a full elastic AI stack with Dataiku as your AI and ML platform running on Snowflake Data Cloud, with connections to your Snowflake data. You can use this connection to clean, query, analyze, and visualize your data and create ML models from it.
Open Dataiku DSS, select New Project, and click Blank project.
Your sample project homepage should look like the image below. Click on + IMPORT YOUR FIRST DATASET.
This action brings up a popup with dataset options. Select Snowflake.
Select your Snowflake connection from the dropdown, then input the following in the Table, Database, and Schema fields to import a table on labor account statistics in Australia from your Labor Snowflake database.
Table: ABS_LABOUR_ACCT
Database: LABOR_DATA_ATLAS
Schema: LABOR
Test the connection to catch any errors. Your inputs should look like this:
Select the Create button. Dataiku will create your table in moments.
Data Preparation With Dataiku
Dataiku offers an intuitive user interface that enables technical and non-technical users to curate and analyze data. You can set up repeatable data processes called Recipes either visually, using point-and-click user interfaces, or by coding your own custom transformations in Python, SQL, R, and other popular programming languages.
Select the Actions tab to view the options available to you.
Select the Prepare option under the Visual Recipes and keep the default output dataset name.
Click Add a New Step. This action brings up the library of over ninety data preparation processors. Hovering over any processor name gives you a helpful tooltip on its usage.
Click on the Date column header. You can filter or create a date range for your dataset and extract elements of the dates in a new column, among other actions.
Select Compute time since. Here, you will create a numerical value of the time elapsed between the current date and the date entry. This value can be as granular as needed, with options available from seconds to years. For our project, we’ll keep the default settings.
Click on the Units column header. This column has no values. If there were sparsely populated values, you could fill the missing rows with the mode, a specific value, or remove the rows with missing values entirely. In this case, it is best to remove the column.
Select Delete.
Select the Value column header. You can also carry out several actions here, such as filtering the column with a set numerical range, rounding the decimal values, or creating a formula to handle the data.
Click Round to integer and leave the default settings intact.
Click the Run button. Although we have been viewing only a sample of the data in this view, this action will execute the processes you have set up against the entire dataset. The procedure might take a few seconds.
Now, you can explore the processed dataset. With this, you have cleaned and prepared your dataset to be used for your data science needs. You can create charts and reports and update them as your dataset changes.
You can also make in-depth analyses and build predictive models using Dataiku’s AutoML capabilities, which have algorithms from scikit-learn, Keras, and TensorFlow frameworks. If you prefer to code rather than use visual interfaces, you can use built-in Jupyter Notebooks and code recipes, shared libraries and code snippets, and predefined code templates for common tasks.
All this and more are available on Dataiku behind a relatable visual interface. We also offer documentation and a learning Academy to help you get familiar with Dataiku’s tools and offerings.
Pushdown Execution for Data Preparation
Dataiku employs a sophisticated pushdown mechanism that allows you to leverage Snowflake’s Java User Defined Functions (UDF) to speed up execution time for many queries.
Dataiku automatically chooses the best available execution engine for a data preparation task. The Extract numbers processor can utilize the Snowflake UDF that allows you to extract numbers into a JSON array. The pushdown nature of the action means that it runs exceptionally fast on the Snowflake side, reducing latency in your project.
If you are using Dataiku Online, the extended pushdown functionality is included automatically. However, if you are using the self-managed version, you need to enable UDFs via the administrative tab. Navigate to Administration > Connections > [Your Snowflake Connection] > Advanced Params to enable Use Java UDF, as shown below:
Please note that if you are using a self-managed version of Snowflake, you will need to provide a writeable stage when adding or editing your connection. Add your stage to the connections settings displayed when creating a new Snowflake connection, as shown below:
Next Steps
Over the course of this article, you have created a Snowflake database and connected it to Dataiku using the Partner Connect integration on the Snowflake platform for an easy and stress-free connection. Your data is readily available for use in Dataiku, where you can create processing pipelines to prepare, analyze, and visualize the data — and even build and deploy ML models.
Additionally, the Dataiku/Snowflake partnership offers pushdown for its scoring, meaning you can efficiently score your batch model based on Snowflake data. Once you have trained a model and added it to your flow, it can be selected and scored.
When the dataset is connected to Snowflake, its scoring can be pushed down to Snowflake. This enables you to score your batch models more quickly, offering you greater control and understanding of your models. The process is no different than traditional SQL scoring with Dataiku, but with reduced latency thanks to the extended pushdown functionality.
Together, Snowflake and Dataiku enable you to build powerful, flexible, and scalable AI and ML projects. The simple integration of these complementary solutions allows you to prepare, process, and visualize your data efficiently and intuitively.