This article is based on the 2022 Dataiku Frontrunner Awards submission from Snow Fox Data, a division of Excelion Partners. The original authors are Ryan Moore, Head of Solutions and Delivery, and Tony Olson, Head of Strategy and Partnerships.
Through the Dataiku Partner Ecosystem, organizations across industries and locations deliver value and innovation to customers alongside Dataiku. As one of our top consulting partners, Snow Fox Data works with numerous customers who utilize Dataiku in their data science and analytics practice. Their work with various clients and industries allowed them to quickly identify a common pain point related to data cataloging and lineage.
To help their customers resolve this challenge, Snow Fox Data created a lightweight plugin that directly integrates with Dataiku and its datasets. Meet one of the plugin’s creators, Head of Solutions and Delivery Ryan Moore, in the video below. Continue reading to discover the story behind their award-winning data science solution, which was recognized as a finalist in the 2022 Dataiku Frontrunner Awards.
Identifying Customer Pain Points
Snow Fox Data remarked that many organizations and analytics groups they worked with had yet to invest in an enterprise data cataloging or data lineage tool, which is often cost prohibitive. Instead, they witnessed them creating “homegrown” data cataloging solutions that consisted of a combination of spreadsheets, Dataiku, and their preferred visualization tool. These solutions were labor-intensive to maintain and did not integrate efficiently with their developers, who were hands-on with Dataiku projects.
Data lineage was also identified as a common challenge for clients. While they were creating numerous downstream datasets in Dataiku, clients did not have upstream data lineage visibility. This lack of visibility opens the door for clients to lose trust in both the data and, ultimately, the solution’s business outcomes.
Developing a Business Solution With Dataiku
To meet this challenge, the team at Snow Fox Data created a free Dataiku plugin called THREAD™. A lightweight cataloging tool that ties together data lineage, it provides a single location to document data connected to Dataiku and consume the catalog’s contents quickly and efficiently.
Implemented as a Dataiku web app plugin, it has the ability to securely scan an entire or partial Dataiku node to allow for lineage view and documentation. The resulting indexes and metadata are then saved as Dataiku datasets in a project flow, making it easy to export for exposure in third-party visualization tools.
As THREAD™ is built on top of Dataiku, Snow Fox Data leverages the ability to write custom plugins and integrate with the Python API, providing native security integration that removes governance concerns and increases the speed of the innovation process.
Value Generated by the THREAD™ Plugin
Since its development, the THREAD™ plugin has been deployed on hundreds of projects of multiple joint clients of Snow Fox Data and Dataiku. Several areas of business value that were obtained include:
By having the data definition at the time and location the information is needed, the end result is fewer clicks and time saved. Through better-documented columns, meanwhile, users also benefit from more and improved insights during exploratory analysis.
THREAD™ also allows for faster solution building and data enrichment through documentation and improved data understanding.
Using the plugin, there are clear measurements of governance through KPIs showing the percentage of columns documented in any dataset.
Further ways that THREAD™ improves governance include:
- Creation of a repository for data documentation.
- Ease of keeping definitions up-to-date.
- Native integration with Dataiku permissions that limit the editing of data definitions to those with access.
- Definitions that are easily auditable and exportable.
A common challenge that Snow Fox Data encounters when dealing with different clients is that analysts start working on a project, start investigating data, and then realize that they do not understand the definition and what the data represents. The THREAD™ plugin helps eliminate this ambiguity by building a common language between the source systems and the consumption of that data.
Furthermore, it creates easy transparency for data analysts, data engineers, data scientists, and business leaders to see:
- What data was used in a project (data catalog).
- Where it was used (upstream/downstream data lineage).
- How that data is defined throughout the project (data dictionary).
Training and Onboarding Efficiencies
In addition to helping new team members to learn company-specific terms and abbreviations faster, the plugin also streamlines onboarding and training by keeping all individuals in Dataiku instead of a myriad of spreadsheets and code documentation.