Only about five years ago, data preparation took up to 80% of time dedicated to a data project. In 2020, an Anaconda survey found that data scientists now spend about 45% of their time on data preparation tasks, including loading and cleaning data.
While this is a significant improvement, data preparation remains a time-consuming step that needs to be optimized in order to scale AI across the organization and complete more projects faster. This blog post explores two ways analysts can make their data preparation techniques more efficient in Dataiku and save time to provide value elsewhere.
Take Advantage of Dataiku’s Visual Recipes
Dataiku offers many visual recipes that can be used to cleanse, blend and prepare data for dashboards and business reporting and make your data transformations easy.
For data preparation and cleansing, you can find the broom that represents the Prepare recipe.
This recipe includes a library of over 90 common data transformations. You can get started by giving it an output, clicking create, and then you’re free to scroll through the list of available processors, things like currency conversions, data conversions, filtering, splitting and more.
And as bonus, Dataiku saves time during your data analysis by suggesting functions based on the meaning of your data. To find these, you an hover over the column header to show the suggested options.
Adopt a Habit of Reuse
Governance is made possible in Dataiku as the platform offers one centralized place from which all data work happen. Teams can keep documentation of data sources with sensitive or proprietary information, view what data is used for a given project, and who owns what data, see what has been done on a specific project and track the history of the data, such as where it's being used and where it came from, etc. These features enable analysts to trace and reproduce work.
When large amounts of data are involved, as is the case in most data science projects today, organizations should automate the task of data preparation to speed up the process and ensure consistency, explainability, and repeatability.
Automating data preparation consists in defining a series of steps or actions that will occur each time a defined trigger occurs. Triggers can be time-based, such as a daily run, or depend on other factors such as new data coming into the system or an upstream job finishing.
Scheduled jobs provide a user-friendly way to set up repeated data cleaning processes so that incoming data is cleaned automatically when it arrives into the database. By automating data preparation, analysts can save time by using and reusing already existing automation processes instead of spending the majority of their time doing tasks manually, over and over again.
Once an analyst has cleaned and prepped a dataset through automation, not only should that dataset be made available to everyone to reuse, but analysts should also save and share their data preparation automation processes to ensure saving time and efforts and maintaining consistency across analyses and results.