In many organizations, highly trained (and quite expensive) expert data scientists don't do that much advanced data science. Instead, they spend time on tedious tasks and projects that don't move the needle — like setting up environments for their experiments or cleaning and preparing data. Once models move to production, expert data scientists are often turned into babysitters, ensuring that models behave like they are supposed to in production. And that is only half the issue.
Many of these projects that the best minds are working on are not the innovative, moonshot projects that would change the course of company history. Instead, experts are working on the mundane, which leads to issues with their job satisfaction and the value they generate. So, how do we get our experts to do expert work?
Step 1: Teamwork
First, we need to get the mundane work out of the way. We have two options: Either someone else who is better suited does the tasks, or we automate tasks. You may already know that Dataiku can streamline data preparation work for experts with collaborative projects. In Dataiku, data analysts, who are often closer to the business issues and data, can prepare the data, and data scientists can build models.
Dataiku also helps streamline the path to production with machine learning operations, aka, MLOps, to ensure projects are safely deployed and managed in production. Again, teamwork comes into play. With a Dataiku project ready for production, data scientists can easily bundle up their project and hand it off to an ML engineer for testing and deployment and an ML operator for day-to-day monitoring and management. A straightforward production process and roles allow expert data scientists to get out of the model babysitting business.
Step 2: Environment Management
With data preparation and operations under control, issues are still holding your experts back. Many expert teams manage their environments and infrastructure so they can work in their preferred notebooks or IDEs. These environments are often disconnected from other systems, creating collaboration and production deployment challenges.
In Dataiku 11, we have taken a big step to make your expert data scientists more comfortable working with Dataiku so they can take on your moonshot projects. New capabilities include code studios, experiment tracking, managed labeling, a new feature store, and more.
New code studios in Dataiku streamline and automate environment setup with pre-built templates to build isolated development environments for creating and debugging code and models in JupyterLab, VSCode, RStudio, and more.
With code studios, instead of setting up and managing resources, security, and storage, experts now have everything they need to get experimentation environments up and running quickly in Dataiku. They will be more productive using the tools they know without the overhead of IDE setup and management.
In addition, with environments running on Dataiku, experts are more integrated with the rest of Dataiku users and projects. They can directly edit and debug recipes used in pipelines, notebooks, libraries, or web apps from their preferred environment. This closer integration allows experts to collaborate on larger projects and contribute their expertise where it is most needed.
Step 3: Experiment Tracking
But there is even more for your experts in Dataiku 11. In Dataiku 10, we introduced integration with the popular open-source MLFlow framework allowing expert data scientists to work in their environments with MLFlow and then import their projects into Dataiku. That was just the first step. In Dataiku 11, we have extended experiment tracking in Dataiku to work seamlessly with these imported MLFlow projects. With visual experiment tracking, expert data scientists don't have to build and manage experiment tracking independently and can save time finding the best performing models.
Step 4: Managed Labeling
Another problem experts have is getting the high-quality training data they need for predictive modeling, especially for computer vision use cases. Advanced models based on transfer learning require a set of images unique to a particular company or situation to tune the model to individual needs. Data scientists may end up labeling these images, but that is a waste of their time, and they may not have the expertise required. Ideally, the data scientists should be able to push the images to subject matter experts in the business for annotation.
To solve this challenge, Dataiku 11 includes a new managed labeling system. With managed labeling, a data scientist or project manager can create a labeling task with the appropriate labels and a set of images. They easily assign the task to a group of annotators, and a visual interface makes labeling easy for business users to complete their task. Managers can review the assigned labels, resolve any conflicts, and monitor team progress against goals. The result is higher quality annotations of images, resulting in better data for models and more accurate results.
Step 5: Reusability and Sharing
There is even more for experts in Dataiku 11, including a new feature store that uses new seamless sharing to make datasets available for experimentation and production projects. With the feature store in Dataiku, teams can reuse more data assets, save time and money, and have higher confidence in the data they use for projects.
Dataiku 11 has excellent capabilities that can help expert data scientists do more expert data science. Code studios, experiment tracking, managed labeling, and the feature store can help, but it will take even more. We also need to empower more tech-savvy people across the organization to take on projects related to their functions. Keep an eye out for my next blog on this important topic. To learn more about Dataiku 11, check out the Product Updates page on our website or watch the recording of the Dataiku 11 session from Dataiku’s Everyday AI Conference in New York City.