Custom Labeling and Quality Control With Free-Text Annotation

Use Cases & Projects, Dataiku Product, Featured Chad Kwiwon Covin

“Garbage in, garbage out.” It’s a well-known saying in the world of data, and it continues to ring true even as models become more complex. It’s a simple but powerful reminder that the quality of your data dictates the quality of your AI outputs. No matter how advanced your models are, if they’re built on messy or poorly labeled data, you’re in for a whole lot of disappointment. This is why data labeling is crucial to building a robust, production-ready AI pipeline.

To ensure businesses can maximize the potential of their AI, Dataiku has been at the forefront of data labeling innovation. Since we introduced managed labeling, users have had a suite of tools to efficiently and accurately label large datasets, regardless of type (tabular, textual, image, etc.). With our newest release, we are adding the ability to label data with free-text annotation. This will provide a new level of flexibility and precision, ensuring that the quality of data fed into your models is top-notch. Before diving into the new free-text annotation, let’s first revisit what makes managed labeling so effective.

A Quick Look at Managed Labeling

Since the introduction of managed labeling in Dataiku 11, it has continuously expanded to handle a variety of labeling needs. From image classification to object detection and text labeling, it has enabled data teams to label their data efficiently while keeping humans involved. Managed labeling is designed to make complex workflows easier, providing:

  • Multiple labeling options for images, text, and records.
  • Collaboration features that allow multiple annotators and/or reviewers to work on a dataset simultaneously.
  • A review system that ensures labels are accurate and aligned across the team.
  • Status and performance dashboards for tracking progress and managing annotation quality.

This approach not only increases the quality of labeled data but also ensures that human expertise is involved in the modeling process. As models become more complex, having the right labeling tools to handle different data types is critical.

The Game-Changer: Free-Text Labeling

Free-text labeling in Dataiku unlocks the ability to label tabular data with unconstrained, free-form text. Instead of using predefined labels, annotators can type anything they want — adding a layer of flexibility to the annotation process.

This may sound small on the surface, but it’s a huge deal for teams working with text data or data that doesn’t fit into exact categories.

For example, imagine you’re working with a dataset of customer reviews or survey responses. Traditional labels like “positive” or “negative” might not fully capture the sentiment or detail needed for fine-tuning a language model. With free-text labeling, annotators can write custom descriptions and contextual notes, enabling more detailed and accurate labels.

Real-World Use Case: Fine-Tuning an LLM With Free-Text Labels

Let’s explore a more specific example to understand how free-text labeling can enhance model performance in medical settings, particularly with Subjective, Objective, Assessment, Plan (SOAP) notes, a common format used by healthcare providers to document patient interactions.

Consider a hospital that uses a large language model (LLM) to automatically generate SOAP notes from conversations between doctors and patients. While the LLM is capable of extracting key information, it might sometimes misinterpret medical jargon, miss important nuances, or fail to properly categorize subjective or objective data points. By using free-text labeling in Dataiku, human annotators — such as medical professionals — can go through the generated SOAP notes and add custom annotations, correcting any misinterpretations or adding context that the model missed.

Labeling tasks in Dataiku can be done on several different data types.

Labeling tasks in Dataiku can be done on several different data types.

In this instance, a healthcare practitioner uses free-text labeling to annotate SOAP notes from the LLM. 

Free-text annotations can be added on the right-hand side of the labeling interface. Users can skip and save notes to prepare for review.

Free-text annotations can be added on the right-hand side of the labeling interface. Users can skip and save notes to prepare for review.

For instance, in this case, the reviewer adds an assessment note with a potential secondary diagnosis to ensure the model captures crucial medical details. This allows the dataset to be better aligned with clinical expectations, improving the model’s accuracy in future iterations.

In this next phase of the workflow, we see how reviewers play a crucial role in validating annotations. This streamlined review process ensures that data quality is maintained throughout the project. Once annotations are validated, they become part of the final dataset, helping to refine the model’s output.

Reviewers can validate all annotations in the same new place. Instance admins can assign review controls to users from the admin panel.

Reviewers can validate all annotations in the same new place. Instance admins can assign review controls to users from the admin panel.

For example, the reviewer here has the option to validate the free-text annotation related to the secondary diagnosis, ensuring that the SOAP notes meet clinical expectations.

Once all annotations have been reviewed and validated, a final dataset is generated that includes the annotations and labels, ready to be used for model fine-tuning or further analysis. Each annotation is captured with details such as the reviewer, the labeling task ID, and the specific label applied. This ensures traceability and provides context-rich data that can significantly improve model outputs.

After validating annotations, a dataset is created with the labels to be used to give more context to the model/outputs.

After validating annotations, a dataset is created with the labels to be used to give more context to the model/outputs.

For example, the dataset includes corrections and additional diagnoses. This finalized dataset provides more comprehensive, context-aware data that enhances the model’s ability to generate accurate and reliable outputs in future iterations.

Keeping Humans in the Loop for Generative AI

Now, why is this important to the world of Generative AI (GenAI)? GenAI is extremely powerful but, like any model, it is prone to mistakes and errors, such as hallucinations. By keeping humans in the loop, teams can reduce the risk of model errors and ensure that AI-generated outputs meet their specific needs.

For example, when using LLMs to generate content, human reviewers can use free-text labeling to annotate where the model’s output went wrong or to suggest corrections. These labeled datasets can then be used to feed more context to the model, ensuring it produces more accurate and relevant outputs.

This is especially important for industries like healthcare or finance, where AI systems need to meet high standards of accuracy and reliability. In settings like healthcare, an LLM might suggest a treatment plan based on historical data, but free-text labeling allows doctors to review and fine-tune those suggestions, ensuring the AI’s outputs align with current clinical practices.

Flexibility Meets Precision and Governance

Free-text labeling provides the freedom and flexibility to label data in ways that fit your unique needs — whether you’re fine-tuning an LLM, building a GenAI workflow, or tackling a complex classification task. On top of flexibility, keeping humans in the loop is the key to ensuring high-quality AI outputs.

So, the next time you think about your data pipeline, remember: garbage in, garbage out. With free-text annotation and managed labeling, Dataiku makes sure that you’re putting in the right data, so you can get the right results.

You May Also Like

Moving Beyond Guesswork: How to Evaluate LLM Quality

Read More

A Tour of Popular Open Source Frameworks for LLM-Powered Agents

Read More

Navigating Regulations With Dataiku’s Governance Capabilities

Read More

Get to Know NYC and Paris From the Point of View of an Algorithm

Read More