Moving Beyond Guesswork: How to Evaluate LLM Quality

Ninety percent of leaders are already investing in Generative AI in some way, but there's a common challenge: How can you objectively measure whether an LLM's output is actually "good enough"? For instance, imagine you’re using an LLM to power a conversational Q&A chatbot. After a few successful exchanges, it’s tempting to assume the model is delivering high-quality answers, but how can you quantify and benchmark that? And, more importantly, how can teams systematically keep an eye on ongoing response quality when they can’t possibly manually monitor every interaction?

Qualitative evaluation, while intuitive, may be subject to bias and doesn't scale well when dealing with high-stakes use cases like customer service automation, document generation, or research assistance. You need a way to pinpoint what's working and where there's room for improvement across thousands or even millions of model interactions. That’s where automated LLM evaluation comes in. This critical capability, available with the Dataiku LLM Mesh, allows you to measure LLM response quality with precise metrics and compare different models or approaches side-by-side. Read on to learn more about how these tools can not only bring clarity and direction to your design experiments, but also help teams monitor the ongoing quality of AI apps in production.

Understanding LLM Evaluation Metrics

With generated text that seems plausible but can contain errors or irrelevant information, it’s not always obvious how to quantitatively gauge reponse quality for LLMs. Businesses (and individuals, too! I’m definitely guilty as charged. 😅) often fall into the trap of relying on subjective judgments, essentially using a “finger in the air” approach to evaluate LLM performance.

The evaluate LLM recipe in Dataiku provides a powerful, visual way to both measure and monitor LLM performance at scale. Whether you're building a conversational Q&A app, summarizing documents, generating translations, or tackling another task, this feature recommends relevant GenAI-specific metrics to match your use case. Some metrics such as faithfulness, answer correctness, answer relevancy, and context precision take advantage of the popular “LLM-as-a-judge” technique, in which a specially-crafted prompt uses a secondary LLM as a proxy for human evaluation. Other metrics such as BERT score, ROUGE, and BLEU rely on statistical, traditional NLP-based techniques.

As with any ML model evaluation, to properly assess metrics like accuracy, precision, and recall, you’ll want to start with an evaluation dataset that contains examples of model inputs, outputs and, if available, the corresponding reference answers that the model should consider as ground truth. If your use case leverages in-context learning techniques like retrieval-augmented generation (RAG) or few-shot learning, you’ll also want to include a column in the evaluation dataset specifying the context that was given to the model as part of the prompt.

evaluate LLM recipe dataiku

Along with assigning these columns in the Evaluate LLM recipe, you’ll also specify the task the LLM is performing: question answering, summarization, translation, or other. Based on your selections, Dataiku dynamically recommends a suitable set of metrics to compute for your LLM evaluation.

input dataset dataiku

Fact vs. Feeling: A Data-Driven Approach to Model Evaluation

To better understand some of the metrics suitable for Q&A applications, let’s use the analogy of a debate. In a debate, when a participant is asked a question, the expectation is that they will deliver a clear, relevant, and factually supported response. Similarly, when an LLM is asked a question, its output is evaluated based on its accuracy, relevance, and the clarity of its reasoning.

The answer correctness metric evaluates whether the returned response is factually correct, while answer relevancy measures whether the answer is relevant and on-topic to the question asked. Note that these two metrics are not always aligned! Just like debate participants who may deflect or avoid a certain topic by changing the subject, it’s possible for an LLM to provide a well-supported and factually correct response that does not actually answer the question asked.

If your question-answering app uses RAG and retrieves information from a knowledge base, you’ll also want to measure how accurate and comprehensive the retrieval step is. Going back to the debate analogy, think of this as: Does the speaker support their argument with credible and relevant source information? And do they omit critical information or miss crucial details that could affect the overall completeness of the response?

Context precision checks how well the semantic search portion of your pipeline is extracting the right information from your knowledge base, while context recall measures how well the system is retrieving all of the relevant and necessary information to answer the question. In practice, a high context precision score means your app is pulling highly relevant information that directly addresses the question, while low context precision indicates that the information retrieved may be partially or entirely irrelevant, which leads to less accurate or helpful responses.

Faithfulness, as an LLM evaluation metric, measures how well the generated response aligns with the source material or facts in the knowledge base. For example, if the LLM suddenly introduces a fact that wasn’t found in the knowledge bank, it would score low on faithfulness, even if the information is correct; this metric can help identify if your app is producing outputs that incorporate unverified information or even hallucinations. In the debate analogy, this would be like evaluating whether the participant’s response is logical and factually consistent with the evidence they've presented, rather than straying into unsupported claims or incorrect details.

Metrics like these not only help you determine if the LLM is doing its job, but they also provide a consistent, objective framework for comparing different models and setups. The beauty here is that you can test different configurations and objectively compare the results based on real data, rather than gut feelings.

Extend LLM Evaluation With Custom Python Metrics

To extend the standard set of metrics with tailored insights, experts can also define custom metrics with Python code. Dataiku includes templates and code samples to speed up the process of building custom metrics. As a bonus, Dataiku even preserves and displays the exact code used to compute any custom metrics in each individual LLM evaluation, in case the definition changes at some point in the app’s lifecycle!

LLM evaluation with custom Python

Why Automated Evaluation Beats Guesswork

Each time the recipe is run, all metrics — both standard and custom — are captured, stored, and visualized in the Evaluation Store, making it simpler for teams to automate the evaluation process, track performance over time, and benchmark different approaches to optimize their GenAI applications. Teams can also incorporate automated metrics checks and actions into Dataiku Scenarios, to alert AI engineers or stakeholders that GenAI app performance quality is degrading. This is a good way to seamlessly integrate LLMOps into your conventional MLOps practices.

evaluation Q&A

For human-in-the-loop evaluation and qualitative analysis, you can also inspect and analyze results row-by-row, to manually review how the LLM is performing for individual observations.

evaluation Q&A

Use Model Comparisons to Optimize GenAI Pipelines

AI application developers face a wide array of design choices, since solutions can range from straightforward outputs based on a single LLM prompt to highly complex agentic workflows that chain together multiple prompts and combine them with more standard data or NLP techniques. Even using the earlier example of a simple Q&A chatbot, builders must select the underlying LLM service or model, decide between long-context window or RAG approaches for handling input, and if using RAG, determine how to chunk source documents for the best retrieval outcomes. Each decision can significantly impact performance, scalability, and cost, requiring careful consideration to align with the specific use case and business needs.

LLM evaluation in Dataiku can adapt to each of those distinct pipelines, producing relevant set of metrics for each of them during the original design/experimentation phase and also in post-production, as AI engineers or data scientists make incremental changes to the system over time to maintain or improve it. To easily compare and contrast outcomes from these different approaches, model comparisons are another invaluable tool for builders.

Model comparisons in Dataiku allow you to compare metrics LLM evaluation metrics side-by-side and benchmark performance across different runs, or across different versions of your pipeline or app configuration.

comparing 3 evaluations compare 3 evaluations

With countless design choices shaping the final AI system, these summary and row-by-row comparison views allow you to spot where different versions produce varying results, making it easier to pinpoint potential issues in your design and select the optimal configuration to deploy. For instance, one model might generate more correct answers but struggle with relevancy, while another might excel in both areas but fall short on context recall. With automated comparisons, you can easily identify the strengths and weaknesses of each approach, which allows for data-driven decision-making. This systematic approach is far more reliable than relying on ad-hoc evaluations or single-point tests, which can miss broader trends in model performance.

Final Thoughts

For enterprise-grade AI applications, the era of manually checking a handful of outputs and making subjective calls on LLM quality is behind us. Whether you’re deploying a chatbot, scaling document summarization, or handling custom queries, establishing a robust, data-driven evaluation framework is essential for sustained success. The fine line between "good enough" and exceptional model performance often lies in how rigorously you measure and compare results. With tools like Dataiku’s Evaluate LLM recipe, businesses can automate this process, continuously track performance, and benchmark various approaches, ensuring their GenAI solutions are always optimized for impact.

Moving Beyond Guesswork: How to Evaluate LLM Quality

Understanding LLM Evaluation Metrics

Fact vs. Feeling: A Data-Driven Approach to Model Evaluation

Extend LLM Evaluation With Custom Python Metrics

Why Automated Evaluation Beats Guesswork

Use Model Comparisons to Optimize GenAI Pipelines

Final Thoughts

You May Also Like

How IT Leaders Can Win the Analytics and AI Race

How to Build Trustworthy AI Systems

Dataiku Ranked #1 in Product Owner Use Case in Gartner Critical Capabilities Report

The Governance Blueprint CoEs Use to Scale Self-Service and AI Agents

Moving Beyond Guesswork: How to Evaluate LLM Quality

Understanding LLM Evaluation Metrics

Fact vs. Feeling: A Data-Driven Approach to Model Evaluation

Extend LLM Evaluation With Custom Python Metrics

Why Automated Evaluation Beats Guesswork

Use Model Comparisons to Optimize GenAI Pipelines

Final Thoughts

Watch a Dataiku Demo of LLM Evaluation

Subscribe to the Dataiku Blog

You May Also Like

How IT Leaders Can Win the Analytics and AI Race

How to Build Trustworthy AI Systems

Dataiku Ranked #1 in Product Owner Use Case in Gartner Critical Capabilities Report

The Governance Blueprint CoEs Use to Scale Self-Service and AI Agents