Practical LLM Selection: A Recipe for Success

In our previous exploration of selecting the right LLM, part one of this series, we discussed the essential metrics and decision factors for evaluating these models. Understanding performance benchmarks and tools like Chatbot Arena provided a solid foundation for comparing LLMs. Now, it’s time to build on that knowledge with a practical guide for selecting the ideal LLM for your specific use case.

Let’s explore a concrete process for implementing LLMs, from scoping and feasibility to proving value and optimizing performance, adapted from Andrej Karpathy's talk on the "State of GPT":
GM3877 Blog Charts v3-01 (1)

The first step consists of scoping the use case (Step 1). Like for any other AI use case, this involves defining the problem, assessing the potential data sources, outlining the target solution, listing requirements, defining success criteria, estimating resource needs, and planning the project.

In the context of an LLM use case, some aspects need to be considered more carefully given the specificities of LLMs:

Are input sources available and machine-readable? Implementing a retrieval-augmented generation (RAG) approach or augmenting an LLM with tools is often an effective approach to improve its accuracy. We then need to understand what relevant information sources are available and whether their format is suitable.
To what extent are errors acceptable? LLMs are known to occasionally provide inaccurate answers in a confident tone — these are commonly referred to as “AI hallucinations.” It is important to understand how costly or risky errors can be and how the overall system (including the users) consuming the outputs of the LLM can detect and correct these errors.
Is there a need for real-time interactivity? The designer of the solution should determine the acceptable latency, knowing that an LLM call can take a few seconds or dozens of seconds and that answering a user’s request may involve several LLM calls.
What is the budget for the use case? Even if powerful LLMs are more and more affordable, the running costs of an LLM use case (calling the LLM, calling an embedding model to index information sources in the context of a RAG pipeline, using external APIs as tools, etc.) can be quite significant.

First, Prove the Feasibility of the Use Case

With the requirements of the use case in mind, it is then possible to select the LLM for an initial proof-of-concept (Step 2) and implement a simple baseline pipeline (Step 3). The main question at this stage is the feasibility of using the LLM to achieve the desired task. In this context, the best practice is to use the most powerful model available whose features are compatible with the use case, regardless of their cost or latency. The flagship models of the most prominent providers such as GPT-4o, Claude-3.5, or Gemini Pro 1.5 are generally safe bets because they are versatile, feature-rich (in particular with features like multimodality or function calling), and directly available.

The next steps are to create a test set as representative as possible of the context of the use case (Step 4) and define an evaluation method (Step 5). If this evaluation method involves an LLM used as a judge, it should be thoroughly tested to make sure that the resulting evaluations are sufficiently aligned with human judgment. Results can then be generated for the test set (Step 6) and compared to the expected level of quality (Step 7).

If the quality of the outputs is insufficient, the pipeline should be iteratively improved (Step 8) until the quality is sufficient, starting with the simplest changes first, e.g.,:

Switch to a specialist model (if any)
Prompt engineering (e.g., few-shot examples or chain-of-thought)
Constrained decoding (e.g., function calling)
Implementation of a RAG approach or augmentation of the LLM with tools
Fine-tuning

Then, Prove the Value of the Use Case

As soon as the quality criteria are met, we can move from demonstrating the feasibility of the use case to proving its value. For this, we need to assess the extent to which the current pipeline is efficient enough, given the costs, latency, and throughput constraints of the use case (Step 9).

If the pipeline already meets the efficiency criteria of the use case, it's advisable to conclude the process (Step 13) rather than risk over-engineering, which could make the pipeline less maintainable and which could not be a good investment given how fast new powerful models are made available. Otherwise, we can either switch to a smaller, less expensive model (Step 10), or even a non-generative model like BERT for basic tasks throughout part or all of the pipeline, or improve the cost-effectiveness of the pipeline (Step 11) through:

Prompt engineering (e.g., few-shot examples or chain-of-thought)
Constrained decoding (e.g., function calling)
Implementation of a RAG approach or augmentation of the LLM with tools
Optimized inference techniques (e.g., quantization, distillation, speculative decoding, etc.)
Fine-tuning

Please note that prompt engineering tricks may not transfer well from one model to another, so we may need to adjust the prompt even though significant efforts have already been made to optimize the prompt with a previous model.

How Does Dataiku Help?

As you can see, the process of selecting the right LLM, crafting the best prompt for that model, and incorporating advanced AI techniques is part art, part science. Although some decisions in the step-by-step process above may rely on subjective measures, following a systematic evaluation approach is nevertheless essential. Without a plan for knowing when “good enough” is good enough, you risk endlessly iterating and investing resources into marginal performance improvements, only to have those gains be erased by the next wave of cutting-edge models that deliver that level of performance off the shelf.

Dataiku is the right tool to create Generative AI applications in a robust and future-proof way. In particular, Dataiku’s LLM Mesh enables practitioners to easily switch from one LLM to another without modifying the Flow or the code recipes of a Dataiku project. Furthermore, the features in the table below support the execution of the process described in the previous section:

Step	Relevant Dataiku features
3. Select most powerful model available 10. Select a smaller model	Easily switch between different LLM connections for hosted LLMs APIs and locally-running HuggingFace models
5. Define and test evaluation methods	LLM Evaluation (coming soon)
6. Evaluate results for the test set 12. Evaluate results for the test set	Experiment Tracking Model Evaluation Store
8. Implement improvement 11. Implement improvement	Prompt Studio Function calling (coming soon) Knowledge banks and augmented LLMs Fine-tuning (coming soon) Efficient local inference
9. Is the pipeline efficient enough?	Track costs at a fine-grained level and calculate cost savings due to caching with LLM Cost Guard

Practical LLM Selection: A Recipe for Success

First, Prove the Feasibility of the Use Case

Then, Prove the Value of the Use Case

How Does Dataiku Help?

You May Also Like

From Bedside to Backend: Making Sense of Real-World Health Data

How IT Leaders Can Win the Analytics and AI Race

AI Agents: Setting The Bar For Manufacturing Maintenance

Why Secure Agentic Applications Require a New Approach

Practical LLM Selection: A Recipe for Success

First, Prove the Feasibility of the Use Case

Then, Prove the Value of the Use Case

How Does Dataiku Help?

Generative AI With Dataiku

Subscribe to the Dataiku Blog

You May Also Like

From Bedside to Backend: Making Sense of Real-World Health Data

How IT Leaders Can Win the Analytics and AI Race

AI Agents: Setting The Bar For Manufacturing Maintenance

Why Secure Agentic Applications Require a New Approach