Since the release of ChatGPT, the landscape of large language models (LLMs) has exploded with innovation. New proprietary models from dynamic start-ups and major cloud providers are emerging alongside an abundance of open-weight models, which now dominate the HuggingFace hub.
Selecting the right LLM is daunting. But fear not! In this blog post for enterprise practitioners, we're here to help you make informed decisions for your Generative AI use case by:
- Presenting some key LLM performance metrics
- Explaining other important decision factors
- Describing how Dataiku’s LLM Mesh provides model optionality and ensures you can easily incorporate Generative AI technologies into your data workflows
Read on as we demystify the complexities of LLM performance evaluation and equip you with some initial knowledge of how to choose the best one for your use case. Stay tuned for a companion blog detailing a repeatable, practical LLM selection process.
Overall Performance Assessment of LLMs
Let’s focus on the most helpful public sources of information to quantitatively assess the quality of text generated by an LLM. There are various ways to perform such an evaluation because LLMs power a myriad of use cases. There are also several dimensions to take into consideration beyond the accuracy of the answers, such as their clarity, their tone, or the degree to which the user’s instructions were followed. Two major approaches stand out: automated testing with public thematic benchmarks and crowdsourced blind evaluation.
A thematic benchmark consists of a set of questions assessing a specific capability and a way to automatically check the answers to these questions, typically with ground truth answers (or unit tests for code benchmarks). The table below lists some of the most common thematic benchmarks and illustrates the range of capabilities covered.
Example of Common Benchmarks Used to Evaluate LLMs
Thematic benchmarks offer a convenient and objective way to compare LLMs across multiple dimensions but they suffer from some disadvantages. The most concerning one is that LLMs can be overfit to these benchmarks, particularly if some test examples have inadvertently been added to the training corpora of the LLMs.
Chatbot Arena is a powerful complement to these thematic benchmarks. The underlying platform enables all internet users to provide a prompt, receive two completions from two undisclosed and randomly selected LLMs, and express a preference for one of these two completions. With over 1,300,000 of such human pairwise comparisons collected so far, Chatbot Arena can rank LLMs on an Elo scale similar to the one for competitive chess players.
User Interface of Chatbot Arena to Compare Two LLM Completions Side-by-Side (We Prefer Side B!)
Although more time-consuming and harder to reproduce, LLM evaluations based on Chatbot Arena circumvent risks of overfitting and data contamination. They can hopefully better reflect real-life LLM usage and human preferences. Chatbot Arena also includes leaderboards for more specific types of queries ( e.g., “Coding,” “Longer query,” “French”).
Other Key Characteristics of LLMs
The quality of the text generated by an LLM is essential, but not enough to assess its adequacy for a given use case. This section aims at summarizing other important criteria and when they should be taken into account. For a proprietary model, the information on these criteria can generally be found in the documentation of the API or the technical reports on the models. For an open-weight model, the best source of information is often the resources found on the model’s page on the HuggingFace hub.
Features of the LLM
First, the architecture of the LLM and the way it was trained and fine-tuned can make it more or less suitable for a given use case. We present below the key features to look at.
Features of the Serving Infrastructure
The features depend on the LLM. The following features can be offered for all LLMs provided that the serving infrastructure allows it. These features also impact the suitability and ease of use of the LLM.
Serving Efficiency
If the use case entails real-time interactions with the user or involves a large volume of data, the efficiency of the serving platform — especially the generation costs, latency and throughput — can become an important decision factor.
Availability
Finally, even if a given LLM is theoretically the right option for a use case, we need to ensure that it can be used in the context of this specific use case.
How Does Dataiku Help?
The Generative AI field is moving so fast that it is easy to feel both dizzy and enthusiastic about all the opportunities to seize. In this context, it is important to track the latest developments, keep one’s options open, and avoid overinvesting in incremental improvements which can be made obsolete with a new and more powerful model.
As the Universal AI Platform, Dataiku is the right platform to build and deliver Generative AI applications in a robust and future-proof way, while making informed trade-offs between LLM performance, cost and risk. In particular, Dataiku’s LLM Mesh enables practitioners to easily switch from one LLM to another without modifying the Flow or the code recipes of a Dataiku project. This way, teams can integrate new or more suitable technologies quickly and securely, connecting with the data and (Gen)AI services stack of your choice, with full compliance with the governance policies you easily define in Dataiku.