Though large language models (LLMs) and Generative AI are all the rage, let’s face it: In most industries, tabular data is still king. Organizations have — and produce — loads of tables, and the common machine learning (ML) method that data scientists apply to this data is (still) gradient boosting.
This blog post covers a recent episode of ML Research, In Practice from the Dataiku AI Lab (a team within Dataiku that aims to build a bridge between ML research and practical business applications). In this session, Jun Kim from Inria discusses his research (conducted with Léo Grinsztajn and Gaël Varoquaux from the SODA team) of trying to apply transformer recipes to tabular representation and tabular data.
With the success of convolutional neural networks (CNNs) and transformers, people are increasingly applying deep neural networks to tabular data. However, in benchmarks comparing their performance, gradient boosting trees continue to outperform deep learning techniques.
Existing Work on Pre-Trained Models
Pre-trained models have proven successful in several fields of ML — for example, for language models (e.g., Bert) or images. So to make full use of all the tables out in the real world, Kim says, “We might be better by having a pre-trained model to improve some performance for the downstream tasks.”
It’s worth noting that there has already been some work in this area. For example, Sherlock is a deep learning approach to semantic data type detection. Similarly, TURL is a framework for understanding tables through representation learning. However, the focus of performance evaluation is around finding characteristics of the table — like column annotations or entity link predictions — and not on new sets of tables where we want to make predictions through the ML methods.
Challenges to Making a Pre-Trained Model for Tables
So there has been work that describes the tables with the pre-trained model, and there has been work done on knowledge embeddings, which can be used for the downstream tasks. But there are some challenges that present when thinking about making a pre-trained model for tables.
First is a challenge Kim calls “out of vocabularies,” meaning that sometimes tables don’t contain the entity name within the knowledge graph, making it difficult to extract the embeddings from the knowledge graph and then directly use it downstream. For example, you might have NY Times in the downstream tasks, but in the knowledge graph, it has The New York Times.
Another challenge is related to simply the sheer volume and variety of tables. Often, there are different columns for every table, making it challenging to combine them and map them to the same space.
Solving These Challenges
A graph is just a representation of a data set where each of the nodes represent the data points. Relations are defined by the similarity of the data points — this notion is not new. To learn more about Kim and team’s approach and how they’ve solved some of the challenges above, we recommend watching the full talk: