Applying Transformer Recipes to Tabular Data

Tech Blog Lynn Heidmann

Though large language models (LLMs) and Generative AI are all the rage, let’s face it: In most industries, tabular data is still king. Organizations have — and produce — loads of tables, and the common machine learning (ML) method that data scientists apply to this data is (still) gradient boosting

This blog post covers a recent episode of ML Research, In Practice from the Dataiku AI Lab (a team within Dataiku that aims to build a bridge between ML research and practical business applications). In this session, Jun Kim from Inria discusses his research (conducted with Léo Grinsztajn and Gaël Varoquaux from the SODA team) of trying to apply transformer recipes to tabular representation and tabular data. 

→ Watch the Full Talk: CARTE — Context-Aware Representation of Table Entities

With the success of convolutional neural networks (CNNs) and transformers, people are increasingly applying deep neural networks to tabular data. However, in benchmarks comparing their performance, gradient boosting trees continue to outperform deep learning techniques

Existing Work on Pre-Trained Models

Pre-trained models have proven successful in several fields of ML — for example, for language models (e.g., Bert) or images. So to make full use of all the tables out in the real world, Kim says, “We might be better by having a pre-trained model to improve some performance for the downstream tasks.” 

It’s worth noting that there has already been some work in this area. For example, Sherlock is a deep learning approach to semantic data type detection. Similarly, TURL is a framework for understanding tables through representation learning. However, the focus of performance evaluation is around finding characteristics of the table — like column annotations or entity link predictions — and not on new sets of tables where we want to make predictions through the ML methods.

Challenges to Making a Pre-Trained Model for Tables

So there has been work that describes the tables with the pre-trained model, and there has been work done on knowledge embeddings, which can be used for the downstream tasks. But there are some challenges that present when thinking about making a pre-trained model for tables.

First is a challenge Kim calls “out of vocabularies,” meaning that sometimes tables don’t contain the entity name within the knowledge graph, making it difficult to extract the embeddings from the knowledge graph and then directly use it downstream. For example, you might have NY Times in the downstream tasks, but in the knowledge graph, it has The New York Times.

Another challenge is related to simply the sheer volume and variety of tables. Often, there are different columns for every table, making it challenging to combine them and map them to the same space.

Solving These Challenges

A graph is just a representation of a data set where each of the nodes represent the data points. Relations are defined by the similarity of the data points — this notion is not new. To learn more about Kim and team’s approach and how they’ve solved some of the challenges above, we recommend watching the full talk:

 

You May Also Like

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

A Tour of Popular Open Source Frameworks for LLM-Powered Agents

Read More

With Context Windows Expanding So Rapidly, Is RAG Obsolete?

Read More

Retrieval Augmented ML: How Can You Best Leverage a Data Lake?

Read More