Enhancing Data Analytics Through Intelligent Join Suggestions

Tech Blog Du Phan

Lately, the concept of data experience has been gaining attention in discussions around the enterprise data stack. As the name suggests, it refers to how people interact with data in enterprise settings. Due to fragmented data setups in these companies, their data lakes have the following characteristics:

  • Large amounts of data
  • Information redundancy
  • Incomplete metadata and documentation
three groups of features of the data experience
Source: Castor

The data experience can be broken down into three groups of related features that try to address the above issue:

  • Discovery: How users find, understand, and use data assets.
  • Community: How users share data knowledge.
  • Health: How users monitor and trust the data assets.

In this series, we focus on the search aspect of data discovery. Given the above issues with enterprise data lakes, traditional search engines no longer meet users’ needs. In the next section, we will talk about the limitations of the current search and introduce a more advanced search feature called join suggestion.

The Current State of Search

The initial go-to solution for companies addressing data discovery challenges has been keyword-based search. However, due to the complex and noisy nature of enterprise data lakes, this approach is proving less effective.

A practical illustration highlights this issue: A query for 2023 ARR information within our internal database yields 825 potential candidate datasets whose names are not very helpful as they are quite similar.

The first results when querying for “ARR 2023” in our internal data catalog.
The first results when querying for “ARR 2023” in our internal data catalog.

A broader limitation of keyword search is the lack of contextual awareness: The process does not consider the user’s starting point. In practice, users rarely start from scratch, as they have some basic reference tables they want to enrich. For example:

  • To understand the churn patterns of past clients, the data scientist will look at historical purchase data, along with the client’s browsing history, as well as their demographic information.
  • To compute the annual recurrent revenue of the company, the data analyst will need to assemble the various consistent revenue streams per client (subscription fees, monthly service charges, etc.).
  • To detect potential clients for a promotion campaign, the marketer will analyze their reactions to past campaigns, coupled with their recent purchasing trends.

A common pattern across these use cases is the need to link different business entities together, often realized through join operations. Thus, the introduction of the join suggestion feature addresses this need effectively. The pipeline consists of three main steps:

  • Column matching
  • Candidate clustering
  • Ranking

Join Suggestion Pipeline

Column Matching

Column matching consists of determining whether two columns can be joined based on their representation of the same entity, such as a country, client, or product.

To achieve this, two conditions must be met:

  • Syntactic similarity: The columns should share a similar set of values.
  • Semantic similarity: The columns should convey the same meaning.

Our primary focus lies in the initial criterion. Why? This is because syntactic matching is the fundamental step of any join search approach. If there is no syntactic similarity between columns, an (exact) join cannot occur. Secondly, delving into semantic matching poses greater complexity and requires further exploration. Furthermore, in enterprise settings, the data lake generally adheres to well-defined nomenclature, making semantic ambiguity a less common occurrence. This adherence substantially reduces the likelihood of encountering semantic ambiguity. Consequently, by focusing on syntactic matching, we can already identify a substantial portion of candidates that are also similar semantically.

Finding similar sets is a widely researched topic in computer science. The scalability bottleneck of the all-pair comparison approach has driven the development of locality-sensitive hashing (LSH) techniques. Relying on hashing mechanisms allows one to uncover potentially similar pairs without exhaustively analyzing every possible combination and at a lower cost.

Numerous metrics for LSH exist to quantify the similarity between two sets. For instance, in join suggestion, the goal is to have the most complete join possible: ensuring every row in the query table has matched rows in the candidate table. Consequently, the metric of interest is Jaccard Containment. It quantifies the ratio of the intersection size of two columns to the size of the query column. A Jaccard Containment score of one denotes a full coverage of the query set by the candidate set, implying a complete join.

The two most well-known LSH-based techniques for Jaccard Containment are Lazo and LSHEnsemble. In our operational pipeline, Lazo is chosen for two principal reasons:

  • Superior scalability: Lazo’s scalability makes it better suited for real-world applications.
  • Estimation capability: Both Lazo and LSHEnsemble return candidates above a chosen Jaccard Containment threshold. However, only Lazo estimates the actual Jaccard Containment value between the query column and candidates, facilitating candidate ranking.

In our next article, we will delve deeper into Lazo.

Candidate Clustering

Enterprise data lakes usually contain a huge amount of datasets with much redundancy between them due to organizational silos and the lack of an efficient data discovery system.

Therefore, the outcome of the previous set matching phase will potentially yield a long and repetitive list of candidates. This issue becomes particularly pronounced when the query column is a common entity, such as “client_id.” Additionally, similar candidates tend to have similar estimated Jaccard Containment scores, making them close in the ranking list. This behavior confuses users and hampers their ability to explore potentially more interesting candidate datasets further down the list.

Therefore, an impactful way to enhance user interaction involves clustering similar datasets.

While dataset similarity attracts substantial attention in the research world, the focus leans heavily toward images, with tabular datasets receiving comparatively less attention. Here, we propose an intuitive two-step process grounded exclusively in dataset schema comparison:

  • Initially, candidates are clustered based on the syntactic similarity of their schemas. Each cluster is represented by a representative candidate.
  • Then, the representative candidates sharing common themes are grouped together.

From a product perspective, this allows users to navigate candidates across multiple layers of granularity:

  • First, users are presented with a ranked list of topics inferred from the candidate datasets. This helps users with specific search intentions to focus on topics that match their use case. The ranking follows successively: the average estimated Jaccard Containment across schema clusters in a topic, the number of clusters in each topic, and finally, the total candidate count.
  • Moving forward, for a particular topic, users can peruse the cluster listings. Given the semblance in schemas among datasets within a cluster, only the representative dataset is displayed.
  • In cases where the representative dataset doesn’t precisely fit the user’s requirements, they can explore all datasets within that cluster, facilitating the selection of a better candidate for their intended use case.

Let’s dive in a bit more.

Schema Clustering

First, we create a concatenated string of column names for each candidate dataset, ensuring the removal of any special characters. We then apply tf-idf on this preprocessed data, generating a document-term matrix. Subsequently, various clustering algorithms can be applied to this matrix.

The tuning of clustering hyperparameters depends on the desired outcome. Smaller clusters guarantee high redundancy among datasets but may yield numerous overly similar clusters. On the other hand, larger clusters risk grouping unrelated candidates.

Selecting a representative dataset for each cluster is an interesting question that involves two dimensions:

  • Data-based: Opting for the most comprehensive dataset, which could entail: the dataset that comprises other datasets or the dataset with the most rows/columns.
  • Metadata-based: Favoring datasets with greater usage or popularity, which indicates higher quality and reliability.

Topic Clustering

After retrieving schema clusters and their representatives, we focus on adding a semantic layer into the outcomes. Specifically, our aim is to aggregate clusters that belong to similar topics.

In this pursuit, Large Language Models (LLMs) have emerged as a potent solution. Our approach involves constructing prompts akin to the following:

You are given a list of dataset name along with their column’s names.

Your goal is to group these datasets together based on the topic that they are about.

For example: dataset ARR_2013 and dataset 2012_NARR_detailed must be grouped together as they are about the annual recurrent revenue.

Your answer must be a JSON list of groups with a good description of the inferred topic.

[

{“group_member”: [“customer_dataset”, “client_transaction_2010”], “topic”: “Customers and their activities”},

{“group_member”: [“workday_data”, “emmployee_active”, “employee_paris”], “topic”: “Information about current and past employees”}

{“group_member”: [“ARR_2013”, “2012_NARR_detailed”], “topic”: “Annual recurrent revenue”}

]

You must avoid having one dataset per group or all dataset in one group.

The list of datasets along with their columns:

It’s worth noting that we exclusively use representative datasets for two reasons:

  • Clusters with datasets sharing the same cluster exhibit substantial redundancy, implying a common underlying topic.
  • Given the current token limitations, incorporating all candidate datasets into the prompt is not feasible.

(Better) Ranking

Up until this point, our approach has primarily relied upon the estimated Jaccard Containment as the ranking metric. This choice is informed by the fact that without a clear understanding of the user’s intent and, consequently, a measurable objective, our metric options are significantly limited.

To enhance the efficiency of our pipeline, it becomes imperative to incorporate more user-driven signals. There are several logical steps we can take to achieve this:

  • Utilizing User Inputs With LLMs: One effective strategy involves engaging users directly to understand their specific requirements. We can prompt users for details about what they seek and then employ LLMs to process these requests intelligently.
  • Keyword-Driven Approach: Alternatively, if we wish to avoid the complexity of natural language processing, we can solicit users to provide relevant keywords. By incorporating these keywords, we can rank the outcomes accordingly.
  • Focused Scope: To streamline our efforts, we can narrow our focus to a specific use case, such as machine learning. Here, the objective would revolve around identifying joins that have the potential to enhance model performance.

In summary, by harnessing user insights and leveraging advanced techniques like LLMs, we can significantly refine our approach and ultimately yield more meaningful results.

Conclusion

In this article, we have discussed the issues with the current search experience for data discovery and the potential benefits of join suggestion. The proposed pipeline involves Lazo’s efficient set matching using hash-based techniques, candidates clustering based on schema and semantic similarities, and possible improvements to ranking metrics. These steps collectively will enhance the user’s data experience.

Stay tuned for the upcoming Part 2, where we will explore Lazo in more detail and its connection to the traditional MinHash LSH technique.

You May Also Like

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

A Tour of Popular Open Source Frameworks for LLM-Powered Agents

Read More

With Context Windows Expanding So Rapidly, Is RAG Obsolete?

Read More

Retrieval Augmented ML: How Can You Best Leverage a Data Lake?

Read More