Tap Into All Your Data's Senses: The Art of Multimodal ML

What do cooking, walking on a dark or busy street, and building a machine learning (ML) model have in common?

At first glance, these may seem like completely unrelated activities, but they share a common thread: the benefits of multifaceted inputs for better decision making. When both cooking and navigating through risky situations, relying on a single sense or input can be limiting. Just as a chef uses multiple senses (sight, smell, sound, taste, touch) to craft the perfect dish and a pedestrian prioritizes both sight and sound to stay safe, embracing the concept of multimodality when it comes to ML models can lead to better, more accurate outcomes.

While human brains automatically translate diverse sensory inputs into binary nerve signals, the multimodal ML approach attempts a similar feat: extracting and converting information from diverse input modalities such as text, images, and tabular data into numeric features that an algorithm then uses to make a prediction.

Multimodal ML combines information from mixed data types, or “modalities,” as inputs to a prediction.

Multimodal ML combines information from mixed data types, or “modalities,” as inputs to a prediction.

Read on to explore different real-world use cases where a multimodal ML approach is valuable, and examine how Dataiku's framework can help your team use this technique to achieve better predictions.

Real-World Applications of Multimodal ML

There are so many practical applications for multimodal ML, it's tough to choose just a few! Let's explore just three examples:

Medicine: Enhancing Diagnosis and Treatment Decisions
Marketing Analytics for E-Commerce, Retail, and Real Estate
Content Moderation on Social Media Platforms

Medicine: Enhancing Diagnosis and Treatment Decisions

Patient data is by nature multimodal, including numerical test results and structured clinical data, radiology images, and freeform physician/nursing reports. Blending information from varied data types can lead to a more holistic assessment of patients’ health and conditions, enhance diagnostic precision, and aid in treatment decisions. For example, in Alzheimer's disease diagnosis, relying solely on structural MRI scans or speech analysis results in approximately 80% detection accuracy. However, by also incorporating complementary modalities like audio features, speech transcript, genomic and clinical assessments, multi-modality models have achieved over 90% diagnosis accuracy.

Multimodal ML for this melanoma detection model uses a patient’s structured data as well images of their lesions.

Multimodal ML for this melanoma detection model uses a patient’s structured data as well images of their lesions.

E-Commerce, Retail, and Real Estate: Improving Pricing and Buyer Propensity Models

When researching and shopping for a product online, be it regular consumer packaged goods, a vehicle, or even a real estate property, you probably consider a range of factors before making a purchase decision: price, specifications, written reviews, star ratings, product photos, and so on. Am I right? This makes sense — all of these inputs help you construct a comprehensive view of an item’s quality, style, and suitability for your needs. This type of use case is perfect for multimodal ML, where the accuracy of predicting a product’s sale price, customer conversion rate, or other outcome can be improved by considering diverse input types, rather than structured data alone.

Content Moderation: Detecting Hate Speech or Inappropriate Content

Multimedia content is the lifeblood of social networking and online gaming, but how can these sites possibly comb through millions of text posts, images, videos, memes, etc. to detect content that violates its policies? Enter multimodal ML.

Social media, especially memes, are by nature multimodal. For example, an image of a skunk and a sentence “you smell good” are benign/neutral separately, but can be hateful when interpreted together in a given context. Research shows that multimodal approaches outperform unimodal approaches in detecting hate speech, highlighting the importance of combining visual and textual features. In fact, in 2020, Facebook AI (now Meta) launched the Hateful Memes Challenge and Dataset, an online competition where AI researchers were given an open source dataset with 10,000+ examples in order to advance progress in multimodal hate speech detection. With the recent advancements in Generative AI, the need for accurate, automated content moderation techniques is even more critical.

Multimodal ML in Dataiku: How Does It Work?

Infrastructure Readiness

To enable this feature, administrators must first make approved embedding models accessible through the Dataiku LLM Mesh, which facilitates their integration into pipelines and Dataiku AutoML via secure, access-controlled connections. Choose from local image or text embedding models downloaded from the HuggingFace hub, or connect to API-based text embedding services from LLM providers like OpenAI, Azure OpenAI, Bedrock, or Databricks Mosaic AI.

Through the LLM Mesh, Dataiku offers connections to various LLM providers.

Through the LLM Mesh, Dataiku offers connections to various LLM providers.

Data Prep

If your image data is stored as files in a folder in your Flow, use the List Folder Content recipe to create a dataset with a column containing the path to each image in the managed folder, then join the results to your main training dataset. The good news is, for unstructured text fields, no specific data prep (tokenization, stemming, etc.) is required; today’s modern models can accommodate raw text inputs, including datasets with observations in mixed languages.

Using a visual recipe to generate the metadata and format expected for labeling, modeling, and other image-based tasks in Dataiku.

Using a visual recipe to generate the metadata and format expected for labeling, modeling, and other image-based tasks in Dataiku.

Model Development

In Dataiku AutoML, incorporating diverse data modalities into a model is straightforward; you can expect a familiar workflow in the features handling tab of the Visual ML Lab. While text and image features are initially rejected by default, simply toggle to enable each column you wish to include in the model design. Then choose the variable type and select the desired pre-trained embedding model used to vectorize the information. Note that text and image features are handled independently — that is, embeddings from different modalities don’t exist in the same latent space, as they do with multimodal embeddings, which are used for very specific applications and are a separate topic.

Enable text or image features, then choose an embedding model from the LLM Mesh

Enable text or image features, then choose an embedding model from the LLM Mesh

Once trained and deployed back to the Flow, a multimodal model can be operationalized via Dataiku Scenarios, and evaluated and scored for performance assessment over time.

A typical flow containing a multimodal dataset with the managed folder of images. The resulting Saved Model is then evaluated and scored.

A typical flow containing a multimodal dataset with the managed folder of images. The resulting Saved Model is then evaluated and scored.

To wrap up, let's circle back to our cooking and pedestrian metaphors. Just as we rely on multiple senses in the kitchen or on the street, multimodal ML allows you to activate multiple “senses” for smarter decision-making. With Dataiku, you can tap into the latest embedding models to boost accuracy while still staying compliant with data regulations and AI Governance policies, thanks to the LLM Mesh. Plus, whether you're a code whiz or prefer a point-and-click approach, Dataiku's got you covered with both programmatic and code-free tooling that makes building and deploying these models a breeze.

Tap Into All Your Data's Senses: The Art of Multimodal ML

Real-World Applications of Multimodal ML

Medicine: Enhancing Diagnosis and Treatment Decisions

E-Commerce, Retail, and Real Estate: Improving Pricing and Buyer Propensity Models

Content Moderation: Detecting Hate Speech or Inappropriate Content