Get Started

How I Found the Best Ramen Restaurant for My French Friend Using NLP

Use Cases & Projects, Dataiku Product, Featured Makoto Miyazaki, Louis Takumi

This is my friend, Louis Takumi. He is French-Japanese and one of the newest Dataikers who has joined the company this year. He is the very first data scientist who will work in our soon-to-open Tokyo office — but what makes him really stand out is his passion for ramen. When he signed the contract to officially come on board, he asked me, “Makoto, which ramen restaurant do you recommend in Tokyo?”

Louis Takumi

I replied, “I like Ramen Wakura in Shinjuku city, they have nice lightly-flavored tonkotsu (pork bone) soup.”

“Oh but I prefer ramen with heavy and rich soup,” said Louis, “like Naritake Ramen in Paris, you know.”

Naritake Ramen is famous for its broth full of flavors oscillating between soybeans, condiments, and the taste of pork. Of course I have tried it, but it was just not my type.

Naritake Ramen

When visiting Japan, a lot of tourists are surprised by the variety of different ramen flavors. Even for the soup, there is tonkotsu, miso, soy sauce, sesame, dried sardines, curry, milk — the list goes on and on. What this means is that the preferred flavor differs a lot among individuals. If you are planning to visit Japan, you should never trust what your guide book says. On top of that, you don’t want to post photos of your ramen on Instagram, only to find tons of similar photos already posted by other tourists, right?

So, what is the best way to recommend the best ramen restaurant in Tokyo for Louis? Well, let’s use Natural Language Processing (NLP) — follow along in this blog post to see how we did it!

Where Do the Japanese Talk About Ramen?

In Japan, there is a website called Tabelog. It is something similar to Tripadvisor, and it has around 34.8 million reviews of about 900,000 restaurants all over Japan. It is a monstrous review site with 100 million monthly users. Based on these word-of-mouth reviews, I used Dataiku to create a recommendation model of ramen shops for non-Japanese people.

When you make a project in Dataiku, you can package it as a visual application. Below you can see my final application for Louis. This is how it works: He types the URL of Naritake Ramen on Tripadvisor and clicks the green button. The app returns the name of the best ramen restaurant in Tokyo. 

project in Dataiku

What restaurant did my app recommend to him? Well, read through this blog and you will see in the end!

The Menu

The overall pipeline on the back of my app goes like this: The app scrapes the reviews of the Tripadvisor page that the user enters. Then, it compares the Tripadvisor reviews with the 15,000 Tabelog reviews from 730 ramen restaurants in Tokyo. Lastly, a Tabelog ramen restaurant with the highest similarity is recommended to you. 

But the real pipeline is a bit more complicated. Let's take a closer look. 

  1. The scraped Tabelog review texts are split into words. 
  2. Using the word2vec model, words are embedded into a vector space in order to derive the relationships between each word. 
  3. Words are clustered and only those that are related to ramen are kept; all the other words are dropped from the dataset. 
  4. The Tripadvisor reviews are also processed in the same way after being translated into Japanese.
  5. The similarity values between the TripAdvisor reviews and those of the Tabelog reviews are calculated.

order of events for data pipelineCleaning Reviews

scraped raw review texts

Above are the scraped raw review texts, but you cannot directly use them for analysis. NLP is all about removing extraneous words from the text, leaving only the necessary and meaningful words. It is no exaggeration to say that this preprocessing determines the quality of the recommendation engine. 

In particular, Japanese reviews make extensive use of Kaomoji. Kaomoji is a popular Japanese emoticon style made up of brackets, carets, underscores, and many other symbols and are used to express emotion in texting and cyber communication. Unlike picturized emoticons, there are an infinite number of patterns, making them very difficult to remove. 

Below is an example from one of the Tabelog reviews, with my lovely English translation.

Tabelog review with english translation

If you are an engineer implementing NLP, you may want to hang your hat and hope to reincarnate in a world without Kaomojis. In order to clean an infinite number of Kaomoji patterns, I used regular expressions in a Python recipe and cleaned them up.

regular expressions in a Python recipe

Wakachi-Gaki and Word2vec Modeling

The main difference between Japanese and English NLP is that there are no breaks between words in Japanese. Take a look at the sentence below.

I had a bowl of ramen today.


They describe the same thing. Thanks to the spaces, English sentences are easy to separate into words. But what about the Japanese? Japanese is written using a mixture of three alphabets (hiragana, katakana, and kanji), so it is possible to guess where these words are split.

guess where Japanese words split

However, this division is not perfect. If we break this sentence down to its smallest elements, we get the following:

division of sentence in Japanese

It is only when we get to this point that we are on the same starting point as English NLP. This process, which is unique to Japanese NLP, is called "wakachi-gaki (dividing into words)." There are a number of Python packages that can do the wakachi-gaki for you. I used one of the most commonly used one, MeCab. Warning: this requires a system-level installation.

After implementing wakachi-gaki with MeCab, I trained the word2vec model. I put the trained model in a folder on the Flow for later use.

trained model in a folder of the Flow

Training the word2vec model makes it possible to map each word into a vector space. Words in the review are not mere words, but can now be treated as objects with mathematical values. You can even measure the distance between each word. Words with a small distance are similar to each other and vise-versa.

Furthermore, words can be added and subtracted. It is a well-known example that, when a word2vec model is trained on general text data such as Wikipedia, you can express a word “queen” with a formula like below: 

word queen in NLP

Jumping Into the Universe of Ramen Words

However, if the model is trained on text data that consists of ramen reviews, something funny happens. For example, let's compute an addition below and see what the top results are:

“ramen” + “Sapporo” = ?

ramen and Sapporo

As a result, the words “Miso Ramen” came to the top. Indeed, Sapporo, a city in northern Japan, is famous for its miso ramen.

What about this? Ramen is a popular Japanese food, but it originally came from China. So what happens when you subtract ramen from Chinese food?

Chinese - ramen = ?

Chinese minus ramen

Buckwheat noodle, or Soba in Japanese, is a traditional Japanese noodle dish that has been around the country since before ramen was introduced from China. So what about this?

ramen shop + midnight - ramen = ?

ramen shop midnight

It is interesting that words such as “last train,” “bar,” and “refrain from going out” appear at the top of the list. It seems like, after slurping their last drip of the soup at a ramen stand, people spend the rest of the night in a variety of ways.

Finally, what happens if vegetarian people try to eat ramen in Japan, notoriously known for its unfriendliness to vegetarians?

vegetarian + ramen = ?

vegetarian ramen

Do Clustering to Focus on Ramen-Related Words

If you look carefully at reviews on Tripadvisor, Tabelog, or anything similar, you'll find that reviews are not always about food and restaurants. There are people who write about conversations with loved ones, those who quote the famous French poet Stéphane Mallarmé's "Salut" in their reviews of drinks, ones who write about how to impress the person you love with a meal, or tell the story of how they tried to pick up a waiter and failed spectacularly. If you simply want to find the ramen that fits your preference, you have to omit these irrelevant details from your data. 

Clustering the words will help you identify groups of words that are irrelevant to your needs. Dataiku’s AutoML can do K-Means clustering with different numbers of clusters that you set. The best result I obtained was with four clusters.

4 clusters on ramen-related words

Clusters themselves do not have names. It is your task to look at the words in each of the four clusters and give them names. Like below, I named the clusters “interior,” “website & reviews,” “taste & ingredients,” and “location & appearance.”

words within each cluster

Now I can finally focus on words belonging to the two clusters: “interior” and “taste & ingredients.” All other words will be excluded from the review. In this way, my application will find the restaurant that fits your favorite taste as well as the atmosphere of the restaurant. The clustering model created from the Tabelog is also applied to Tripadvisor reviews, again leaving out only the words for “interior” and “taste & ingredients.”

Measuring Similarity

I then used TF-IDF to score the words and narrow down the list to 40 words with top TF-IDF scores. Then, using the trained word2vec model once more, I converted the remaining words into a vector representation and measured the cosine similarity between the Tripadvisor and Tabelog reviews. The final dataset will look like the below with the similarity score:

40 words with top TF-IDF scores

What Is the Best Ramen in Tokyo for Louis?

Now that my app finished running, Louis can finally find out what the best ramen restaurant is for him!

best ramen for Louis

My app recommended a place called “Kibi,”  a small restaurant that is situated in the west part of central Tokyo.

So let’s ask Louis, who actually went to try the ramen at this place to check the quality of my app! Louis, how was the ramen at Kibi? He said, “Makoto, the ramen was pretty good.”

ramen at Kibi in Tokyo

"But...the soup was not as thick as the one at Naritake." 

Louis at the ramen restaurant

Oh gosh, looks like I need to remove more Kaomojis from the dataset… 


You May Also Like

Why AI Is Essential to the New Energy Model

Read More

What Is Mature MLOps? A Perspective

Read More

How to Move From Brick and Mortar to Online With Dataiku

Read More

3 Inherent Challenges to the Modern Data Stack

Read More