Get Started

Taxi Cab Fare Prediction Machine Learning in Real Time

Use Cases & Projects, Dataiku Product Greta Nasai & Alex Combessie

Do you remember the days before Uber, Lyft, or Gett? Standing in the street trying to hail a taxi waiting for the moment a free cab might drive by and spot you? These days that world seems so far away. And you might often wonder: how do these apps work? After all, that set price is not a random guess. This blog post will dive into how to build a system for taxi cab fare prediction.

→ Using Dataiku to Prove NFL Teams Should Take More Touchbacks

NYC taxi"I get out of the taxi and it’s probably the only city which in reality looks better than on the postcards, New York.’"― Milos Forman

In the last few years, the number of for-hire vehicles operating in NY has grown from 63,000 to more than 100,000. However, while the number of trips in app-based vehicles has increased from 6 million to 17 million a year, taxi trips have fallen from 11 million to 8.5 million. Hence, the NY Yellow Cab organization decided to become more data-centric.

Providing data from past rides, they asked the data scientist community on Kaggle to design the best taxi fare prediction machine learning model. This post walks through how we  developed our ML model, deployed it in real time, and built a web application for anyone to use it.

Data Discovery

When we first discovered the raw data, we were quite disappointed. It appeared to contain only a handful of features: the location and time of the pickup, the location of the drop-off point, and the number of passengers. But when we started digging deeper, we actually unveiled some interesting findings:

  • Some people take round trips that can last for up to five hours and cost more than $800 (one could easily have rented a private limo with a driver for the same amount).
  • A trip inside Manhattan might hurt your pocket a lot more than an airport ride (reaching up to $400, while a trip to an airport has a flat rate)

Initially, we expected distance to be the main factor impacting the fare. But as shown in the plot below, the relationship between the two is not so straightforward (read: linear).

fare-by-disance-nyc-taxiWhat are these rides above 300 kilometers?!

As you can expect in real life, we discovered multiple cases of bad data. For instance, over five years of history, there were more than 100,000 cases of rides above 300 kilometers. We found funny outliers like rides costing $94,000 and rides to the bottom of the Hudson River, which is something that you don’t want your model to learn from, do you? You want to give it only good, reasonable examples to learn from. So we started by identifying and filtering out this dirty data.

Feature Engineering

From then on, we thought about what features influence the taxi fare. All in all, the journey to spot the best features was not easy. If you don’t limit your imagination you’ll end up with hundreds of (usually) useless features.

Indeed, after many iterations, we went from eight features in the raw data to more than 500! At that point, our models started to perform worse as models learned “spurious correlations.” Fortunately, we had the right tools to help us with that. Using the feature importance for manual selection and the automatic feature reduction tool built into Dataiku, we were able to reduce the number of features to around 100.


Along with our iterations of data cleaning, wrangling, and enriching, we did many iterations of machine learning model training. The main challenge we faced was the sheer size of the dataset - it wouldn’t fit in our server memory!

We also had to tune the hyperparameters of this model quite a lot. Why? To defeat the archenemy of all data scientist: Overfitting!

the-avengers-thanos-gifOverfitting is a powerful, vicious foe

In a nutshell, data scientists need to fight overfitting to ensure our models can generalize to unseen, future data. The challenge is that it is a stealthy foe: you can easily get good results when training the model but have a bad surprise after deploying your model in production on live data. Please find out more about this overfitting foe in the article “How to Figure Out if your Model Sucks.”

Exposing the Model to Users

Making features and training a good predictive model is one thing, but how do you make its benefits tangible to a non-data scientist? This was our last mission: exposing the model as an interactive web application.

The first step was to turn the predictive model into a real-time API.

But to complete our mission, we could not stop at the technical milestone of the real-time API: we wanted to make our work useful and accessible to non-technical users. So we built an interface!

You can see what it looks like here, and if you need to take regular old taxi cab in New York anytime soon, we invite you to try our web app. Heck, even if you don't need to take a cab ride soon, check it out. It gives you a good idea for what kind of user interface one can build, relatively quickly, based on a machine learning model and putting that model in the hands of real users.



Lessons Learned

All in all, building this end-to-end project was the opportunity to learn four important lessons. We hope you can apply them to build and deploy your own AI projects.

1. Understand the problem before building your models: This is a common pitfall for data scientists that we also fell into. Since training models is fun and easy, we went straight at it without researching the fare formula that taxis have to apply. If we had spent a more time researching this in the beginning, we would have saved hours that we lost training models without the right features.

2. Do not add features for the sake of adding features: Think in terms of causality: what are the driving factors behind what you want to predict? Then make sure to feed the features that describe these factors to your models. You can also automate the creation of derivative features based on existing ones. But be careful not to create too many or your models will yield worse results. To counteract that effect, invest time in feature selection techniques.

3. Try as many algorithms as possible: As proven by the “No free lunch” theorem, there is no algorithm that is always superior to all others. For instance, neural networks have proved amazing at image recognition and translation, but some problems are best solved by other algorithms. In our case, our results were greatly improved by trying the new LightGBM algorithm from Microsoft.

4. Simplify your pipeline before deploying to real-time: Model training is a batch process with no limits on data availability. But the world of real-time is different: all features used by the model on top of the raw ones need to be available. That was a problem for our window aggregation features, as we did not have a live data feed to compute them. We had to retrain our model without them. At a very small expense of performance, we were able to deploy this model as an API, responding well under 100 ms per request.

You May Also Like

Explaining AutoML: What It Is and How Dataiku Can Help

Read More

Succeed With AI at Scale With These New Year’s Resolutions Tips

Read More

5 Reasons Why Predictive Maintenance Is Overhyped

Read More

Did MLFlow Kill the Data Science Star?

Read More