Do you remember the days before Uber, Lyft, or Gett? Standing in the street trying to hail a taxi waiting for the moment a free cab might drive by and spot you? These days that world seems so far away. And you might often wonder: how do these apps work? After all, that set price is not a random guess.
"I get out of the taxi and it’s probably the only city which in reality looks better than on the postcards, New York.’"― Milos Forman
In the last few years, the number of for-hire vehicles operating in NY has grown from 63,000 to more than 100,000. However, while the number of trips in app-based vehicles has increased from 6 million to 17 million a year, taxi trips have fallen from 11 million to 8.5 million. Hence, the NY Yellow Cab organization decided to become more data-centric.
Providing data from past rides, they asked the data scientist community on Kaggle to design the best taxi fare prediction machine learning model. This post walks through how we developed our ML model, deployed it in real time, and built a web application for anyone to use it.
When we first discovered the raw data, we were quite disappointed. It appeared to contain only a handful of features: the location and time of the pickup, the location of the drop-off point, and the number of passengers. But when we started digging deeper, we actually unveiled some interesting findings:
- Some people take round trips that can last for up to five hours and cost more than $800 (one could easily have rented a private limo with a driver for the same amount).
- A trip inside Manhattan might hurt your pocket a lot more than an airport ride (reaching up to $400, while a trip to an airport has a flat rate)
Initially, we expected distance to be the main factor impacting the fare. But as shown in the plot below, the relationship between the two is not so straightforward (read: linear).
What are these rides above 300 kilometers?!
As you can expect in real life, we discovered multiple cases of bad data. For instance, over five years of history, there were more than 100,000 cases of rides above 300 kilometers. We found funny outliers like rides costing $94,000 and rides to the bottom of the Hudson River, which is something that you don’t want your model to learn from, do you? You want to give it only good, reasonable examples to learn from. So we started by identifying and filtering out this dirty data.
From then on, we thought about what features influence the taxi fare. All in all, the journey to spot the best features was not easy. If you don’t limit your imagination you’ll end up with hundreds of (usually) useless features.
Indeed, after many iterations, we went from eight features in the raw data to more than 500! At that point, our models started to perform worse as models learned “spurious correlations.” Fortunately, we had the right tools to help us with that. Using the feature importance for manual selection and the automatic feature reduction tool built into Dataiku, we were able to reduce the number of features to around 100.
We started by building location-related features based on the raw data: geometrical distances between pickup and dropoff points, the direction of the trip (north/south/east/west), etc. We added features based on the date and the time in order to capture seasonality effects.
After some positive iterations, we decided to combine location-related and time-related features to describe the traffic conditions for a specific route at a given time. To do so, we needed to define a notion of “neighborhood,” which did not exist in the raw data.
We decided to put the data decide through clustering. Interestingly, a simple K-means clustering algorithm was able to identify New York neighborhoods quite accurately:
Then, we used the Window recipe to compute useful features such as the average fare for a ride for one cluster to another in the last 10 rides, the last 100 rides, etc. More importantly, using neighborhoods of pickup and dropoff as a unique route identifier, we added data from the HERE API to estimate driving conditions including traffic. We called this API to retrieve historical data on all possible routes for all hours at a given day of the week, and enrich our original dataset.
Finally, the “business rules” that taxi meters follow to make price calculations (check out this open data) were an excellent reference to create more features. Airport flat rates, peak time price raises, Manhattan traffic charges, tunnel charges, etc., were a few of the most significant features we fed into our model.
Along with our iterations of data cleaning, wrangling, and enriching, we did many iterations of machine learning model training. The main challenge we faced was the sheer size of the dataset - it wouldn’t fit in our server memory!
We also had to tune the hyperparameters of this model quite a lot. Why? To defeat the archenemy of all data scientist: Overfitting!
Overfitting is a powerful, vicious foe
In a nutshell, data scientists need to fight overfitting to ensure our models can generalize to unseen, future data. The challenge is that it is a stealthy foe: you can easily get good results when training the model but have a bad surprise after deploying your model in production on live data. Please find out more about this overfitting foe in the article “How to Figure Out if your Model Sucks.”
To overcome the dataset size issue, we used a specific Machine Learning library from Microsoft called LightGBM. Interestingly for our case, this library focuses on increased training speed and lower memory usage. We were then able to train our model on a large (50%) sample of the training set within a dozen hours, on our standard shared server with 128GB of CPU and 12 CPU cores (Intel Xeon E5-1650 v4). This was not possible with scikit-learn and xgboost gradient boosting models, which could only take a 30% sample and ran out of memory above that.
What’s more, we were able to get performance gains, enough to raise us to the top 10% of the public leaderboard on Kaggle, with a Root Mean Squared Error (RMSE) of 2.91963. To put it simply using another metric, our model is, in theory, able to predict the fare at +/- 1.4 USD on average.
So how did it work in practice? We used Dataiku’s Custom Python model feature to integrate with the scikit-learn API of LightGBM. Here's the piece of code we used for that.
In our project, we applied a few specific tricks to fight overfitting at the model level, which you can categorise in two buckets:
- Feature overfitting: Here the goal is to make sure that the boosted ensemble of all trees has a balanced view of all features in the dataset, instead of always learning from the more predictive ones. Primarily, we reduced the column sample given to each tree (colsample_bytree parameter) to a rather low percentage of 60% instead of 100%. Finally, we used lightgbm native binning capabilities to avoid learning small ranges for numerical variables (also speeding up the computation).
- Observation overfitting: You want your model to generalize to the entire training set, not be specific to a small group of observation. For instance, a deep tree-based model may learn a specific rule for rides on the first week of February 2015 from the Empire State Building to Hell’s Kitchen. But that rule would probably not apply today on a similar but slightly different route. Roads, traffic patterns, etc., are subject to change. Hence, we forced each tree of gradient boosting model to be simpler than the default, by tweaking parameters such as the minimum performance improvement to justify creating a new split (min_split_gain) to be higher (0.25 instead of 0.0)
In general, finding these best parameters of a given model is an iterative process which we automated through grid-searching. The tricks above were used to guide our grid to the “sweet spot” of the optimization space.
Exposing the Model to Users
Making features and training a good predictive model is one thing, but how do you make its benefits tangible to a non-data scientist? This was our last mission: exposing the model as an interactive web application.
The first step was to turn the predictive model into a real-time API.
Fortunately, you can convert models trained in the visual ML environment of Dataiku into a RESTful API in a few clicks.
However, these few clicks were not enough. We had enriched our dataset with geographical neighborhoods and traffic information, so our model was expecting these features to be present.
As we had already written Python scripts to compute them on historical data, we rewrote them from a batch-oriented framework to a functional framework. These functions were then exposed as Python API endpoints available in real time.
In the end, the structure of our API service looked like this:
predict_fare: Python function endpoint taking as input the raw features and outputting the fare prediction, acting as a “wrapper” to call consecutively the following internal endpoints:
_cluster: Python function endpoint taking as input the latitude and longitude information and running the clustering algorithm to assign pickup and dropoff to a pair of 2 neighbourhoods
_traffic: Python function endpoint taking as input the latitude and longitude information along with the time of the ride request and sending a request to the HERE API for the latest traffic information
_fare: our initial predictive model endpoint taking as input the raw features and the enrichment above to predict the fare
But to complete our mission, we could not stop at the technical milestone of the real-time API: we wanted to make our work useful and accessible to non-technical users. So we built an interface!
You can see what it looks like here, and if you need to take regular old taxi cab in New York anytime soon, we invite you to try our web app. Heck, even if you don't need to take a cab ride soon, check it out. It gives you a good idea for what kind of user interface one can build, relatively quickly, based on a machine learning model and putting that model in the hands of real users.
We decided to use the Flask web app development framework, in order to have full control over the interface design, make it work across desktop and mobile, and enjoy the power and simplicity of Python in the backend. As a matter of fact, this framework is natively available in Dataiku to facilitate development.
Finally, we decided to make our web app public, so we ported it from the Dataiku development environment to a dedicated web server.
All in all, building this end-to-end project was the opportunity to learn four important lessons. We hope you can apply them to build and deploy your own AI projects.
1. Understand the problem before building your models: This is a common pitfall for data scientists that we also fell into. Since training models is fun and easy, we went straight at it without researching the fare formula that taxis have to apply. If we had spent a more time researching this in the beginning, we would have saved hours that we lost training models without the right features.
2. Do not add features for the sake of adding features: Think in terms of causality: what are the driving factors behind what you want to predict? Then make sure to feed the features that describe these factors to your models. You can also automate the creation of derivative features based on existing ones. But be careful not to create too many or your models will yield worse results. To counteract that effect, invest time in feature selection techniques.
3. Try as many algorithms as possible: As proven by the “No free lunch” theorem, there is no algorithm that is always superior to all others. For instance, neural networks have proved amazing at image recognition and translation, but some problems are best solved by other algorithms. In our case, our results were greatly improved by trying the new LightGBM algorithm from Microsoft.
4. Simplify your pipeline before deploying to real-time: Model training is a batch process with no limits on data availability. But the world of real-time is different: all features used by the model on top of the raw ones need to be available. That was a problem for our window aggregation features, as we did not have a live data feed to compute them. We had to retrain our model without them. At a very small expense of performance, we were able to deploy this model as an API, responding well under 100 ms per request.