Even as we continue to reach new technological milestones and solve the world's most demanding problems, many companies are still confronted with the oldest of administrative nightmares: piles and piles of mail.
Last month, Dataiku hosted its EGG Conference for the first time in London. In one of many incredible talks from thought leaders in data science and machine learning from across industries and sectors, Data Scientist Léo Dreyfus-Schmidt offered a scalable solution to the eternal problem of mail processing by using AI and deep learning techniques. Watch the full video of his talk here.
Léo Dreyfus-Schmidt is a mathematician and holds a PhD in pure mathematics from University of Oxford and University of Paris VII. After five years focusing on homological algebra and representation theory in Paris, Oxford, and the University of California - Los Angeles, he joined Dataiku where he has been developing solutions for predictive maintenance, personalized ranking systems, price elasticity, and natural language applications. Léo is a bicycle and food aficionado (separately), so in his spare time, you’ll find him either zipping around Paris rain or shine or enjoying a great meal somewhere.
Who Writes Letters Anymore?
The data science team behind the project was tasked with solving the four major problems of mail processing, driven by a real use case for an insurance company:
Distinguishing if a letter is handwritten or typed
Parsing the text from a typed letter
Detecting words in handwritten letters
Extracting meaning from the images of words
Their goal was ultimately to deliver a production-ready tool that could be used to automatically sort any letters received and send them to their appropriate departments. Traditionally, this would have to be done by hand — an expensive and time-consuming task.
The first challenge that the data team had to overcome was a very heterogeneous dataset. While initially expecting to receive a pretty even mix of handwritten and typed letters, the actual training set contained a mix of letters, envelopes, forms, leaflets, and other forms of written documents.
With the 200,000 unlabeled images that they received, they went through the long process of labeling every document by type using a webapp they built. This allowed them to begin building their deep learning model on a large training set of data.
The team created a webapp that was used to classify the images that were to become the training set.
In a process that involved constructing a vector representation of the document images using an autoencoder (a process explained more thoroughly in the talk) and running a random forest ML model on the dataset, the team was able to successfully distinguish hand-written documents from typed ones.
Extracting text from images now accurately identified as being typed was the first (and perhaps most straightforward) step. They used an open source OCR (Optical Character Recognition) engine called Tesseract to do this.
Then came the hard part — the handwritten letters. This process involved using computer vision techniques to detect paragraphs on the page and then to detect words from those paragraphs. They then stacked two common layers of deep learning techniques to learn and read the visual characteristics of those words.
Using some open source datasets (and some augmented versions of those datasets) as the training set, they were then able to create a deep learning model in Dataiku that was able to identify the meaning of those handwritten words with fairly high confidence. Once this step was completed, the team had an operational method of extracting meaning from all of the incoming documents.
More than anything, this talk offers an incredibly clear business use case for deep learning that provides a solution to a really common issue. Mail is an obnoxious problem that every company has to deal with. As Léo explains in his talk, the beauty of AI is that it lets us automate the boring, time-consuming things (like sorting through mail) so that companies can focus on the things that actually matter.