The New Search : Fuzzy, Instantaneous, and Local

data science| Technology | | Florian Douetteau

At Dataiku, we use extensively search logs and associated navigation information for user behaviour analytics and relevance optimization. Most of our customers today use SOLR or ElasticSearch. But, new uses cases are driven by social/local/mobile apps: Fuzzy, As-you-type, Geo.

  • "As-you-type" is the ability to provide results as the user types the query. Like an autocomplete, but with the results included.
  • "Fuzzy" is the ability to correct typos.

  • "Geo" is the ability to restrict and rank results by proximity.

Google released these features on their web search about 2 years ago, so they were bound to eventually become mainstream.

This year, two new products were released in this area, so we took some spare time to test them.

What do Algolia and Srch2 look like?


Whereas SOLR and ElasticSearch are built for large scale, highly available deployments Algolia and Srch2 products focus on local/ embedded deployment.

Srch2 is quite new (first released in March). Today, it appears as a Linux Binary for Ubuntu 10.04 that you can install and start to play with, providing a simple json file for data.

Algolia is more mature, and is released as a full-blown SDK for mobile and desktop, as well as being provided as a service.

Algolia's killer feature is offline search: you can build an app with your index prepackaged in it, and provide your users with search with no GSM latencies.

In terms of pricing, Srch2 is still in beta, and Algolia is free to use provided you add "Search Powered By Algolia" in your app description on the app store.

The unavoidable performance benchmark!

We chose to compare Srch2 and Algolia with a typical scenario: fuzzy queries searching for small objects. We used a dump of cities downloaded from the Geonames database. This comprised of 3 million cities with the following info: name, country, admin codes, and population. The uncompressed dump was about 250MB.

We performed our tests on a Xeon E3-1230 with 16GB of RAM and 240GB of SSD for both.

We used nearly out-of-the-box settings for both products. We just indicated in the Srch2 conf file the list of fields to index. In Algolia, we reduced the number of hits per page and changed the fuzzy settings to be closer to Srch2 (1 typo between 4 and 7 characters, 2 typos for 8 characters and more).


In both cases, indexing is straightforward.

  Srch2 Algolia
Indexing time 187s 55s
Index size 640MB 630MB

Algolia is about 3 times faster, but as both products are relatively fast, who cares about indexing speed?

For Search, here are the response times we observed: 

  Srch2 Algolia
s 237ms 1ms
sa 46ms 1ms
s f 41ms 7ms
san 10ms 1ms
san frqncisco 4ms 6ms
sqn frqncisco no result (3ms) no result (1ms)
los ang 2ms 1ms
los angeles 1ms 1ms
loz angeles no result (2ms) no result (1ms) 

Response times are very impressive for both products. For very short queries (one or two characters) where Algolia is much faster, real life timings will actually depend a lot on network latency.

A different approach to relevance

From our understanding, Srch2 follows a classic lucene-like approach. While it is suitable for document search, it may have many pitfalls for small objects search.

In short, the score of a hit is a number resulting from a "tf.idf" calculation. It can be influenced by boosting specific attributes. That's the way most search engine manage relevance today. A problem with this approach is that the relevance is limited to a single floating point number, which reduces the amount of business logic you can actually use for relevance. Also when you choose to sort by a specific field (e.g. city population) you cannot combine order with other criteria such as number of typos. As an example, if you search for the city “Pablo” in Montana, you would get “Sao Paulo” as first result as it has a greater population.

Algolia takes a completely opposite and pragmatic approach to relevance. By default, criteria are in this decreasing order of importance:

  1. The number of typos corrected to match the entries. The less typos, the better the result.

  2. The geographical distance when retrieving objects around a specific position.

  3. The position of the first matched word. If your query matches the first word of the first attribute, this is better than following words.

  4. A field to distinguish ex aqueo (e.g. city population).


The mobile world is transforming the search engine market; it's time to test new libs!



Other Content You May Like