Behind the Scenes: Autocomplete Architecture



This is the start of an on-going technology series letting you know what goes on behind the scenes at Agorafy.

One of our core features is a search box on our home page which you can use to easily search for listings and property information. 

How do we make it possible for you to search both properties AND listings? How do we decide what you entered is a street, a neighborhood, a borough or a zip code? And how do we make the suggestions return in a reasonable amount of time?

We asked our programmers to explain what goes on behind the scenes to make that possible. Here's what Jeffrey Li, our head of Software Development told us:

We take a lot of inspiration from Quora. They do an awesome search if you haven't checked them out already. If there's one thing we learned from them, it's to use C++ and Prefix Matching. And so that's what we did. In the beginning, the entire implementation was written in PHP since we're all web developers around here and know next to nothing about compiled languages. But after a while, we realized it's just too slow, so we decided to learn some C++ along the way.

The architecture of our listing search and property search is a bit different but use many of the same components. Here's what our listing search looks like:

Your query first passes through a tokenizer class which breaks up the query into individual words (tokens) and passes them through various filters such as listing type filter, neighborhood filter, borough filter, street filter, and zip code filter. Each filter tries to find a match using the tokens. 

Then there's a controller class that handles communication between the filters. For example, if one of the tokens is "Retail", the listing type filter lays claim to it and the other filters know not to waste any time on that token. Once all the filters are done searching, the results are merged together and ranked.

The filters store data in a Trie structure:

That means, as soon as you type "ch", we immediately begin to traverse the Trie and know that you either mean "Chelsea" or "Chinatown". We designed our Tries so that each node contains an array of all words below it. So in the above illustration, the "H" node contains both "Chelsea" and "Chinatown" as well as any other neighborhoods that begin with "ch". With Tries, looking up a word with length k takes worse case O(k) time.

Our ranking algorithm is pretty simple at the moment. It counts the number of search tokens that exist in the result and measures the percentage of the string that was matched. For example, "ch" matches 40% (2/5) of the word "Chelsea" versus 22% (2/9) of the word "Chinatown". And so we display Chelsea above Chinatown simply because the word Chelsea is shorter. In the future, we can write a better ranker that takes into account personal search history as well as overall search trends and even your current location.

Our architecture for property search is even simpler:

The tokenizer parses out the house number if it exists and runs the rest through a street filter. With the house number and street, it's a straight lookup from there. The Property Address Lookup table is a hash table keyed on street. Originally, we implemented the lookup table as a two-dimensional array keyed on both street and house number. However, during testing, we noticed requests were taking over 120ms. We were expecting lookups to be constant time but it wasn't so. We tried various map implementations: STL MapTR1 unordered_map and Boost unordered_map but the results were the same. But if we just keyed on street and iterate through the house numbers, it took less than half as long. Not sure why that is the case (maybe a C++ guru can shed some light on this?). Anyway, we logically went with the approach that yielded the fastest time. To keep the tables smaller, we divided up the properties by boroughs.

And there you have it. Stay tuned for more articles giving you a glimpse of the technology behind 

10/26/2012 - 09:43