Friday, April 18, 2014

Stack Exchange Technical Debt

Marc Gravell:

So, we’ve got inline tag data that is simple to display, but is virtually impossible to query. Regular indexing doesn’t really work well at finding matches in the middle of character data. Enter (trumpets) SQL Server Full-Text Search. This is inbuilt to SQL Server (which we were already using), and allows all kinds of complex matching to be done using CONTAINS, FREETEXT, CONTAINSTABLE and FREETEXTTABLE. But there were some problems: stop words and non-word characters (think “c#”, “c++”, etc). For the tags that weren’t stop words and didn’t involve symbols, it worked great. So how to convince Full Text Search to work with these others? Answer: cheat, lie and fake it. At the time, we only allowed ASCII alpha-numerics and a few reserved special characters (+, –, ., #), so it was possible to hijack some non-English characters to replace these 4, and the problem of stop words could be solved by wrapping each tag in a pair of other characters. It looked like gibberish, but we were asking Full Text Search for exact matches only, so frankly it didn’t matter. A set of tags like “.net c#” thus became “éûnetà écñà”.

[…]

We finally had a reason to remove this legacy from the past. […] After some thought and consideration, we settled on a pipe (bar) delimited natural representation, with leading/trailing pipes, so “.net c#” becomes simply “|.net|c#|”.

Comments RSS · Twitter

Leave a Comment