While working at django_db_indexer i wanted to implement an indexing system that could take full advantage of Redis db types.
In this post i'll explain how django_db_indexer offers support for startswith queries using redis zsets, feel free to write me if you think of a better way.
Suppose you have some objects you want to store in a redis database, for example :
class Musician(object):
def __init__(self,name,group):
self.name = name
self.group = group
We want to be able to retrieve all musician whose names starts with a certain string.
To do so, i took inspiration from this post in Salvatore Sanfilippo's blog:
http://antirez.com/post/autocomplete-with-redis.html
We'll be using redis' ordered sets. Zsets in redis are sets of values (strings) ordered by a certain score.
When two or more values share the same score, they are ordered alphabetically.
We'll store some musicians:
> hmset musicians:1 name "Julian Casablancas" group Strokes
> hmset musicians:2 name "Thom Yorke" group Radiohead
> hmset musicians:2 name "James Murphy" group "LCD Soundsystem"
...
For every record, we'll save the musician's name in a redis zset like this:
> zadd musician_name_startswith 0 Julian Casablancas_1
> zadd musician_name_startswith 0 Thom Yorke_2
> zadd musician_name_startswith 0 James Murphy_3
> zrange zset 0 -1
1. "James Murphy_3"
2. "Julian Casablancas_1"
3. "Thom Yorke_2"
Now, if we want to find all musicians whose name starts with "J" we simply follow these steps:
1) add the string "J" to our zset with score 0
2) find it's rank with "zrank" command
3) retrieve all elements in zset which rank is bigger than "J"'s rank using zrange
4) remove "J" from the set
We want to use a transaction with "WATCH" used over our zset in order to make this method work, because if another record is inserted in the zset the lookup string rank may vary.
We can now simply iterate over the list, breaking when we find a string that doesn't start with "J", and find musician's id by splitting the string by "_".
While zadd and zrank have logarithmic complexity, zrange has O(log(N)+M), where M is the number of elements returned (N in worst case scenario). To reduce this M you could add another string with only the last char changed in the following one (e.g. for our "J" i whould be "K") and retrieve the records with zrank between the two strings.
Here's the code:
SEPARATOR = '_'
def startswith(lookup,connection):
key = "musician_name_startswith"
conn.zadd(key,lookup,0)
conn.watch(key)
while True:
try:
pipeline = conn.pipeline()
up = conn.zrank(key,lookup)
pipeline.zrange(key,up+1,-1)
pipeline.zrem(key,lookup)
res = pipeline.execute()
return result_generator(lookup,res[0])
except WatchError:
pass
def result_generator(lookup,l):
for i in l:
if not i.startswit(lookup):
break
yield i