Wednesday, December 23, 2009

Google App Engine Datastore "IN" operator

This is the first Desert Island-related technical post. In this article, I am outlining a couple of gotchas regarding the use of the "IN" operator in GQL.

In, I created a class DesertIslandList, which represents a user's list of his/her 8 desert island albums. It, amongst other properties, contains user_id and id_source. For a user who's logged in via Facebook Connect, the user_id of his/her created list would simply be his Facebook profile id and id_source would be "f" (for Facebook).

In order to display lists created by his/her friends, I, first of all, get his/her friend's id list using friends.get method in the Facebook RESTful API with a version of minifb hacked up for Google App Engine. I then created a GQL query for all DesertIslandList WHERE user_id IN :1 AND id_source = :2 with :1 being the Python list containing all the friend ids as strings, and :2 being "f". Straightforward enough? Well it all seemed so until I stumbled across of a couple of gotchas.

Only 30 items allowed

First of all, I soon got a "too many subqueries (max: 30..." exception. As it turns out, only 30 items are allowed in the list:

(from Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.

So, my bad, not reading the documentation properly. To get around this, you could either somehow limit the list to 30 items or chop your list up into 30-item chunks and union the results (sorting offline if necessary).

What if the "IN" list is empty?

What if he or she has no friends?

Convention SQL wisdom would suggest that the query should return no DesertIslandList objects. However, in the world of Google App Engine datastore, "IN" means cross-product of all the separate individual queries, so in my GQL query above, an empty "IN" list effectively means get all DesertIslandList WHERE id_source = "f" i.e. not filtering on the user_id field at all!

Not only is this slight counter-intuitive, this can also throw a surprise "no matching index found" exception. The fact that an empty "IN" list does not filter on the user_id field means my GQL instance with an empty "friend ids" list has to access an additional DesertIslandList index with only the id_source field, and not the usual index with both the user_id and the id_source fields. Imagine this, if you have not run a GQL instance with an empty "IN" list in your local dev environment for this particular query, no index (with only the id_source field) will be added to your index.yaml. Up in the cloud, if a person happens to have no Facebook friends, he or she is going to see an ugly stacktrace complaining the the index does not exist!

Of course, the simple solution to this problem is to always check the list to see if it is empty before running the query.

Just another thing to bear in mind!

No comments: