Saturday, September 03, 2011

Redis Pipelining

At Mendeley, we use a mixture of (my|no)sql technologies for handling different types of data. Redis is a key-value store that we use for our user feed and user notification data.

Being an in-memory storage system with disk persistence, Redis has delivered exceptional performance in many of our applications. As the applications grow in complexity, it becomes apparent that network delay acts as the main bottleneck. Redis is a TCP server based on the simple request-response model, which means if you are using a client library, such as Rediska, and executing hundreds of SET commands in one go over a single open socket (maintained over the script's execution), each one of those SETs is going to result in a request being sent to Redis, and a response being expected. Each time, the socket will block until the response can be read. The effect of network delay associated with these request-response actions will simply multiply with the number of commands we try to execute per operation.

One obvious answer to this performance issue is to use MSET to set multiple key-value pairs in one single socket request. (Incidentally, MSET is an atomic command, which may or may not be what you want, depending on your particular usecase.)

But there is a better alternative - Redis pipelining. The idea is dead simple: instead of waiting for a response per socket request, the client simply send through multiple requests before reading the aggregated responses. This hugely cuts down the effect of network delay on operation time. Client libraries like Rediska make using pipelining a bliss. In Rediska's case, one simply has to start a pipeline, invoke the commands on the pipeline (instead of the usual Rediska object), and finally execute the pipeline to send through the batch of commands. Minimal code changes required!

Not surprisingly, the performance gain by issuing 5k commands in a pipeline is remarkable. While I cannot provide you with precise benchmarking data, I can say that connecting from my local machine to a Redis server installed on an ec2 instance, the operation completed in 2 seconds as opposed to 20 seconds without using pipelining.

There are a number of points worth noting (some more subtle than others):

  • pipelining is NOT Redis transactions
  • pipelining does not provide atomicity
  • pipelining really is nothing more complicated than the fact that Redis server is able to queue up responses in memory
  • it is entirely up to the client when to send the requests and when to wait on the responses
  • commands in a pipeline must be independent of each other (as individual responses don't get read till the end of the batch)
  • sending through too many commands in a single pipeline may cause memory issues and potentially socket timeout, so one may want to flush a pipeline after a number of commands (best to do some performance testing)
BufferedRedisPipeline

Or "auto flushing pipeline"....

I created a wrapper class for the Rediska_Pipeline to introduce auto-flushing.

Code snippet of an example implementation of the class:

 /** @var Rediska_Pipeline */  
 private $pipeline;  
   
 /** @var Rediska */
 private $rediska;  
   
 /** @var int */  
 private $counter = 0;  
   
 /** @var int */  
 private $flushThreshhold;  
   
 /** @var array */  
 private $result = array();  
   
 /**  
  * @param Rediska $rediska  
  * @param int $flushThreshhold set as -1 to stop the pipeline from autoflushing  
  */  
 public function __construct(Rediska $rediska, $flushThreshhold = 5000) {  
      $this->rediska = $rediska;  
      $this->flushThreshhold = $flushThreshhold;  
      $this->pipeline = $rediska->pipeline();  
 }  
   
 /**  
  * @return array  
  */  
 public function execute() {  
      if($this->counter) {  
           $this->flushPipeline(false);  
      }  
      $this->pipeline = null;  
      return $this->result;  
 }  
   
 /**  
  * Magic method to invoke command calls on the underlying pipeline object.  
  *  
  * @param string $name  
  * @param array|null $arguments  
  * @return void  
  */  
 public function __call($name, $arguments) {  
      $this->counter++;  
      call_user_func_array(array($this->pipeline, $name), $arguments);  
      // OR use other method to invoke command calls if one is concerned about the performance of call_user_func_array  
      if($this->flushThreshhold != -1 && $this->counter >= $this->flushThreshhold) {  
           $this->flushPipeline(true);  
      }  
 }  
   
 /**  
  * @param bool $createNewPipeline  
  * @return void  
  */  
 private function flushPipeline($createNewPipeline) {  
      $this->counter = 0;  
      $this->result = array_merge($this->result, $this->pipeline->execute());  
      $this->pipeline = $createNewPipeline ? $this->rediska->pipeline() : null;  
 }  
   

Finally...

Of course it is always worth considering handling heavy IO operations in an asynchronous manner to guarantee responsive user experience, e.g, either by leveraging a queue system or by using AJAX requests.