Failures happen in the application stack. It’s part of life. Whether you’re running on-premises or connecting to an API in the cloud, you need to make sure your application design is resilient and fault-tolerant. However, solving for a myriad of potential culprits like transient network failures or overloaded resources is certainly easier said than done. In addition, with microservices, solutions have an increasing number of component interdependencies that must degrade gracefully to keep applications operational and users happy.

I recommend starting with a retry pattern to keep your system healthy and to minimize downtime instead of immediately trying to design many solutions for specific culprits, such as a transient networking problem. This article shows an example of using retry logic in a Redis client library to illustrate the steps you can take to design a self-healing connection to a persistent data store or a cache.

Prerequisites

Cloud native designs for handling Redis retry logic

Redis is a very fast, in-memory database that allows you to build caching layers, session stores, or custom indexes with its low-level commands. Your application code will typically use an off-the-shelf Redis library that can speak the Redis binary protocol. Reading and writing to a key is as simple as:

  // create a key (z) and store a value against it (somedata)
  await redisClient.set('z', 'somedata')

  // retrieve a value from a known key (z)
  const data = await redisClient.get('z')
  // 'somedata'

Note that every Redis operation is an asynchronous call, even if the Redis server is on your local machine. The network interaction between your code and the Redis service may take as little as half a millisecond or hang indefinitely and never complete.

This brings us to error detection and handling with Redis. How can you detect that the Redis server you were connected to has gone away? How should you cope with an unresponsive Redis service?

Luckily, the client library will perform most of the heavy lifting. Let’s look at some examples with the ioredis client library for Node.js. (Note that the Redis library is more popular, but it is rarely updated and offers fewer resiliency options compared with ioredis).

Connect to a Redis service

Connecting to your Redis service is simple. For example, if you use IBM Cloud Databases for Redis like I do, here’s how to take the credentials from the service and extract the host, port, password, and TLS certificate:


// load your credentials from a JSON file
const credentials = require('./credentials.json')

// format the credentials in the correct form for the ioredis library
const redisconn = credentials.connection.rediss
const opts = {
  host: redisconn.hosts[0].hostname,
  port: redisconn.hosts[0].port,
  password: redisconn.authentication.password,
  tls: {
    ca: Buffer.from(redisconn.certificate.certificate_base64, 'base64').toString(),
  }
}

// load ioredis
const Redis = require('ioredis')

// configure Promises
Redis.Promise = global.Promise

// make Redis connection
const redisClient = new Redis(opts)

When I initialize the ioredis library, a network connection is made, authentication credentials are exchanged, certificates are checked, and the network connection is kept indefinitely (so that this relatively expensive step can be avoided next time). The client object (redisClient) is returned and then can be used (and reused again and again) to execute Redis commands.

This is fine until reality takes over: networks are unreliable, passwords get rotated, computers need to be rebooted for upgrades or maintenance, and unexpected events happen. So how can you deal with this uncertainty with your connection to a Redis service?

Auto-reconnect to a Redis service

By default, the ioredis library will attempt to reconnect to a Redis service when the connection is lost. When reconnected, it can optionally re-subscribe to any publish/subscribe channels that became disconnected and retry any commands that failed when disconnected.

This behavior is configured when initializing the library by providing addition options in the opts object:

  • autoResubscribe is set to false to prevent auto-subscription to publish/subscribe channels. (default: true)
  • maxRetriesPerRequest is the number of times to retry a Redis command before failing with an error. (default: 20)
  • enableOfflineQueue indicates whether to queue Redis commands issued when the client is not yet in its “ready” state. (default: true)
  • retryStrategy is a function that determines the delay between reconnection attempts. You determine whether to reconnect at a fixed interval or perhaps with exponential back-off. (default: every two seconds with back-off)
  • reconnectOnError is a function that determines whether a reconnection should be undertaken if a Redis error occurs. This is useful if the Redis node you are connected to becomes readonly because of a reconfiguration of a multi-node Redis cluster.

Here’s an example opts object:

const opts = {
  host: redisconn.hosts[0].hostname,
  port: redisconn.hosts[0].port,
  password: redisconn.authentication.password,
  tls: {
    ca: Buffer.from(redisconn.certificate.certificate_base64, 'base64').toString(),
  },
  autoResubscribe: true,
  maxRetriesPerRequest: 1,  // only retry failed requests once
  enableOfflineQueue: true, // queue offline requests
  retryStrategy: function(times) {
    return 2000 // reconnect after 2 seconds
  },
  reconnectOnError: function(err) {
    // only reconnect on error if the node you are connected to
    // has switched to READONLY mode
    return err.message.startsWith('READONLY')
  }
}

The options supplied here will depend on your application. You may be happy to be connected to a readonly node because your code is only ever reading data. You may want the library to perform more retries for failed commands, but bear in mind that lots of unfulfilled commands will eat up your application memory. It’s about striking a balance between detecting an error so that your code can fail gracefully, and seamlessly continuing to work to survive short connection blips.

Listen for connection events

It’s worth logging the connection behavior of your client so you can diagnose patterns of connection errors. The ioredis library emits connection events (reconnectOnError), which you can listen to and log appropriately:

const debug = require('debug')('myapp')
redisClient
.on('connect', () => {
  debug('Redis connect')
})
.on('ready', () => {
  debug('Redis ready')
})
.on('error', (e) => {
  debug('Redis ready', e)
})
.on('close', () => {
  debug('Redis close')
})
.on('reconnecting', () => {
  debug('Redis reconnecting')
})
.on('end', () => {
  debug('Redis end')
})

If you run your application with a DEBUG environment variable set, you will see the debug messages:

$ DEBUG=mapp node myapp.js
  myapp Redis connect +0ms
  myapp Redis ready +138ms

Furthermore, if you set the DEBUG environment variable to * you will see debugging data from the ioredis library itself:

$ DEBUG=* node myapp.js
  ioredis:redis status[myredisinstance.databases.appdomain.cloud:32006]: [empty] -> connecting +0ms
  ioredis:redis status[169.60.159.182:32006]: connecting -> connect +446ms
  ioredis:redis write command[169.60.159.182:32006]: 0 -> auth([ 'mypassword' ]) +1ms
  ioredis:redis write command[169.60.159.182:32006]: 0 -> info([]) +2ms
  myapp Redis connect +0ms
  ioredis:redis status[169.60.159.182:32006]: connect -> ready +137ms
  myapp Redis ready +137ms
  ioredis:redis write command[169.60.159.182:32006]: 0 -> set([ 'z', 'somedata' ]) +4s
  ioredis:redis write command[169.60.159.182:32006]: 0 -> get([ 'z' ]) +134ms
  ioredis:redis write command[169.60.159.182:32006]: 0 -> set([ 'z', 'somedata' ]) +5s
  ioredis:redis write command[169.60.159.182:32006]: 0 -> get([ 'z' ]) +133ms

Detect errors while performing Redis commands

A typical use-case is to use Redis as a cache. Your application will attempt to fetch a cached key from Redis. If it exists, it is used. Otherwise, a request is made to fetch the data from the underlying primary source database and then the data is written to a Redis cache.

const go = async (query) => {

  // calculate cache key
  const h = hash(query)
  const cachedData = await redisClient.get(h)

  // if we have cached data
  if (cachedData) {

    // return it
    return JSON.parse(cachedData)
  }

  // otherwise get the data from the source database
  const nonCachedData = await fetchFromSourceDB(query)

  // if we got data
  if (nonCachedData) {

    // write it to the cache for next time
    await redisClient.set(h, JSON.stringify(nonCachedData))
  }
  return nonCachedData
}

This simple algorithm is written naively. If the Redis service is faulty, then you may not be able to read a cached value, and/or you may not be able to write data back to the cache. In this circumstance, it should be possible for your application to keep functioning, albeit without the benefit of the Redis cache. All you need is some defensive code that allows each Redis operation to fail gracefully. Don’t try the write operation if the read operation failed.

const go = async (query) => {

  // calculate cache key
  const h = hash(query)
  debug('cache key', h)
  let cachedData = null
  let readRedisWorked = false

  // try to read from Redis
  try {
    cachedData = await redisClient.get(h)
    readRedisWorked = true
  } catch (e) {
    debug('Redis read error', e)
  }

  // if we have cached data
  if (cachedData) {

    // return it
    debug('cache hit', h, cachedData)
    return JSON.parse(cachedData)
  }

  // otherwise get the data from the source database
  debug('cache miss - fetching from source database')
  const nonCachedData = await fetchFromSourceDB(query)

  // if we got data
  if (nonCachedData && readRedisWorked) {

    // write it to the cache for next time
    debug('writing to cache', h, nonCachedData)
    try {
      await redisClient.set(h, JSON.stringify(nonCachedData))
    } catch (e) {
      debug('Redis write error', e)
    }
  }
  return nonCachedData
}

The above code puts the Redis operations in try/catch blocks. If ioredis throws an error (which it will do after one retry attempt, as per your opts configuration), your code swallows the error, performs the query using the underlying database, and then doesn’t attempt to write the cache key to Redis. In other words, your app continues to operate without Redis, but will attempt reconnections so that normal service is resumed as soon as possible. Bear in mind that your back-end database will see an uptick in requests because every cache read will be a “miss”!

Conclusion

The ioredis library has a range of options to configure reconnections and automatic retry logic to suit most needs. Once configured, some defensive coding allows you to work around temporal Redis outages so that your application continues to function.