Failures happen in the application stack. It’s part of life. Whether you’re running on-premises or connecting to an API in the cloud, you need to make sure your application design is resilient and fault-tolerant. However, solving for a myriad of potential culprits like transient network failures or overloaded resources is certainly easier said than done. In addition, with microservices, solutions have an increasing number of component interdependencies that must degrade gracefully to keep applications operational and users happy.
I recommend starting with a retry pattern to keep your system healthy and to minimize downtime instead of immediately trying to design many solutions for specific culprits, such as a transient networking problem. This article shows an example of using retry logic in a Redis client library to illustrate the steps you can take to design a self-healing connection to a persistent data store or a cache.
Prerequisites
Cloud native designs for handling Redis retry logic
Redis is a very fast, in-memory database that allows you to build caching layers, session stores, or custom indexes with its low-level commands. Your application code will typically use an off-the-shelf Redis library that can speak the Redis binary protocol. Reading and writing to a key is as simple as:
// create a key (z) and store a value against it (somedata)
await redisClient.set('z', 'somedata')
// retrieve a value from a known key (z)
const data = await redisClient.get('z')
// 'somedata'
Note that every Redis operation is an asynchronous call, even if the Redis server is on your local machine. The network interaction between your code and the Redis service may take as little as half a millisecond or hang indefinitely and never complete.
This brings us to error detection and handling with Redis. How can you detect that the Redis server you were connected to has gone away? How should you cope with an unresponsive Redis service?
Luckily, the client library will perform most of the heavy lifting. Let’s look at some examples with the ioredis client library for Node.js. (Note that the Redis library is more popular, but it is rarely updated and offers fewer resiliency options compared with ioredis).
Connect to a Redis service
Connecting to your Redis service is simple. For example, if you use IBM Cloud Databases for Redis like I do, here’s how to take the credentials from the service and extract the host, port, password, and TLS certificate:
// load your credentials from a JSON file
const credentials = require('./credentials.json')
// format the credentials in the correct form for the ioredis library
const redisconn = credentials.connection.rediss
const opts = {
host: redisconn.hosts[0].hostname,
port: redisconn.hosts[0].port,
password: redisconn.authentication.password,
tls: {
ca: Buffer.from(redisconn.certificate.certificate_base64, 'base64').toString(),
}
}
// load ioredis
const Redis = require('ioredis')
// configure Promises
Redis.Promise = global.Promise
// make Redis connection
const redisClient = new Redis(opts)
When I initialize the ioredis
library, a network connection is made, authentication credentials are exchanged, certificates are checked, and the network connection is kept indefinitely (so that this relatively expensive step can be avoided next time). The client object (redisClient
) is returned and then can be used (and reused again and again) to execute Redis commands.
This is fine until reality takes over: networks are unreliable, passwords get rotated, computers need to be rebooted for upgrades or maintenance, and unexpected events happen. So how can you deal with this uncertainty with your connection to a Redis service?
Auto-reconnect to a Redis service
By default, the ioredis
library will attempt to reconnect to a Redis service when the connection is lost. When reconnected, it can optionally re-subscribe to any publish/subscribe channels that became disconnected and retry any commands that failed when disconnected.
This behavior is configured when initializing the library by providing addition options in the opts
object:
autoResubscribe
is set tofalse
to prevent auto-subscription to publish/subscribe channels. (default: true)maxRetriesPerRequest
is the number of times to retry a Redis command before failing with an error. (default: 20)enableOfflineQueue
indicates whether to queue Redis commands issued when the client is not yet in its “ready” state. (default: true)retryStrategy
is a function that determines the delay between reconnection attempts. You determine whether to reconnect at a fixed interval or perhaps with exponential back-off. (default: every two seconds with back-off)reconnectOnError
is a function that determines whether a reconnection should be undertaken if a Redis error occurs. This is useful if the Redis node you are connected to becomesreadonly
because of a reconfiguration of a multi-node Redis cluster.
Here’s an example opts
object:
const opts = {
host: redisconn.hosts[0].hostname,
port: redisconn.hosts[0].port,
password: redisconn.authentication.password,
tls: {
ca: Buffer.from(redisconn.certificate.certificate_base64, 'base64').toString(),
},
autoResubscribe: true,
maxRetriesPerRequest: 1, // only retry failed requests once
enableOfflineQueue: true, // queue offline requests
retryStrategy: function(times) {
return 2000 // reconnect after 2 seconds
},
reconnectOnError: function(err) {
// only reconnect on error if the node you are connected to
// has switched to READONLY mode
return err.message.startsWith('READONLY')
}
}
The options supplied here will depend on your application. You may be happy to be connected to a readonly
node because your code is only ever reading data. You may want the library to perform more retries for failed commands, but bear in mind that lots of unfulfilled commands will eat up your application memory. It’s about striking a balance between detecting an error so that your code can fail gracefully, and seamlessly continuing to work to survive short connection blips.
Listen for connection events
It’s worth logging the connection behavior of your client so you can diagnose patterns of connection errors. The ioredis
library emits connection events (reconnectOnError
), which you can listen to and log appropriately:
const debug = require('debug')('myapp')
redisClient
.on('connect', () => {
debug('Redis connect')
})
.on('ready', () => {
debug('Redis ready')
})
.on('error', (e) => {
debug('Redis ready', e)
})
.on('close', () => {
debug('Redis close')
})
.on('reconnecting', () => {
debug('Redis reconnecting')
})
.on('end', () => {
debug('Redis end')
})
If you run your application with a DEBUG
environment variable set, you will see the debug messages:
$ DEBUG=mapp node myapp.js
myapp Redis connect +0ms
myapp Redis ready +138ms
Furthermore, if you set the DEBUG
environment variable to *
you will see debugging data from the ioredis
library itself:
$ DEBUG=* node myapp.js
ioredis:redis status[myredisinstance.databases.appdomain.cloud:32006]: [empty] -> connecting +0ms
ioredis:redis status[169.60.159.182:32006]: connecting -> connect +446ms
ioredis:redis write command[169.60.159.182:32006]: 0 -> auth([ 'mypassword' ]) +1ms
ioredis:redis write command[169.60.159.182:32006]: 0 -> info([]) +2ms
myapp Redis connect +0ms
ioredis:redis status[169.60.159.182:32006]: connect -> ready +137ms
myapp Redis ready +137ms
ioredis:redis write command[169.60.159.182:32006]: 0 -> set([ 'z', 'somedata' ]) +4s
ioredis:redis write command[169.60.159.182:32006]: 0 -> get([ 'z' ]) +134ms
ioredis:redis write command[169.60.159.182:32006]: 0 -> set([ 'z', 'somedata' ]) +5s
ioredis:redis write command[169.60.159.182:32006]: 0 -> get([ 'z' ]) +133ms
Detect errors while performing Redis commands
A typical use-case is to use Redis as a cache. Your application will attempt to fetch a cached key from Redis. If it exists, it is used. Otherwise, a request is made to fetch the data from the underlying primary source database and then the data is written to a Redis cache.
const go = async (query) => {
// calculate cache key
const h = hash(query)
const cachedData = await redisClient.get(h)
// if we have cached data
if (cachedData) {
// return it
return JSON.parse(cachedData)
}
// otherwise get the data from the source database
const nonCachedData = await fetchFromSourceDB(query)
// if we got data
if (nonCachedData) {
// write it to the cache for next time
await redisClient.set(h, JSON.stringify(nonCachedData))
}
return nonCachedData
}
This simple algorithm is written naively. If the Redis service is faulty, then you may not be able to read a cached value, and/or you may not be able to write data back to the cache. In this circumstance, it should be possible for your application to keep functioning, albeit without the benefit of the Redis cache. All you need is some defensive code that allows each Redis operation to fail gracefully. Don’t try the write operation if the read operation failed.
const go = async (query) => {
// calculate cache key
const h = hash(query)
debug('cache key', h)
let cachedData = null
let readRedisWorked = false
// try to read from Redis
try {
cachedData = await redisClient.get(h)
readRedisWorked = true
} catch (e) {
debug('Redis read error', e)
}
// if we have cached data
if (cachedData) {
// return it
debug('cache hit', h, cachedData)
return JSON.parse(cachedData)
}
// otherwise get the data from the source database
debug('cache miss - fetching from source database')
const nonCachedData = await fetchFromSourceDB(query)
// if we got data
if (nonCachedData && readRedisWorked) {
// write it to the cache for next time
debug('writing to cache', h, nonCachedData)
try {
await redisClient.set(h, JSON.stringify(nonCachedData))
} catch (e) {
debug('Redis write error', e)
}
}
return nonCachedData
}
The above code puts the Redis operations in try/catch blocks. If ioredis
throws an error (which it will do after one retry attempt, as per your opts
configuration), your code swallows the error, performs the query using the underlying database, and then doesn’t attempt to write the cache key to Redis. In other words, your app continues to operate without Redis, but will attempt reconnections so that normal service is resumed as soon as possible. Bear in mind that your back-end database will see an uptick in requests because every cache read will be a “miss”!
Conclusion
The ioredis library has a range of options to configure reconnections and automatic retry logic to suit most needs. Once configured, some defensive coding allows you to work around temporal Redis outages so that your application continues to function.