Our clients depend on us to deliver exceptional service, and solid engineering is a large part of that promise. With Quantcast’s large request volume (nearly two million requests per second) we cannot afford any downtime. Downtime on any of the services in the request path (e.g. Keebler) compromises the quality of our final result, negatively impacting the business performance of our products and the quality of service we provide to our clients. We take this very seriously and can’t allow for it.
We also pump out new features and fixes to our software almost daily. Some of these releases require downtime as an instance of a service is upgrading, restarting, reloading offline data, rebuilding caches, etc. However, this needn’t cause user-visible service interruption. In this post, we will discuss some of the strategies engineers use to achieve zero downtime restarts, including some we use at Quantcast, and the pros and cons of each.

Hot Standby
A common strategy to avoid service disruption is to have another set of servers with all services running, data loaded, caches built, etc, sitting idle. When software needs to be restarted, traffic is redirected to the hot standby, which begins serving new requests. The previously active instances are now idle, and can be upgraded without affecting the user experience.
This strategy has the benefit of simplicity: it is relatively easy to implement and operate, and is thus less bug-prone than some more complex restart strategies. However, it’s wasteful: your infrastructure must be about twice as large as required to handle your peak traffic if you are to allow for a failover during peak times.

Rolling Restart with Load Balancers
Another common strategy is to keep services behind load balancers. When software needs to be restarted, the load balancer redirects traffic away from the instance which will be restarted. Once the instance has been drained of traffic, operators can safely restart the software (e.g. software deployment). Once the instance has successfully come back online and is ready to serve traffic again, the load balancer returns traffic to the instance. An entire fleet of instances can be restarted this way, in a rolling manner, without an external user noticing any service disruption.Load-EngBlog-01
Unlike the hot standby strategy, using a load balancer allows gradual upgrades, making deployments less risky. Should a new release require some extra validation, a few instances being upgraded and released into production creates less risk than the hot standby method.
Second, a load balancer may already be present in the architecture to actually balance load so this strategy will bake right into already present functionality. Unfortunately, load balancers do add significant latency to the request path. For latency sensitive applications, the load balancer strategy may not be an option. Furthermore, draining instances to perform a software restart places higher load on the remaining servers. Thus, some excess capacity is required to handle this load (though not the twice-peak-load over provisioning of a hot standby).

Service Discovery
A corollary to the load balancer strategy (and a method we use within Quantcast) that is worth mentioning is using ZooKeeper for service discovery (though you can use other similar systems). When a service starts up and is ready to serve traffic, it registers itself in a shared path under ZooKeeper with an ephemeral (short-lived) node (as mentioned in our earlier post on Keebler).

Other services which are interested subscribe to that shared path using a ZooKeeper directory watch; they are notified about service availability changes and cache the results. When a service is to be restarted, the service is notified (e.g. through SIGTERM) and closes its connection to ZooKeeper, thereby removing its ephemeral node. Any final requests are handled and the service exits cleanly. Since the ephemeral node has gone, internal clients will no longer send requests to the service going down.

Similar to this is to use Zookeeper as a distributed lock manager. Multiple instances of a service are started and all try to lock a distributed lock within Zookeeper. The instance which is successful is now the master service, responsible for handling the traffic. The other services standby waiting to acquire the lock. When the master service exits, a standby service will acquire the lock and become the new master and take over handling traffic.
Note that this method, while similar to the load balancer strategy, will really only work internally within an architecture. There must still be some process which accepts external connections and begins the request path within the internal architecture. That could be a load balancer or some other process that uses another strategy to achieve zero downtime restarts.

Bind Sharin
Starting with Linux kernel version 3.9, a new socket option called SO_REUSEPORT (similar to the SO_REUSEADDR socket option) allows multiple sockets to call bind(2) on the same port, so long as the first server which binds to the port sets this option. The kernel is then responsible for fairly spreading incoming connections on the port to all processes bound to that port. Using this strategy makes software restarts very simple. The new process simply binds to the same common port that is receiving incoming connections. For a period of time, both processes are bound to the port and serving incoming traffic. The old process is then told to exit, and the service restart is complete. For more information on SO_REUSEPORT and some sample code on how to enable it on a socket, see the excellent LWN article.

However, TCP, and many other stream based protocols, use a three-way handshake to ensure a connection is live. Should a new instance of a service come up or an old instance go down while a new connection is still being set up, the kernel may get confused regarding which instance a particular connection belongs to. As of this writing, this limitation of SO_REUSEPORT is being worked on within the linux kernel. Note that SO_REUSEPORT’s main purpose is to get balanced load on a machine and was not specifically built for zero downtime restarts.

File Descriptor Sharing
A simple, yet effective, method resembling bind sharing is to take advantage of file descriptors preserved after a call to exec(3). By default, file descriptors remain open across an exec(3). Operators can restart software without any downtime by replacing the software binary on disk and then instructing the currently running binary (e.g. through SIGUSR1) to fork(2) and exec(3) the new binary. In the exec(3) call, the service somehow passes the file descriptor number of an accept socket to the new binary (e.g. as a command line argument) which the new instance can begin accepting requests on. The old service completes all pending requests and exits cleanly.

The main drawback of using this method is that it may be difficult to integrate into production systems. Most production systems have some sort of process watchdog (e.g. daemontools) which is responsible for ensuring services are running properly. In file descriptor sharing, the previously running instance is the parent process instead of the process manager. Some process managers may not play nicely with this architecture.

File descriptor passing
Bind sharing and file descriptor sharing are great solutions, but they may not be right in all cases. For example, you may have to work on a legacy system with a kernel that doesn’t support the SO_REUSEPORT socket option. There exists a less common, but more powerful, method to restart software without downtime through file descriptor passing.

The strategy, implementation details, and some sample code is discussed in detail in Chapter 17 of Advanced Programming in the UNIX Environment by W. Richard Stevens and Stephen A. Rago. In summary, engineers can send a socket to another process over another socket. When software has to be restarted, the new instance is simply started. It then notifies the previously running instance to send over its file descriptor through a shared mechanism (e.g. through a named pipe). The previously running instance sends the socket over to the new instance and stops accepting new requests. Instead, it focus on handling the requests which were already in flight. Once all requests have been successfully completed, the old instance can exit cleanly. The new instance will begin accepting connections and serving traffic on the socket it received from the previously running instance.

File descriptor passing, like bind sharing and file descriptor sharing, makes restarts simple: just start a new instance and let the old the instance exit. Furthermore, file descriptor passing doesn’t suffer from one of the limitations of bind sharing: the kernel getting confused which instance a packet belongs to when instances are coming online and going offline. However, one of the biggest negatives regarding file descriptor passing is that it is complicated to implement correctly compared to the other techniques listed in this post. There are a few libraries available which provide a base implementation.

In this post, we have discussed various methods you can use within your own infrastructure to restart your software without causing downtime for your internal and external users.

Hot standby
  • Simple
  • Easy to operate
  • Wasteful of resources
Rolling restart with load balancers
  • Controls risk during rollout
  • Bakes into already present load balancer
  • Introduces significant latency
  • May not be appropriate for hot datacenters
Bind sharing
  • Simple
  • Easy to operate
  • Negligible added latency
  • Requires a modern kernel
  • Requires fixes to ensure connections aren’t dropped
File descriptor sharing
  • Simple
  • Easy to operate
  • Negligible added latency
  • May not play nice with process managers
File descriptor passing
  • Simple
  • Easy to operate
  • Negligible added latency
  • Complex code to implement
  • Difficult to implement correctly

Some of these (e.g. file descriptor passing) are currently in production use at Quantcast. There are many more and the correct one for your particular situation may not be listed here. Providing a solid user experience to your users, customers, and clients is paramount and restarting software without any downtime is a part of that solid user experience.

Interested in engineering at Quantcast? Check out our current openings.