Hurricane Sandy was a strong and disastrous storm that affected over 20 states and left significant destruction in its wake. With an office in NY and many friends and loved ones on the east coast, our thoughts and prayers are with those whose lives were affected.
Quantcast, like a lot of companies, monitored Hurricane Sandy closely as it approached New York. In support of our business, we operate multiple datacenters spread out all over the United States, including a pair in Manhattan. As it became clear there was a storm surge approaching, our datacenter team began monitoring things even more closely. We have a lot of redundancy in place to immediately take over in the event we need to move traffic from one data center to another, however, you can never be 100 percent sure things will work correctly.
The first indication that we might need to put our disaster plan in place was late on Monday (Pacific time), when our provider informed us that the 75 Broad Street datacenter had flooded, causing the fuel pumps not to work. Only a few hours worth of fuel remained. Our plan was to move traffic from there and turn the servers off. We quickly did this, doing our small bit to help extend the time the fuel reserves would last. Eventually the facility lost power late on Tuesday morning and the power outage lasted approximately 12 hours.
Things were running well given the circumstances. We still had the 75 Broad Street location offline, and we were closely monitoring availability of all our platforms. On Tuesday night everything seemed to be improving, and our operations team went to bed content. We had received notification that the 75 Broad Street was coming back online, and all our other datacenters were unaffected. We intended to bring the 75 Broad Street facility back online on Wednesday.
Unfortunately, Sandy hadn’t had enough — shortly after 2 a.m. the fuel pump of our other New York location failed, causing that datacenter to lose power. Our failover infrastructure instantly kicked in, shifting traffic to other East Coast facilities, and things continued to perform well. Our on-call engineers immediately worked on bringing the 75 Broad Street location online (to bring back a layer of redundancy), only for it to become unreachable at 2.30 a.m. At that point we realized that it was connected through the second location, and presumably the battery backup had failed — never an ideal situation.
Despite the loss of multiple major datacenters, due to careful planning, traffic flowed to our other production datacenters (we operate more than a dozen fully redundant locations in the United States) with no loss of traffic.
The biggest impact on the company is the temporary outage of the New York office, which is in the area that suffered a complete blackout. Fortunately, all employees in New York are safe and accounted for, and working remotely until the office power and connectivity are restored.
As with anything, things change rapidly. Currently, all our datacenters are back online (some on generator power) and our fully redundant infrastructure is back to full strength. Equally, we have lessons to learn:
1. Always double-check exactly where the failure points are — we had a previously unknown failure point where one datacenter could take out the other.
2. Always make sure you are receiving facility notifications.
3. Plan for the worst case, hope for the best. While we had actually discussed the impact of hurricanes on our infrastructure prior, we will incorporate more disaster planning into our infrastructure strategy going forward.
As an aside, many thanks and credit to everyone working at our Peer1 facility at 75 Broad Street — they have kept everything running (mostly), by carrying fuel up 17 flights of stairs to feed the generator. While they have had a small amount of downtime related to cooling, we are very impressed with the lengths they have gone to in order to reduce any downtime to a minimum.
Posted by Crispin Flowerday, 11/2/12