Anonymous: 10 Things We Have Learned In 2013
(click image for larger view and for slideshow) \
After the zombies took over Google's data center, the heroic action of a few selfless individuals saved the day. Never underestimate what a site reliability engineer can do with an axe.
"If you look at zombies in the data center, they're after the people," explained Kripa Krishnan, technical program manager at Google. "So it becomes less of a machine's problem and becomes more of a people problem..."
The zombie invasion occurred back in 2007. It was one of the first Disaster Recovery Testing (DiRT) events created to evaluate Google's operational resilience in a crisis. This was before Centers for Disease Control and Prevention began warning about zombies because storms, pandemics and earthquakes don't get people's attention anymore.
Although heroism has played a central role in saving Google more than once -- another scenario involved an executive wielding the teleportation gun from Valve's Portal -- it's not something that can be relied on when disaster strikes, just like any IT system or business process at a time of crisis. Google as a company promotes the perception that its employees are exceptionally talented. But when it comes to preparing for the worst, the company can't simply assume that exceptional skills will save the day.
[ Could what you wear be used to identify you in the future? Read Google Funds Fashion Recognition Research. ]
"We find that people are people, and they burn out if they work insane hours and long shifts," said Krishnan. "Heroic tactics are not a sustainable model if you're in a disaster."
The DiRT program was created seven years ago and Krishnan began managing DiRT events a year after that. Genial and sharp, with a penchant for using the word "goodness" to emphasize a point, her background recalls the famously overachieving Buckaroo Banzai, depicted in the 1984 movie that bears his name as a neurosurgeon, physicist, rock musician and test pilot.
Hyperbole perhaps, but it's a necessary element in a story about heroism. Krishnan was studying medicine over a decade ago when her interests took her to music and theater. Three years in, she decided to study performance arts, and eventually came to the U.S. to focus on theater. Then a professor convinced her to take a computer science course. Having left science for the arts, Krishnan finally emerged from graduate school with a degree in Management Information Systems. Thereafter, she became involved with telemedicine networking in Kosovo and later landed at Google.
Now her job is to break things, as Krishnan explained in an interview at Google's Mountain View, Calif., headquarters.
"Sometimes we will bring in someone to write something that will cause a failure in some underneath layer and it will manifest itself as cascading failures in some front-end facing product," Krishnan said. Other times, she says, her team might direct someone to introduce corrupt data into a system, to see how long it takes to find the problem.
DiRT is an annual exercise. Although various Google product groups conduct their own internal stress tests, DiRT's scope is companywide. DiRT scenarios challenge both technical infrastructure and organizational dynamics. Initially, the tests were restricted to user-facing systems, but they have been expanded to cover the full range of Google operations. Beyond data centers, DiRT testing might include systems used by facilities, finance, human resources and security, among other business groups. More recently, as the company's enterprise business has become more successful, customer support systems were added to the tests.
DiRT exercises require the work of hundreds of engineering and operations employees for several days, which means they're not inexpensive to run. They can affect live systems and have even resulted in revenue loss. But the price is deemed to be worth it.
Sanjay Jain, associate industry professor in the department of decision sciences at George Washington University, said in an email that the apparent increase in manmade and natural disasters around the globe demands more active continuity planning.
"Recently, companies have had to face major issues due to disasters including the loss of operations in New York and New Jersey area following Hurricane Sandy a few months ago, and the major impact on supply chains following the tsunami in Japan in 2011," he said in an email. "Companies need to be more thorough in planning for safety of their personnel and maintaining business continuity in face of such eventualities. Such efforts have to go beyond duplicating data servers (that is of course needed) to employing live and computer simulations of potential disaster scenarios and their impact on companies' personnel, operations, and assets, and testing of measures to eliminate or substantially reduce the negative impacts."
In case of emergency, Google has a war room. DiRT tests are run from a simulated war room, which can be one of the company's many conference rooms.