In 2016, Google published a book called "Site Reliability Engineering: How Google Runs Production Systems" that extolled a new approach to managing IT infrastructure. In Google’s words, site reliability engineering, or SRE for short, is “what you get when you treat operations as if it’s a software problem.”
That definition seems to align very closely with the DevOps movement, which aims, in part, to bring agile software development approaches to infrastructure management. People involved in DevOps teams have become increasingly interested in SRE and how it might help them become more collaborative and agile.
To find out more about site reliability engineering, Network Computing spoke with Rob Hirschfeld, who has been in the cloud andinfrastructure space for nearly 15 years, including work with early ESX betas and serving on the Open Stack Foundation Board. Hirschfeld, cofounder and CEO of RackN, will present "DevOps vs SRE vs Cloud Native" at Interop ITX 2018.
We asked him to explain some of the basics of SRE, and what infrastructure pros need to know about this new concept. He highlighted four key facts:
1. SRE is a job function that started at Google
“Site reliability engineering is a term that was coined by Google to describe their engineering operations group,” Hirschfeld said. “It’s basically a job functions that spans multiple disciplines on the operations side of Google. They are responsible not only for data center operations, but going up all the way to interacting with application developers and some of their key internet properties to analyze them, do performance management -- basically take a sustained application into an ongoing full lifecycle deployment.”
2. SRE complements DevOps approaches and cloud-native architecture
Hirschfeld explained that DevOps, SRE, and cloud-native apply similar philosophies to different aspects of IT. DevOps is “about people and culture and process, SRE is “a job function,” and cloud-native is “an architectural pattern that describes how applications are built in a way that makes them more sustainable and runnable in the cloud,” he said.
He added, “It fits very cleanly together where we have an architectural pattern, a job function, a process management description -- all three tie together to really create the way modern application development works.”
3. SRE offers greater reliability and performance
In Hirschfeld’s words, SRE “supercharges a company’s operational experiences.”
He said that by embracing SRE, companies are “placing a high priority on sustaining engineering and making sure their site is up and running and performing well, and that they are not so focused on adding a feature that might hurt the customer experience in the end by being unreliable or slow.”
He also noted that while many organizations have very high regard for their developers, that hasn’t always been true for IT operations personnel. SRE can equalize the influence and respect afforded to development and operations staff.
4. SRE requires commitment
The one big downside of SRE is that it “takes a bit of commitment,” Hirschfeld said. “If the company is used to letting the operations team fight fires all the time and move from crisis to crisis, the SRE team is going to slow down those process while it cleans house, while it fixes the backlog of problems and builds a more repeatable process.” That process can be discouraging, but he encourages organizations not to give up.
He also noted, “If you just throw SRE onto a team that’s not empowered as an SRE team, you will not be that successful at all. It’s not something you should do halfway.”
In conclusion, he re-emphasized the connections among DevOps, SRE, and cloud-native. “You can’t succeed at SRE without thinking about DevOps, without thinking about cloud-native architecture as well,” he said. “They all go hand-in-hand.”
Get live advice on networking, storage, and data center technologies to build the foundation to support software-driven IT and the cloud. Attend the Infrastructure Track at Interop ITX, April 30-May 4, 2018. Register now!