In this archived keynote session, StrategITcom's Carrie Goetz opens our 'Network Resilience Boot Camp' with a discussion of chaos engineering and failure simulations. The event was presented by Data Center Knowledge and Network Computing and sponsored by Panduit. This excerpt is from our live 'Network Resilience Boot Camp' virtual event moderated by Bonnie D. Graham back on June 29, 2023.
The devil is certainly in the details, but more often, the devil is a detail you never would have predicted. The weakest link in a complex architecture might be one you know about, but more likely, it was hiding in plain sight.
View the entire Network Resilience Boot Camp event on-demand here.
In this virtual keynote presentation, Carrie highlighted the adaptive systems needed in modern hybrid architectures, such as Netflix’s resiliency tools like its Chaos Monkey, which unleashes failures intentionally to test and track the many unanticipated consequences of those failures.
A transcript of the video follows below. Minor edits have been made for clarity.
Decentralized architectures drive the need for chaos engineering
Bonnie D. Graham: Let me tell you about Carrie. She's the principal and CTO of StrategITcom. She's going to correct me when she gets here, as well as a fractional CTO to several companies. Carrie has nearly 40 years of global experience in designing, running, and auditing data centers, IT departments, and intelligent buildings. I want to hear about that. She's an international keynote speaker. She's been published in more than 250 Publications. After Carrie's keynote, I hope you'll stay tuned for my Fireside Chat with Panduit’s Jeffrey J. Paliga and Bob Wagner. Now, it's my pleasure and privilege to welcome Carrie Goetz to enlighten us on failure simulations, finding that chaos demon lurking in your system. Carrie, welcome. Take it over.
Carrie Goetz: Thank you so much. Hi, everybody. I am Carrie Goetz, as she said. She read a little bit of my bio. Most recently, I wrote this book, 'Jumpstart Your Career in Data Centers’ and the educator’s reference is coming out for that as well. So, if you're involved anywhere in tech or IT, consider that for your education. But today, we're really going to talk about chaos daemons and different things that you can use for simulation and helping out whatever you're doing in your data center and through your communication. So, you know, things changed a lot.
This is a very cyclical industry. We went from very centralized architectures with mainframes to then decentralized because we didn't have enough bandwidth to do backups and things. So, we started moving PCs and all out to remote offices. And then we had a big re-centralization effort to take advantage of things like virtualization and to have better sustainability policies and work through our systems better with a smaller architecture. And now, it's kind of redistributed again, with cloud, obviously, in a slightly different platform. But when you think about all the different areas where communication could go wrong in a network or where you can have an issue, we obviously have network communications. And today, that means a whole lot more than it used to - because we have remote workers, we have people in transit, we have people in the air, we have people on the ground. We have people on planes, trains, automobiles, all those things.
5G has certainly taken over some of that, and we still have a significant portion of the world that is not connected at all. So, thinking about that, those are certainly things we would want to work on. But we also have to be able to test for those scenarios. And the tricky bit is when you think about testing for any of these, but specifically, if we talk about networks, there are about a million things that can go wrong. Now, if your frame of reference is seven things that go wrong, then obviously, you're only going to test for seven. So, this could be difficult. Same thing with servers. We have multiple servers sharing platforms now. We have abstracted the software from the hardware layer, so it's not one-to-one ratios. We have different ways to test for those. And this is an area really where AI has helped us out quite a bit.
So, we have machine learning that helps with some of the things like server reboots, for instance. Instead of manually going in and having to reboot, we have machines now that can do that. We also have to think about databases and where things can go wrong. Databases only work when you write stuff in them. They're not intuitive, and they're not mind readers. They can't pre-write data based on you just thinking the data needs to be there, right? We have to think about that, and we have to think about what's going to be in RAM versus what gets written down and where things are when chaos happens. We also have to think about all these very locations.
Poor performance is the new system down
I touched on that a little bit on the networking part but think about that. Think of all the different places you connect, for instance, to your email, right? Or think of all the different locations, especially if you're a transitory worker, that you connect just to your corporate network. Latency is a big issue that really drives a lot of what we do, from being able to distribute our systems and distribute where they go. Because obviously, slow is down for most people. If you click on a link, and it's really slow - nobody waits on the link, you hit the back button, and you go to the next link. And so, all these really become a big part of this, sorting out how we're going to make this and that we have to figure out failover, right?
It's great to say ‘System A’ is automatically going to take over for ‘System B.' Now, I can definitely tell you in a lot of the outage triage that I've done – that’s great when you've tested, and it works, but it doesn't always work that way. What we have with failover sometimes is that we know what it's supposed to do, but then it doesn't. If we don't test these things, we don't know until the actual situation happens, and then that ‘oh no’ moment happens when ‘uh-oh,' everything's gone crazy. We also have mirroring back and forth – how that's going to work if we do simultaneous writes, all those different things. When we think about distributed computing and putting all these things either in the cloud, our data center, at the edge, which is another component we haven't really touched on, and think about how we're going to get to all the systems.
We can get to them if the network's working. We cannot get to them if it's not unless we're doing out-of-band management. But then, you think about the potential for anything to go wrong. Realistically, the hard part with testing and triage is when we come to conclusions about what happened, it's because we've done a root cause analysis or we've looked at something in hindsight. And really, what we want to be able to do is take that proactive approach.
It's almost like pre-raising your kids, right? If you knew that you could raise your child, pick all of their friends, figure out every place they're going to go, and put them in that plastic bubble, that'd be great. Clearly, it's never going to happen, but it's sort of the same kind of thing, right? We want to pre-raise our systems, and we want to know where those problems are before they come back to bite us. So, if we look at the definition of chaos, it's a condition or place of great disorder or confusion. Well, that certainly describes downtime to a tee, right? A disorderly mass or a jumble, the disordered state of unformed matter in infinite space that was proposed in some cosmogonic views to have existed before the ordered universe.
If we take the first two, we're not talking about the universe, although I guess some people would argue that the internet is kind of our universe. But think about the condition of great disorder or confusion that absolutely happens when we have downtime, especially if we don't know the cause of that downtime – and we see this a lot today. You can log on to a cloud application, for instance, and it's not there, and you find out the next day that there was a service outage or somebody had an issue. These are things that we deal with on a pretty regular basis. Now, the trick is, if we want to test for that chaos, we have to create that chaos. Well, how do you create chaos if you don't know what could technically happen? And the even trickier part is the combination of things that can happen, right? So, we want to make sure that we can address this a little bit.
Enter chaos engineering
So, Netflix years ago started this Chaos Monkey, and the idea was they wrote something that would go out and just randomly terminate instances to figure out what would happen. So, think about Netflix as a company and an application. They stream everywhere that there's a Netflix presence. Some of that uses edge compute, and some of that is going to be in more core data centers but think of their endpoints. Their endpoints are literally everything; they're desktops, TVs, phones, or tablets. Anything that can kind of communicate that you can get the Netflix app on. So how do you test for all those? What they figured out was we can't really rely on humans to do that testing. We have to come up with something that is going to work outside of the human mindset to make that happen, and that's where Chaos Monkey was developed.
When we think about adaptive systems, we have to think about how we train them to adapt when there's a problem. How do we sort this out if you can't test for everything? Going back a few years ago, there was a very widespread, highly televised outage. It turned out that this particular company flipped over to its backup data center. They thought that this was going to solve their problem, but somebody had hard-coded an IP address, and there was a lag time. So somewhere, they didn't really test for that, and they didn't realize that it was not a dynamic one. So, when they failed over, it also didn't work because the IP address was hard coded to the wrong machine. All these things can happen, and again, you don't really realize that that could happen until it bites you.
We want to make sure that our resilience really has that adaptive capacity. When we test for resilience, and we expect resilience in our systems, we want to make sure that we're not just basing this on the assumption that this is a real rosy world - all things are going to go right, great. Because anybody that's been on the troubleshooting side knows it doesn't work that way. I mean, it really does not work that way. So, what we want to be able to do is give ourselves the tools to be able to sort that. If we think about the problem with traditional testing, you're only going to test for the things you know about, right? If you're driving a car and you're looking for potential causes of an accident - the ones that you're going to think about are in your frame of mind.
You're not going to think of a brick coming off an overpass. But do you look for that every time you go? Maybe not, but if a brick came off an overpass and it hit the front of your car or even hit the trunk of your car, clearly, you would have a very different day. But you wouldn't think about that because it's not your normal frame of mind, and that's a horrible example, but you get the point.
The need for expanded testing and failure simulations
The point is we don't test for stuff we don't know. As you think about diversity in tech, this is really one area where that diversity helps because you have people that sort problems out in a different way.
You have people that do it the scrappy way that are self-taught. You have people that do it a traditional way that have learned in a very particular pattern, and then maybe you have some people that are from the outside, but they work with systems that closely relate to that. Maybe they can shed a different light. If you think about root cause analysis, that's the reason it's so successful - is because you involve so many people in that process, and you bring everybody into the fold, so you get all those perspectives. Honestly, when you're testing systems, that is one of the best things you can do. You can ask anybody that writes code that you know and say, hey, how would you like your code tested?
They'll tell you they would either like to run it over AI or they'd rather give it to somebody that knows nothing about the program. Those are the ones that are going to click in the weird spots. They're going to try to go out of a box without entering critical information, and they're going to do different things. That helps pinpoint those errors before it becomes an issue for the company that wrote the code, right? We want to make sure that we test often, and we want to make sure that we're not testing for the same five things. Not that those five things aren't critical, but there are other things that can happen. So, what happens if ‘network server A’ goes down, and we're running on the ‘B network’ because ‘network A’ went down? Also, say it was a power failure, do we know that those addresses failover? Do we know if that stuff is going to work?
What if we have to move an instance right in the middle of that, and we've got an instance that is in the middle of a move? How do we test for that? How do we know that we can get that server backup so that we can hit a resiliency point? Those are the kinds of things that testing is going to give you. But again, that becomes more from a development standpoint, and DevOps, and even DevSecOps – when you add security to this, think about those testing scenarios. But again, they're going to be limited to what you know because you don't know what you don't know. We're all ignorant until we know. What we're trying to do is figure out random patterns of different things that can happen. If we think about adaptive computing and what happens with the storage, what happens with the server instances? What happens with the networking – all those things, and even the backups? How do we get back to a non-fractured state when there is an instance? That's really what this testing is going to help.
So, it will allow for random cycling, and it'll happen at different times of the day. Even if you bring sustainability and smart buildings into this equation, what happens with all these devices when nobody is at work? What happens to them in the middle of the night? Can we do some of this testing at those random hours when we're not staffed by people? And the other thing is, once you start applying machines to this, and you make them record things – it’s going to pick up on things that humans would normally ignore. We all do this, right? As you're going through your systems, you click on something to ignore that error. Well, you might be ignoring something that's really critical to a different application. So this allows that accountability, and most of these systems will help test for that accountability. Then, as I said, post-mortem is absolutely effective, root cause analysis is great.
Root cause analysis vs. chaos engineering
How we look at systems after things go down is spectacular, but a lot of times, once we hit a single problem, we stop looking, right? So, root cause analysis brings in a lot more, and then, of course, you test that to make sure your theory is correct. But we don't always have time to do that when we're operating at the speed of business and we're just trying to hurry things along. What we want to do is, give ourselves the tool and make our life a little easier too. Chaos engineering is sort of like the crash test dummy here of software engineering, but it's the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions. What we do is we build, we fail, we learn from that failure, and then we repeat it again and see if we can find another failure.
One of the best management books I've ever read was something along the lines If it ain't broke, break it. The idea was this company paid people to tear things up and create problems as long as they had a solution. If you broke a company system and you came up with a solution on how to fix it, there was a $5,000 bonus. Now keep in mind this was 20 years ago; five grand went a lot farther. But still, that's kind of the mentality that you want to have, especially where your systems impact what you can do as a company and impact your relations with your customers. So, Netflix launched the Simian Army; the first was Chaos Monkey. They came out with the name because, ideally, this monkey would go around and create little havoc here and there, and they would test to see what happened. Then, they decided, well, we really need to do something bigger.
So-called monkey and gorilla monkey came along, and those can wipe out an entire availability zone just to see how failover of all the systems works. Then, we've got a janitor monkey that goes in and cleans up after all the other monkeys, probably the worst job of all the monkeys. We have a latency monkey that deals with latency. What happens if site ‘Site A’ goes down? Is the latency going to be acceptable? Is there going to be too many little windy circles on people's screens, for instance? Then, conformity monkey to make sure that everything conforms, obviously for compliance reasons. Then doc monkey, security monkey, the 1018 monkey – that works over the different instances. All these different monkeys test different parts of the system.
Now, if you think about that, if you turn one kid loose in a playground, they're going to play with their favorite toys. If you turn ten kids loose on that playground, they're going to play with probably all the toys in various orders, and you have no control over that. What a great way to test your systems; put all ten kids on the playground, let them have fun, and let them see what they can come up with. Then, see how that works. See how they interact with each other and see if the janitor monkey doesn't clean something up. Is that going to have a problem for the security monkey? Does that create a security problem or a hole? All these things can be tested and simulated before they become actual money-causing problems.
Cost of downtime continues to grow
I think the last estimate I saw downtime on the internet seemed to be somewhere in the neighborhood of $5800 per minute for the average company. It's definitely worth testing, right? And that's tangible costs, but they don't really cover the intangible costs that happen when you have downtime, like loss of customer confidence and people going to your competitors and staying there. All those things are hard to track, but clearly, you want to make sure that doesn't happen. Now, as far as the Netflix Chaos Monkeys, they don't have the longest lifecycle, but they don't have the most support. So, if you go out and look at these monkeys from an open systems perspective, they've really had very few additions and very few comments.
There are some other open-source ones that definitely are a little more used these days because even these platforms are evolving as we go, but there are some dependencies. So, the Chaos Monkeys rely on Spinnaker, and that works with most cloud platforms, but not everybody uses that. You have to enable Chaos Monkey, obviously, for it to work. It works through Cron-Tasks, and then there's the idea that we are operating on a continuous integration and continuous delivery model. The idea is that we do this in a forward-moving motion, right? As we're still moving at the speed of business, we still test for all the things that could stop our business, and then we still have to react at the speed of business. Then, we have my sequel that keeps up with the scheduling and stuff. So, if your instances don't have this, you would need to look at one of the other chaos tools, and there are certainly other ones out there.
There are eight fallacies for distributed computing that people assume are things that work. Just like if you put something in the cloud, the cloud itself is secure. You're responsible for securing your stuff in the cloud, so these are the same things. These are things that people assume are going to be solved by distributed computing, but they are actually not true. We can't say the network is reliable. We do have very reliable networks, and most of them have a resilient component. Some parts of the world have very unreliable networks, so you can't always say that the network is reliable. Equipment is not reliable, and it all fails at some point. You also can't assume that there's zero latency.
There are parts of the world where things will fail, just simply on latency. For instance, if you're trying to do a data center - and the application is in one availability zone, and the database is in a different availability zone. If your network gets kicked over to a different connection, it's going to be very, very slow. So, what's going to happen is it's going to timeout - it's going to appear that it's down when really, it was just a latency issue. You also can't assume that bandwidth is infinite. I was speaking not long ago, and this girl said, but 5g solves that, right? Because with 5G, everybody gets a gig all the time, and you don't - you get a gig to the tower, but you still have to share the uplink, and you still have to share that with other people. Same thing with cable modems and a lot of that; you can't rely on that bandwidth being infinite. Wireless, the same thing; you have saturation at some point - nobody else can attach. So, you can't make those assumptions.
You also can't assume that the network is secure. This lesson was probably driven home very, very hard during COVID. Because people found out that company laptops were being used for kids' school assignments, and kids didn't always stay where they needed to stay. So, you have to think about how people get there. We found quite a few problems with VPNs and areas where VPNs just simply didn't do enough, and so Zero-Trust came out. Now, we have different things that we're trying from a security standpoint. Now, think about that just in these top-five or top-four, all of this is networking. When you have the internet, it is this gargantuan mesh of stuff, and we would like to think it all works perfectly but think about how much of that is not under control.
Keeping pace with changing network topologies
We also have to understand that topology is not constant, right? People move, work at a Starbucks one day, at home the next day, they might work at the beach the next day, unless, of course, you make them come to the office. But even there, they might work on the fifth floor, and they might have meetings on the second floor the next time. We understand that our people are portable, our devices are portable, and all that moves around. We also can't assume that there's only one admin; there are a lot of admins. As a matter of fact, a lot of cloud misconfigurations happen because there are too many admins. We also can't assume that the transport cost is zero. There are different costs of connecting in different places. We also can't assume that the network is homogenous because it is not – there are all kinds of different devices. Some are banned in the sub-country, and some are not, but we still communicate with those devices in other countries.
All these things really form the basis for chaos engineering. These are things that we would love to say that we can count on, but we can't. What happens when these fail? What happens to our systems? So, like I said, with the old school, you test for a problem, you fix it, you test it again, you document it. But you're only testing for the problems you know about. You're only testing for the scenarios that really have come up in your past, right? You learned about them in school, they've happened in your career, a peer has told you that they happen – that’s your frame of reference for that testing. But if you try to hit that over 20 different systems, it really becomes very complex for people to manage. But hey, you know what? We have computers, and we can code those, we can program those, and we can make them do that testing for us.
When you look at some of these systems, especially the open systems and cloud-based systems, where you can do this on an instant basis, as opposed to having this run out of your systems all the time - for small businesses and for smaller entities, this is a great equalizer. You can have a company come in and do this as a service; run these tests over your systems, make sure you know pinpoint what your problems are, let you fix those, come back six months or a year from now, and do this as a service. Or, as an open system, if you have the time to do that yourself. The problem is. Obviously, most of us already work more than 40 hours a week, so there's that.
What chaos engineering brings to the table
But chaos engineering, like I said, is different. It's like pre-raising your kids if you could figure out all the different scenarios. What we do with chaos engineering is we decide our blast ratio – what we're going to figure out. Then, we load in those configurations, and we tell it to go. Maybe we say we're going to kill network one at 1:00 pm, 9:30 pm, and 1:30 am, and we load these into the database. The Cron is going to start launching these. Maybe we shut off a couple of instances, and maybe we make a database disappear and look at how the backups work, all these things. Then, we observe the scenario and what happens. Then we say, hey, all of this went without fail. Let's try this maybe at different times, or let's try this on a different server and try different scenarios and see how that works – make sure that goes well.
Then, of course, repeat and change, and repeat and change. Do you need every single one of these monkeys in your scenario? The answer is maybe, right? It depends on the size of your entity. It depends on the complexity of your system. It also depends on whether you want to test or not. I mean, believe it or not, there are still a lot of companies that don't test like they should. Larger companies do because they have the resources, but it is a strain on smaller companies that have maybe one or two IT directors that stuck with this. These open-source tools are definitely a help to that. We also want to test for evil and coincidence.
The one thing we haven't really touched on – we did talk a little bit about the security of networks, but not everybody that hits your network has your best interests at heart. I don't want you to be shocked by that news, but I'm just going to tell you. As we're testing for these scenarios, what happens if somebody tries to hack your system at that same opportune moment that you're running one of these scenarios? So, you have to be able to test not only for the evil but you also want to be able to test for random things that happen, and sometimes those things happen at the same time. That's a wealth of information. Whatever you're using for cybersecurity, whatever that posture is, and what your plans are, obviously, you want to make sure that you include security in these, folks. Include compliance and anything else, you know, that could have financial report consequences. They need to be part of this scenario.
Now, there are a few other chaos tools. So, if you're not looking at Chaos Monkey, there are some other ones out there. Cyborg is open-source Monkey Ops, obviously. Gremlin is another one, and then AWS also has a fault injection simulator that you can get as part of the AWS platform. I'm sure that Microsoft and Google have similar tools, whether you use open-source or something. Beyond that, you can mix and match however you want to. So, that's it for me. This is my information should you have any questions. If you need any follow-up on answers after this webcast, please feel free to reach out. If you are teaching IT, give me a shout. We have a new teacher's book that we would love to put in front of you to help with some of these. It's all about us learning together and working together. So, thank you, everybody.