People tune in to events like the World Series and the US Open to see athletes at the top of their game, but these streams also represent the finest hour for a less-recognized group pushing the limits: the backend engineering, cloud/DevOps, and site reliability engineer (SRE) teams who prepare and maintain the infrastructure necessary for these video streams to function in the face of tremendous traffic. Ensuring that massive online events are successful and deliver great online experiences takes as much planning, effort, and quick thinking as winning a championship game.
It’s a shame that this work takes place behind the scenes because there’s a lot to learn from watching it happen. Spikes in demand require careful planning to mitigate risk and maintain optimal performance.
With the right mix of preparation, technology, and flexibility, any networking team can ensure resilience and provide every user with a stable, uninterrupted experience. In order to lay the groundwork for success, three essential best practices stand out:
1) Begin the planning process as soon as possible to ensure great online experiences
Networking teams can anticipate some spikes in traffic well in advance (such as for the Super Bowl), while other traffic surges come as a surprise (such as a jump in demand resulting from a small company’s TikTok advertisements going viral). However, in either case, it’s best to have a plan for handling high demand as far in advance as possible.
A good first step is to assemble the team that will handle the event. This should include a diverse group of staff members, including technical managers, SREs, and backend engineers. Everyone involved needs to have a precise sense of what their duties are and what is best left for a teammate. They should instantly know whom to ask for help if they encounter something they can’t handle themselves.
2) Prepare the infrastructure you need for resilience
Just as high-profile events can be targets for cyberattacks or increase the risk of infrastructure failures, companies of all sizes face these same risks every day. It’s essential to create a robust infrastructure that ensures redundancy and resilience to sustain continued operations in the event of an attack or outage.
Two of the most critical technologies for delivering high-quality experiences are content delivery networks (CDNs) and DNS networks. Different CDNs have distinct strengths and weaknesses, so building a multi-CDN network can ensure reliable access to many kinds of content from any location. A given CDN may excel at providing coverage in major metropolitan regions, while another may excel at catering to rural audiences. Additionally, one CDN may be better at hosting videos and streams, while another may be a better fit for static assets.
Given the critical role DNS plays in connectivity, having a contingency plan to implement in the event of a DNS outage is crucial for resilience. This requires two separate DNS networks that are fully autonomous, such that the complete failure of the primary DNS (for instance, in the event of a DDoS attack) allows the team to dynamically shift traffic to the secondary DNS instead of going offline entirely. It’s also helpful to implement anycast technology for diverting traffic around partial outages.
If you anticipate using infrastructure from multiple providers, you will need to map out resource usage well in advance. Establishing minimum usage commits should happen as soon as possible since they play a key role in determining how best to distribute traffic amongst your infrastructure.
3) Rely on deep analytics and open communication for day-of challenges
No matter what preparations a networking team makes, complications are inevitable across every industry and type of event — even the final day before a college's applications are due. A seldom-used peripheral domain might suddenly see traffic, which could possibly indicate a DDoS attack. Even so, unexpected challenges should not indicate a failure on any team's part but simply an unavoidable reality.
A critical component of handling disruptions is being able to derive deep analytics from your DNS, cloud, and on-premises server data in order to identify misconfigurations and potential attacks. The more you can derive from your data, and the faster you can do it, the more quickly you can resolve complications. Establishing open lines of communication also lays the groundwork for effective troubleshooting. By asking for frequent (perhaps hourly) updates and having ways to quickly escalate major issues to the people who need to resolve them, companies gain a record of everything that has taken place and can pinpoint problems quickly.
No matter how overwhelming running an event or application may seem, networking teams that implement these best practices can face tremendous spikes and fluctuations in traffic with confidence. A coordinated team knows exactly to whom to turn for any given issue, while resilient architecture and deep analytics make it possible to respond to any disruption quickly while preserving a great user experience.
Jodell Morency is a senior solutions engineer at NS1, an IBM Company.
Related articles: