Network Performance Monitoring using flow data (NetFlow) is an approach to isolate the root cause of performance issues related to network traffic by measuring a set of characteristics across L2-L7 layers.
There are three basic causes of performance issues: round trip time, server response time, and jitter. Each can contribute to low performance and downtimes. Let’s examine each one.
1. Round trip time
Also called network delay, round trip time represents a data transfer time of a packet being transmitted from client to server and back. It is a single value that models the performance of the network itself, calculated by observing the time needed to establish a TCP session. A typical value in enterprise networks in one location is less than 1 ms (even tens of microseconds) as on the local network. An application has no impact on the TCP handshake as this is part of the TCP/IP stack implemented in the operating system itself. It would require an operating system malfunction to influence this metric which won’t happen in practice. Here are some typical root causes of network delays.
Overload of network devices: High packet rates impact buffers in network devices where packets need to wait to be dispatched. QoS can help to prioritise critical services to a certain extent but experiencing a DDoS attack may lead to network congestion and increased values of RTT.
Clients working from remote locations: Complaining about slow application responses might not always be the case. Having an RTT of 500ms when connecting from home through a VPN to a company data centre means that just to transmit the packet takes half a second and any application will look slow from a user’s perspective.
Cloud applications: To lower the delay, SaaS providers use CDNs and proxy servers to host the application as close to customers as possible. For the same reason large companies purchase dedicated lines to connect their infrastructure directly to cloud providers.
Ethernet vs. Wi-Fi: In my practical experience, the usual performance difference between wired Ethernet connection and WiFi is around 10ms. So 10ms is the average penalty you get when going through WiFi instead of wired Ethernet connection. And we are still talking about ideal conditions.
Performance bottleneck caused by heterogeneous port speeds: Imagine a 10G backbone while servers are connected through 1G, especially when multiple servers share such a 1G uplink. Numerous clients can easily generate traffic that will spike above 1G port capacity, saturating switch buffers, which leads to packet drops. Such packets need to be retransmitted and consecutively users experience a network delay.
2. Server response time
This metric represents the request processing time on the server side and so represents the delay caused by the application itself. The measured server response time expresses the time difference between the predicted observation time of the server's ACK packet (prediction based on observation time of the client request and previously measured RTT value) and the actual observation time of the server's response. The measurement can't rely on observing an ACK packet from the server since the ACK packet might be merged with the server’s response.
SRT enables a performance measurement of the whole application, per application server, per client network range or even individual clients. This enables finding correlations between application performance and a number of clients or a specific time of the day. Using this metric together with RTT answers the ultimate question. Is it a network issue or application issue?
3. Jitter - variance of delay between packets
Jitter can show irregularities in packet flow by calculating the variance of individual delays between the packets. In an ideal case, delay between the individual packets is a constant value, which means that jitter is 0. In reality, having a jitter value of 0 doesn’t occur as a variety of parameters might influence the data stream. Why should we measure jitter anyway? Jitter is critical and has the main value for assessing the quality of real-time applications, such as conference calls and video streaming. But also when downloading, e.g. a Linux distribution ISO file of Linux distribution from a mirror, jitter may indicate an unstable network connection.
Summary
Continuous monitoring and baselining of network performance monitoring metrics by using flow data helps network administrators to identify an issue in the network itself, specific connections or applications. It’s valuable to reveal problems before users do and prevent complaints on performance degradation. Long term monitoring of network performance metrics (RTT, SRT, Jitter) can help to predict future needs (capacity planning) and incidents.
Network performance monitoring metrics can considerably improve the performance of the network as well as contributing to the improvement of the application side.