In a recent post we discussed how we have been adding resilience to our network.
The strength of the Internet is its ability to interconnect all sorts of networks — big data centers, e-commerce websites at small hosting companies, Internet Service Providers (ISP), and Content Delivery Networks (CDN) — just to name a few. These networks are either interconnected with each other directly using a dedicated physical fiber cable, through a common interconnection platform called an Internet Exchange (IXP), or they can even talk to each other by simply being on the Internet connected through intermediaries called transit providers.
The Internet is like the network of roads across a country and navigating roads means answering questions like “How do I get from Atlanta to Boise?” The Internet equivalent of that question is asking how to reach one network from another. For example, as you are reading this on the Cloudflare blog, your web browser is connected to your ISP and packets from your computer found their way across the Internet to Cloudflare’s blog server.
BGP allows interconnections between networks to change dynamically. It provides an administrative protocol to exchange routes between networks, and allows for withdrawals in the case that a path is no longer viable (when some route no longer works).
The Internet has become such a complex set of tangled fibers, neighboring routers, and millions of servers that you can be certain there is a server failing or a optical fibre being damaged at any moment. Whether it’s in a datacenter, a trench next to a railroad, or at the bottom of the ocean. The reality is that the Internet is in a constant state of flux as connections break and are fixed; it’s incredible strength is that it operates in the face of the real world where conditions constantly change.
While BGP is the cornerstone of Internet routing, it does not provide first class mechanisms to automatically deal with these events, nor does it provide tools to manage quality of service in general.
Although BGP is able to handle the coming and going of networks with grace, it wasn’t designed to deal with Internet brownouts. One common problem is that a connection enters a state where it hasn’t failed, but isn’t working correctly either. This usually presents itself as packet loss: packets enter a connection and never arrive at their destination. The only solution to these brownouts is active, continuous monitoring of the health of the Internet.
Again, the metaphor of a system of roads is useful. A printed map may tell you the route from one city to another, but it won’t tell you where there’s a traffic jam. However, modern GPS applications such as Waze can tell you which roads are congested and which are clear. Similarly, Internet monitoring shows which parts of the Internet are blocked or losing packets and which are working well.
At Cloudflare we decided to deploy our own mechanisms to react to unpredictable events causing these brownouts. While most events do not fall under our jurisdiction — they are “external” to the Cloudflare network — we have to operate a reliable service by minimizing the impact of external events.
This is a journey of continual improvement, and it can be deconstructed into a few simple components:
- Building an exhaustive and consistent view of the quality of the Internet
- Building a detection and alerting mechanism on top of this view
- Building the automatic mitigation mechanisms to ensure the best reaction time
Monitoring the Internet
Having deployed our network in a hundred locations worldwide, we are in a unique position to monitor the quality of the Internet from a wide variety of locations. To do this, we are leveraging the probing capabilities of our network hardware and have added some extra tools that we’ve built.
By collecting data from thousands of automatically deployed probes, we have a real-time view of the Internet’s infrastructure: packet loss in any of our transit provider’s backbones, packet loss on Internet Exchanges, or packet loss between continents. It is salutary to watch this real-time view over time and realize how often parts of the Internet fail and how resilient the overall network is.
Our monitoring data is stored in real-time in our metrics pipeline powered by a mix of open-source software: ZeroMQ, Prometheus and OpenTSDB. The data can then be queried and filtered on a single dashboard to give us a clear view of the state of a specific transit provider, or one specific PoP.
Above we can see a time-lapse of a transit provider having some packet loss issues.
Here we see a transit provider having some trouble on the US West Coast on October 28, 2016.
Building a Detection Mechanism
We didn’t want to stop here. Having a real-time map of Internet quality puts us in a great position to detect problems and create alerts as they unfold. We have defined a set of triggers that we know are indicative of a network issue, which allow us to quickly analyze and repair problems.
For example, 3% packet loss from Latin America to Asia is expected under normal Internet conditions and not something that would trigger an alert. However, 3% packet loss between two countries in Europe usually indicates a bigger and potentially more impactful problem, and thus will immediately trigger alerts for our Systems Reliability Engineering and Network Engineering teams to look into the issue.
Sitting between eyeball networks and content networks, it is easy for us to correlate this packet loss with various other metrics in our system, such as difficulty connecting to customer origin servers (which manifest as Cloudflare error 522) or a sudden decrease of traffic from a local ISP.
Receiving valuable and actionable alerts is great, however we were still facing the hard to compress time-to-reaction factor. Thankfully in our early years, we’ve learned a lot from DDoS attacks. We’ve learned how to detect and auto-mitigate most attacks with our efficient automated DDoS mitigation pipeline. So naturally we wondered if we could apply what we’ve learned from DDoS mitigation to these generic internet events? After all, they do exhibit the same characteristics: they’re unpredictable, they’re external to our network, and they can impact our service.
The next step was to correlate these alerts with automated actions. The actions should reflect what an on-call network engineer would have done given the same information. This includes running some important checks: is the packet loss really external to our network? Is the packet loss correlated to an actual impact? Do we currently have enough capacity to reroute the traffic? When all the stars align, we know we have a case to perform some action.
All that said, automating actions on network devices turns out to be more complicated than one would imagine. Without going into too much detail, we struggled to find a common language to talk to our equipment with because we’re a multi-vendor network. We decided to contribute to the brilliant open-source project Napalm, coupled it with the automation framework Salt, and and improved it to bring us the features we needed.
We wanted to be able to perform actions such as configuring probes, retrieving their data, and managing complex BGP neighbor configuration regardless of the network device a given PoP was using. With all these features put together into an automated system, we can see the impact of actions it has taken:
Here you can see one of our transit provider having a sudden problem in Hong-Kong. Our system automatically detects the fault and takes the necessary action, which is to disable this link for our routing.
Our system keeps improving every day, but it is already running at a high pace and making immediate adjustments across our network to optimize performance every single day.
Here we can see actions taken during 90 days of our mitigation bot.
The impact of this is that we’ve managed to make the Internet perform better for our customers and reduce the number of errors that they’d see if they weren’t using Cloudflare. One way to measure this is how often we’re unable to reach a customer’s origin. Sometimes origins are completely offline. However, we are increasingly at a point where if an origin is reachable we’ll find a path to reach it. You can see the effects of our improvements over the last year in the graph below.
While we keep improving this resiliency pipeline every day, we are looking forward to deploying some new technologies to streamline it further: Telemetry will permit a more real-time collection of our data by moving from a pull model to a push model, and new automation languages like OpenConfig will unify and simplify our communication with network devices. We look forward to deploying these improvements as soon as they are mature enough for us to release.
At Cloudflare our mission is to help build a better internet. The internet though, by its nature and size, is in constant flux — breaking down, being added to, and being repaired at almost any given moment — meaning services are often interrupted and traffic is slowed without warning. By enhancing the reliability and resiliency of this complex network of networks we think we are one step closer to fulfilling our mission and building a better internet.