On July 2, it became evident that the internet is extremely vulnerable to even small software glitches.
It was the second incident to happen in less than ten days that caused widespread outage. The first incident was an issue with a Verizon Border Gateway Protocol (BGP) that took out enormous number of web pages. The common thing about the two incidents was Cloudflare- one of the biggest companies in the content delivery network space, which got affected both times.
A week ago, Cloudflare clients confronted a noteworthy blackout when Verizon unintentionally rerouted IP bundles after it wrongly accepted a system misconfiguration from an internet service provider in Pennsylvania, USA. This time, the outage was a result of a single misconfigured rule inside the Cloudflare Web Application Firewall (WAF), that led to an increase of Cloudflare's network CPU usage, which then got scaled across different global geographies.
The incident happened at 13:42 UTC and lasted for 30 minutes. Visitors to Cloudflare-proxied domains received 502 errors due to the global outage across Cloudflare's network. This affected thousands of prominent webpages, including some big tech brands.
For instance, Facebook and its assets including WhatsApp and Instagram suffered outages relating to image display for the majority of the day. The issue was due to varying timestamp data fed to the social media giant's CDN in some image tags. Facebook also displayed varying timestamp arguments embedded in the same image URLs.
so that cloudflare outage was a caused by a single regex rule deployed globally in one go?♂️ pic.twitter.com/xws5kQZ59K
— mjos\dwez (@mjos_crypto) July 2, 2019
The rules were being implemented in a simulated mode where issues were identified and logged according to the new rules but no customer traffic was blocked. This was done like this so Cloudflare can measure false positive rates and ensure that the new rules do not cause problems when they get deployed into full production. But, things didn't go according to plan as the new rules also contained a regular expression that caused all the havoc.
"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%," wrote John Graham-Cumming, CTO of Cloudflare in the company's official blog.
According to Cloudflare, the CPU exhaustion event that it witnessed was unprecedented as the company had never experienced a global exhaustion in the past. In the wake of discovering the real reason for the issue, Cloudflare pulled the plug on the new WAF Managed Rules, which right away dropped CPU back to typical and reestablished normal web traffic.
Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc., which the CTO John Graham-Cumming clarified was not true in his tweet.
I've seen a bunch of speculation that today's @Cloudflare outage was caused by a DDoS from China, Iran, North Korea, etc. etc.
It was not an attack by anyone from anywhere.
— John Graham-Cumming (@jgrahamc) July 2, 2019
“This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels,” according to Graham-Cumming.
In the SLA, Cloudflare guarantees 100% up-time and 100% delivery of content and so the recent event has caused the company to breach its SLA, even though it was unintentional and could have been prevented had the company planned its new Cloudflare WAF Managed rules better. "We built Cloudflare with a mission of helping build a better Internet and, this morning, we didn't live up to that," Cloudflare CEO Matthew Prince told DCD.