What we know so far about Facebook's massive outage

Facebook's massive outage began at around 12:00 am Philippine time on Tuesday, October 5. For a number of hours, access to Facebook, Instagram, and WhatsApp was unavailable, with the company scrambling to restore access to users around the world.

Reports are pointing out that the issue which caused the outage appears to be internal, with Facebook's own staff unable to enter Facebook buildings due to the lack of access caused by staff badges not working, leading to a longer timeframe for a permanent fix – because the people who might best know how to solve the problem can't get into their buildings.

Here's what's known so far.

Border Gateway Protocols: How to get to Facebook

Web infrastructure and security company Cloudflare released a blogpost by Tom Strickx and Celso Martinho, trying to explain the likely situation in which an attempt to update Facebook's Border Gateway Protocols (BGP) may have gone wrong.

A BGP is like a listing of routes someone can take to get to a given destination. Internet service providers share BGP information, allowing everyone to know which providers can route traffic to a given location on the internet.

If I wanted to go to "Facebook Street," for example, there'd be quick routes, and alternate routes that would take slightly more time, but would still get me to where I'm going. BGP is the mechanism by which those routes are made known.

Cloudflare said BGP updates inform routers of changes made to its network of these maps.

Facebook's BGP updates don't happen too often, and as such Cloudflare says theirs "is fairly quiet: Facebook doesn’t make a lot of changes to its network minute to minute."

Now, imagine if the map of those routes disappeared and you didn't have any way of knowing where to go. A computer or router would be stuck wondering what happened to the location it thought was previously visible on its maps.

Cloudflare went on to say that, "at around 15:40 UTC (11:40 pm, October 4, Philippine time) we saw a peak of routing changes from Facebook. That’s when the trouble began."

Cloudflare added, "At 1658 UTC (12:58 am, October 5, Philippine time) we noticed that Facebook had stopped announcing the routes to their DNS prefixes. That meant that, at least, Facebook’s DNS (Domain Name System) servers were unavailable. Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com."

Stubbornness making the problem worse

Now the Domain Name System or DNS is also affected by this.

A DNS resolver – which translates domain names like Facebook.com into those numerical IP addresses to connect to – will try to get the necessary information from domain nameservers.

Should the nameservers be unreachable or unresponsive, a SERVFAIL is what occurs. The browser will send an error message to the user.

Attempts to get to Facebook.com – a domain name tied to a specific IP address – kept failing.

Humans, stubborn as we are, kept trying to redo attempts to get to Facebook, leading to an increase in traffic and latency, which Cloudflare called "a tsunami of additional DNS traffic."

"Because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms," Cloudflare wrote.

Facebook started coming back online at around 21:20 UTC (5:20 am, October 5, Philippine time).

"As of 21:28 UTC Facebook appears to be reconnected to the global Internet and DNS working again," Cloudflare wrote.

Facebook, after services were restored, explained in a blog post: "Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.

This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."

Beyond the basic explanation, Facebook didn't provide any more details surrounding the problem-causing configuration changes such as which team or person made the changes, and how mistakes were made.

The situation took longer to resolve partly because of what appears to be a centralized network or access system within Facebook that doesn't appear to have a contingency system for emergencies like this. – Rappler.com

Victor Barreiro Jr.

Victor Barreiro Jr is part of Rappler's Central Desk. An avid patron of role-playing games and science fiction and fantasy shows, he also yearns to do good in the world, and hopes his work with Rappler helps to increase the good that's out there.

image