未加星标

SRE Case Study: Mysterious Traffic Imbalance

字体大小 | |
[系统(windows) 所属分类 系统(windows) | 发布者 店小二04 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

Once upon a time there was a website. I am going to call it foo.com, but the name doesn't really matter. Feel free to replace it with any name that sounds better to you.

Foo.com had two data centers, Miami and Denver, running in active-active mode for business continuity and disaster recovery. Web traffic was evenly distributed between the two data centers by round robin DNS.

If you haven't heard about round robin DNS, the way it works is quite simple. As foo.com runs in two data centers, the foo.com name is registered to two IP addresses in Domain Name Server (DNS). The IP address in Miami is 100.100.100.100, and the IP address in Denver is 200.200.200.200. When clients browse foo.com, the first thing they do is to resolve the name into IP addresses. For each of the name resolution request, DNS server returns the two IP addresses in alternative order. For instance:

The first client asks: what are the IP addresses of foo.com? DNS answers: The IP addresses of foo.com are 100.100.100.100 and 200.200.200.200. The second client asks: what are the IP addresses of foo.com? DNS answers: The IP addresses of foo.com are 200.200.200.200 and 100.100.100.100.

Each client selects the first IP address in the response, so the first client talks to 100.100.100.100 in Miami, the second client talks to 200.200.200.200 in Denver, so on and so forth. When there are millions of clients, the end result is that both Miami and Denver would receive approximately the same amounts of traffic.

It had been working like this for many years until mid-2007, when the Site Reliability Engineering (SRE) team noticed that Denver started getting slightly more traffic than Miami. The discrepancy was under 1%, which wasn't significant enough to cause any impact. It just seemed to be strange as it never happened before, so the SRE team opened a case and started to monitor the traffic distribution more closely.

After several weeks of monitoring, the team clearly observed a trend that the Internet traffic from the users was shifting to Denver slowly and consistently, from 1% to 2% to 3%. At this point, the severity level of the case was raised and more engineers were grouped together to figure out the root cause.

The team identified the related components in the data flow and checked all of them.

They verified that the DNS systems did return the IP addresses in round robin fashion. They verified that all the major Internet service providers were not having any significant outage. They analyzed the traffic in Denver and Miami to see if the extra traffic came from a specific Internet service providers, or a specific country, or for a specific URL, but nothing stood out. They verified if the report generating system was working properly, and confirmed that the report was accurate and the system wasn't missing any data.

While the troubleshooting activities were taking place, the discrepancy was still growing slowly and consistently, from 3% to 5% to 10% in several weeks. 10% of traffic imbalance wasn't a problem by itself. The website was designed to absorb much higher of discrepancy. The problem was that the reason of the discrepancy remained mysterious. Such a clear and growing pattern without a clear reason was very strange. The severity level was raised, the team was still in the dark, and everyone started to feel the pressure.

The first thread of light arrived two months later, when one of the engineers noticed that most of the extra traffic in Denver came from IE7 (by User-Agent header of the HTTP requests, in case you are curious). This version of IE7 was only available in windows Vista at that time, and Windows Vista was released right before the initial report of the traffic imbalance.

So the question became: why does Windows Vista prefer Denver?

The reason was still unknown, but the team felt relieved as they knew the rest the troubleshooting would be easy and straightforward. Why? SRE veterans know that the most challenging phase of troubleshooting is when there is no clue. When they are troubleshooting something and feel there is no clue, it means they haven't yet collected enough data. They must keep digging wider and wider, which would be time consuming and difficult to certain extent, especially when the troubleshooting effort is under time pressure. As soon as they find a clue pointing to a certain direction, digging 100 feet deep on that direction is much easier than turning an acre of land up side down.

As a Sherlock Holmes story, the second half is the deciphered version.

In 2003, Microsoft proposed RFC 3484 and decided to adopt it in Windows Vista. RFC 3484 defined a "longest matching prefix" method for a client machine to select the server IP address from round robin DNS. Taking foo.com as an example, let's say a client whose IP address is 150.150.150.150 talks to foo.com. It asks DNS server to resolve foo.com into IP addresses. DNS server returns two IP addresses, 100.100.100.100 and 200.200.200.200. Instead of selecting the first IP address, the client will use following procedure to decide which foo.com IP it should connect to:

a) Convert the IP addresses from decimal to binary (e.g. 100 = 01100100, 150 = 10010110, 200 = 11001000)

Client IP = 150.150.150.150 = 10010110 . 10010110 . 10010110 . 10010110 foo IP 1 = 100.100.100.100 = 01100100 . 01100100 . 01100100 . 01100100 foo IP 2 = 200.200.200.200 = 11001000 . 11001000 . 11001000 . 11001000

b) From left to right, compare the binary string of client IP with foo IP 1 and count the length of matching bits, until the first un-matching bit is reached. The first bit of client IP is "1", the first bit of foo IP 1 is "0", so the length of matching prefix is 0 (no matching bits at all).

c) In the same way, compare the binary string of client IP with foo IP 2. The first bit matches (first bit is 1 in both client IP and foo IP 2), the second bit does not match (it's 0 in client IP, but 1 in foo IP 2), so the length of matching prefix is 1 (only the first bit matches).

d) foo IP 2 is selected because it has a longer matching prefix than foo IP 1 (1 vs 0).

Around the same time that RFC 3484 was proposed, there were two other technologies getting popular: Broadband Internet and 802.11 Wi-Fi. More and more households switched to cable or DSL, and set up a wireless router for their home Internet access. Most of the wireless routers (such as Linksys or D-Link) were designed to assign 192.168.0.0 to 192.168.255.255 private IP range to the home computers.

Those events were unrelated, until January 2007 when Windows Vista was released.

Let's see what happened when Windows Vista users connect to foo.com via their wireless routers at home:

Client IP = 192.168.100.100 = 11000000 . 10101000 . 01100100 . 01100100 foo IP 1 = 100.100.100.100 = 01100100 . 01100100 . 01100100 . 01100100 foo IP 2 = 200.200.200.200 = 11001000 . 11001000 . 11001000 . 11001000

Comparing client IP with foo IP 1, the length of matching prefix is 0. Comparing client IP with foo IP 2, the length of matching prefix is 1. So Windows Vista selected foo IP 2, which was in Denver. With time going, more and more home wifi users upgraded to Windows Vista, so engineers at foo.com observed the increasing traffic imbalance between their Denver and Miami data centers.

Technically speaking, the "longest matching prefix" method may be helpful only if both client and server are on public IP addresses. It doesn't make any sense when client is on private IP address because private IP addresses are not routable on the Internet, nor would they indicate the distance to any public IP address.

After the root cause was identified, the next step

本文系统(windows)相关术语:三级网络技术 计算机三级网络技术 网络技术基础 计算机网络技术

代码区博客精选文章
分页:12
转载请注明
本文标题:SRE Case Study: Mysterious Traffic Imbalance
本站链接:https://www.codesec.net/view/611441.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 系统(windows) | 评论(0) | 阅读(87)