Mais conteúdo relacionado

Apresentações para você(20)

Similar a APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale(20)


Mais de Michael Kehoe(19)


APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

  1. Trafficshifting: Avoiding Disasters & Improving Performance at Scale Michael Kehoe Staff Site Reliability Engineer LinkedIn
  2. 2 Overview • Problem Statement • Solution – How LinkedIn trafficshift’s • Datacenter shifting • PoP steering • Challenges of APAC region • IPv4 vs IPv6 • Questions
  3. $ whoami 3 Michael Kehoe • Staff Site Reliability Engineer (SRE) @ LinkedIn • Production-SRE team • Funny accent = Australian + 3 years American
  4. $ whatis SRE 4 Michael Kehoe • Site Reliability Engineering • Operations for the production application environment • Responsibilities include • Architecture design • Capacity planning • Operations • Tooling • Responsibilities include DNS/ CDN management & Traffic infrastructure
  5. 5 Terminology • PoP - Where LinkedIn terminates incoming requests. • Fabric – Datacenter with full LinkedIn production stack deployed • Loadtest – Stress test of a Fabric – to simulate a disaster scenario
  6. Disaster Recovery 6 Problem Statement • Fail between Fabrics • Performance of applications is degraded • Validate disaster recovery (DR) scenario • Expose bugs and suboptimal configurations via loadtest • Planned maintenance • Fail between PoP’s • Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links) • Software/ Configuration Bugs
  7. Performance 7 Problem Statement • Fabric Assignment • Assign preferred and secondary fabric to all members based on: • Member location • Capacity • PoP/ CDN steering • Use GeoDNS to steer user to ‘best’ PoP • Use RUM DNS to steer users to ’best’ CDN
  8. United States Performance (Global) 8 Problem Statement
  9. APAC Performance (APAC cities) 9 Problem Statement
  10. Delta US & APAC 10 Problem Statement
  11. Site Speed 11 Problem Statement • Site Speed affects User Engagement • User Engagement affects page-views & transactions • Bottom Line: Site Speed has an impact on revenue
  12. LinkedIn’s Traffic Architecture 12 Solution
  13. LinkedIn’s Traffic Architecture 13 Solution
  14. Fabric shifting 14 Solution • Stickyrouting • Using a Hadoop job, we calculate a primary and secondary datacenter for the user based on location • This data is stored in a Key-Value store (Espresso) • Stickyrouting serves this information over a RESTful interface to our Edge PoP’s
  15. Fabric shifting 15 Solution • Different traffic types are partitioned and controlled separately • Logged-In vs Logged-out • CDN’s • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ • Buckets are marked online/ offline to move site traffic
  16. Fabric shifting 16 Solution • Stickyrouting – Benefits • Ensure we serve the request as close to the user as possible • Capacity management for datacenters • We can assign a percentage of users to a datacenter • Enables personal data routing (PDR) • Only store data where we need it
  17. Fabric shifting Automation 17 Solution
  18. Fabric shifting Automation 18 Solution
  19. Fabric Shifting 19 Solution
  20. Fabric Shifting Load tests 20 Solution
  21. Fabric Shifting Loadtests 21 Solution
  22. LinkedIn’s Traffic Architecture 22 Solution
  23. LinkedIn’s PoP Distribution 23 Solution
  24. LinkedIn’s PoP Architecture 24 Solution • Using IPVS - Each PoP announces a unicast address and a regional anycast address • APAC, EU and NAMER anycast regions • Use GeoDNS to steer users to the ‘best’ PoP • DNS will either provide users with an anycast or unicast address for • US and EU members is nearly all anycast • APAC is all unicast
  25. LinkedIn’s PoP DR 25 Solution • Sometimes need to fail out of PoP’s • 3rd party provider issues (e.g. transit links going down) • Infrastructure maintenance • Withdraw anycast route announcements • Fail healthchecks on proxy to drain unicast traffic
  26. LinkedIn’s PoP Performance 26 Solution • PoP DNS Steering • LinkedIn currently uses GeoDNS for routing • Piloting RumDNS • Pick the best PoP based on network, not country • CDN Steering • Mix CDN’s to get best performance • Constantly evaluate performance/ availability • Automatically adjust CDN weighting
  27. LinkedIn’s PoP Performance 27 Solution US CDN request time 50th percentile 24 hours
  28. Working around fiber cuts 28 APAC Challenges • Case Study: Fail out of India PoP due to fiber cuts Connection Time for Indian members (90th percentile)
  29. ASN 15802 ASN 5384 GeoDNS Suboptimal PoP’s 29 APAC Challenges Source: SingaporeMumbai 45 ms 220 ms 70 ms ASN 15802 RTT to Singapore is (220+70) 290ms (all at 50th percentile)
  30. GeoDNS Suboptimal PoP’s 30 APAC Challenges London Dublin SingaporeMumbai 160 ms 45 ms ASN 15802 ASN 5384 70 ms 35 ms 350 ms Hong Kong160 ms
  31. GeoDNS Suboptimal PoP’s 31 APAC Challenges 600 700 800 900 1000 1100 1200
  32. Performance & Adoption 32 IPv4 vs IPv6 • IPv6 performs better for our members • Less request time-outs on IPv6 for mobile users • Mobile carriers are adopting IPv6 faster • Win for LinkedIn and our members! • In July 2014 (IPv6 launch): 3% of traffic was IPv6 • Today: ~12% of traffic is IPv6
  33. Key Takeaways 33 Conclusion • Application level traffic engineering is extremely important for content providers • RUM data is extremely useful for finding anomalies • Route traffic based on performance, not just location • IPv6 performs better for LinkedIn users
  34. 34 Questions?

Notas do Editor

  1. Good morning, my name is Michael Kehoe and in this presentation I’m going to talk about how LinkedIn shifts traffic between it’s PoP’s and datacenters to avoid disaster and improve site performance at scale
  2. So this morning I want to talk about the problem that we’re trying to solve, particularly in the context of APAC which is extremely challenging for internet companies Then we’ll deep-dive into how LinkedIn solves these problems to improve our availability and site performance. Specifically we’ll look at: Datacenter shifting PoP steering We’ll look at some of the challenges of operating in the APAC region, briefly talk about IPv6 adoption and then I’ll take questions
  3. So who am I? I’m a Staff Site Reliability Engineer (commonly referred to as SRE) at LinkedIn. I am on a team called Production-SRE, our team charter includes: Developing applications to improve MTTD and MTTR Build tools for efficient site issue troubleshooting, issue detection & correlation Assist in restoring stability to services during site critical issues Yes I have a slightly strange accent, it’s Australian with three 3 years of American.
  4. Site Reliability Engineering A term coined by Ben Treynor from Google You may also find it being called Devops/ Appops or Production Engineering Skillset based of: Sysadmin Network Engineer Architect Troubleshooter Software Engineer Role consists of: Architecture design Capacity planning Application Operations – Keeping the site healthy Writing automation and tooling SRE role/ philosophy differs between companies. At LinkedIn, SRE’s are responsible for DNS/ CDN management and traffic infrastructure
  5. So before we deep-dive, let’s go over some terminology PoP – Where LinkedIn terminates incoming requests to it’s datacenters. Spread geographically across the world Fabric – Datacenter where the full LinkedIn application stack is deployed. LinkedIn has 3 datacenters in the US and one in Singapore Loadtest – Where we stress test a Fabric to simulate a disaster.
  6. What are the use-cases for shifting traffic for Disaster Recovery purposes? Fabric: Performance of applications is degraded Site may be slow or users get errors Validate disaster recovery Plan for disasters (natural/ infrastructure/ code) Expose code bugs and suboptimal configurations via loadtest When the application infrastructure is under stress, easier to expose sub optimal configuration/ code Planned maintenance Intrusive infrastructure maintenance that may cause impact PoP Transport provider maintenance More common in Asia given the large number of submarine cables we utilize Software bugs
  7. So let’s look at the performance side of the equation. How can shifting traffic improve performance: Fabric: Members use the closest datacenter to them Manage capacity of a datacenter PoP: Steering Users to the best possible PoP gives us significant performance advantage By measuring CDN availability/ performance using RUM (talk about RUM and how it works), we can speed-up page-load-time by 50%
  8. **** NOTE: Move to excel and remove values *** Average page load time for countries using US Data-centers (measured by Catchpoint – All Major Metro Nodes around the world)
  9. Average page load time for countries using APAC Data-centres (measured by Catchpoint – Top 10 APAC metro nodes).
  10. Delta between US and APAC performance. Average is 2.5s
  11. LinkedIn has done extensive research on the impact site-speed has on user-engagement. From this research we know that slow page load times affects engagement and transaction This in-turn affects our revenue. This is imporant!
  12. So what does LinkedIn’s traffic architecture look like DNS routes users to the ‘best’ PoP (more on that later) IPVS (IP Virtual Server, a Linux kernel module) announces Unicast and Anycast addresses for and terminates TCP connections ATS (Apache Traffic Server) terminates SSL sessions and proxies requests to datacenters Stickyrouting service (talk about in a minute) tells the PoP (specifically ATS) which datacenter/ fabric to send the request to ATS in the datacenter proxies requests to frontend services
  13. Let’s talk about stickyrouting and Fabric-Shifting
  14. We run an offline Hadoop job to calculate primary and secondary datacenters for users. Hadoop is a distributed computing mechanism that proceses large datasets We store this data in an in-house key-value store named Espresso Stickyrouting serves information over a RESTFul interface to our Edge-PoP’s
  15. At LinkedIn, we partition our traffic into various classes so we can control them independently Logged-in vs Logged-out CDN traffic Monitoring traffic Microsites Logged-in users get assigned to a bucket (an arbitrary partition) We then online/ offline buckets in a fabric to manipulate the distribution of traffic between fabrics
  16. Benefits: Serve the request as close to the user Capacity management - Ensure that data-centers aren’t overloaded Personal data routing – lowers cost to serve
  17. My team built ’TrafficShift’ app to help automate datacenter routing’ We’ve automated fail-outs of datacenters Also allows us to do automated load-testing of our datacenters
  18. You can see, LTX1 (Texas datacenter) is failed out
  19. Example of failing out of East Coast Datacenter Top graph – Online buckets Bottom graph – Distribution of traffic
  20. Automation to validate DR Tell the engine which datacenter to stress, how much traffic, and what time periods and it will execute for us Traffic engine watches our alerting system to ensure we do not negatively impact the member experience
  21. Let’s talk about how users connect to LinkedIn’s PoP’s
  22. LinkedIn’s PoP locations Note that PoP in India is red – means it’s offline – talk about that further later
  23. Sometimes need to fail out for 3rd party issues – remember the red dot on the PoP map. Steer users to the next-best PoP. In this case. India to Singapore Note the slow traffic tail-off in TMU1 – DNS TTL’s not being honored For Anycast traffic, we withdraw the prefix announcement For Unicast, Fail healthchecks that DNS providers use to check if we are serving from that site
  24. Remember that red dot before. Sometimes by pure necessity, we need to fail out of PoP’s to mitigate impact or potential impact. In this case, move India traffic from India PoP to Singapore This does have an impact on client connect times and also page-load times.
  25. UAE has 2 ASNs and GeoDNS routes both to India 5384 – That’’s ok 15802 – Not ok
  26. RUM DNS recognizes optimal PoPs for ASN 15802 Two better paths, Hong Kong and London/ Dublin
  27. Drop in connect time after the change
  28. IPv6 – performs up to 40% better We’ve grown from 3% IPv6 traffic in July 2014 to over 12% today