A comprehensive overview of Google's architecture - starting from the search page and all the way to its internal networks.
By Ed Austin, talk given at Edinburgh Techmeetup in December 2009
http://techmeetup.co.uk
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
The Anatomy Of The Google Architecture Fina Lv1.1
1.
2.
3. The Anatomy of the Google Architecture “The unofficial Version“ V1.0 November 2009 Ed Austin {ed, edik} @i-dot.com
4. Section I – The Basic Glue 1. Exterior Network (Perimeter Architecture) 2. Data Centre 3. Rack Characteristics 4. Core Server Hardware 5. Operating System Implementation 6. Interior Network Architecture
9. Worldwide Data Centres Where is Google Located? Last estimated were 36 Data Centers, 300+ GFSII Clusters and upwards of 800K machines. US (#1) – Europe (#2) – Asia (#3) – South America/Russia (#4) Australia – on Hold Future : Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina.
10. The Modular Data Centre Standard Google Modular DC (Cell) holds 1160 Servers / 250KW Power Consumption in 30 racks (40U). This is the “Atomic“ Data Centre Building Block of Google. A Data Centre would consist of 100‘s of Modular Cells. DC architecture then being the aggregation of smaller Cell level infrastructures in their own container – some being pure GFS, other BT, other Map, some mixed etc. MDC‘s can also be deployed autonomously at the Perimeter (stand alone).
11. THE RACK How is a server stored in the Data Centre?
18. INTERIOR NETWORK ROUTING PROTOCOL Internal network is IPv6 (exterior machines can be reached using IPv6) Heavily Modified Version of OSPF as the IRP Intra-rack network is 100baseT Inter-rack network is 1000baseT Inter-DC network pipes unknown but very fast Technology: Juniper, Cisco, Foundry, HP, routers and switches Software: ipvs (ip virtual server)
19. THE MAJOR GLUE The three foundation blocks of Googles Secret Sauce
20. Section II – Googles Major Glue 1. Google File System Architecture – GFS II 2. Google Database - Bigtable 3. Google Computation - Mapreduce 4. Google Scheduling - GWQ
21. GOOGLE FILE SYSTEM Manages the underlying Data on behalf of the upper layers and ultimately the applications
22. FILE SYSTEM I – GFS v1 The GFS II cell is Googles fundamental building block – everything can be layered on top of this Consists of (Highly distributed Linux based) Master Servers and Chunk Servers Chunk Servers serve the Data in 64MB Chunks to the client directly via Master arbitration DATA REDUNDANCY/FAULT TOLERANCE? Triplicate Copies of Chunks are kept often in other clusters / DC Chunks can be pulled from outside the DC! Expensive.... And try not to do! However apps built on top of GFS/BT do this on an ad-hoc basis (i.e. Gmail) On Chunk loss the Master handles the Recovery by sourcing a chunk copy Data is compressed using BMDiff/Zippy Chunk Server Fault-Tolerance achieved by Heart-beat to the Master (I am alive..) Master Failure was problematic for Google (finally down from 2 minutes to 10 seconds)
23. FILE SYSTEM I – GFS II GFS II “Colossus“ Version 2 improves in many ways (is a complete rewrite) Elegant Master Failover (no more 2s delays...) Chunk Size is now 1MB – likely to improve latency for serving data other than Indexing – for example GMail – this was the rationale behind the change Master can store more Chunk Metadata (therefore more chunks addressable up to 100 million) = also more Chunk Servers However according to Google Engineer they have only ever lost one 64MB chunk (in GFS I) during its entire production deployment (2004 – 2008?) so assumed extremely reliable
24. GOOGLE DATABASE Accesses the underlying Data on behalf of the upper layers and ultimately the applications
25. Bigtable I - Introduction What is it? Googles Database Implementation since 1994 Used internally for all large scale (Search, Indexing, GMail etc) Similar to a sharded Database implemention GOALS Huge Scalability to many PB‘s (Web Database currently around 40 Billion URL‘s) Tight Latency Highly efficient scans over Textual Data Fault Tolerant Load Balancable Eliminate Googles dependency on an external provider
26. Bigtable II How is Data Referenced? Distributed Multi-Dimensional Sparse Map Simple addressing model using a triple: (row, column, { timestamp } ) -> cell contents ROWS - Rows (arbitrary length usually 10-100 Bytes Max <=64KB) - Rows stored lexographically - example row (URL)) COLUMNS - example column (contents:, PR, anchor1: ..) TIMESTAMP (OPTIONAL?) - timestamp (various API func args, i.e. “ALL“, “LATEST“) .
30. Mapreduce I Map Reduction can be seen as a way to exploit massive parallelism by breaking a task down into constituent parts and executing on multiple processors The Major Functions are MAP & REDUCE (with a number of intermediatary steps) MAP Break task down into parallel steps REDUCE Combine results into final output Shown is a 2-pipeline Map Reduction (There are 24 Map Reductions in the indexing pipeline) Mappers & Reducers usually run on separate processors (90% loss of reducers job still completed!)
35. Section III – Some more Glue 1. Languages employed 2. Development Environment 3. Google App Engine 4. Network Security 5. Future Google Architecture Advances 6. Odds n Sods 7. DIY Google
36. DEVELOPMENT LANGUAGES - Initially Python, Java, C++ Usual Suspects - Sawzall (since 2006) - equivalent to Hadoops Pig Latin - written in C++ - interpreted bytecode output JIT‘d An internal Procedural language employed to solve map reduction problems. The few published Google papers employ Sawzall in the algorithm examples. Runs in the Map phase, Aggregators run in the Reduce phase (from each Sawzall Map instance) to get the final output. - Transparent Parallelization – no specialist Distrib Sys Knowledge Required (Good for developer) - Simple Datatypes 64-bit signed int, float, string, byte and a few unique such as time - Much STR regexp support - Compound Types arrays, tuples - typesafed (and declarations) similar to Pascal (Probably an LL(1) lang?) - similar to Algol, C Syntax (no pointers though!) - No Processing of exceptions (no exception handlers) - Shorter than corresponding C++ code by a factor of 10 Early versions could not write into Bigtable. Now implemented? Output sometimes pipelined into MySQL for further analysis
37.
38. Security Rack Board Level (possible scenario) gPXE on the board goes through DHCP/tftp sequence to pull over an encrypted image (this is not expensive as is done once per boot and boots are not usual) Image is pulled from a Secure Image Distribution Server (and held encrypted on these) Once at the board end the image is OTF decrypted and booted as normal RHEL 02/09 Google Engineer didn‘t dispute this and seemed to concur adding that in-core encryption might be a possibility (R/T decryption might not be that expensive) – this possibily means cryptology is used throughout the lifetime of the image – including components outside the working-set but sensitive parts of the in-core OS (OTF decrypted) Enterprise Kerberos is used throughout the enterprise They have an Automated issuance system for SSL certificates, used by internal (secure) infrastructure to validate https/TLS and generic SSL connections . Complete internal network encryption unlikely due to latency introduced? Likely that one of the reasons failover between DC‘s problematic is the latency introduced due to the expense of Wide Area Encryption (essential)
39.
40. Odds n Sods borg – google technology/architecture (is a cluster..) Borg: a hybrid protocol for scalable application-level multicast in peer-to-peer networks (WAN multimedia steaming) data cube – google technology Have a “global loadbalancer“ – assume load balances across a unified namespace – probably worldwide gmail designers implemented application level failover to move your session to an alternate DC in a seamless fashion to the end user. Probably all Google Apps will be able to migrate to an alternate DC cell (the application, and its GFS data if need be) MySQL is used for back-end sys admin stuff (high availability master-slave implementations) and post Bigtable processing Remote employee access is via VPN Sys Admins maintain 5 and 30 minute SLA’s – so on the ball Has its own internal archive.org equiv.
44. DIY GOOGLE What you require: Preferably 2 Machines + 100BT CentOS/RHEL (squid) Apache Hadoop (HDFS, Mapreduce, Pig, HBase) HDFS bmdiff/zippy compression library Google glibc/tcmalloc – perftools Supporting stuff – JRE etc Browser with Search Box pig mr call to scan a few files print results
45. DIY GOOGLE Install Hadoop and Pig on Cluster Install eclipse and dependencies Install PigPen for eclipse and configure to cluster (NFS)
46. TEMPLATE - IPv6 enablement started 2008 (2009 finished?) - IRP OSPF Google authored RFC points towards OSPF
47. DEVELOPMENT ENVIRONMENT bits&bobs A rare shot of some concrete google internal stuff (this of a GFS Master Server code execution found as a perftools profiling example) Agile Methodologies Used ( development iterations, teamwork, collaboration, and process adaptability throughout the life-cycle of the project) “ Libraries are the predominant way of building programs” An infrastructure handles versioning of applications so they can be release without a fear of breaking things = roll out with minimal QA - Internal Code uses replacement libraries - Google as you‘d expect rewrites everything! - Hungarian Notation? - Work in small teams 3-5 people – likely few scutters know ‘‘the big picture“