SlideShare a Scribd company logo
1 of 42
Download to read offline
Distributed Systems

 scalability and high availability




Renato Lucindo - lucindo.github.com - @rlucindo
Renato Lucindo

 Call me Lucindo (or Linus)
 2002 - Bachelor Computer Science
 2007 - M.Sc. Computer Science (Combinatorial
 Optimization)
 7+ year developing Distributed Systems




 My default answer: "I don't know."
Agenda


 Scalability

 High Availability

 Problems

 Tips and Tricks

 Learning More
Distributed Systems

  Multiple computers that interact with each other over a
  network to achieve a common goal
  Purpose
     Scalability
     High availability




                                     source: http://www.cnds.jhu.edu/
Scalability

  System ability to handle gracefully a growing amount of
  work

  Scale up (vertical)
     Add resources to a single node
     Improve existing code to handle more work

  Scale out (horizontal)
     Add more nodes to a system
     Linear (or better) scalability
Scalability - Vertical

  Add: CPU, Memory, Disks (bigger box)
  Handling more simultaneous:
     Connections
     Operations
     Users
  Choose a good I/O and concurrency model
     Non-blocking I/O
     Asynchronous I/O
     Threads (single, pool, per-connection)
     Event handling patterns (Reactor, Proactor, ...)
  Memory model?
     STM
Scalability - Vertical

  Careful with numbers
      Requests per second
      # of Connections
      Simultaneous operations
  Event handling
      Think front-end
      Slow connections/clients
      It's slower than other options
  In doubt, go async
  Back-end
      Thread pool (thread per-connection)
      No events
      Process per-core
Scalability - Horizontal

  Add nodes to handle more work
  Front-end
     Straightforward
     Stateless
  Back-end
     Master/Slave(s)
     Partitioning
         DHT
         Volatile Index
Scalability - Horizontal

  Master/Slave
  Write on single Master
  Read on Slaves (one or more)
  Scales reads
Scalability - Horizontal

  Partitioning (Sharding)
     Distribute dada across nodes
  Generally involves data de-normalization
  Where is some specific data?
     Master Index
     Hash (DTH, Consistent Hashing)
     Volatile Index
  Joins done in application level
  NoSQL friendly
Scalability - Horizontal

  Volatile Index: build and maintain data index as cached
  information (all clients)
High Availability

            "Processes, as well as people, die"


  Handle hardware and software failures
      Eliminate single point of failure
  Redundancy
  Failover
  Replicas
High Availability - Failover/Redundancy
High Availability - Replicas

  Two or more copies of same data
  Replica granularity
     From node replica to "row" replica
  Load balancing
  Write concurrency
  Replica updates
  Key for high availability and root of several problems
Problems
Problems - CAP Theorem
Problems - CAP Theorem

 Consistency: all operations (reads/writes) yield a global
 consistent state

 Availability: all requests (on non-failed servers) must have
 a response

 Partition Tolerance: nodes may not be able
 to communicate with each other.



                     Pick Two
Problems - CAP Theorem

 C + A: network problems might stop the system

 Examples:
    Oracle RAC, IBM DB2 Parallel
    RDBMS (Master/Slave)
    Google File System
    HDFS (Hadoop)
Problems - CAP Theorem

 C + P: clients can't always perform operations

 Examples:
    Distributed lock-systems: Chubby, ZooKeeper
    Paxos protocol (consensus)
    BigTable, Hbase
    Hypertable
    MongoDB
Problems - CAP Theorem

 A + P: clients may read inconsistent (old or undone) data

 Examples:
    Amazon Dynamo
    Cassandra
    Voldemort
    CouchDB
    Riak
    Caches
Problem with CAP Theorem

 In practice, C + A and C + P systems are the same.
     C + A: not tolerant of network partitions
     C + P: not available when a network partition occurs
 Big problem: network partition
     Not so big (how often does it happens?)
 Pick two
     Availability
     Consistency
 The forgotten: Latency
     Or, how long the system waits before considering a
     partitioned network?
Problems - Real World

Every component may fail:
   Network failure
   Hardware failure
   Electricity
   Natural disasters
   Code failure
Tips & Tricks
Tips & Tricks - Pyramid

  Capacity (connections, operations, ...) Pyramid
Tips & Tricks - Reply Fast

  FAIL Fast
  Break complex requests into smaller ones
  Use timeouts
  No transactions
  Be aware that a single slow operation or component can
  generate contention
  Self-denial attack
Tips & Tricks - Cache

  Cache: component location, data, dns lookups, previous
  requests, etc
  Use negative cache for failed requests (low expiration)
  Don't rely on cache
  Your system must work with no cache
Tips & Tricks - Queues

  Easy way to add asynchronous processing an decouple
  your system.
Tips & Tricks - DNS
Tips & Tricks - Logs

  Log everything
  Use several log levels
  On every log message
       User
       Request host
       Component involved
       Version
       Filename and line
  If log level not enabled do not process log message
       Avoid lookup calls (gettimeofday)
Tips & Tricks - Domino Effect

  Make sure your load balancer won't overload components
  User smart algorithms
     Load Balance
     Resource Allocation
Tips & Tricks - (Zero) Configuration

  No configuration files
  Use good defaults
  Auto-discovery (multicast, gossip, ...)
  Make everything configurable
     Administrative command
     No need to stop for changes
  Automatic self adjusts when possible
Tips & Tricks - STOP Test

  With your system under load: kill -STOP <component>
Tips & Tricks - Know your tools

  load average (uptime)
  stats tools
      vmstat
      iostat
      mpstat
      tcpstat, tcprstat, etc
  tcpdump, nc, netstat
  tunning
      /proc/net/*
      ulimit
      sysctl
  oprofile
  debuging tools (gdb, valgrind)
  ...
Tips & Tricks - Count

  Count everything
     Connections
     Operations
     Failures
     Successes
     Request times (granularity)
  Total, average, standard deviation
  Monitor counters
Tips & Tricks - Stability Patterns

  Use Timeouts
  Circuit Breaker
  Bulkheads
  Steady State
  Fail Fast
  Handshaking
  Test Harness
  Decoupling Middleware
Tips & Tricks - Don't Panic!
Learning More - Books

TCP/IP Illustrated, Vol. 1: The Protocols
Learning More - Books

Unix Network Programming, Vol. 1: The Sockets Networking
Learning More - Books

Pattern Oriented Software Architecture, Vol. 2
Learning More - Books

Release It!
Learning More - Papers

 The Google File System
 Bigtable: A Distributed Storage System for Structured Data
 Dynamo: Amazon's Highly Available Key-Value Store
 PNUTS: Yahoo!’s Hosted Data Serving Platform
 MapReduce: Simplified Data Processing on Large Clusters

 Towards robust distributed systems
 Brewer's conjecture and the feasibility of consistent,
 available, partition-tolerant web services
 BASE: An Acid Alternative
 Looking up data in P2P systems
Thanks!!! Questions?

lucindo.github.com - @rlucindo

More Related Content

What's hot

Operating system support in distributed system
Operating system support in distributed systemOperating system support in distributed system
Operating system support in distributed systemishapadhy
 
Distributed Computing
Distributed Computing Distributed Computing
Distributed Computing Megha yadav
 
Structure of operating system
Structure of operating systemStructure of operating system
Structure of operating systemRafi Dar
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bankpkaviya
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memoryAshish Kumar
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Object Storage Overview
Object Storage OverviewObject Storage Overview
Object Storage OverviewCloudian
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating SystemKathirvel Ayyaswamy
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed SystemsNandakumar P
 
Directory and discovery services
Directory and discovery servicesDirectory and discovery services
Directory and discovery servicesRamchandraRegmi
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemAnamika Singh
 

What's hot (20)

Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Storage overview
Storage overviewStorage overview
Storage overview
 
Operating system support in distributed system
Operating system support in distributed systemOperating system support in distributed system
Operating system support in distributed system
 
Distributed Computing
Distributed Computing Distributed Computing
Distributed Computing
 
Structure of operating system
Structure of operating systemStructure of operating system
Structure of operating system
 
Cloud Computing Tools
Cloud Computing ToolsCloud Computing Tools
Cloud Computing Tools
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bank
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Object Storage Overview
Object Storage OverviewObject Storage Overview
Object Storage Overview
 
CS9222 Advanced Operating System
CS9222 Advanced Operating SystemCS9222 Advanced Operating System
CS9222 Advanced Operating System
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed Systems
 
Directory and discovery services
Directory and discovery servicesDirectory and discovery services
Directory and discovery services
 
11. dfs
11. dfs11. dfs
11. dfs
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
cluster computing
cluster computingcluster computing
cluster computing
 

Similar to Distributed Systems: scalability and high availability

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginnerswebhostingguy
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Stuart Charlton
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rulesOleg Tsal-Tsalko
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Planning for-high-performance-web-application
Planning for-high-performance-web-applicationPlanning for-high-performance-web-application
Planning for-high-performance-web-applicationNguyễn Duy Nhân
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsFirat Atagun
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 

Similar to Distributed Systems: scalability and high availability (20)

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginners
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009Designing for the Cloud Tutorial - QCon SF 2009
Designing for the Cloud Tutorial - QCon SF 2009
 
test
testtest
test
 
HeartBeat
HeartBeatHeartBeat
HeartBeat
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Planning for-high-performance-web-application
Planning for-high-performance-web-applicationPlanning for-high-performance-web-application
Planning for-high-performance-web-application
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Distributed Systems: scalability and high availability

  • 1. Distributed Systems scalability and high availability Renato Lucindo - lucindo.github.com - @rlucindo
  • 2. Renato Lucindo Call me Lucindo (or Linus) 2002 - Bachelor Computer Science 2007 - M.Sc. Computer Science (Combinatorial Optimization) 7+ year developing Distributed Systems My default answer: "I don't know."
  • 3. Agenda Scalability High Availability Problems Tips and Tricks Learning More
  • 4. Distributed Systems Multiple computers that interact with each other over a network to achieve a common goal Purpose Scalability High availability source: http://www.cnds.jhu.edu/
  • 5. Scalability System ability to handle gracefully a growing amount of work Scale up (vertical) Add resources to a single node Improve existing code to handle more work Scale out (horizontal) Add more nodes to a system Linear (or better) scalability
  • 6. Scalability - Vertical Add: CPU, Memory, Disks (bigger box) Handling more simultaneous: Connections Operations Users Choose a good I/O and concurrency model Non-blocking I/O Asynchronous I/O Threads (single, pool, per-connection) Event handling patterns (Reactor, Proactor, ...) Memory model? STM
  • 7. Scalability - Vertical Careful with numbers Requests per second # of Connections Simultaneous operations Event handling Think front-end Slow connections/clients It's slower than other options In doubt, go async Back-end Thread pool (thread per-connection) No events Process per-core
  • 8. Scalability - Horizontal Add nodes to handle more work Front-end Straightforward Stateless Back-end Master/Slave(s) Partitioning DHT Volatile Index
  • 9. Scalability - Horizontal Master/Slave Write on single Master Read on Slaves (one or more) Scales reads
  • 10. Scalability - Horizontal Partitioning (Sharding) Distribute dada across nodes Generally involves data de-normalization Where is some specific data? Master Index Hash (DTH, Consistent Hashing) Volatile Index Joins done in application level NoSQL friendly
  • 11. Scalability - Horizontal Volatile Index: build and maintain data index as cached information (all clients)
  • 12. High Availability "Processes, as well as people, die" Handle hardware and software failures Eliminate single point of failure Redundancy Failover Replicas
  • 13. High Availability - Failover/Redundancy
  • 14. High Availability - Replicas Two or more copies of same data Replica granularity From node replica to "row" replica Load balancing Write concurrency Replica updates Key for high availability and root of several problems
  • 16. Problems - CAP Theorem
  • 17. Problems - CAP Theorem Consistency: all operations (reads/writes) yield a global consistent state Availability: all requests (on non-failed servers) must have a response Partition Tolerance: nodes may not be able to communicate with each other. Pick Two
  • 18. Problems - CAP Theorem C + A: network problems might stop the system Examples: Oracle RAC, IBM DB2 Parallel RDBMS (Master/Slave) Google File System HDFS (Hadoop)
  • 19. Problems - CAP Theorem C + P: clients can't always perform operations Examples: Distributed lock-systems: Chubby, ZooKeeper Paxos protocol (consensus) BigTable, Hbase Hypertable MongoDB
  • 20. Problems - CAP Theorem A + P: clients may read inconsistent (old or undone) data Examples: Amazon Dynamo Cassandra Voldemort CouchDB Riak Caches
  • 21. Problem with CAP Theorem In practice, C + A and C + P systems are the same. C + A: not tolerant of network partitions C + P: not available when a network partition occurs Big problem: network partition Not so big (how often does it happens?) Pick two Availability Consistency The forgotten: Latency Or, how long the system waits before considering a partitioned network?
  • 22. Problems - Real World Every component may fail: Network failure Hardware failure Electricity Natural disasters Code failure
  • 24. Tips & Tricks - Pyramid Capacity (connections, operations, ...) Pyramid
  • 25. Tips & Tricks - Reply Fast FAIL Fast Break complex requests into smaller ones Use timeouts No transactions Be aware that a single slow operation or component can generate contention Self-denial attack
  • 26. Tips & Tricks - Cache Cache: component location, data, dns lookups, previous requests, etc Use negative cache for failed requests (low expiration) Don't rely on cache Your system must work with no cache
  • 27. Tips & Tricks - Queues Easy way to add asynchronous processing an decouple your system.
  • 28. Tips & Tricks - DNS
  • 29. Tips & Tricks - Logs Log everything Use several log levels On every log message User Request host Component involved Version Filename and line If log level not enabled do not process log message Avoid lookup calls (gettimeofday)
  • 30. Tips & Tricks - Domino Effect Make sure your load balancer won't overload components User smart algorithms Load Balance Resource Allocation
  • 31. Tips & Tricks - (Zero) Configuration No configuration files Use good defaults Auto-discovery (multicast, gossip, ...) Make everything configurable Administrative command No need to stop for changes Automatic self adjusts when possible
  • 32. Tips & Tricks - STOP Test With your system under load: kill -STOP <component>
  • 33. Tips & Tricks - Know your tools load average (uptime) stats tools vmstat iostat mpstat tcpstat, tcprstat, etc tcpdump, nc, netstat tunning /proc/net/* ulimit sysctl oprofile debuging tools (gdb, valgrind) ...
  • 34. Tips & Tricks - Count Count everything Connections Operations Failures Successes Request times (granularity) Total, average, standard deviation Monitor counters
  • 35. Tips & Tricks - Stability Patterns Use Timeouts Circuit Breaker Bulkheads Steady State Fail Fast Handshaking Test Harness Decoupling Middleware
  • 36. Tips & Tricks - Don't Panic!
  • 37. Learning More - Books TCP/IP Illustrated, Vol. 1: The Protocols
  • 38. Learning More - Books Unix Network Programming, Vol. 1: The Sockets Networking
  • 39. Learning More - Books Pattern Oriented Software Architecture, Vol. 2
  • 40. Learning More - Books Release It!
  • 41. Learning More - Papers The Google File System Bigtable: A Distributed Storage System for Structured Data Dynamo: Amazon's Highly Available Key-Value Store PNUTS: Yahoo!’s Hosted Data Serving Platform MapReduce: Simplified Data Processing on Large Clusters Towards robust distributed systems Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services BASE: An Acid Alternative Looking up data in P2P systems