Sistemas Distribuidos

406 visualizações

Publicada em

Diego Souza fala sobre sistemas distribuídos mostradando uma introdução sobre os conceitos básicos e algumas considerações práticas que podem afetar o nosso dia a dia.
Assista esta palestra em https://www.eventials.com/locaweb/sistemas-distribuidos/

Publicada em: Tecnologia
0 comentários
1 gostou
Estatísticas
Notas
  • Seja o primeiro a comentar

Sem downloads
Visualizações
Visualizações totais
406
No SlideShare
0
A partir de incorporações
0
Número de incorporações
3
Ações
Compartilhamentos
0
Downloads
2
Comentários
0
Gostaram
1
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide

Sistemas Distribuidos

  1. 1. distributed systems diego souza @ infra-dev
  2. 2. agenda ● the basics ● models ● practical aspects
  3. 3. the basics
  4. 4. the basics what is a distributed system? (cont.) ● a distributed system is a piece of software that ensures that a collection of independent computers appears to its users as a single coherent system;
  5. 5. the basics what is a distributed system? (cont.) ● a distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages;
  6. 6. the basics what is a distributed system? ● a distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable [Lamport];
  7. 7. the basics fallacies of a distributed system 1. the network is reliable; 2. latency is zero; 3. bandwidth is infinite; 4. the network is secure; 5. topology doesn't change; 6. there is one administrator; 7. transport cost is zero; 8. the network is homogeneous;
  8. 8. the basics examples: ● cassandra ● hadoop ● www ● internet ● etc.
  9. 9. the basics why? ● things no longer fit in a single machine; ● scalability [size, geographic, organizational]; ● availability; ● fault tolerance; ● performance;
  10. 10. the basics scalability ● is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth;
  11. 11. the basics performance ● depends on the context and what we want to achieve: ○ response time/low latency; ○ throughput; ○ utilization of computer resources;
  12. 12. the basics latency ● the state of being latent; delay, a period between the initiation of something and the occurrence; ● a wise man once said: ○ Bandwidth is easy. Engineers build bandwidth. But latency is hard. Only God gives us latency;
  13. 13. the basics availability ● the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable;
  14. 14. the basics fault tolerance ● ability of a system to behave in a well-defined manner once faults occur;
  15. 15. models
  16. 16. models availability metrics availability = uptime / (uptime + downtime) availability = mtbf / (mtbf + mttr) mtbf: mean time between failure mttr: mean time to repair ● q: is every second the same?
  17. 17. models availability metrics yield = successes / requests ● a: very unlikely!
  18. 18. models availability metrics harvest = data_available / total_data ● how incomplete is this [think of websearch]?
  19. 19. models distributing the dataset ● partition ● replication
  20. 20. models partition ● improves performance [reduces dataset]; ● improves availability [partial failures]; ● usually application specific [random, time, user];
  21. 21. models replication ● improves performance [full copy]; ● improves availability [full copy, reed-solomon codes]; ○ synchronous, asynchronous; ○ single copy, multi-master ○ crdts
  22. 22. models replication [strong consistency] ● primary/copy [eg. mysql master] ● 2pc [eg. mysql cluster] ● paxos, zab, raft
  23. 23. models replication [weak consistency] ● amazon dynamo ○ consistent hashing [partitioning] ○ partial quorums ○ failure detection and read repair ○ gossip protocol ● note: r + w > n != strong consistency
  24. 24. models time ● global clock [ntp, total order] ● local clock [partial order] ● logical clock [partial order; lamport clock, vector clocks]
  25. 25. models consensus & atomic broadcast ● consensus: vote & agreement; ● atomic broadcast: reliable message transmission and order guarantees; ● they are equivalent
  26. 26. models flp impossibility ● does not exist an algorithm for the consensus problem in an asynchronous system subject to failures, even if messages can never be lost, at most one process may fail, and it can only fail by crashing ● note: its not that bad! :)
  27. 27. models
  28. 28. models cap: [note: pick only two is misleading] ● consistency: the same data at the same time; ● availability; ● partition tolerance: continues to operate despite message loss [network or node failure];
  29. 29. practical aspects
  30. 30. I find latency one of the most important aspects of performance
  31. 31. hard to develop, even hard to operate: they are not unbreakable
  32. 32. consensus is a hard problem
  33. 33. failures are the norm
  34. 34. metrics, metrics, metrics
  35. 35. what to do in presence of failures
  36. 36. think about backpressure mechanisms
  37. 37. think about timeouts
  38. 38. feature flag as a deploy mechanism
  39. 39. think hard about scalability
  40. 40. thanks :) questions or comments?
  41. 41. appendix
  42. 42. appendix: what we have here ● cassandra ● zookeeper ● ceph ● etcd ● consul ● leela
  43. 43. links ● http://book.mixu.net/distsys/

×