SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Distributed Programming
     and Data Consistency
     by Paulo Gaspar
     @paulogaspar7




                                                           1

Twitter: @paulogaspar7 - http://twitter.com/paulogaspar7
Blog: http://paulogaspar7.blogspot.com/
Consistency Perception




                         2
What is Consistency?
                                                                                                                  3

Our perception of consistency is related with what we know about the system and its state. That is how we figure
what might fit...
What isn’t?
                                                                                                              4

...and what does not fit. Obviously a person will have a different degree of precision and tolerance than an
automated system.
Consistency across time
                                                                                     5

Consistency also has a time axis, with state sequences that make sense...
1 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Consistency across time
                                                                                     6

2 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Consistency across time
                                                                                     7

3 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
                                                                                       8

...and state sequences that do NOT make sense.
1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
                                                                                       9

2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
                                                                                       10

3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Consistency is perception
              ...and time matters...
                                                                                                                    11

Again, each (type of) observer will have a different degree of evaluation precision and tolerance to inconsistencies.
Caching Consistency
(The Lower Latency - read performance)




                                         12
Data Caching Consistency

                     Multi-layer caching

                     The 3 second cache for a “LIVE” site
                     (e.g.: BBC News live soccer reports)

                     User changing cached data

                     Schrodinger’s Cache?


                                                                                                                         13
Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state
changes, are any server to client delays (due to caching) really there?

Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site
down) due to overload.
Memcached at FB:
     You HAVE TO Replicate to Scale-Out
                                                                                                                  14

An example of how you still might have to replicate in order to scale, even with a very high performance store.

The reason for FB’s issue (might lack some detail):
 http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-
capacit.html
So, now it “Loadbalances”...
                                                                                                            15

...and with LB inconsistencies along the time axis can happen (eg. by reading from alternate out-of-synch
backends)
...but then you can have...
                                                                                       16

With the possibility of state sequences that do NOT make sense.
1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
                                                                                       17

2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
                                                                                       18

3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
...now it can pick >1 versions!
                                                        19

Why you can have inconsistencies along the time axis.
Slow and Big Consistency
(The Higher Latency - BigData)




                                 20
MapReduce is for embarrassingly
   parallel problems with some time...
                                                                                                                  21

Consistency scenarios, starting from the most “sexy” (Web, Peta Bytes of Data):
* MapReduce works like vote counting - vote mapped to voting tables, counted, “reduced” to stats;
* MR is appropriate for "embarrassingly parallel" tasks, like indexing the Internet and other huge processing tasks;
* We should use it whenever possible;
* There is a lot to be learned about Map Reduce:
 - Evaluation and expression of candidate problems;
 - Build and manage an its infrastructure;
 - etc.
* Even MR has coordination needs;
* Even MR should have SLAs (Service Level Agreements).
MapReduce Implementations
                                                 (& Cia.)
          Google, coordination by Chubby using Paxos.
          Used only at Google;
          Google BigTable is a Wide Column Store which works
          on top of GoogleFS. Used only at Google;
          Hadoop, used at Amazon, Facebook, Rackspace,
          Twitter, Yahoo!, etc.;
          Hadoop ZooKeeper implements a Paxos variation and
          is used at Rackspace, Yahoo!, etc.;
          Hadoop HBase is a Wide Column Store, on top of
          HDFS and now uses ZooKeeper. Used at Yahoo! etc.
                                                                                         22
Parallel between Google’s internally developed systems and their Hadoop counterparts.
 http://hadoop.apache.org/
 http://labs.google.com/papers/

The very interesting “coordinators”:
 http://labs.google.com/papers/chubby.html
 http://hadoop.apache.org/zookeeper/

Zookeeper sure looks like a very interesting and reusable piece of software.

Curiosity: HBase is faster since using ZooKeeper... is it also because of Zookeeper???
 http://hadoop.apache.org/hbase/
Consistency w/ Interaction
(Low Latency - read/write - harder stuff)




                                            23
Two “High”/Sexy reasons for
 Distributing Data Storage
            (not just cache)


High Performance Data Access
(Read / Write)
High Availability (HA)


                               24
Why care about HA?

          1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google)
          Unrecoverable RAM errors/year: 1.3% machines,
          0.22% DIMM (Google)
          Router, Rack, PDU, misc. network failures
          Over 4 nines only through redundancy, best hardware
          never good enough (James Hamilton-MS and Amazon)



                                                                                     25
Sources:
For Google’s numbers check the slideware at:
 http://videolectures.net/wsdm09_dean_cblirs/

For the James Hamilton quote:
 http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdf

Another very quoted paper with Google’s DRAM failure stats and patterns:
 http://research.google.com/pubs/pub35162.html

You can find other HA and Systems related papers from Google and James Hamilton at:
 http://mvdirona.com/jrh/work/
 http://research.google.com/pubs/DistributedSystemsandParallelComputing.html
Why care about Latency?

         Google: Half a second delay caused a 20% drop in
         traffic (30 results instead of 10, via Marissa Mayer);
         Amazon found every 100ms of latency costs 1% sales
         (via Greg Linden);
         A broker could lose $4 million in revenues per
         millisecond if their electronic trading platform is 5 ms
         behind the competition (via NYT).



                                                                                     26
You can find all this references trough this page (if you follow the links):
 http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it

Including these:
  http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html
  http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx
  http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
Other Distributed Data Contexts
                             (the less sexy daily stuff)


         EAI / B2B / Systems Integration


         Geographic Distribution (e.g.:Health System+Hospitals)


         Systems with n-tier / SOA Architectures




                                                                                                               27

The daily jobs of so many IT professionals have much more relation with this type of common distributed systems
than with the sexier kind we talked about before. But these fields too would benefit from the learning the lessons
and using the technologies we are talking about.
Fallacies of Distributed Computing
                           1. The network is reliable;
                           2. Latency is zero;
                           3. Bandwidth is infinite;
                           4. The network is secure;
                           5. Topology doesn't change;
                           6. There is one administrator;
                           7. Transport cost is zero;
                           8. The network is homogeneous.

                                                                             28
Just to remember this classic on the HA challenges. A few more details at:
  http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
CAP Theorem History
         1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems”
         paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley)

         2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC
         Conference

         2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and
         Nancy Lynch (MIT)

         2007-10-02: “Amazon's Dynamo” post by Werner Vogels
         (Amazon’s CTO) quoting the paper:
         Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash
         Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels,
         “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st
         ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

         2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO)


                                                                                                                     29

The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual
Consistency” chapter:
 http://books.couchdb.org/relax/intro/eventual-consistency

Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in
truly industrial sites, even with stats describing real life behavior:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and the now famous Eventually Consistent post by Werner Vogels:
  http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,
is the best I found about the Brewer’s CAP Theorem and its history:
  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

You should still take a look at:
* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is
already mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf

* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already
mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

* You might also see with your own eyes how CAP became a proved Theorem:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

Definition of ACID:
 http://en.wikipedia.org/wiki/ACID
The CAP Theorem
         strong Consistency, high Availability, Partition-resilience:
                             pick at most 2
                                                                        30

I simply had to put The Diagram, of course.
Eventual Consistency for
       Availability
       BASE                                                             ACID
       (Basically Available Soft-state Eventual consistency)            (Atomicity, Consistency, Isolation, Durability)


           Weak Consistency                                                   Strong consistency
           (stale data ok)                                                    (NO stale data)

           Availability first                                                  Isolation

           Best effort                                                        Focus on “commit”

           Approximate answers OK                                             Availability?

           Aggressive (optimistic)                                            Conservative (pessimistic)

           Faster                                                             Safer

                                                                                                                                   31
You can find a variation of this slide at Brewer’s 2000’s PODC keynote at:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

I skipped these rather controversial bits:
  ACID: * Nested transactions; * Difficult evolution (e.g. schema)
  BASE: * Simpler! * Easier evolution

I already tried both ways (data stores with and without schema) and I rather have some schema mechanism for the most
complex stuff.


ACID:
A)tomicity
Either all of the tasks of a transaction are performed or none of them are.

C)onsistency
A database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful
or not).

I)solation
Other operations cannot access or see the data in an intermediate state during a transaction.

D)urability
Once the user has been notified of success, the transaction will persist. This means it will survive system failure, and that the
database system has checked the integrity constraints and won't need to abort the transaction.
CAP Trade-offs
          CA without P: Databases providing distributed transactions can
          only do it while their network is ok;

          CP without A: While there is a partition, transactions to an ACID
          database may be blocked until the partition heals
          (to avoid merge conflicts -> inconsistency);

          AP without C: Caching provides client-server partition resilience
          by replicating data, even if the partition prevents verifying if a
          replica is fresh. In general, any distributed DB problem can be
          solved with either:
                expiration-based caching to get AP;
                or replicas and majority voting to get PC
                (minority is unavailable).

                                                                                                      32

Concept introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.):
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

I should probably skip this slide during a life presentation. This is stuff you have to read about.
Living with CAP
        All systems are probabilistic, wether they realize it or not
        And so are Distributed Transactions (2 Generals Problem)
        Weak CAP Principle: The stronger the guarantees made
        about any two of C, A and P, the weaker the guarantees
        that can be made about the third
        Systems should degrade gracefully, instead of all or
        nothing (e.g.: displaying data from available partitions)
        Life is Eventually Consistent
        Aim for Eventual Consistency
                                                                                                                        33

Steve Yen clearly illustrates the “Life is Eventually Consistent” idea on the slideware (slides 40 to 45) he used for
his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009:
 http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf

The Weak CAP Principle was introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et
al.):
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

To understand how hard (ACID) Distributed Transactions are, you have an excellent history of the concepts related
to this problem here:
 http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html

The difficulties of (ACID) Distributed Transactions are well illustrated by the classic Two Generals’ Problem:
 http://en.wikipedia.org/wiki/Two_Generals'_Problem

Leslie Lamport et al further explore the problem (and its solutions) on the classic “The Byzantine Generals Problem”
paper:
 http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf

And if you think that Two Phase Commit is a 100% reliable mechanism... think again:
 http://www.cs.cornell.edu/courses/cs614/2004sp/papers/Ske81.pdf

This is just to illustrate the difficulty of the problem. There are more reliable mechanisms, like Three Phase
Commit:
 http://en.wikipedia.org/wiki/Three-phase_commit_protocol
 http://ei.cs.vt.edu/~cs5204/fall99/distributedDBMS/sreenu/3pc.html

...or the so called Paxos Commit:
  http://research.microsoft.com/pubs/64636/tr-2003-96.pdf
CAP Theorem History
         1999: 1st mention on the “Harvest, Yield, and Scalable Tolerant Systems”
         paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley)

         2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC
         Conference

         2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and
         Nancy Lynch (MIT)

         2007-10-02: “Amazon's Dynamo” post by Werner Vogels
         (Amazon’s CTO) quoting the paper:
         Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash
         Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels,
         “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st
         ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

         2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO)


                                                                                                                     34

Repeated slide, repeated notes (to pass focus from CAP to Dynamo and Eventual Consistency):

The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual
Consistency” chapter:
 http://books.couchdb.org/relax/intro/eventual-consistency

Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in
truly industrial sites, even with stats describing real life behavior:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and the now famous Eventually Consistent post by Werner Vogels:
  http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,
is the best I found about the Brewer’s CAP Theorem and its history:
  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

You should still take a look at:
* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is
already mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf

* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already
mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

* You might also see with your own eyes how CAP became a proved Theorem:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

Definition of ACID:
 http://en.wikipedia.org/wiki/ACID
Amazon’s Dynamo DB
                Also a “Wide Column Store”


      Problem                                    Technique
      Partitioning                               Consistent Hashing

      High Availability for writes               Vector clocks with reconciliation during reads

      Handling temporary failures                Sloppy Quorum and hinted handoff (NRW)

      Recovering from permanent failures Anti-entropy using Merkle trees

      Membership and failure detection           Gossip-based membership protocol and failure detection.




                                                                                                                       35
The source here is the already mentioned Dynamo paper:
 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Strict distributed DBs, rather than dealing with the uncertainty of the correctness of an answer, make data is made
unavailable until it is absolutely certain that it is correct.

At Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution - avg or median not good
enough to provide a good experience for all. The choice for 99.9% over an even higher percentile has been made based
on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much.
Experiences with Amazon’s production systems have shown that this approach provides a better overall experience
compared to those systems that meet SLAs defined based on the mean or median.
N: number of nodes to replicate each item to;
                      W: number of required nodes for write success;
                      R: number of required nodes for write success.

                      W < N = remaining nodes will receive the write later.
                      R < N = remaining nodes ignored.



                                                                                                             36
Also based in the already mentioned Dynamo paper:
 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...but you can find a similar diagram and similar mechanisms described about several (NoSQL) databases that
partially clone Dynamo.
Wikipedia image




                            Merkle Tree / Hash Tree
                    Used to verify / compare a set of data blocks
                    and efficiently find where the mismatches are.

                                                                                     37
Also based in the already mentioned Dynamo paper:
 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and on the Wikipedia article about this algorithm:
  http://en.wikipedia.org/wiki/Hash_tree
Wikipedia image



                                             Vector Clocks
                  On each internal even a process increments its logical clock;
                  Before sending a message, it increments its own clock in the
                  vector and sends it with the message;
                  On receiving a message, it increments its clock and updates
                  each element on its own vector to max.(own, msg).

                                                                                                               38
Also based in the already mentioned Dynamo paper:
 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and on the Wikipedia article about this algorithm:
  http://en.wikipedia.org/wiki/Vector_clock

Vector Clocks (and other similar algorithms) have a predecessor in Lamport timestamps:
 http://en.wikipedia.org/wiki/Lamport_timestamps

Introduced in the classic paper “Time, Clocks, and the Ordering of Events in a Distributed System” by Leslie
Lamport:
  http://en.wikipedia.org/wiki/Lamport_timestamps
Amazon Dynamo Lessons
                                        (according to the paper)

          Data returned to Shopping Cart 24h profiling:
          0.00057% of requests saw 2 versions; 0.00047% of
          requests saw 3 versions and 0.00009% of requests
          saw 4 versions.
          In two years applications have received successful
          responses (without timing out) for 99.9995% of its
          requests and no data loss event has occurred to date;
          With coordination via Gossip protocol it is harder to
          scale further than a few hundred nodes.
          (Could be better w/ Chubby / ZK like coordinators?)

                                                                                                                      39
Also based in the already mentioned Dynamo paper:
 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Wikipedia has an article on Gossip Protocols (although, at the data I write this, not as precise as other Wikipedia
articles I just quoted):
 http://en.wikipedia.org/wiki/Gossip_protocol

The solution I mention as a possibly more scalable alternative to Gossip Protocols for consensus is the use of
Paxos (or derivates) Coordinators, like the proprietary Google’s Chubby or the open source Apache Hadoop
Zookeeper.

When I first wrote and used (at my SAPO Codebits 2009 talk) these slides, the only support I still had to my (then
intuitive) belief that these more directed approaches should be more efficient than Gossip Protocols was the 6.6
part from the Dynamo paper - the paper even mentions the possibility of “introducing hierarchical extensions to
Dynamo”.

Thanks to my SAPO Codebits talk I met Henrique Moniz, then a Ph.D. student at the University of Lisbon. After I
discussed this issue (consensus scalability) with him he pointed me to a couple of interesting papers, one of which
immediately captured my attention:
* Gossip-based broadcast protocols by João Leitão
 http://www.gsd.inesc-id.pt/~jleitao/pdf/masterthesis-leitao.pdf

This paper offers a more complete description of gossip protocols overhead and, to my surprise, also pointed a
few reliability weak spots on known Gossip Protocols. The paper goes on to present a more robust and efficient
Gossip Protocol called “HyParView” using a more “directed” approach.

HyParView sure looks like an interesting solution in terms of robustness for environments with an high incidence
of system/network failures but I still believe that using coordinators will be more efficient in a well controlled data
center.

Not that using coordinators and making them scale out BIG is exactly trivial, as you can read here:
-On the “Vertical Paxos and Primary-Backup Replication” paper, by Leslie Lamport et al, that Henrique Moniz
pointed me to:
 http://research.microsoft.com/pubs/80907/podc09v6.pdf

-Or on this interesting article from the Cloudera’s blog about the (now upcoming) Observers feature of Apache
Eventually Consistent Systems

                   Banks
                   EAI Integrations
                   Many messaging based (SOA) systems
                   Google
                   Amazon
                   Etc.



                                                                                                              40

Unlike what many examples say, Banks often use Eventual Consistency on many (limited value/risk) transactions -
or use “large” periodic transaction / compensation fixed windows to process large numbers of larger value
movements. So much for those ACID transaction examples...
ACID and FAST
(Lowest Latency - read/write - hardest stuff)




                                                41
Immediately Consistent Systems

                                                               Data-grids:
                                                                Coherence
           Trading                                              Gigaspaces
                                                               All Data in RAM
           Online Gambling                                     Can do ACID
                                                               Very High Speed
                                                               Max. Scale-out


                                                                                                               42

Trading and Online Gambling really need to do large volumes of fast ACID transactions and are the big customers
of Data Grids.

Why Online Gambling needs ACID transactions has all to do with the type of game and the type of rules/assets
(some virtual) it involves.

Why Trading really needs ACID is s bit more obvious: you might be able to compensate an overdraft at a bank
(more so for limited values) but you really cannot sell shares you do not have for sale.

The performance needs are obvious for both too. For Trading there are even some new reasons, like (again):
 http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
Tools
(Most with source code to pick from)




                                       43
NoSQL Taxonomy
      by Steve Yen [PG]
          key‐value‐cache: memcached, repcached, coherence [?], infinispan, eXtreme scale, jboss
          cache, velocity, terracota [???]
          key‐value‐store: keyspace [w/Paxos], flare, schema‐free, RAMCloud [, Mnesia (Erlang),
          Chordless]
          eventually‐consistent key‐value‐store: dynamo, Voldemort, Dynomite, SubRecord,
          MotionDb, Dovetaildb
          ordered‐key‐value‐store: tokyo tyrant[, BerkleyDB], lightcloud, NMDB, luxio, memcachedb,
          actord
          data‐structures server: redis
          tuple‐store: gigaspaces [?], coord, apache river
          object database: ZopeDB, db4o, Shoal
          document store: CouchDB [evC, MVCC], MongoDB [evC], Jackrabbit, XML Databases,
          ThruDB, CloudKit, Perservere, Riak Basho [evC], Scalaris [Erlang, w/Paxos]
          wide columnar store: BigTable, Hadoop HBase [w/ Zookeeper], [Amazon Dynamo-evC, ]
          Cassandra [evC], Hypertable, KAI, OpenNeptune, Qbase, KDI
          [graph database: Neo4J, Sones, etc.]

                                                                                                                       44

From Steve Yen’s slideware (slide 54) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009:
 http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf

I do not completely understand or agree with Steve’s criteria but it sure is a possible starting point on building a
database/storage taxonomy.

The stuff in square brackets is mine. “evC” means Eventually Consistent and “?” just means I have doubts / don’t
understand some specific classification.
Opportunities
(...to use these toys)




                         45
Cases to talk about
                    Analytics
                    Live soccer game site (like BBC News did)
                    Log like / timeline systems
                    (forums, healthcare, Twitter, etc.)
                    EAI Integrations
                    (Should use Vector Clocks?)
                    Zookeeper at the “Farm” (Config./Coord.)
                    Logistic Planing across EU
                    Trading

                                                                                                                         46

This is the placeholder slide to exercise the ideas and discuss possible applications of some of the mechanisms
which were presented on this talk (had no time at Codebits... still tuning this not-so-easy presentation).

Except for the last two scenarios (and the Twitter alternative on the “Log like” one) all others represent quite
common types of problems which you can meet without having to work for a Fortune Top 50 company or for a
mega web portal / service. Even an “Analytics” with enough data to justify using MapReduce is common enough.
Many large (but not necessarily huge) companies often quit doing more with the data they have just because of
the trouble of finding a way to do it (“more”).

* “Analytics” (high data + easy on consistency as it is) is currently seem to be the playground of Map Reduce, with
Hadoop stuff being used “everywhere”. Look at how many times you can find the words “analytics” or
“analysis” (and “MapReduce”) on these “Powered by” Hadoop web pages:
 http://wiki.apache.org/hadoop/PoweredBy
 http://wiki.apache.org/hadoop/Hbase/PoweredBy

* “Live soccer game...” is a nice problem to discuss short live caching and its consistency issues;

* “Log like / timeline systems...” are systems where information is mostly “insert only” and most of the effort to
keep consistency is related to keeping proper ordering information (with timestamps being usually enough),
properly merging the data from different sources and respect the explicit or implicit SLAs on data
synchronizations. Obviously, there are different difficulties across the several cases here mentioned, depending on
data flow, necessary performance, etc.;

* “EAI Integrations” often need better knowledge about ordering and are not as simples as the previous scenario.
Due to factors like the use of asynchronous and event driven mechanisms and the possibility of having updates
for a given document across multiple steps of a (multiple) process(es), a timestamp is often too limited as
ordering information... but is often the most you get. IMO this is a good scenario for using Vector Clocks and
company;

* “Zookeeper” is a great system even if “just” to configure the simplest web (or webservice) farm, to coordinate the
simplest cross farm operations (e.g.: cache related) or just for each server to know which are its peers;

* “Logistic Planing” is a complex scenario which demands a mix of solutions. It revolves around a logistics
company which transports goods across Europe, with planning offices on different countries. I will probably have
to remove it from this slide for any future talk I might give on this topic even if it is the most interesting of them
all. So, it does not make much sense to develop it here (maybe a blog post since, to me, this is a >10 year old
Q&A




      47

Mais conteúdo relacionado

Semelhante a Distributed Programming and Data Consistency w/ Notes

HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyRohit Dubey
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Google Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGoogle Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGlobant
 
Real-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky BorisReal-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky Borislucenerevolution
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareQuantum Leaps, LLC
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Pavlo Baron
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Nt1320 Unit 51
Nt1320 Unit 51Nt1320 Unit 51
Nt1320 Unit 51Tara Smith
 
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...João Parreira
 
How to Build a Pure Evil Magento Module
How to Build a Pure Evil Magento ModuleHow to Build a Pure Evil Magento Module
How to Build a Pure Evil Magento ModuleAOE
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 

Semelhante a Distributed Programming and Data Consistency w/ Notes (20)

HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Google Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGoogle Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQL
 
Realtime search at Yammer
Realtime search at YammerRealtime search at Yammer
Realtime search at Yammer
 
Real Time Search at Yammer
Real Time Search at YammerReal Time Search at Yammer
Real Time Search at Yammer
 
Real-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky BorisReal-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky Boris
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Nt1320 Unit 51
Nt1320 Unit 51Nt1320 Unit 51
Nt1320 Unit 51
 
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...
Building collaborative HTML5 apps using a backend-as-a-service (HTML5DevConf ...
 
How to Build a Pure Evil Magento Module
How to Build a Pure Evil Magento ModuleHow to Build a Pure Evil Magento Module
How to Build a Pure Evil Magento Module
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 

Último

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 

Último (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 

Distributed Programming and Data Consistency w/ Notes

  • 1. Distributed Programming and Data Consistency by Paulo Gaspar @paulogaspar7 1 Twitter: @paulogaspar7 - http://twitter.com/paulogaspar7 Blog: http://paulogaspar7.blogspot.com/
  • 3. What is Consistency? 3 Our perception of consistency is related with what we know about the system and its state. That is how we figure what might fit...
  • 4. What isn’t? 4 ...and what does not fit. Obviously a person will have a different degree of precision and tolerance than an automated system.
  • 5. Consistency across time 5 Consistency also has a time axis, with state sequences that make sense... 1 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 6. Consistency across time 6 2 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 7. Consistency across time 7 3 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 8. Inconsistency across time 8 ...and state sequences that do NOT make sense. 1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 9. Inconsistency across time 9 2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 10. Inconsistency across time 10 3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 11. Consistency is perception ...and time matters... 11 Again, each (type of) observer will have a different degree of evaluation precision and tolerance to inconsistencies.
  • 12. Caching Consistency (The Lower Latency - read performance) 12
  • 13. Data Caching Consistency Multi-layer caching The 3 second cache for a “LIVE” site (e.g.: BBC News live soccer reports) User changing cached data Schrodinger’s Cache? 13 Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state changes, are any server to client delays (due to caching) really there? Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site down) due to overload.
  • 14. Memcached at FB: You HAVE TO Replicate to Scale-Out 14 An example of how you still might have to replicate in order to scale, even with a very high performance store. The reason for FB’s issue (might lack some detail): http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more- capacit.html
  • 15. So, now it “Loadbalances”... 15 ...and with LB inconsistencies along the time axis can happen (eg. by reading from alternate out-of-synch backends)
  • 16. ...but then you can have... 16 With the possibility of state sequences that do NOT make sense. 1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 17. Inconsistency across time 17 2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 18. Inconsistency across time 18 3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 19. ...now it can pick >1 versions! 19 Why you can have inconsistencies along the time axis.
  • 20. Slow and Big Consistency (The Higher Latency - BigData) 20
  • 21. MapReduce is for embarrassingly parallel problems with some time... 21 Consistency scenarios, starting from the most “sexy” (Web, Peta Bytes of Data): * MapReduce works like vote counting - vote mapped to voting tables, counted, “reduced” to stats; * MR is appropriate for "embarrassingly parallel" tasks, like indexing the Internet and other huge processing tasks; * We should use it whenever possible; * There is a lot to be learned about Map Reduce: - Evaluation and expression of candidate problems; - Build and manage an its infrastructure; - etc. * Even MR has coordination needs; * Even MR should have SLAs (Service Level Agreements).
  • 22. MapReduce Implementations (& Cia.) Google, coordination by Chubby using Paxos. Used only at Google; Google BigTable is a Wide Column Store which works on top of GoogleFS. Used only at Google; Hadoop, used at Amazon, Facebook, Rackspace, Twitter, Yahoo!, etc.; Hadoop ZooKeeper implements a Paxos variation and is used at Rackspace, Yahoo!, etc.; Hadoop HBase is a Wide Column Store, on top of HDFS and now uses ZooKeeper. Used at Yahoo! etc. 22 Parallel between Google’s internally developed systems and their Hadoop counterparts. http://hadoop.apache.org/ http://labs.google.com/papers/ The very interesting “coordinators”: http://labs.google.com/papers/chubby.html http://hadoop.apache.org/zookeeper/ Zookeeper sure looks like a very interesting and reusable piece of software. Curiosity: HBase is faster since using ZooKeeper... is it also because of Zookeeper??? http://hadoop.apache.org/hbase/
  • 23. Consistency w/ Interaction (Low Latency - read/write - harder stuff) 23
  • 24. Two “High”/Sexy reasons for Distributing Data Storage (not just cache) High Performance Data Access (Read / Write) High Availability (HA) 24
  • 25. Why care about HA? 1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google) Unrecoverable RAM errors/year: 1.3% machines, 0.22% DIMM (Google) Router, Rack, PDU, misc. network failures Over 4 nines only through redundancy, best hardware never good enough (James Hamilton-MS and Amazon) 25 Sources: For Google’s numbers check the slideware at: http://videolectures.net/wsdm09_dean_cblirs/ For the James Hamilton quote: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdf Another very quoted paper with Google’s DRAM failure stats and patterns: http://research.google.com/pubs/pub35162.html You can find other HA and Systems related papers from Google and James Hamilton at: http://mvdirona.com/jrh/work/ http://research.google.com/pubs/DistributedSystemsandParallelComputing.html
  • 26. Why care about Latency? Google: Half a second delay caused a 20% drop in traffic (30 results instead of 10, via Marissa Mayer); Amazon found every 100ms of latency costs 1% sales (via Greg Linden); A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 ms behind the competition (via NYT). 26 You can find all this references trough this page (if you follow the links): http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it Including these: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  • 27. Other Distributed Data Contexts (the less sexy daily stuff) EAI / B2B / Systems Integration Geographic Distribution (e.g.:Health System+Hospitals) Systems with n-tier / SOA Architectures 27 The daily jobs of so many IT professionals have much more relation with this type of common distributed systems than with the sexier kind we talked about before. But these fields too would benefit from the learning the lessons and using the technologies we are talking about.
  • 28. Fallacies of Distributed Computing 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesn't change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous. 28 Just to remember this classic on the HA challenges. A few more details at: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
  • 29. CAP Theorem History 1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazon's Dynamo” post by Werner Vogels (Amazon’s CTO) quoting the paper: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007. 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 29 The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual Consistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistency Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in truly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne, is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem You should still take a look at: * The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf * The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf * The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf * You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf Definition of ACID: http://en.wikipedia.org/wiki/ACID
  • 30. The CAP Theorem strong Consistency, high Availability, Partition-resilience: pick at most 2 30 I simply had to put The Diagram, of course.
  • 31. Eventual Consistency for Availability BASE ACID (Basically Available Soft-state Eventual consistency) (Atomicity, Consistency, Isolation, Durability) Weak Consistency Strong consistency (stale data ok) (NO stale data) Availability first Isolation Best effort Focus on “commit” Approximate answers OK Availability? Aggressive (optimistic) Conservative (pessimistic) Faster Safer 31 You can find a variation of this slide at Brewer’s 2000’s PODC keynote at: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf I skipped these rather controversial bits: ACID: * Nested transactions; * Difficult evolution (e.g. schema) BASE: * Simpler! * Easier evolution I already tried both ways (data stores with and without schema) and I rather have some schema mechanism for the most complex stuff. ACID: A)tomicity Either all of the tasks of a transaction are performed or none of them are. C)onsistency A database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful or not). I)solation Other operations cannot access or see the data in an intermediate state during a transaction. D)urability Once the user has been notified of success, the transaction will persist. This means it will survive system failure, and that the database system has checked the integrity constraints and won't need to abort the transaction.
  • 32. CAP Trade-offs CA without P: Databases providing distributed transactions can only do it while their network is ok; CP without A: While there is a partition, transactions to an ACID database may be blocked until the partition heals (to avoid merge conflicts -> inconsistency); AP without C: Caching provides client-server partition resilience by replicating data, even if the partition prevents verifying if a replica is fresh. In general, any distributed DB problem can be solved with either: expiration-based caching to get AP; or replicas and majority voting to get PC (minority is unavailable). 32 Concept introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf I should probably skip this slide during a life presentation. This is stuff you have to read about.
  • 33. Living with CAP All systems are probabilistic, wether they realize it or not And so are Distributed Transactions (2 Generals Problem) Weak CAP Principle: The stronger the guarantees made about any two of C, A and P, the weaker the guarantees that can be made about the third Systems should degrade gracefully, instead of all or nothing (e.g.: displaying data from available partitions) Life is Eventually Consistent Aim for Eventual Consistency 33 Steve Yen clearly illustrates the “Life is Eventually Consistent” idea on the slideware (slides 40 to 45) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf The Weak CAP Principle was introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf To understand how hard (ACID) Distributed Transactions are, you have an excellent history of the concepts related to this problem here: http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html The difficulties of (ACID) Distributed Transactions are well illustrated by the classic Two Generals’ Problem: http://en.wikipedia.org/wiki/Two_Generals'_Problem Leslie Lamport et al further explore the problem (and its solutions) on the classic “The Byzantine Generals Problem” paper: http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf And if you think that Two Phase Commit is a 100% reliable mechanism... think again: http://www.cs.cornell.edu/courses/cs614/2004sp/papers/Ske81.pdf This is just to illustrate the difficulty of the problem. There are more reliable mechanisms, like Three Phase Commit: http://en.wikipedia.org/wiki/Three-phase_commit_protocol http://ei.cs.vt.edu/~cs5204/fall99/distributedDBMS/sreenu/3pc.html ...or the so called Paxos Commit: http://research.microsoft.com/pubs/64636/tr-2003-96.pdf
  • 34. CAP Theorem History 1999: 1st mention on the “Harvest, Yield, and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazon's Dynamo” post by Werner Vogels (Amazon’s CTO) quoting the paper: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007. 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 34 Repeated slide, repeated notes (to pass focus from CAP to Dynamo and Eventual Consistency): The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual Consistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistency Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in truly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne, is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem You should still take a look at: * The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf * The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf * The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf * You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf Definition of ACID: http://en.wikipedia.org/wiki/ACID
  • 35. Amazon’s Dynamo DB Also a “Wide Column Store” Problem Technique Partitioning Consistent Hashing High Availability for writes Vector clocks with reconciliation during reads Handling temporary failures Sloppy Quorum and hinted handoff (NRW) Recovering from permanent failures Anti-entropy using Merkle trees Membership and failure detection Gossip-based membership protocol and failure detection. 35 The source here is the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Strict distributed DBs, rather than dealing with the uncertainty of the correctness of an answer, make data is made unavailable until it is absolutely certain that it is correct. At Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution - avg or median not good enough to provide a good experience for all. The choice for 99.9% over an even higher percentile has been made based on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much. Experiences with Amazon’s production systems have shown that this approach provides a better overall experience compared to those systems that meet SLAs defined based on the mean or median.
  • 36. N: number of nodes to replicate each item to; W: number of required nodes for write success; R: number of required nodes for write success. W < N = remaining nodes will receive the write later. R < N = remaining nodes ignored. 36 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...but you can find a similar diagram and similar mechanisms described about several (NoSQL) databases that partially clone Dynamo.
  • 37. Wikipedia image Merkle Tree / Hash Tree Used to verify / compare a set of data blocks and efficiently find where the mismatches are. 37 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Hash_tree
  • 38. Wikipedia image Vector Clocks On each internal even a process increments its logical clock; Before sending a message, it increments its own clock in the vector and sends it with the message; On receiving a message, it increments its clock and updates each element on its own vector to max.(own, msg). 38 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Vector_clock Vector Clocks (and other similar algorithms) have a predecessor in Lamport timestamps: http://en.wikipedia.org/wiki/Lamport_timestamps Introduced in the classic paper “Time, Clocks, and the Ordering of Events in a Distributed System” by Leslie Lamport: http://en.wikipedia.org/wiki/Lamport_timestamps
  • 39. Amazon Dynamo Lessons (according to the paper) Data returned to Shopping Cart 24h profiling: 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3 versions and 0.00009% of requests saw 4 versions. In two years applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date; With coordination via Gossip protocol it is harder to scale further than a few hundred nodes. (Could be better w/ Chubby / ZK like coordinators?) 39 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Wikipedia has an article on Gossip Protocols (although, at the data I write this, not as precise as other Wikipedia articles I just quoted): http://en.wikipedia.org/wiki/Gossip_protocol The solution I mention as a possibly more scalable alternative to Gossip Protocols for consensus is the use of Paxos (or derivates) Coordinators, like the proprietary Google’s Chubby or the open source Apache Hadoop Zookeeper. When I first wrote and used (at my SAPO Codebits 2009 talk) these slides, the only support I still had to my (then intuitive) belief that these more directed approaches should be more efficient than Gossip Protocols was the 6.6 part from the Dynamo paper - the paper even mentions the possibility of “introducing hierarchical extensions to Dynamo”. Thanks to my SAPO Codebits talk I met Henrique Moniz, then a Ph.D. student at the University of Lisbon. After I discussed this issue (consensus scalability) with him he pointed me to a couple of interesting papers, one of which immediately captured my attention: * Gossip-based broadcast protocols by João Leitão http://www.gsd.inesc-id.pt/~jleitao/pdf/masterthesis-leitao.pdf This paper offers a more complete description of gossip protocols overhead and, to my surprise, also pointed a few reliability weak spots on known Gossip Protocols. The paper goes on to present a more robust and efficient Gossip Protocol called “HyParView” using a more “directed” approach. HyParView sure looks like an interesting solution in terms of robustness for environments with an high incidence of system/network failures but I still believe that using coordinators will be more efficient in a well controlled data center. Not that using coordinators and making them scale out BIG is exactly trivial, as you can read here: -On the “Vertical Paxos and Primary-Backup Replication” paper, by Leslie Lamport et al, that Henrique Moniz pointed me to: http://research.microsoft.com/pubs/80907/podc09v6.pdf -Or on this interesting article from the Cloudera’s blog about the (now upcoming) Observers feature of Apache
  • 40. Eventually Consistent Systems Banks EAI Integrations Many messaging based (SOA) systems Google Amazon Etc. 40 Unlike what many examples say, Banks often use Eventual Consistency on many (limited value/risk) transactions - or use “large” periodic transaction / compensation fixed windows to process large numbers of larger value movements. So much for those ACID transaction examples...
  • 41. ACID and FAST (Lowest Latency - read/write - hardest stuff) 41
  • 42. Immediately Consistent Systems Data-grids: Coherence Trading Gigaspaces All Data in RAM Online Gambling Can do ACID Very High Speed Max. Scale-out 42 Trading and Online Gambling really need to do large volumes of fast ACID transactions and are the big customers of Data Grids. Why Online Gambling needs ACID transactions has all to do with the type of game and the type of rules/assets (some virtual) it involves. Why Trading really needs ACID is s bit more obvious: you might be able to compensate an overdraft at a bank (more so for limited values) but you really cannot sell shares you do not have for sale. The performance needs are obvious for both too. For Trading there are even some new reasons, like (again): http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  • 43. Tools (Most with source code to pick from) 43
  • 44. NoSQL Taxonomy by Steve Yen [PG] key‐value‐cache: memcached, repcached, coherence [?], infinispan, eXtreme scale, jboss cache, velocity, terracota [???] key‐value‐store: keyspace [w/Paxos], flare, schema‐free, RAMCloud [, Mnesia (Erlang), Chordless] eventually‐consistent key‐value‐store: dynamo, Voldemort, Dynomite, SubRecord, MotionDb, Dovetaildb ordered‐key‐value‐store: tokyo tyrant[, BerkleyDB], lightcloud, NMDB, luxio, memcachedb, actord data‐structures server: redis tuple‐store: gigaspaces [?], coord, apache river object database: ZopeDB, db4o, Shoal document store: CouchDB [evC, MVCC], MongoDB [evC], Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho [evC], Scalaris [Erlang, w/Paxos] wide columnar store: BigTable, Hadoop HBase [w/ Zookeeper], [Amazon Dynamo-evC, ] Cassandra [evC], Hypertable, KAI, OpenNeptune, Qbase, KDI [graph database: Neo4J, Sones, etc.] 44 From Steve Yen’s slideware (slide 54) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf I do not completely understand or agree with Steve’s criteria but it sure is a possible starting point on building a database/storage taxonomy. The stuff in square brackets is mine. “evC” means Eventually Consistent and “?” just means I have doubts / don’t understand some specific classification.
  • 46. Cases to talk about Analytics Live soccer game site (like BBC News did) Log like / timeline systems (forums, healthcare, Twitter, etc.) EAI Integrations (Should use Vector Clocks?) Zookeeper at the “Farm” (Config./Coord.) Logistic Planing across EU Trading 46 This is the placeholder slide to exercise the ideas and discuss possible applications of some of the mechanisms which were presented on this talk (had no time at Codebits... still tuning this not-so-easy presentation). Except for the last two scenarios (and the Twitter alternative on the “Log like” one) all others represent quite common types of problems which you can meet without having to work for a Fortune Top 50 company or for a mega web portal / service. Even an “Analytics” with enough data to justify using MapReduce is common enough. Many large (but not necessarily huge) companies often quit doing more with the data they have just because of the trouble of finding a way to do it (“more”). * “Analytics” (high data + easy on consistency as it is) is currently seem to be the playground of Map Reduce, with Hadoop stuff being used “everywhere”. Look at how many times you can find the words “analytics” or “analysis” (and “MapReduce”) on these “Powered by” Hadoop web pages: http://wiki.apache.org/hadoop/PoweredBy http://wiki.apache.org/hadoop/Hbase/PoweredBy * “Live soccer game...” is a nice problem to discuss short live caching and its consistency issues; * “Log like / timeline systems...” are systems where information is mostly “insert only” and most of the effort to keep consistency is related to keeping proper ordering information (with timestamps being usually enough), properly merging the data from different sources and respect the explicit or implicit SLAs on data synchronizations. Obviously, there are different difficulties across the several cases here mentioned, depending on data flow, necessary performance, etc.; * “EAI Integrations” often need better knowledge about ordering and are not as simples as the previous scenario. Due to factors like the use of asynchronous and event driven mechanisms and the possibility of having updates for a given document across multiple steps of a (multiple) process(es), a timestamp is often too limited as ordering information... but is often the most you get. IMO this is a good scenario for using Vector Clocks and company; * “Zookeeper” is a great system even if “just” to configure the simplest web (or webservice) farm, to coordinate the simplest cross farm operations (e.g.: cache related) or just for each server to know which are its peers; * “Logistic Planing” is a complex scenario which demands a mix of solutions. It revolves around a logistics company which transports goods across Europe, with planning offices on different countries. I will probably have to remove it from this slide for any future talk I might give on this topic even if it is the most interesting of them all. So, it does not make much sense to develop it here (maybe a blog post since, to me, this is a >10 year old
  • 47. Q&A 47