SlideShare a Scribd company logo
1 of 132
Distributed Databases



             Dr. Julian Bunn
Center for Advanced Computing Research
                 Caltech

            Based on material provided by:
  Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu
               Ramakrishnan (Wisconsin)
Outline
                   ?             Introduction to Database
                                 Systems
                   ?             Distributed Databases
                   ?             Distributed Systems
                   ?             Distributed Databases for
                                 Physics


J.J.Bunn, Distributed Databases, 2001                        2
Part I

Introduction to Database
        Systems     .




                   Julian Bunn
        California Institute of Technology
What is a Database?
   ?             A large, integrated collection of data
   ?             Entities (things) and Relationships
                 (connections)
   ?             Objects and Associations/References
   ?             A Database Management System
                 (DBMS) is a software package designed
                 to store and manage Databases
   ?             “Traditional” (ER) Databases and
                 “Object” Databases


J.J.Bunn, Distributed Databases, 2001                     4
Why Use a DBMS?
          ?             Data Independence
          ?             Efficient Access
          ?             Reduced Application Development Time
          ?             Data Integrity
          ?             Data Security
          ?             Data Analysis Tools
          ?             Uniform Data Administration
          ?             Concurrent Access
          ?             Automatic Parallelism
          ?             Recovery from crashes
J.J.Bunn, Distributed Databases, 2001                          5
Cutting Edge Databases
       ?             Scientific Applications
       ?             Digital Libraries, Interactive Video,
                     Human Genome project, Particle
                     Physics Experiments, National Digital
                     Observatories, Earth Images
       ?             Commercial Web Systems
       ?             Data Mining / Data Warehouse
       ?             Simple data but very high transaction
                     rate and enormous volume (e.g. click
                     through)
J.J.Bunn, Distributed Databases, 2001                        6
Data Models
  ?             Data Model: A Collection of Concepts
                for Describing Data
  ?             Schema: A Set of Descriptions of a
                Particular Collection of Data, in the
                context of the Data Model
  ?             Relational Model:
                  ?        E.g. A Lecture is attended by zero or more
                           Students
  ?             Object Model:
                  ?        E.g. A Database Lecture inherits attributes
                           from a general Lecture
J.J.Bunn, Distributed Databases, 2001                                    7
Data Independence
  ?             Applications insulated from how data
                in the Database is structured and stored
                   ? Logical Data Independence: Protection
                     from changes in the logical structure of
                     the data
                   ? Physical Data Independence: Protection
                     from changes in the physical structure of
                     the data




J.J.Bunn, Distributed Databases, 2001                            8
Concurrency Control
     ?             Good DBMS performance relies on
                   allowing concurrent access to the data
                   by more than one client
     ?             DBMS ensures that interleaved actions
                   coming from different clients do not
                   cause inconsistency in the data
                     ?        E.g. two simultaneous bookings for the
                              same airplane seat
     ?             Each client is unaware of how many
                   other clients are using the DBMS
J.J.Bunn, Distributed Databases, 2001                                  9
Transactions
    ?            A Transaction is an atomic sequence of
                 actions in the Database (reads and
                 writes)
    ?            Each Transaction has to be executed
                 completely, and must leave the
                 Database in a consistent state
                    ?       The definition of “consistent” is ultimately the client’s responsibility!
                                                                                      responsibility!

    ?            If the Transaction fails or aborts
                 midway, then the Database is “rolled
                 back” to its initial consistent state
                 (when the Transaction began).
J.J.Bunn, Distributed Databases, 2001                                                                   10
What Is A Transaction?
                         ?              Programmer’s view:
                                        ?   Bracket a collection of actions
                         ?              A simple failure model
                                        ?   Only two outcomes:

                             Begin()                   Begin()        Begin()
                                action                 action         action
                                action                 action         action
                                action                 action         action
                                action                 Rollback()                  Fail !!
                                                                                   Fail
                             Commit()                                 Rollback()

               Success!                                          Failure!
J.J.Bunn, Distributed Databases, 2001                                                        11
ACID
          ?             Atomic: all or nothing
          ?             Consistent: state transformation
          ?             Isolated: no concurrency
                        anomalies
          ?             Durable: committed transaction
                        effects persist



J.J.Bunn, Distributed Databases, 2001                      12
Why Bother: Atomicity?
  ?            RPC semantics:
                  ?       At most once: try one time      ?
                  ?       At least once: keep trying
                                                          ?
                          ’till acknowledged              ?
                  ?       Exactly once: keep trying
                          ’till acknowledged and server
                          discards duplicate requests



J.J.Bunn, Distributed Databases, 2001                     13
Why Bother: Atomicity?
?       Example: insert record in file
           ? At most once: time-out means “maybe”
           ? At least once: retry may get “duplicate” error
             or retry may do second insert
           ? Exactly once: you do not have to worry

?       What if operation involves
           ? Insertseveral records?
           ? Send several messages?

?       Want ALL or NOTHING for group of actions



J.J.Bunn, Distributed Databases, 2001                         14
Why Bother: Consistency
      ?           Begin-Commit brackets a set of operations
      ?           You can violate consistency inside brackets
                    ? Debit but not credit (destroys money)
                    ? Delete old file before create new file in a copy
                    ? Print document before delete from spool queue

      ?           Begin and commit are points of consistency




                                                                       Commit
                           Begin




                                           State transformations
                                        new state under construction


J.J.Bunn, Distributed Databases, 2001                                           15
Why Bother: Isolation
      ?             Running programs concurrently
                    on same data can create
                    concurrency anomalies
                       ?       The shared checking account example
                                  Begin()
                                        read BAL                             Begin()
                                        add 10       Bal = 100
                                                                 Bal = 100    read BAL
                                        write BAL                             Subtract 30
                                  Commit()          Bal = 110                 write BAL
                                                                  Bal = 70   Commit()



      ?             Programming is hard enough without
                    having to worry about concurrency
J.J.Bunn, Distributed Databases, 2001                                                       16
Isolation
?   It is as though programs run one at a time
       ? No               concurrency anomalies
?   System automatically protects applications
       ? Locking (DB2, Informix, Microsoft® SQL Server™, Sybase…)
       ? Versioned databases (Oracle, Interbase…)


                                            Begin()
                                             read BAL
                                             add 10       Bal = 100
                                             write BAL                            Begin()
                                            Commit()     Bal = 110    Bal = 110    read BAL
                                                                                   Subtract 30
                                                                                   write BAL
                                                                       Bal = 80   Commit()



    J.J.Bunn, Distributed Databases, 2001                                                        17
Why Bother: Durability
         ?             Once a transaction commits,
                       want effects to survive failures
         ?             Fault tolerance:
                       old master-new master won’t work:
                          ? Can’t do daily dumps:
                                 would lose recent work
                          ? Want “continuous” dumps

         ?             Redo “lost” transactions
                             in case of failure
         ?             Resend unacknowledged messages
J.J.Bunn, Distributed Databases, 2001                      18
Why ACID For
Client/Server And Distributed
?         ACID is important for centralized systems
?         Failures in centralized systems are simpler
?         In distributed systems:
             ? More and more -independent failures
             ? ACID is harder to implement

?         That makes it even MORE IMPORTANT
             ? Simple failure model
             ? Simple repair model

J.J.Bunn, Distributed Databases, 2001                   19
ACID Generalizations
?       Taxonomy of actions
          ? Unprotected:                not undone or redone
                    ? Temp
                       files
          ? Transactional: can be undone before commit
             ? Database and message operations
          ? Real: cannot be undone
             ? Drill a hole in a piece of metal,
               print a check
?       Nested transactions: subtransactions
?       Work flow: long-lived transactions
J.J.Bunn, Distributed Databases, 2001                          20
Scheduling Transactions
    ?            The DBMS has to take care of a set of
                 Transactions that arrive concurrently
    ?            It converts the concurrent Transaction
                 set into a new set that can be executed
                 sequentially
    ?            It ensures that, before reading or
                 writing an Object, each Transaction
                 waits for a Lock on the Object
    ?            Each Transaction releases all its Locks
                 when finished
                    ?       (Strict Two -Phase-Locking Protocol)
                                    Two-Phase-
J.J.Bunn, Distributed Databases, 2001                              21
Concurrency Control
                                          Locking
  ?             How to automatically prevent
                concurrency bugs?
  ?             Serialization theorem:
                  ?        If you lock all you touch and hold to commit:
                           no bugs
                  ?        If you do not follow these rules, you may see bugs
  ?             Automatic Locking:
                  ?        Set automatically (well-formed)
                  ?        Released at commit/rollback (two-phase locking)
  ?             Greater concurrency for locks:
                  ?        Granularity: objects or containers or server
                  ?        Mode: shared or exclusive or…
J.J.Bunn, Distributed Databases, 2001                                           22
Reduced Isolation Levels
?         It is possible to lock less and risk fuzzy data
?         Example: want statistical summary of DB
             ?      But do not want to lock whole database
?         Reduced levels:
             ?      Repeatable Read: may see fuzzy inserts/delete
                     ? But will serialize all updates
             ?      Read Committed: see only committed data
             ?      Read Uncommitted: may see uncommitted updates




J.J.Bunn, Distributed Databases, 2001                               23
Ensuring Atomicity
   ?             The DBMS ensures the atomicity of a
                 Transaction, even if the system crashes in the
                 middle of it
   ?             In other words all of the Transaction is
                 applied to the Database, or none of it is
   ?             How?
                   ?        Keep a log/history of all actions carried out on
                            the Database
                   ?        Before making a change, put the log for the
                            change somewhere “safe”
                   ?        After a crash, effects of partially executed
                            transactions are undone using the log
J.J.Bunn, Distributed Databases, 2001                                          24
DO/UNDO/REDO
            ?             Each action generates a log record
                                                       Old state                 New state

                                                                        DO



            ?             Has an UNDO action                                   Log
                                                    Log
                                        New state                  Old state

                                                     UNDO




            ?             Has a REDO action
                                                            Log
                                               Old state                 New state

                                                            REDO



J.J.Bunn, Distributed Databases, 2001                                                        25
What Does A Log Record
                  Look Like?
                     ?           Log record has
                                   ? Header   (transaction ID, timestamp… )
                                   ? Item ID
                                   ? Old value          ? Log ?
                                   ? New value

                     ?           For messages: just message text
                                 and sequence #
                     ?           For records: old and new value
                                 on update
                     ?           Keep records small
J.J.Bunn, Distributed Databases, 2001                                         26
Transaction Is A Sequence
                  Of Actions
                 ?             Each action changes state
                                        ?   Changes database
                                        ?   Sends messages
                                        ?   Operates a display/printer/drill press
                 ?             Leaves a log trail                Old state                   New state
                                                             Old state           DO New state
                                                         Old state             DO New state
                                                     Old state            DO    New state
                                                                                        Log
                                                                     DO                Log
                                                                                 Log

                                                                           Log
J.J.Bunn, Distributed Databases, 2001                                                                    27
Transaction UNDO Is Easy
                               ?        Read log backwards
                               ?        UNDO one step at a time
                               ?        Can go half-way back to
                                        get nested transactions
                                                        Old state           New state
                                                     Old state           New state
                                                                   UNDO
                                                  Old state           New state
                                                                 UNDO
                                               Old state      UNDO New state
                                                                         Log
                                                            UNDO      Log
                                                                      Log
                                                                    Log
J.J.Bunn, Distributed Databases, 2001                                                   28
Durability: Protecting The Log
            ?          When transaction commits
                          ? Put  its log in a durable place (duplexed disk)
                          ? Need log to redo transaction
                            in case of failure
                             ? System failure: lost




                                                                            e
                                                                         rit
                                in-memory updates




                                                                        W
                                                                           Log
                                                                          Log
                                                                         Log
                                                                        Log
                                                                       Log
                                                                      Log
                             ? Media failure (lost disk)
                                                                     Log
                                                                    Log

            ?          This makes transaction durable
            ?          Log is sequential file
                          ? Convertsrandom IO to single sequential IO
                          ? See NTFS or newer UNIX file systems

J.J.Bunn, Distributed Databases, 2001                                            29
Recovery After System Failure
   ?           During normal processing,
                 write checkpoints on non -volatile storage
   ?           When recovering from a system failure…
                 ? returnto the checkpoint state
                 ? Reapply log of all committed transactions
                 ? Force-at-commit insures log will survive restart

   ?           Then UNDO all uncommitted transactions
                                                Old state           New state
                                            Old state            New state
                                                             REDO
                                           Old state      REDO New state
                                        Old state             New state
                                                     LogREDO
                                                  LogREDO
                                                 Log
                                           Log
J.J.Bunn, Distributed Databases, 2001                                           30
Idempotence
                                              Dealing with failure

 ?             What if fail during restart?
                  ?       REDO many times
 ?             What if new state not around at restart?
                  ?       UNDO something not done



  Old state                     New state             New state   New state          Old state          Old state
                    REDO                       REDO                           UNDO               UNDO

      Log                               Log                         Log                 Log



J.J.Bunn, Distributed Databases, 2001                                                                               31
Idempotence
                                              Dealing with failure

?            Solution: make F(F(x))=F(x) (idempotence)
               ?        Discard duplicates
                         ? Message sequence numbers
                            to discard duplicates
                         ? Use sequence numbers on pages to detect state
               ?        (Or) make operations idempotent
                         ? Move to position x, write value V to byte B…


  Old state                     New state             New state   New state          Old state          Old state
                    REDO                       REDO                           UNDO               UNDO

      Log                               Log                         Log                 Log



J.J.Bunn, Distributed Databases, 2001                                                                               32
The Log: More Detail
  ?             Actions recorded in the Log
                   ? Transaction writes an Object
                      ? Store in the Log: Transaction Identifier,
                        Object Identifier, new value and old
                        value
                      ? This must happen before actually
                        writing the Object!
                   ? Transaction commits or aborts

  ?             Duplicate Log on “stable” storage
  ?             Log records chained by Transaction
                Identifier: easy to undo a Transaction
J.J.Bunn, Distributed Databases, 2001                               33
Structure of a Database
            ?            Typical DBMS has a layered architecture


                                        Query Optimisation & Execution

                                             Relational Operators

                                          Files and Access Methods

                                             Buffer Management

                                           Disk Space Management


                                                    Disk
J.J.Bunn, Distributed Databases, 2001                                    34
Database Administration
    ?             Design Logical/Physical Schema
    ?             Handle Security and Authentication
    ?             Ensure Data Availability, Crash
                  Recovery
    ?             Tune Database as needs and workload
                  evolves




J.J.Bunn, Distributed Databases, 2001                   35
Summary
    ?             Databases are used to maintain and
                  query large datasets
    ?             DBMS benefits include recovery from
                  crashes, concurrent access, data
                  integrity and security, quick application
                  development
    ?             Abstraction ensures independence
    ?             ACID
    ?             Increasingly Important (and Big) in
                  Scientific and Commercial Enterprises
J.J.Bunn, Distributed Databases, 2001                         36
Part 2

Distributed Databases
                   .




                  Julian Bunn
       California Institute of Technology
Distributed Databases
   ?             Data are stored at several locations
                    ?       Each managed by a DBMS that can run
                            autonomously
   ?             Ideally, location of data is unknown to
                 client
                    ?       Distributed Data Independence
   ?             Distributed Transactions are supported
                    ? Clients can write Transactions regardless
                      of where the affected data are located
                    ? Distributed Transaction Atomicity
                    ? Hard, and in some cases undesirable
                               ?        E.g. need to avoid overhead of ensuring location transparency
J.J.Bunn, Distributed Databases, 2001                                                                   38
Types of Distributed
                                 Database
       ?            Homogeneous: Every site runs the
                    same type of DBMS
       ?            Heterogeneous: Different sites run
                    different DBMS (maybe even RDBMS
                    and ODBMS)




J.J.Bunn, Distributed Databases, 2001                    39
Distributed DBMS
                                      Architectures
    ?             Client-Servers
                    ? Client sends query to each database server
                      in the distributed system
                    ? Client caches and accumulates responses

    ?             Collaborating Server
                    ? Client sends query to “nearest” Server
                    ? Server executes query locally
                    ? Server sends query to other Servers, as
                      required
                    ? Server sends response to Client
J.J.Bunn, Distributed Databases, 2001                              40
Storing the Distributed Data
     ?             In fragments at each site
                      ? Split the data up
                      ? Each site stores one or more fragments

     ?             In complete replicas at each site
                      ?       Each site stores a replica of the complete
                              data
     ?             A mixture of fragments and replicas
                      ?       Each site stores some replicas and/or
                              fragments or the data

J.J.Bunn, Distributed Databases, 2001                                      41
Partitioned Data
Break file into disjoint groups
                                                          Orders
?            Exploit data access locality            N.A. S.A. Europe Asia
                ?       Put data near consumer
                ?       Less network traffic
                ?       Better response time
                ?       Better availability
                ?       Owner controls data
                               autonomy
?            Spread Load
                ?       data or traffic may exceed
                        single store
J.J.Bunn, Distributed Databases, 2001                                        42
How to Partition Data?
?         How to Partition
             ?       by attribute or
             ?       random or                   N.A. S.A. Europe Asia
             ?       by source or
             ?       by use
?         Problem: to find it must have
             ?       Directory (replicated) or
             ?       Algorithm
?         Encourages
          attribute -based partitioning


    J.J.Bunn, Distributed Databases, 2001                                43
Replicated Data
        Place fragment at many sites
?       Pros:
         + Improves availability
         + Disconnected (mobile) operation
                                             Catalog
         + Distributes load
         + Reads are cheaper
?       Cons:
         ? N times more updates
         ? N times more storage

?       Placement strategies:
         ? Dynamic: cache on demand
         ? Static: place specific
    J.J.Bunn, Distributed Databases, 2001              44
Fragmentation
  ?             Horizontal – “Row-wise”
                  ?        E.g. rows of the table make up one fragment
  ?             Vertical – “Column-Wise”
                  ?        E.g. columns of the table make up one fragment


                 ID #Particles           Energy   Event#    Run#     Date            Time
                 …          …                …        …       …        …                …
              10001          3            121.5      111   13120   3/1406   13:30:55.0001
              10002          3            202.2      112   13120   3/1406   13:30:55.0001
              10003          4             99.3      113   13120   3/1406   13:30:55.0001
              10004          5            231.9      120   13120   3/1406   13:30:55.0001
              10005          6            287.1      125   13120   3/1406   13:30:55.0001
              10006          6            107.7      126   13120   3/1406   13:30:55.0001
              10007          6             98.9      127   13120   3/1406   13:30:55.0001
              10008          9            100.1      128   13120   3/1406   13:30:55.0001
                 …          …                …        …       …        …                …
J.J.Bunn, Distributed Databases, 2001                                                       45
Replication
    ?            Make synchronised or unsynchronised
                 copies of data at servers
                    ? Synchronised : data are always current,
                      updates are constantly shipped between
                      replicas
                    ? Unsynchronised: good for read-only data

    ?            Increases availability of data
    ?            Makes query execution faster


J.J.Bunn, Distributed Databases, 2001                           46
Distributed Catalogue
                             Management
   ?             Need to know where data are distributed in
                 the system
   ?             At each site, need to name each replica of
                 each data fragment
                    ?       “Local name”, “Birth Place”
   ?             Site Catalogue:
                    ?       Describes all fragments and replicas at the site
                    ?       Keeps track of replicas of relations at the site
                    ?       To find a relation, look up Birth site’s catalogue:
                            “Birth Place” site never changes, even if relation
                            is moved
J.J.Bunn, Distributed Databases, 2001                                             47
Replication Catalogue
     ?             Which objects are being replicated
     ?             Where objects are being replicated to
     ?             How updates are propagated

     ?             Catalogue is a set of tables that can be
                   backed up, and recovered (as any
                   other table)
     ?             These tables are themselves replicated
                   to each replication site
                     ?        No single point of failure in the
                              Distributed Database
J.J.Bunn, Distributed Databases, 2001                             48
Configurations
      ?            Single Master with multiple read-only snapshot sites
      ?            Multiple Masters
      ?            Single Master with multiple updatable snapshot sites
      ?            Master at record-level granularity
      ?            Hybrids of the above




J.J.Bunn, Distributed Databases, 2001                                     49
Distributed Queries
                        Islamabad                                                                    Geneva
     ID #Particles      Energy     Event#    Run#     Date            Time      ID #Particles   Energy   Event#    Run#     Date            Time
     …          …           …          …       …        …                …      …          …        …        …       …        …                …
  10001          3       121.5        111   13120   3/1406   13:30:55.0001   10001          3    121.5      111   13120   3/1406   13:30:55.0001
  10002          3       202.2        112   13120   3/1406   13:30:55.0001   10002          3    202.2      112   13120   3/1406   13:30:55.0001
  10003          4        99.3        113   13120   3/1406   13:30:55.0001   10003          4     99.3      113   13120   3/1406   13:30:55.0001
  10004          5       231.9        120   13120   3/1406   13:30:55.0001   10004          5    231.9      120   13120   3/1406   13:30:55.0001
  10005          6       287.1        125   13120   3/1406   13:30:55.0001   10005          6    287.1      125   13120   3/1406   13:30:55.0001
  10006          6       107.7        126   13120   3/1406   13:30:55.0001   10006          6    107.7      126   13120   3/1406   13:30:55.0001
  10007          6        98.9        127   13120   3/1406   13:30:55.0001   10007          6     98.9      127   13120   3/1406   13:30:55.0001
  10008          9       100.1        128   13120   3/1406   13:30:55.0001   10008          9    100.1      128   13120   3/1406   13:30:55.0001
     …          …           …          …       …        …                …      …          …        …        …       …        …                …




     ?             SELECT AVG(E.Energy) FROM Events E
                   WHERE E.particles > 3 AND E.particles < 7
     ?             Replicated: Copies of the complete Event
                   table at Geneva and at Islamabad
     ?             Choice of where to execute query
                     ?        Based on local costs, network costs, remote
                              capacity, etc.
J.J.Bunn, Distributed Databases, 2001                                                                                                              50
Distributed Queries (contd.)
?         SELECT AVG(E.Energy) FROM Events E
          WHERE E.particles > 3 AND
            E.particles < 7                          ID #Particles
                                                     …          …
                                                                     Energy
                                                                         …
                                                                              Event#
                                                                                  …
                                                                                        Run#
                                                                                          …
                                                                                                 Date
                                                                                                   …
                                                                                                                 Time
                                                                                                                    …
                                                  10001          3    121.5      111   13120   3/1406   13:30:55.0001
                                                  10002          3    202.2      112   13120   3/1406   13:30:55.0001
                                                  10003          4     99.3      113   13120   3/1406   13:30:55.0001
                                                  10004          5    231.9      120   13120   3/1406   13:30:55.0001
                                                  10005          6    287.1      125   13120   3/1406   13:30:55.0001


          Row-wise fragmented:
                                                  10006          6    107.7      126   13120   3/1406   13:30:55.0001

?                                                 10007
                                                  10008
                                                     …          …
                                                                 6
                                                                 9
                                                                       98.9
                                                                      100.1
                                                                         …
                                                                                 127
                                                                                 128
                                                                                  …
                                                                                       13120
                                                                                       13120
                                                                                          …
                                                                                               3/1406
                                                                                               3/1406
                                                                                                   …
                                                                                                        13:30:55.0001
                                                                                                        13:30:55.0001
                                                                                                                    …




          Particles < 5 at Geneva, Particles > 4 at
          Islamabad
             ?       Need to compute SUM(E.Energy) and
                     COUNT(E.Energy) at both sites
             ?       If WHERE clause had E.particles > 4 then only
                     need to compute at Islamabad
    J.J.Bunn, Distributed Databases, 2001                                                                               51
Distributed Queries (contd.)
  ?             SELECT AVG(E.Energy) FROM Events E WHERE
                E.particles > 3 AND E.particles < 7
                                                          ID #Particles   Energy   Event#    Run#     Date           Time
                                                          …          …        …        …       …        …               …
                                                       10001          3    121.5      111   13120   3/1406   13:30:55.0001
                                                       10002          3    202.2      112   13120   3/1406   13:30:55.0001
                                                       10003          4     99.3      113   13120   3/1406   13:30:55.0001
                                                       10004          5    231.9      120   13120   3/1406   13:30:55.0001

  ?             Column-wise Fragmented:                10005
                                                       10006
                                                                      6
                                                                      6
                                                                           287.1
                                                                           107.7
                                                                                      125
                                                                                      126
                                                                                            13120
                                                                                            13120
                                                                                                    3/1406
                                                                                                    3/1406
                                                                                                             13:30:55.0001
                                                                                                             13:30:55.0001
                                                       10007          6     98.9      127   13120   3/1406   13:30:55.0001
                                                       10008          9    100.1      128   13120   3/1406   13:30:55.0001
                                                          …          …        …        …       …        …               …



                ID, Energy and Event# Columns at Geneva, ID and
                remaining Columns at Islamabad:
                   ?       Need to join on ID
                   ?       Select IDs satisfying Particles constraint at Islamabad
                   ?       SUM(Energy) and Count(Energy) for those IDs at Geneva



J.J.Bunn, Distributed Databases, 2001                                                                                 52
Joins
  ?             Joins are used to compare or combine
                relations (rows) from two or more
                tables, when the relations share a
                common attribute value
  ?             Simple approach: for every relation in
                the first table “S”, loop over all
                relations in the other table “R”, and
                see if the attributes match
  ?             N-way joins are evaluated as a series of
                2-way joins
  ?             Join Algorithms are a continuing topic
                of intense research in Computer
                Science
J.J.Bunn, Distributed Databases, 2001                      53
Join Algorithms
  ?             Need to run in memory for best
                performance
  ?             Nested-Loops: efficient only if “R” very small
                (can be stored in memory)
  ?             Hash-Join: Build an in-memory hash table of
                “R”, then loop over “S” hashing to check for
                match
  ?             Hybrid Hash-Join: When “R” hash is too big
                to fit in memory, split join into partitions
  ?             Merge-Join: Used when “R” and “S” are
                already sorted on the join attribute, simply
                merging them in parallel
  ?             Special versions of Join Algorithms needed
                for Distributed Database query execution!
J.J.Bunn, Distributed Databases, 2001                            54
Distributed Query
                                      Optimisation
        ?             Cost-based:
                         ? Consider all “plans”
                         ? Pick cheapest: include communication
                           costs
        ?             Need to use distributed join methods
        ?             Site that receives query constructs
                      Global Plan, hints for local plans
                         ?       Local plans may be changed at each site
J.J.Bunn, Distributed Databases, 2001                                      55
Replication
       ?            Synchronous: All data that have been
                    changed must be propagated before
                    the Transaction commits
       ?            Asynchronous: Changed data are
                    periodically sent
                       ? Replicas may go out of sync.
                       ? Clients must be aware of this




J.J.Bunn, Distributed Databases, 2001                      56
Synchronous Replication
                       Costs
    ?             Before an update Transaction can
                  commit, it obtains locks on all
                  modified copies
                     ? Sends lock requests to remote sites, holds
                       locks
                     ? If links or remote sites fail, Transaction
                       cannot commit until links/sites restored
                     ? Even without failure, commit protocol is
                       complex, and involves many messages

J.J.Bunn, Distributed Databases, 2001                               57
Asynchronous Replication
       ?             Allows Transaction to commit before
                     all copies have been modified
       ?             Two methods:
                       ?        Primary Site

                       ?        Peer-to-Peer




J.J.Bunn, Distributed Databases, 2001                      58
Primary Site Replication
                     ? One copy designated as “Master”
                     ? Published to other sites who subscribe to
                       “Secondary” copies
                     ? Changes propagated to “Secondary”
                       copies
                     ? Done in two steps:
                        ? Capture changes made by committed
                          Transactions
                        ? Apply these changes



J.J.Bunn, Distributed Databases, 2001                              59
The Capture Step
  ?             Procedural: A procedure, automatically
                invoked, does the capture (takes a
                snapshot)

  ?             Log-based: the log is used to generate a
                Change Data Table
                   ?       Better (cheaper and faster) but relies on
                           proprietary log details



J.J.Bunn, Distributed Databases, 2001                                  60
The Apply Step
   ?             The Secondary site periodically obtains
                 from the Primary site a snapshot or
                 changes to the Change Data Table
                    ? Updates its copy
                    ? Period can be timer-based or defined by
                      the user/application
   ?             Log-based capture with continuous
                 Apply minimises delays in propagating
                 changes

J.J.Bunn, Distributed Databases, 2001                           61
Peer to P Replication
                   - - eer
      ?             More than one copy can be “Master”
      ?             Changes are somehow propagated to
                    other copies
      ?             Conflicting changes must be resolved
      ?             So best when conflicts do not or
                    cannot arise:
                       ? Each “Master” owns a disjoint fragment
                         or copy
                       ? Update permission only granted to one
                         “Master” at a time
J.J.Bunn, Distributed Databases, 2001                             62
Replication Examples
   ?            Master copy, many slave copies (SQL Server)
                   ?       always know the correct value (master)
                   ?       change propagation can be
                            ? transactional
                            ? as soon as possible
                            ? periodic
                            ? on demand

   ?            Symmetric, and anytime (Access)
                   ?       allows mobile (disconnected) updates
                   ?       updates propagated ASAP, periodic, on demand
                   ?       non-serializable
                   ?       colliding updates must be reconciled.
                   ?       hard to know “real” value
J.J.Bunn, Distributed Databases, 2001                                     63
Data Warehousing and
                         Replication
    ?             Build giant “warehouses” of data from many
                  sites
                     ?       Enable complex decision support queries over
                             data from across an organisation
    ?             Warehouses can be seen as an instance of
                  asynchronous replication
                     ?       Source data is typically controlled by different
                             DBMS: emphasis on “cleaning” data by
                             removing mismatches while creating replicas
    ?             Procedural Capture and application Apply
                  work best for this environment
J.J.Bunn, Distributed Databases, 2001                                           64
Distributed Locking
   ?             How to manage Locks across many
                 sites?
                   ? Centrally: one site does all locking
                      ? Vulnerable to single site failure
                   ? Primary Copy: all locking for an object
                     done at the primary copy site for the
                     object
                      ? Reading requires access to locking site
                        as well as site which stores object
                   ? Fully Distributed: locking for a copy done
                     at site where the copy is stored
                      ? Locks at all sites while writing an
                        object
J.J.Bunn, Distributed Databases, 2001                             65
Distributed Deadlock
                                Detection
      ?             Each site maintains a local “waits -for” graph
      ?             Global deadlock might occur even if local
                    graphs contain no cycles
                      ?        E.g. Site A holds lock on X, waits for lock on Y
                      ?        Site B holds lock on Y, waits for lock on X
      ?             Three solutions:
                      ?        Centralised (send all local graphs to one site)
                      ?        Hierarchical (organise sites into hierarchy and
                               send local graphs to parent)
                      ?        Timeout (abort Transaction if it waits too long)
J.J.Bunn, Distributed Databases, 2001                                             66
Distributed Recovery
   ?             Links and Remote Sites may crash/fail
   ?             If sub-transactions of a Transaction
                 execute at different sites, all or none
                 must commit
   ?             Need a commit protocol to achieve
                 this
   ?             Solution: Maintain a Log at each site of
                 commit protocol actions
                   ?        Two-Phase Commit

J.J.Bunn, Distributed Databases, 2001                       67
Two P
                                - hase Commit
  ?             Site which originates Transaction is coordinator,
                other sites involved in Transaction are subordinates
  ?             When the Transaction needs to Commit:
                  ?        Coordinator sends “prepare” message to subordinates
                  ?        Subordinates each force-writes an abort or prepare Log
                           record, and sends “yes” or “no” message to Coordinator
                  ?        If Coordinator gets unanimous “yes” messages, force-writes
                           a commit Log record, and sends “commit” message to all
                           subordinates
                  ?        Otherwise, force-writes an abort Log record, and sends
                           “abort” message to all subordinates
                  ?        Subordinates force-write abort/commit Log record
                           accordingly, then send an “ack” message to Coordinator
                  ?        Coordinator writes end Log record after receiving all acks



J.J.Bunn, Distributed Databases, 2001                                                   68
Notes on Two P
                                     - hase
                            Commit (2PC)
  ?             First: voting, Second: termination – both
                initiated by Coordinator
  ?             Any site can decide to abort the Transaction
  ?             Every message is recorded in the local Log by
                the sender to ensure it survives failures
  ?             All Commit Protocol log records for a
                Transaction contain the Transaction ID and
                Coordinator ID. The Coordinator’s
                abort/commit record also includes the Site
                IDs of all subordinates
J.J.Bunn, Distributed Databases, 2001                           69
Restart after Site Failure
?             If there is a commit or abort Log record for
              Transaction T, but no end record, then must
              undo/redo T
                 ?       If the site is Coordinator for T, then keep sending
                         commit/abort messages to Subordinates until
                         acks received
?             If there is a prepare Log record, but no
              commit or abort:
                 ?       This site is a Subordinate for T
                 ?       Contact Coordinator to find status of T, then
                          ? write commit/abort Log record
                          ? Redo/undo T
                          ? Write end Log record
J.J.Bunn, Distributed Databases, 2001                                          70
Blocking
  ?             If Coordinator for Transaction T fails,
                then Subordinates who have voted
                “yes” cannot decide whether to
                commit or abort until Coordinator
                recovers!
  ?             T is blocked
  ?             Even if all Subordinates are aware of
                one another (e.g. via extra information
                in “prepare” message) they are blocked
                   ?       Unless one of them voted “no”
J.J.Bunn, Distributed Databases, 2001                      71
Link and Remote Site
                                 Failures
     ?            If a Remote Site does not respond
                  during the Commit Protocol for T
                     ?       E.g. it crashed or the link is down
     ?            Then
                     ? If current Site is Coordinator for T: abort
                     ? If Subordinate and not yet voted “yes”:
                       abort
                     ? If Subordinate and has voted “yes”, it is
                       blocked until Coordinator back online
J.J.Bunn, Distributed Databases, 2001                                72
Observations on 2PC
   ?             Ack messages used to let Coordinator
                 know when it can “forget” a
                 Transaction
                    ?       Until it receives all acks, it must keep T in
                            the Transaction Table
   ?             If Coordinator fails after sending
                 “prepare” messages, but before writing
                 commit/abort Log record, when it
                 comes back up, it aborts T
   ?             If a subtransaction does no updates, its
                 commit or abort status is irrelevant
J.J.Bunn, Distributed Databases, 2001                                       73
2PC with Presumed Abort
  ?             When Coordinator aborts T, it undoes T and
                removes it from the Transaction Table
                immediately
                   ?       Doesn’t wait for “acks”
                   ?       “Presumes Abort” if T not in Transaction Table
                   ?       Names of Subordinates not recorded in abort
                           Log record
  ?             Subordinates do not send “ack” on abort
  ?             If subtransaction does no updates, it
                responds to “prepare” message with
                “reader” (instead of “yes”/”no”)
  ?             Coordinator subsequently ignores “ reader”s
  ?             If all Subordinates are “reader”s, then 2nd.
                Phase not required
J.J.Bunn, Distributed Databases, 2001                                       74
Replication and Partitioning
             Compared                                     Scaleup
                                                                                             Central
           Base case
        a 1 TPS system                           to a 2 TPS centralized system           ?
                                                                                             Scaleup
                                                                                             2x
                                                                                             more work
                                 1 TPS server
   100 Users                                       200 Users           2 TPS server



     Partitioning                                        Replication
                                                                                         ?   Partition
Two 1 TPS systems                                     Two 2 TPS systems                      Scaleup
                                                                                             2x
                                                                                             more work
                             1 TPS server
100 Users                                           100 Users           2 TPS server
                                                                                         ?   Replication
                              O tps




                                         O tps




                                                                                 1 tps
                                                                         1 tps



                                                                                             Scaleup
                                                                                             4x
                                                                                             more work
                             1 TPS server
100 Users                                           100 Users           2 TPS server
J.J.Bunn, Distributed Databases, 2001                                                                      75
“Porter” Agent b
                                      - ased
                         Distributed Database
                                        ?   Charles Univ, Prague
                                        ?   Based on “Aglets” SDK from IBM




J.J.Bunn, Distributed Databases, 2001                                        76
Part 3

Distributed Systems                        .




                 Julian Bunn
      California Institute of Technology
What’s a Distributed
                                 System?
?      Centralized:
          ? everything in one place
          ? stand-alone PC or Mainframe

?      Distributed:
          ?       some parts remote
                   ? distributed users
                   ? distributed execution
                   ? distributed data

J.J.Bunn, Distributed Databases, 2001              78
Why Distribute?
?            No best organization
?            Organisations constantly swing between
                ?       Centralized: focus, control, economy
                ?       Decentralized: adaptive, responsive, competitive
?            Why distribute?
                ? reflect organisation or application structure
                ? empower users / producers
                ? improve service (response / availability)
                ? distribute load
                ? use PC technology (economics)
J.J.Bunn, Distributed Databases, 2001                                      79
What
                     Should Be Distributed?
?            Users and User Interface
                ?       Thin client         Presentation

?            Processing                       workflow
                ?       Trim client
                                              Business
?            Data                              Objects
                ?       Fat client
                                               Database

?            Will discuss tradeoffs later
J.J.Bunn, Distributed Databases, 2001                      80
Transparency
                           in Distributed Systems
?        Make distributed system as easy to use and
         manage as a centralized system
?        Give a Single-System Image

?        Location transparency:
           ? hide fact that object is remote
           ? hide fact that object has moved
           ? hide fact that object is partitioned or replicated

?        Name doesn’t change if object is replicated,
         partitioned or moved.
    J.J.Bunn, Distributed Databases, 2001                         81
Naming The basics
                                    -
?          Objects have
           ? Globally Unique Identifier (GUIDs)
                                                                    Address
           ? location(s) = address(es)
           ? name(s)                                                 guid
           ? addresses can change
           ? objects can have many names
                                                                     Jim
?          Names are context dependent:
             ?        (Jim @ KGB ?????? ??? ? ? ??? ? ?Jim @ CIA)
                                                                    James
?          Many naming systems
             ?        UNC: nodedevicedir dir dirobject
             ?        Internet: http://node.domain.root/dir/dir/dir/object
             ?        LDAP: ldap://ldap.domain.root/o=org,c=US,cn=dir
J.J.Bunn, Distributed Databases, 2001                                         82
Name Servers
                       in Distributed Systems
                                 ?      Name servers translate
                                           names + context
                                                  to address (+ GUID)
                                 ?      Name servers are partitioned
                                           (subtrees of name space)
                                 ?      Name servers replicate root
                                        of name tree
                                 ?      Name servers form a hierarchy
                                 ?      Distributed data from hell:
                                        ?   high read traffic
                                        ?   high reliability & availability
                                        ?   autonomy
J.J.Bunn, Distributed Databases, 2001                                         83
Autonomy
                       in Distributed Systems
?          Owner of site (or node, or application, or database)
           Wants to control it
?          If my part is working,
           must be able to access & manage it
                                        (reorganize, upgrade, add user,…)
?          Autonomy is
            ? Essential
            ? Difficult to implement.
            ? Conflicts with global consistency
?          examples: naming, authentication, admin…
J.J.Bunn, Distributed Databases, 2001                                       84
Security
                                            The Basics
?      Authentication server
       subject + Authenticator =>                         Object
                       (Yes + token) | No
?      Security matrix:             subject
        ? who can do what to whom
        ? Access control list is
          column of matrix
        ? “who” is authenticated ID                      Permissions
?      In a distributed system,
       “who” and “what” and “whom” are
       distributed objects
    J.J.Bunn, Distributed Databases, 2001                          85
Security
                 in Distributed Systems
?            Security domain:
             Security domain:
             nodes with a shared security server.
?            Security domains can have trust relationships:
                ?       A trusts B: A “believes” B when it says this is Jim@B
?            Security domains form a hierarchy.
?            Delegation: passing authority to a server
             when A asks B to do something (e.g. print a file, read a database)
             B may need A’s authority
?            Autonomy requires:
              ? each node is an authenticator
              ? each node does own security checks
?            Internet Today:
              ? no trust among domains (fire walls, many passwords)
              ? trust based on digital signatures
J.J.Bunn, Distributed Databases, 2001                                             86
Clusters
                                   The Ideal Distributed System.
?         Cluster is distributed                    ?   Clusters use
          system BUT single                               distributed system
             ?location                                    techniques for
           ? manager                                    ?   load distribution
           ? security policy                                 ? storage

?         relatively homogeneous                             ? execution
                                                        ?   growth
?         communications is
                                                        ?   fault tolerance
           ? high bandwidth
           ? low latency
           ? low error rate



    J.J.Bunn, Distributed Databases, 2001                                       87
Cluster: Shared What?
?         Shared Memory Multiprocessor
             ?       Multiple processors, one memory
             ?       all devices are local
             ?       HP V-class
?         Shared Disk Cluster
             ?       an array of nodes
             ?       all shared common disks
             ?       VAXcluster + Oracle
?         Shared Nothing Cluster
             ?       each device local to a node
             ?       ownership may change
             ?       Beowulf,Tandem, SP2, Wolfpack
J.J.Bunn, Distributed Databases, 2001                  88
Distributed Execution
                                        Threads and Messages
                                                                  threads
?         Thread is Execution unit
          (software analog of cpu+memory)

?         Threads execute at a node
?         Threads communicate via                              shared memory

             ? Shared memory (local)
             ? Messages (local and remote)
                                                           messages



J.J.Bunn, Distributed Databases, 2001                                          89
Peer to P or Client Server
     - - eer        -
?            Peer-to-Peer is symmetric:
               ?        Either side can send




?            Client-server
               ? client sends requests            requ
                                                       est
               ? server sends responses
                                                  resp
                                                      onse

               ? simple subset of peer -to-peer


J.J.Bunn, Distributed Databases, 2001                        90
Connection less or Connected
             -
?    Connection-less                             ? Connected (sessions)
        ?       request contains                    ?open  - request/reply - close
                   ?       client id                ?client authenticated once

                   ?       client context           ?Messages arrive in order
                                                    ?Can send many replies (e.g. FTP)
                   ?       work request
                                                    ? Server has client context
        ?       client authenticated on each
                message                               (context sensitive)
                                                    ? e.g. Winsock and ODBC
        ?       only a single response message
                                                    ? HTTP adding connections
        ?       e.g. HTTP, NFS v1




    J.J.Bunn, Distributed Databases, 2001                                           91
Remote Procedure Call: The
               key to transparency
y = pObj->f(x);
                                                            ?   Object may be
                                     x                          local or remote
                                                            ?   Methods on
                                                                object work
                                                                wherever it is.
                                         f()
                                                            ?   Local invocation
                                              return val;


  y = J.J.Bunn, Distributed Databases, 2001
       val;                      val                                               92
Remote Procedure Call: The
                                key to transparency
             ?             Remote invocation
y = pObj->f(x);                                            proxy
              x                         x    Obj Local?
                                        ?
                   Gee!! Nice pictures! marshal
                                        marshal                                 stub
                                                                           x
                                                                                  un
                                                                                marshal
                                                                                            x Obj Local?
                                                                                              Obj Local?
                                                                               pObj->f(x)
                                        f()                                                  f()


                                             return val;                                          return val;
                                                                     val        marshal     val
                                                             un
                                                              un
 y = val;                                                  marshal
                                                           marshal
                                      val val
     J.J.Bunn, Distributed Databases, 2001                                                                  93
Object Request Broker (ORB)
                                        Orchestrates RPC
?        Registers Servers
?        Manages pools of servers
?        Connects clients to servers
?        Does Naming, request-level authorization,
?        Provides transaction coordination (new feature)
?        Old names:
            ?       Transaction Processing Monitor,
            ?       Web server,              Transaction
            ?       NetWare


J.J.Bunn, Distributed Databases, 2001      Object-Request Broker   94
Using RPC for Transparency
                               Partition Transparency
             ?            Send updates to correct partition
y = pfile->write(x);
               x     part Local?                   x

                                                                      x
                                                                               un
                                                                            marshal
                                                                                         x
                                                     send                 pObj->write(x)
                                                      send
                                                       to
                                                        to                                   write()
                                                   correct
                                                    correct
                                                   partition
                                                    partition
                                                                                               return val;
                                                                val          marshal     val

     J.J.Bunn, Distributed Databases, 2001
                                             val                                                         95
Using RPC for Transparency
                    Replication Transparency
   ?         Send updates to EACH node
y = pfile->write(x);
               x                                   x
                                                    Send
                                                     Send
                                                      to
                                                       to
                                                     each
                                                      each
                                                   replica
                                                    replica




     J.J.Bunn, Distributed Databases, 2001
                                             val              96
Client/Server Interactions
                                            All can be done with RPC
?      Request-Response                                          C         S
                response may be many messages

?      Conversational                                            C         S
                server keeps client context

       Dispatcher
                                                                           S
?                                                            C         S
       three-tier: complex operation at server                              S
?      Queued
       de-couples client from server
       allows disconnected operation                   C         S         S
    J.J.Bunn, Distributed Databases, 2001                                      97
Queued Request/Response
       ?             Time-decouples client and server
                      ? Three Transactions


       ?             Almost real time, ASAP processing
       ?             Communicate at each other’s convenience
                     Allows mobile (disconnected) operation

       ?             Disk queues survive client & server failures

                                        Submit
                                                         Perform
                                        Response

            Client                                                 Server
J.J.Bunn, Distributed Databases, 2001                                       98
Why Queued Processing?
            ?             Prioritize requests
                          ambulance dispatcher favors high-priority calls
            ?             Manage Workflows
Order                                    Build   Ship   Invoice         Pay


            ?             Deferred processing in mobile apps


            ?             Interface heterogeneous systems
                          EDI,
                          MOM: Message-Oriented-Middleware
                          DAD: Direct Access to Data
 J.J.Bunn, Distributed Databases, 2001                                        99
Work Distribution Spectrum
                                            Thin                      Fat
?      Presentation                                Presentation
       and plug-ins
?      Workflow                                    workflow
       manages session
       & invokes objects
?      Business objects
                                                   Business Objects
?      Database
                                                   Database


    J.J.Bunn, Distributed Databases, 2001
                                            Fat                       Thin   100
Transaction Processing Evolution
          to Three Tier
                             Intelligence migrated to clients      Mainframe
 ?        Mainframe Batch processing                       cards

          (centralized)
 ?        Dumb terminals &                               green
                                                         screen
                                                                      Server
          Remote Job Entry                               3270




                                                                   TP Monitor
 ?        Intelligent terminals
          database backends
                                                                    ORB
 ?        Workflow Systems                             Active
          Object Request Brokers
          Application Generators
 J.J.Bunn, Distributed Databases, 2001                                      101
Web Evolution to Three Tier
             Intelligence migrated to clients (like TP)
                                                      Web
                                                  WAIS           Server
?        Character-mode clients,                  archie
                                                  ghopher

                       smart servers
                                                  green screen




                                                Mosaic
?        GUI Browsers - Web file servers

                                                NS & IE
?        GUI Plugins - Web dispatchers - CGI



?        Smart clients - Web dispatcher (ORB)   Active
         pools of app servers (ISAPI, Viper)
         workflow scripts at client & server
    J.J.Bunn, Distributed Databases, 2001                                 102
PC Evolution to Three Tier
           Intelligence migrated to server
?     Stand-alone PC
                                    (centralized)

?     PC + File & print server                      IO request
                                                                  disk I/O
                                                       reply
               message per I/O

?     PC + Database server                             SQL
                                                     Statement
               message per SQL statement

?     PC + App server                               Transaction
               message per transaction

?     ActiveX Client, ORB
      ActiveX server, Xscript
J.J.Bunn, Distributed Databases, 2001                                        103
The Pattern:
                          Three Tier Computing
?      Clients do presentation, gather input        Presentation

?      Clients do some workflow (Xscript)
?      Clients send high-level requests to ORB      workflow
       (Object Request Broker)
?      ORB dispatches workflows and business
       objects -- proxies for client, orchestrate    Business
       flows & queues                                Objects

?      Server-side workflow scripts call on
       distributed business objects to execute       Database
       task
    J.J.Bunn, Distributed Databases, 2001                          104
Web Client
                                                                       The Three
                                                                         Tiers
                         HTML

                                             VB Java
VBscritpt
                                             plug-ins
JavaScrpt
                                                                                  Middleware
                                                                       Object        ORB
  VB or Java                        VB or Java                                     TP Monitor
 Script Engine                      Virt Machine                       server     Web Server...
                                                                       Pool
                           HTTP+
                                                                ORB
                  Internet DCOM                                                         Object & Data
                                                                                           server .
                                                                           DCOM (oleDB, ODBC,...)
                                                          6.2
                                                        LU
                                                                Legacy
                                      IBM                       Gateways
     J.J.Bunn, Distributed Databases, 2001                                                        105
Why Did Everyone Go To
                   Three T
                       - ier?
?     Manageability                                            Presentation
         ?       Business rules must be with data
         ?       Middleware operations tools
?     Performance (scaleability)                               workflow
         ?       Server resources are precious
         ?       ORB dispatches requests to server pools
?     Technology & Physics                                      Business
         ?       Put UI processing near user                    Objects
         ?       Put shared data processing near shared data

                                                                Database
    J.J.Bunn, Distributed Databases, 2001                                     106
Why Put Business Objects
                    at Server?
                                          MOM’s Business Objects

DAD’sRaw Data



Customer comes to store                   Customer comes to store with list
Takes what he wants                        Gives list to clerk
Fills out invoice                          Clerk gets goods, makes invoice
Leaves money for goods                    Customer pays clerk, gets goods

Easy to build                             Easy to manage
No clerks                                 Clerks controls access
  J.J.Bunn, Distributed Databases, 2001
                                          Encapsulation                  107
Why Server Pools?
?            Server resources are precious.
             Clients have 100x more power than server.
?            Pre-allocate everything on server
                ?       preallocate memory
                ?       pre-open files
                ?       pre-allocate threads                N clients x N Servers x F files =
                ?       pre-open and authenticate clients            N x N x F file opens!!!

?            Keep high duty-cycle on objects
             (re-use them)
                ?       Pool threads, not one per client
?            Classic example:                                                    Pool of
             TPC-C benchmark                                    HTTP            DBC links
                                               IE
                ?       2 processes                   7,000              IIS          SQL
                                                      clients
                ?       everything pre-allocated
J.J.Bunn, Distributed Databases, 2001                                                       108
Classic Mistakes
       ?             Thread per terminal
                     fix: DB server thread pools
                     fix: server pools
       ?             Process per request (CGI)
                     fix: ISAPI & NSAPI DLLs
                     fix: connection pools
       ?             Many messages per operation
                     fix: stored procedures
                     fix: server -side objects
       ?             File open per request
                     fix: cache hot files
J.J.Bunn, Distributed Databases, 2001                      109
Distributed Applications
                    need Transactions!
?          Transactions are key to
           structuring distributed applications
?          ACID properties ease
           exception handling
              ? Atomic: all or nothing
              ? Consistent : state transformation
              ? Isolated: no concurrency anomalies

              ? Durable : committed transaction effects persist



J.J.Bunn, Distributed Databases, 2001                             110
Programming & Transactions
                                            The Application View

?         You Start             (e.g. in TransactSQL):
                                                                   Begin    Begin
             ?       Begin [Distributed] Transaction <name>
             ?       Perform actions
             ?       Optional Save Transaction <name>                       RollBack
             ?       Commit or Rollback                            Commit

?         You Inherit a XID
             ?       Caller passes you a transaction    XID
             ?       You return or Rollback.
             ?       You can Begin / Commit sub -trans.
                                                                            RollBack
             ?       You can use save points                       Return   Return

    J.J.Bunn, Distributed Databases, 2001                                           111
Transaction Save Points
                       Backtracking within a transaction
               BEGIN WORK:1
                      action                        ?   Allows app to
                      action
                 SAVE WORK:2
                                                        cancel parts of a
                         action            action       transaction prior
               SAVE WORK:3                 action       to commit
                         action         SAVE WORK:5
                         action            action   ?   This is in most
                         action         SAVE WORK:6
                                           action
                                                        SQL products
               SAVE WORK:4
                         action            action
                   ROLLBACK             SAVE WORK:7
                   WORK(2)
                                           action              action
                                           action              action
                                        ROLLBACK            SAVE WORK:8
                                         WORK(7)               action
J.J.Bunn, Distributed Databases, 2001                      COMMIT WORK    112
Chained Transactions
?       Commit of T1 implicitly begins T2.
?       Carries context forward to next transaction
           ?       cursors
           ?       locks
           ?       other state

               Transaction #1               Transaction #2
                                        C
                                            B
             Processing                 o
                                            e    Processing
                                        m
               context                  m
                                            g    context
            established                     i
                                        i
                                            n
                                                 used
                                        t
J.J.Bunn, Distributed Databases, 2001                         113
Nested Transactions
                    Going Beyond Flat Transactions
     ?          Need transactions within transactions
     ?          Sub-transactions commit only if root does
     ?          Only root commit is durable.
     ?          Subtransactions may rollback
                if so, all its subtransactions rollback
     ?          Parallel version of nested transactions
                                                      T12
                                                            T121   T122 T123
T1

                     T11                     T112            T13
                                                    T114           T131   T132   T133
                               T111
                                             T113
     J.J.Bunn, Distributed Databases, 2001                                              114
Workflow:
                                       A Sequence of Transactions
?          Application transactions are multi -step                 Presentation
              ?       order, build, ship & invoice, reconcile
?          Each step is an ACID unit
?          Workflow is a script describing steps
                                                                    workflow
?          Workflow systems
              Instantiate the scripts
              ?
            ? Drive the scripts
                                                                     Business
            ? Allow query against scripts
                                                                     Objects
?          Examples
             Manufacturing Work In Process (WIP)
             Queued processing
             Loan application & approval,                            Database
             Hospital admissions…
    J.J.Bunn, Distributed Databases, 2001                                          115
Workflow Scripts
  ?            Workflow scripts are programs
                       (could use VBScript or JavaScript)

  ?            If step fails, compensation action handles error
  ?            Events, messages, time, other steps cause step.
  ?            Workflow controller drives flows


                                                                     fork
 Source                                                    join
                                           branch
                                                            case
                                                              loop
Compensation
Action
  J.J.Bunn, Distributed Databases, 2001             Step                    116
Workflow and ACID
?          Workflow is not Atomic or Isolated
?          Results of a step visible to all
?          Workflow is Consistent and Durable
?          Each flow may take hours, weeks, months
?          Workflow controller
              ? keeps flows moving
              ? maintains context (state) for each flow
              ? provides a query and operator interface
                e.g.: “what is the status of Job # 72149?”

J.J.Bunn, Distributed Databases, 2001                        117
ACID Objects Using ACID DBs
     The easy way to build transactional objects
   ?    Application uses transactional objects
        (objects have ACID properties)
                                                                SQL
   ?    If object built on top of ACID objects,
        then object is ACID.
         ?   Example: New, EnQueue, DeQueue
             on top of SQL
   ?    SQL provides ACID                         dim c as Customer
                                                  dim CM as CustomerMgr
Business Object: Customer                         ...
                                                  set C = CM.get(CustID)
                                                  ...
Business Object Mgr: CustomerMgr                  C.credit_limit = 1000
                                                  ...
                           SQL                    CM.update(C, CustID)
Persistent Programming languages automate this.
    J.J.Bunn, Distributed Databases, 2001
                                                  ..                     118
ACID Objects From Bare Metal
The Hard Way to Build Transactional Objects
?        Object Class is a Resource Manager (RM)
         ? Provides ACID objects from persistent storage
         ? Provides Undo (on rollback)
         ? Provides Redo (on restart or media failure)
         ? Provides Isolation for concurrent ops

?        Microsoft SQL Server, IBM DB2, Oracle,…
         are Resource managers.
?        Many more coming.
?        RM implementation techniques described later
J.J.Bunn, Distributed Databases, 2001                      119
Transaction Manager
?    Transaction Manager (TM): manages
     transaction objects.                                       TM
       ? XID factory                               egi
                                                      n
                                                  b
       ? tracks them                                   XID       enlist
                                            App
       ? coordinates them                         call(..XID)
                                                                RM
?    App gets XID from TM
?    Transactional RPC
       ? passes XID on all calls
       ? manages XID inheritance

?    TM manages commit & rollback
    J.J.Bunn, Distributed Databases, 2001                             120
TM Two P
                          - hase Commit
                                 Dealing with multiple RMs
?     If all use one RM, then all or none commit
?     If multiple RMs, then need coordination
?     Standard technique:
         ? Marriage:  Do you? I do. I pronounce…Kiss
         ? Theater: Ready on the set? Ready! Action! Act
         ? Sailing: Ready about? Ready! Helm’s a-lee!
           Tack
         ? Contract law: Escrow agent
?     Two-phase commit:
         ? 1. Voting phase: can you do it?
         ? 2. If all vote yes, then commit phase: do it!
    J.J.Bunn, Distributed Databases, 2001                    121
Two P
  - hase Commit In Pictures
?          Transactions managed by TM
?          App gets unique ID (XID) from TM at
           Begin()
?          XID passed on Transactional RPC
?          RMs Enlist when first do work on XID


                                          gin       TM    En
                                        Be                  lis
                                                               t
                                           XID
                               App                       RM1   En
                                        Call(..XID..)            lis
                                        Call(..XI                   t
                                                 D..)        RM2
J.J.Bunn, Distributed Databases, 2001                                   122
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán
Cơ Sở Dữ Liệu Phân Tán

More Related Content

Similar to Cơ Sở Dữ Liệu Phân Tán

Rdbms session 01_a
Rdbms session 01_aRdbms session 01_a
Rdbms session 01_aShivam Rai
 
Recommendation systems in the scope of opinion formation: a model
Recommendation systems in the scope of opinion formation: a modelRecommendation systems in the scope of opinion formation: a model
Recommendation systems in the scope of opinion formation: a modelMarcel Blattner, PhD
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkAdina Chuang Howe
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetupOKFN-GR
 
Consistent query answering in inconsistent databases
Consistent query answering in inconsistent databasesConsistent query answering in inconsistent databases
Consistent query answering in inconsistent databasesNgonidzashe Zanamwe
 

Similar to Cơ Sở Dữ Liệu Phân Tán (11)

Rdbms session 01_a
Rdbms session 01_aRdbms session 01_a
Rdbms session 01_a
 
Recommendation systems in the scope of opinion formation: a model
Recommendation systems in the scope of opinion formation: a modelRecommendation systems in the scope of opinion formation: a model
Recommendation systems in the scope of opinion formation: a model
 
27 fcs157al1 (1)
27 fcs157al1 (1)27 fcs157al1 (1)
27 fcs157al1 (1)
 
Database Management System 1
Database Management System 1Database Management System 1
Database Management System 1
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetup
 
Consistent query answering in inconsistent databases
Consistent query answering in inconsistent databasesConsistent query answering in inconsistent databases
Consistent query answering in inconsistent databases
 
Unit 1 basic concepts of DBMS
Unit 1 basic concepts of DBMSUnit 1 basic concepts of DBMS
Unit 1 basic concepts of DBMS
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Dbms viva questions
Dbms viva questionsDbms viva questions
Dbms viva questions
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 

Recently uploaded (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

Cơ Sở Dữ Liệu Phân Tán

  • 1. Distributed Databases Dr. Julian Bunn Center for Advanced Computing Research Caltech Based on material provided by: Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu Ramakrishnan (Wisconsin)
  • 2. Outline ? Introduction to Database Systems ? Distributed Databases ? Distributed Systems ? Distributed Databases for Physics J.J.Bunn, Distributed Databases, 2001 2
  • 3. Part I Introduction to Database Systems . Julian Bunn California Institute of Technology
  • 4. What is a Database? ? A large, integrated collection of data ? Entities (things) and Relationships (connections) ? Objects and Associations/References ? A Database Management System (DBMS) is a software package designed to store and manage Databases ? “Traditional” (ER) Databases and “Object” Databases J.J.Bunn, Distributed Databases, 2001 4
  • 5. Why Use a DBMS? ? Data Independence ? Efficient Access ? Reduced Application Development Time ? Data Integrity ? Data Security ? Data Analysis Tools ? Uniform Data Administration ? Concurrent Access ? Automatic Parallelism ? Recovery from crashes J.J.Bunn, Distributed Databases, 2001 5
  • 6. Cutting Edge Databases ? Scientific Applications ? Digital Libraries, Interactive Video, Human Genome project, Particle Physics Experiments, National Digital Observatories, Earth Images ? Commercial Web Systems ? Data Mining / Data Warehouse ? Simple data but very high transaction rate and enormous volume (e.g. click through) J.J.Bunn, Distributed Databases, 2001 6
  • 7. Data Models ? Data Model: A Collection of Concepts for Describing Data ? Schema: A Set of Descriptions of a Particular Collection of Data, in the context of the Data Model ? Relational Model: ? E.g. A Lecture is attended by zero or more Students ? Object Model: ? E.g. A Database Lecture inherits attributes from a general Lecture J.J.Bunn, Distributed Databases, 2001 7
  • 8. Data Independence ? Applications insulated from how data in the Database is structured and stored ? Logical Data Independence: Protection from changes in the logical structure of the data ? Physical Data Independence: Protection from changes in the physical structure of the data J.J.Bunn, Distributed Databases, 2001 8
  • 9. Concurrency Control ? Good DBMS performance relies on allowing concurrent access to the data by more than one client ? DBMS ensures that interleaved actions coming from different clients do not cause inconsistency in the data ? E.g. two simultaneous bookings for the same airplane seat ? Each client is unaware of how many other clients are using the DBMS J.J.Bunn, Distributed Databases, 2001 9
  • 10. Transactions ? A Transaction is an atomic sequence of actions in the Database (reads and writes) ? Each Transaction has to be executed completely, and must leave the Database in a consistent state ? The definition of “consistent” is ultimately the client’s responsibility! responsibility! ? If the Transaction fails or aborts midway, then the Database is “rolled back” to its initial consistent state (when the Transaction began). J.J.Bunn, Distributed Databases, 2001 10
  • 11. What Is A Transaction? ? Programmer’s view: ? Bracket a collection of actions ? A simple failure model ? Only two outcomes: Begin() Begin() Begin() action action action action action action action action action action Rollback() Fail !! Fail Commit() Rollback() Success! Failure! J.J.Bunn, Distributed Databases, 2001 11
  • 12. ACID ? Atomic: all or nothing ? Consistent: state transformation ? Isolated: no concurrency anomalies ? Durable: committed transaction effects persist J.J.Bunn, Distributed Databases, 2001 12
  • 13. Why Bother: Atomicity? ? RPC semantics: ? At most once: try one time ? ? At least once: keep trying ? ’till acknowledged ? ? Exactly once: keep trying ’till acknowledged and server discards duplicate requests J.J.Bunn, Distributed Databases, 2001 13
  • 14. Why Bother: Atomicity? ? Example: insert record in file ? At most once: time-out means “maybe” ? At least once: retry may get “duplicate” error or retry may do second insert ? Exactly once: you do not have to worry ? What if operation involves ? Insertseveral records? ? Send several messages? ? Want ALL or NOTHING for group of actions J.J.Bunn, Distributed Databases, 2001 14
  • 15. Why Bother: Consistency ? Begin-Commit brackets a set of operations ? You can violate consistency inside brackets ? Debit but not credit (destroys money) ? Delete old file before create new file in a copy ? Print document before delete from spool queue ? Begin and commit are points of consistency Commit Begin State transformations new state under construction J.J.Bunn, Distributed Databases, 2001 15
  • 16. Why Bother: Isolation ? Running programs concurrently on same data can create concurrency anomalies ? The shared checking account example Begin() read BAL Begin() add 10 Bal = 100 Bal = 100 read BAL write BAL Subtract 30 Commit() Bal = 110 write BAL Bal = 70 Commit() ? Programming is hard enough without having to worry about concurrency J.J.Bunn, Distributed Databases, 2001 16
  • 17. Isolation ? It is as though programs run one at a time ? No concurrency anomalies ? System automatically protects applications ? Locking (DB2, Informix, Microsoft® SQL Server™, Sybase…) ? Versioned databases (Oracle, Interbase…) Begin() read BAL add 10 Bal = 100 write BAL Begin() Commit() Bal = 110 Bal = 110 read BAL Subtract 30 write BAL Bal = 80 Commit() J.J.Bunn, Distributed Databases, 2001 17
  • 18. Why Bother: Durability ? Once a transaction commits, want effects to survive failures ? Fault tolerance: old master-new master won’t work: ? Can’t do daily dumps: would lose recent work ? Want “continuous” dumps ? Redo “lost” transactions in case of failure ? Resend unacknowledged messages J.J.Bunn, Distributed Databases, 2001 18
  • 19. Why ACID For Client/Server And Distributed ? ACID is important for centralized systems ? Failures in centralized systems are simpler ? In distributed systems: ? More and more -independent failures ? ACID is harder to implement ? That makes it even MORE IMPORTANT ? Simple failure model ? Simple repair model J.J.Bunn, Distributed Databases, 2001 19
  • 20. ACID Generalizations ? Taxonomy of actions ? Unprotected: not undone or redone ? Temp files ? Transactional: can be undone before commit ? Database and message operations ? Real: cannot be undone ? Drill a hole in a piece of metal, print a check ? Nested transactions: subtransactions ? Work flow: long-lived transactions J.J.Bunn, Distributed Databases, 2001 20
  • 21. Scheduling Transactions ? The DBMS has to take care of a set of Transactions that arrive concurrently ? It converts the concurrent Transaction set into a new set that can be executed sequentially ? It ensures that, before reading or writing an Object, each Transaction waits for a Lock on the Object ? Each Transaction releases all its Locks when finished ? (Strict Two -Phase-Locking Protocol) Two-Phase- J.J.Bunn, Distributed Databases, 2001 21
  • 22. Concurrency Control Locking ? How to automatically prevent concurrency bugs? ? Serialization theorem: ? If you lock all you touch and hold to commit: no bugs ? If you do not follow these rules, you may see bugs ? Automatic Locking: ? Set automatically (well-formed) ? Released at commit/rollback (two-phase locking) ? Greater concurrency for locks: ? Granularity: objects or containers or server ? Mode: shared or exclusive or… J.J.Bunn, Distributed Databases, 2001 22
  • 23. Reduced Isolation Levels ? It is possible to lock less and risk fuzzy data ? Example: want statistical summary of DB ? But do not want to lock whole database ? Reduced levels: ? Repeatable Read: may see fuzzy inserts/delete ? But will serialize all updates ? Read Committed: see only committed data ? Read Uncommitted: may see uncommitted updates J.J.Bunn, Distributed Databases, 2001 23
  • 24. Ensuring Atomicity ? The DBMS ensures the atomicity of a Transaction, even if the system crashes in the middle of it ? In other words all of the Transaction is applied to the Database, or none of it is ? How? ? Keep a log/history of all actions carried out on the Database ? Before making a change, put the log for the change somewhere “safe” ? After a crash, effects of partially executed transactions are undone using the log J.J.Bunn, Distributed Databases, 2001 24
  • 25. DO/UNDO/REDO ? Each action generates a log record Old state New state DO ? Has an UNDO action Log Log New state Old state UNDO ? Has a REDO action Log Old state New state REDO J.J.Bunn, Distributed Databases, 2001 25
  • 26. What Does A Log Record Look Like? ? Log record has ? Header (transaction ID, timestamp… ) ? Item ID ? Old value ? Log ? ? New value ? For messages: just message text and sequence # ? For records: old and new value on update ? Keep records small J.J.Bunn, Distributed Databases, 2001 26
  • 27. Transaction Is A Sequence Of Actions ? Each action changes state ? Changes database ? Sends messages ? Operates a display/printer/drill press ? Leaves a log trail Old state New state Old state DO New state Old state DO New state Old state DO New state Log DO Log Log Log J.J.Bunn, Distributed Databases, 2001 27
  • 28. Transaction UNDO Is Easy ? Read log backwards ? UNDO one step at a time ? Can go half-way back to get nested transactions Old state New state Old state New state UNDO Old state New state UNDO Old state UNDO New state Log UNDO Log Log Log J.J.Bunn, Distributed Databases, 2001 28
  • 29. Durability: Protecting The Log ? When transaction commits ? Put its log in a durable place (duplexed disk) ? Need log to redo transaction in case of failure ? System failure: lost e rit in-memory updates W Log Log Log Log Log Log ? Media failure (lost disk) Log Log ? This makes transaction durable ? Log is sequential file ? Convertsrandom IO to single sequential IO ? See NTFS or newer UNIX file systems J.J.Bunn, Distributed Databases, 2001 29
  • 30. Recovery After System Failure ? During normal processing, write checkpoints on non -volatile storage ? When recovering from a system failure… ? returnto the checkpoint state ? Reapply log of all committed transactions ? Force-at-commit insures log will survive restart ? Then UNDO all uncommitted transactions Old state New state Old state New state REDO Old state REDO New state Old state New state LogREDO LogREDO Log Log J.J.Bunn, Distributed Databases, 2001 30
  • 31. Idempotence Dealing with failure ? What if fail during restart? ? REDO many times ? What if new state not around at restart? ? UNDO something not done Old state New state New state New state Old state Old state REDO REDO UNDO UNDO Log Log Log Log J.J.Bunn, Distributed Databases, 2001 31
  • 32. Idempotence Dealing with failure ? Solution: make F(F(x))=F(x) (idempotence) ? Discard duplicates ? Message sequence numbers to discard duplicates ? Use sequence numbers on pages to detect state ? (Or) make operations idempotent ? Move to position x, write value V to byte B… Old state New state New state New state Old state Old state REDO REDO UNDO UNDO Log Log Log Log J.J.Bunn, Distributed Databases, 2001 32
  • 33. The Log: More Detail ? Actions recorded in the Log ? Transaction writes an Object ? Store in the Log: Transaction Identifier, Object Identifier, new value and old value ? This must happen before actually writing the Object! ? Transaction commits or aborts ? Duplicate Log on “stable” storage ? Log records chained by Transaction Identifier: easy to undo a Transaction J.J.Bunn, Distributed Databases, 2001 33
  • 34. Structure of a Database ? Typical DBMS has a layered architecture Query Optimisation & Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management Disk J.J.Bunn, Distributed Databases, 2001 34
  • 35. Database Administration ? Design Logical/Physical Schema ? Handle Security and Authentication ? Ensure Data Availability, Crash Recovery ? Tune Database as needs and workload evolves J.J.Bunn, Distributed Databases, 2001 35
  • 36. Summary ? Databases are used to maintain and query large datasets ? DBMS benefits include recovery from crashes, concurrent access, data integrity and security, quick application development ? Abstraction ensures independence ? ACID ? Increasingly Important (and Big) in Scientific and Commercial Enterprises J.J.Bunn, Distributed Databases, 2001 36
  • 37. Part 2 Distributed Databases . Julian Bunn California Institute of Technology
  • 38. Distributed Databases ? Data are stored at several locations ? Each managed by a DBMS that can run autonomously ? Ideally, location of data is unknown to client ? Distributed Data Independence ? Distributed Transactions are supported ? Clients can write Transactions regardless of where the affected data are located ? Distributed Transaction Atomicity ? Hard, and in some cases undesirable ? E.g. need to avoid overhead of ensuring location transparency J.J.Bunn, Distributed Databases, 2001 38
  • 39. Types of Distributed Database ? Homogeneous: Every site runs the same type of DBMS ? Heterogeneous: Different sites run different DBMS (maybe even RDBMS and ODBMS) J.J.Bunn, Distributed Databases, 2001 39
  • 40. Distributed DBMS Architectures ? Client-Servers ? Client sends query to each database server in the distributed system ? Client caches and accumulates responses ? Collaborating Server ? Client sends query to “nearest” Server ? Server executes query locally ? Server sends query to other Servers, as required ? Server sends response to Client J.J.Bunn, Distributed Databases, 2001 40
  • 41. Storing the Distributed Data ? In fragments at each site ? Split the data up ? Each site stores one or more fragments ? In complete replicas at each site ? Each site stores a replica of the complete data ? A mixture of fragments and replicas ? Each site stores some replicas and/or fragments or the data J.J.Bunn, Distributed Databases, 2001 41
  • 42. Partitioned Data Break file into disjoint groups Orders ? Exploit data access locality N.A. S.A. Europe Asia ? Put data near consumer ? Less network traffic ? Better response time ? Better availability ? Owner controls data autonomy ? Spread Load ? data or traffic may exceed single store J.J.Bunn, Distributed Databases, 2001 42
  • 43. How to Partition Data? ? How to Partition ? by attribute or ? random or N.A. S.A. Europe Asia ? by source or ? by use ? Problem: to find it must have ? Directory (replicated) or ? Algorithm ? Encourages attribute -based partitioning J.J.Bunn, Distributed Databases, 2001 43
  • 44. Replicated Data Place fragment at many sites ? Pros: + Improves availability + Disconnected (mobile) operation Catalog + Distributes load + Reads are cheaper ? Cons: ? N times more updates ? N times more storage ? Placement strategies: ? Dynamic: cache on demand ? Static: place specific J.J.Bunn, Distributed Databases, 2001 44
  • 45. Fragmentation ? Horizontal – “Row-wise” ? E.g. rows of the table make up one fragment ? Vertical – “Column-Wise” ? E.g. columns of the table make up one fragment ID #Particles Energy Event# Run# Date Time … … … … … … … 10001 3 121.5 111 13120 3/1406 13:30:55.0001 10002 3 202.2 112 13120 3/1406 13:30:55.0001 10003 4 99.3 113 13120 3/1406 13:30:55.0001 10004 5 231.9 120 13120 3/1406 13:30:55.0001 10005 6 287.1 125 13120 3/1406 13:30:55.0001 10006 6 107.7 126 13120 3/1406 13:30:55.0001 10007 6 98.9 127 13120 3/1406 13:30:55.0001 10008 9 100.1 128 13120 3/1406 13:30:55.0001 … … … … … … … J.J.Bunn, Distributed Databases, 2001 45
  • 46. Replication ? Make synchronised or unsynchronised copies of data at servers ? Synchronised : data are always current, updates are constantly shipped between replicas ? Unsynchronised: good for read-only data ? Increases availability of data ? Makes query execution faster J.J.Bunn, Distributed Databases, 2001 46
  • 47. Distributed Catalogue Management ? Need to know where data are distributed in the system ? At each site, need to name each replica of each data fragment ? “Local name”, “Birth Place” ? Site Catalogue: ? Describes all fragments and replicas at the site ? Keeps track of replicas of relations at the site ? To find a relation, look up Birth site’s catalogue: “Birth Place” site never changes, even if relation is moved J.J.Bunn, Distributed Databases, 2001 47
  • 48. Replication Catalogue ? Which objects are being replicated ? Where objects are being replicated to ? How updates are propagated ? Catalogue is a set of tables that can be backed up, and recovered (as any other table) ? These tables are themselves replicated to each replication site ? No single point of failure in the Distributed Database J.J.Bunn, Distributed Databases, 2001 48
  • 49. Configurations ? Single Master with multiple read-only snapshot sites ? Multiple Masters ? Single Master with multiple updatable snapshot sites ? Master at record-level granularity ? Hybrids of the above J.J.Bunn, Distributed Databases, 2001 49
  • 50. Distributed Queries Islamabad Geneva ID #Particles Energy Event# Run# Date Time ID #Particles Energy Event# Run# Date Time … … … … … … … … … … … … … … 10001 3 121.5 111 13120 3/1406 13:30:55.0001 10001 3 121.5 111 13120 3/1406 13:30:55.0001 10002 3 202.2 112 13120 3/1406 13:30:55.0001 10002 3 202.2 112 13120 3/1406 13:30:55.0001 10003 4 99.3 113 13120 3/1406 13:30:55.0001 10003 4 99.3 113 13120 3/1406 13:30:55.0001 10004 5 231.9 120 13120 3/1406 13:30:55.0001 10004 5 231.9 120 13120 3/1406 13:30:55.0001 10005 6 287.1 125 13120 3/1406 13:30:55.0001 10005 6 287.1 125 13120 3/1406 13:30:55.0001 10006 6 107.7 126 13120 3/1406 13:30:55.0001 10006 6 107.7 126 13120 3/1406 13:30:55.0001 10007 6 98.9 127 13120 3/1406 13:30:55.0001 10007 6 98.9 127 13120 3/1406 13:30:55.0001 10008 9 100.1 128 13120 3/1406 13:30:55.0001 10008 9 100.1 128 13120 3/1406 13:30:55.0001 … … … … … … … … … … … … … … ? SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 AND E.particles < 7 ? Replicated: Copies of the complete Event table at Geneva and at Islamabad ? Choice of where to execute query ? Based on local costs, network costs, remote capacity, etc. J.J.Bunn, Distributed Databases, 2001 50
  • 51. Distributed Queries (contd.) ? SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 AND E.particles < 7 ID #Particles … … Energy … Event# … Run# … Date … Time … 10001 3 121.5 111 13120 3/1406 13:30:55.0001 10002 3 202.2 112 13120 3/1406 13:30:55.0001 10003 4 99.3 113 13120 3/1406 13:30:55.0001 10004 5 231.9 120 13120 3/1406 13:30:55.0001 10005 6 287.1 125 13120 3/1406 13:30:55.0001 Row-wise fragmented: 10006 6 107.7 126 13120 3/1406 13:30:55.0001 ? 10007 10008 … … 6 9 98.9 100.1 … 127 128 … 13120 13120 … 3/1406 3/1406 … 13:30:55.0001 13:30:55.0001 … Particles < 5 at Geneva, Particles > 4 at Islamabad ? Need to compute SUM(E.Energy) and COUNT(E.Energy) at both sites ? If WHERE clause had E.particles > 4 then only need to compute at Islamabad J.J.Bunn, Distributed Databases, 2001 51
  • 52. Distributed Queries (contd.) ? SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 AND E.particles < 7 ID #Particles Energy Event# Run# Date Time … … … … … … … 10001 3 121.5 111 13120 3/1406 13:30:55.0001 10002 3 202.2 112 13120 3/1406 13:30:55.0001 10003 4 99.3 113 13120 3/1406 13:30:55.0001 10004 5 231.9 120 13120 3/1406 13:30:55.0001 ? Column-wise Fragmented: 10005 10006 6 6 287.1 107.7 125 126 13120 13120 3/1406 3/1406 13:30:55.0001 13:30:55.0001 10007 6 98.9 127 13120 3/1406 13:30:55.0001 10008 9 100.1 128 13120 3/1406 13:30:55.0001 … … … … … … … ID, Energy and Event# Columns at Geneva, ID and remaining Columns at Islamabad: ? Need to join on ID ? Select IDs satisfying Particles constraint at Islamabad ? SUM(Energy) and Count(Energy) for those IDs at Geneva J.J.Bunn, Distributed Databases, 2001 52
  • 53. Joins ? Joins are used to compare or combine relations (rows) from two or more tables, when the relations share a common attribute value ? Simple approach: for every relation in the first table “S”, loop over all relations in the other table “R”, and see if the attributes match ? N-way joins are evaluated as a series of 2-way joins ? Join Algorithms are a continuing topic of intense research in Computer Science J.J.Bunn, Distributed Databases, 2001 53
  • 54. Join Algorithms ? Need to run in memory for best performance ? Nested-Loops: efficient only if “R” very small (can be stored in memory) ? Hash-Join: Build an in-memory hash table of “R”, then loop over “S” hashing to check for match ? Hybrid Hash-Join: When “R” hash is too big to fit in memory, split join into partitions ? Merge-Join: Used when “R” and “S” are already sorted on the join attribute, simply merging them in parallel ? Special versions of Join Algorithms needed for Distributed Database query execution! J.J.Bunn, Distributed Databases, 2001 54
  • 55. Distributed Query Optimisation ? Cost-based: ? Consider all “plans” ? Pick cheapest: include communication costs ? Need to use distributed join methods ? Site that receives query constructs Global Plan, hints for local plans ? Local plans may be changed at each site J.J.Bunn, Distributed Databases, 2001 55
  • 56. Replication ? Synchronous: All data that have been changed must be propagated before the Transaction commits ? Asynchronous: Changed data are periodically sent ? Replicas may go out of sync. ? Clients must be aware of this J.J.Bunn, Distributed Databases, 2001 56
  • 57. Synchronous Replication Costs ? Before an update Transaction can commit, it obtains locks on all modified copies ? Sends lock requests to remote sites, holds locks ? If links or remote sites fail, Transaction cannot commit until links/sites restored ? Even without failure, commit protocol is complex, and involves many messages J.J.Bunn, Distributed Databases, 2001 57
  • 58. Asynchronous Replication ? Allows Transaction to commit before all copies have been modified ? Two methods: ? Primary Site ? Peer-to-Peer J.J.Bunn, Distributed Databases, 2001 58
  • 59. Primary Site Replication ? One copy designated as “Master” ? Published to other sites who subscribe to “Secondary” copies ? Changes propagated to “Secondary” copies ? Done in two steps: ? Capture changes made by committed Transactions ? Apply these changes J.J.Bunn, Distributed Databases, 2001 59
  • 60. The Capture Step ? Procedural: A procedure, automatically invoked, does the capture (takes a snapshot) ? Log-based: the log is used to generate a Change Data Table ? Better (cheaper and faster) but relies on proprietary log details J.J.Bunn, Distributed Databases, 2001 60
  • 61. The Apply Step ? The Secondary site periodically obtains from the Primary site a snapshot or changes to the Change Data Table ? Updates its copy ? Period can be timer-based or defined by the user/application ? Log-based capture with continuous Apply minimises delays in propagating changes J.J.Bunn, Distributed Databases, 2001 61
  • 62. Peer to P Replication - - eer ? More than one copy can be “Master” ? Changes are somehow propagated to other copies ? Conflicting changes must be resolved ? So best when conflicts do not or cannot arise: ? Each “Master” owns a disjoint fragment or copy ? Update permission only granted to one “Master” at a time J.J.Bunn, Distributed Databases, 2001 62
  • 63. Replication Examples ? Master copy, many slave copies (SQL Server) ? always know the correct value (master) ? change propagation can be ? transactional ? as soon as possible ? periodic ? on demand ? Symmetric, and anytime (Access) ? allows mobile (disconnected) updates ? updates propagated ASAP, periodic, on demand ? non-serializable ? colliding updates must be reconciled. ? hard to know “real” value J.J.Bunn, Distributed Databases, 2001 63
  • 64. Data Warehousing and Replication ? Build giant “warehouses” of data from many sites ? Enable complex decision support queries over data from across an organisation ? Warehouses can be seen as an instance of asynchronous replication ? Source data is typically controlled by different DBMS: emphasis on “cleaning” data by removing mismatches while creating replicas ? Procedural Capture and application Apply work best for this environment J.J.Bunn, Distributed Databases, 2001 64
  • 65. Distributed Locking ? How to manage Locks across many sites? ? Centrally: one site does all locking ? Vulnerable to single site failure ? Primary Copy: all locking for an object done at the primary copy site for the object ? Reading requires access to locking site as well as site which stores object ? Fully Distributed: locking for a copy done at site where the copy is stored ? Locks at all sites while writing an object J.J.Bunn, Distributed Databases, 2001 65
  • 66. Distributed Deadlock Detection ? Each site maintains a local “waits -for” graph ? Global deadlock might occur even if local graphs contain no cycles ? E.g. Site A holds lock on X, waits for lock on Y ? Site B holds lock on Y, waits for lock on X ? Three solutions: ? Centralised (send all local graphs to one site) ? Hierarchical (organise sites into hierarchy and send local graphs to parent) ? Timeout (abort Transaction if it waits too long) J.J.Bunn, Distributed Databases, 2001 66
  • 67. Distributed Recovery ? Links and Remote Sites may crash/fail ? If sub-transactions of a Transaction execute at different sites, all or none must commit ? Need a commit protocol to achieve this ? Solution: Maintain a Log at each site of commit protocol actions ? Two-Phase Commit J.J.Bunn, Distributed Databases, 2001 67
  • 68. Two P - hase Commit ? Site which originates Transaction is coordinator, other sites involved in Transaction are subordinates ? When the Transaction needs to Commit: ? Coordinator sends “prepare” message to subordinates ? Subordinates each force-writes an abort or prepare Log record, and sends “yes” or “no” message to Coordinator ? If Coordinator gets unanimous “yes” messages, force-writes a commit Log record, and sends “commit” message to all subordinates ? Otherwise, force-writes an abort Log record, and sends “abort” message to all subordinates ? Subordinates force-write abort/commit Log record accordingly, then send an “ack” message to Coordinator ? Coordinator writes end Log record after receiving all acks J.J.Bunn, Distributed Databases, 2001 68
  • 69. Notes on Two P - hase Commit (2PC) ? First: voting, Second: termination – both initiated by Coordinator ? Any site can decide to abort the Transaction ? Every message is recorded in the local Log by the sender to ensure it survives failures ? All Commit Protocol log records for a Transaction contain the Transaction ID and Coordinator ID. The Coordinator’s abort/commit record also includes the Site IDs of all subordinates J.J.Bunn, Distributed Databases, 2001 69
  • 70. Restart after Site Failure ? If there is a commit or abort Log record for Transaction T, but no end record, then must undo/redo T ? If the site is Coordinator for T, then keep sending commit/abort messages to Subordinates until acks received ? If there is a prepare Log record, but no commit or abort: ? This site is a Subordinate for T ? Contact Coordinator to find status of T, then ? write commit/abort Log record ? Redo/undo T ? Write end Log record J.J.Bunn, Distributed Databases, 2001 70
  • 71. Blocking ? If Coordinator for Transaction T fails, then Subordinates who have voted “yes” cannot decide whether to commit or abort until Coordinator recovers! ? T is blocked ? Even if all Subordinates are aware of one another (e.g. via extra information in “prepare” message) they are blocked ? Unless one of them voted “no” J.J.Bunn, Distributed Databases, 2001 71
  • 72. Link and Remote Site Failures ? If a Remote Site does not respond during the Commit Protocol for T ? E.g. it crashed or the link is down ? Then ? If current Site is Coordinator for T: abort ? If Subordinate and not yet voted “yes”: abort ? If Subordinate and has voted “yes”, it is blocked until Coordinator back online J.J.Bunn, Distributed Databases, 2001 72
  • 73. Observations on 2PC ? Ack messages used to let Coordinator know when it can “forget” a Transaction ? Until it receives all acks, it must keep T in the Transaction Table ? If Coordinator fails after sending “prepare” messages, but before writing commit/abort Log record, when it comes back up, it aborts T ? If a subtransaction does no updates, its commit or abort status is irrelevant J.J.Bunn, Distributed Databases, 2001 73
  • 74. 2PC with Presumed Abort ? When Coordinator aborts T, it undoes T and removes it from the Transaction Table immediately ? Doesn’t wait for “acks” ? “Presumes Abort” if T not in Transaction Table ? Names of Subordinates not recorded in abort Log record ? Subordinates do not send “ack” on abort ? If subtransaction does no updates, it responds to “prepare” message with “reader” (instead of “yes”/”no”) ? Coordinator subsequently ignores “ reader”s ? If all Subordinates are “reader”s, then 2nd. Phase not required J.J.Bunn, Distributed Databases, 2001 74
  • 75. Replication and Partitioning Compared Scaleup Central Base case a 1 TPS system to a 2 TPS centralized system ? Scaleup 2x more work 1 TPS server 100 Users 200 Users 2 TPS server Partitioning Replication ? Partition Two 1 TPS systems Two 2 TPS systems Scaleup 2x more work 1 TPS server 100 Users 100 Users 2 TPS server ? Replication O tps O tps 1 tps 1 tps Scaleup 4x more work 1 TPS server 100 Users 100 Users 2 TPS server J.J.Bunn, Distributed Databases, 2001 75
  • 76. “Porter” Agent b - ased Distributed Database ? Charles Univ, Prague ? Based on “Aglets” SDK from IBM J.J.Bunn, Distributed Databases, 2001 76
  • 77. Part 3 Distributed Systems . Julian Bunn California Institute of Technology
  • 78. What’s a Distributed System? ? Centralized: ? everything in one place ? stand-alone PC or Mainframe ? Distributed: ? some parts remote ? distributed users ? distributed execution ? distributed data J.J.Bunn, Distributed Databases, 2001 78
  • 79. Why Distribute? ? No best organization ? Organisations constantly swing between ? Centralized: focus, control, economy ? Decentralized: adaptive, responsive, competitive ? Why distribute? ? reflect organisation or application structure ? empower users / producers ? improve service (response / availability) ? distribute load ? use PC technology (economics) J.J.Bunn, Distributed Databases, 2001 79
  • 80. What Should Be Distributed? ? Users and User Interface ? Thin client Presentation ? Processing workflow ? Trim client Business ? Data Objects ? Fat client Database ? Will discuss tradeoffs later J.J.Bunn, Distributed Databases, 2001 80
  • 81. Transparency in Distributed Systems ? Make distributed system as easy to use and manage as a centralized system ? Give a Single-System Image ? Location transparency: ? hide fact that object is remote ? hide fact that object has moved ? hide fact that object is partitioned or replicated ? Name doesn’t change if object is replicated, partitioned or moved. J.J.Bunn, Distributed Databases, 2001 81
  • 82. Naming The basics - ? Objects have ? Globally Unique Identifier (GUIDs) Address ? location(s) = address(es) ? name(s) guid ? addresses can change ? objects can have many names Jim ? Names are context dependent: ? (Jim @ KGB ?????? ??? ? ? ??? ? ?Jim @ CIA) James ? Many naming systems ? UNC: nodedevicedir dir dirobject ? Internet: http://node.domain.root/dir/dir/dir/object ? LDAP: ldap://ldap.domain.root/o=org,c=US,cn=dir J.J.Bunn, Distributed Databases, 2001 82
  • 83. Name Servers in Distributed Systems ? Name servers translate names + context to address (+ GUID) ? Name servers are partitioned (subtrees of name space) ? Name servers replicate root of name tree ? Name servers form a hierarchy ? Distributed data from hell: ? high read traffic ? high reliability & availability ? autonomy J.J.Bunn, Distributed Databases, 2001 83
  • 84. Autonomy in Distributed Systems ? Owner of site (or node, or application, or database) Wants to control it ? If my part is working, must be able to access & manage it (reorganize, upgrade, add user,…) ? Autonomy is ? Essential ? Difficult to implement. ? Conflicts with global consistency ? examples: naming, authentication, admin… J.J.Bunn, Distributed Databases, 2001 84
  • 85. Security The Basics ? Authentication server subject + Authenticator => Object (Yes + token) | No ? Security matrix: subject ? who can do what to whom ? Access control list is column of matrix ? “who” is authenticated ID Permissions ? In a distributed system, “who” and “what” and “whom” are distributed objects J.J.Bunn, Distributed Databases, 2001 85
  • 86. Security in Distributed Systems ? Security domain: Security domain: nodes with a shared security server. ? Security domains can have trust relationships: ? A trusts B: A “believes” B when it says this is Jim@B ? Security domains form a hierarchy. ? Delegation: passing authority to a server when A asks B to do something (e.g. print a file, read a database) B may need A’s authority ? Autonomy requires: ? each node is an authenticator ? each node does own security checks ? Internet Today: ? no trust among domains (fire walls, many passwords) ? trust based on digital signatures J.J.Bunn, Distributed Databases, 2001 86
  • 87. Clusters The Ideal Distributed System. ? Cluster is distributed ? Clusters use system BUT single distributed system ?location techniques for ? manager ? load distribution ? security policy ? storage ? relatively homogeneous ? execution ? growth ? communications is ? fault tolerance ? high bandwidth ? low latency ? low error rate J.J.Bunn, Distributed Databases, 2001 87
  • 88. Cluster: Shared What? ? Shared Memory Multiprocessor ? Multiple processors, one memory ? all devices are local ? HP V-class ? Shared Disk Cluster ? an array of nodes ? all shared common disks ? VAXcluster + Oracle ? Shared Nothing Cluster ? each device local to a node ? ownership may change ? Beowulf,Tandem, SP2, Wolfpack J.J.Bunn, Distributed Databases, 2001 88
  • 89. Distributed Execution Threads and Messages threads ? Thread is Execution unit (software analog of cpu+memory) ? Threads execute at a node ? Threads communicate via shared memory ? Shared memory (local) ? Messages (local and remote) messages J.J.Bunn, Distributed Databases, 2001 89
  • 90. Peer to P or Client Server - - eer - ? Peer-to-Peer is symmetric: ? Either side can send ? Client-server ? client sends requests requ est ? server sends responses resp onse ? simple subset of peer -to-peer J.J.Bunn, Distributed Databases, 2001 90
  • 91. Connection less or Connected - ? Connection-less ? Connected (sessions) ? request contains ?open - request/reply - close ? client id ?client authenticated once ? client context ?Messages arrive in order ?Can send many replies (e.g. FTP) ? work request ? Server has client context ? client authenticated on each message (context sensitive) ? e.g. Winsock and ODBC ? only a single response message ? HTTP adding connections ? e.g. HTTP, NFS v1 J.J.Bunn, Distributed Databases, 2001 91
  • 92. Remote Procedure Call: The key to transparency y = pObj->f(x); ? Object may be x local or remote ? Methods on object work wherever it is. f() ? Local invocation return val; y = J.J.Bunn, Distributed Databases, 2001 val; val 92
  • 93. Remote Procedure Call: The key to transparency ? Remote invocation y = pObj->f(x); proxy x x Obj Local? ? Gee!! Nice pictures! marshal marshal stub x un marshal x Obj Local? Obj Local? pObj->f(x) f() f() return val; return val; val marshal val un un y = val; marshal marshal val val J.J.Bunn, Distributed Databases, 2001 93
  • 94. Object Request Broker (ORB) Orchestrates RPC ? Registers Servers ? Manages pools of servers ? Connects clients to servers ? Does Naming, request-level authorization, ? Provides transaction coordination (new feature) ? Old names: ? Transaction Processing Monitor, ? Web server, Transaction ? NetWare J.J.Bunn, Distributed Databases, 2001 Object-Request Broker 94
  • 95. Using RPC for Transparency Partition Transparency ? Send updates to correct partition y = pfile->write(x); x part Local? x x un marshal x send pObj->write(x) send to to write() correct correct partition partition return val; val marshal val J.J.Bunn, Distributed Databases, 2001 val 95
  • 96. Using RPC for Transparency Replication Transparency ? Send updates to EACH node y = pfile->write(x); x x Send Send to to each each replica replica J.J.Bunn, Distributed Databases, 2001 val 96
  • 97. Client/Server Interactions All can be done with RPC ? Request-Response C S response may be many messages ? Conversational C S server keeps client context Dispatcher S ? C S three-tier: complex operation at server S ? Queued de-couples client from server allows disconnected operation C S S J.J.Bunn, Distributed Databases, 2001 97
  • 98. Queued Request/Response ? Time-decouples client and server ? Three Transactions ? Almost real time, ASAP processing ? Communicate at each other’s convenience Allows mobile (disconnected) operation ? Disk queues survive client & server failures Submit Perform Response Client Server J.J.Bunn, Distributed Databases, 2001 98
  • 99. Why Queued Processing? ? Prioritize requests ambulance dispatcher favors high-priority calls ? Manage Workflows Order Build Ship Invoice Pay ? Deferred processing in mobile apps ? Interface heterogeneous systems EDI, MOM: Message-Oriented-Middleware DAD: Direct Access to Data J.J.Bunn, Distributed Databases, 2001 99
  • 100. Work Distribution Spectrum Thin Fat ? Presentation Presentation and plug-ins ? Workflow workflow manages session & invokes objects ? Business objects Business Objects ? Database Database J.J.Bunn, Distributed Databases, 2001 Fat Thin 100
  • 101. Transaction Processing Evolution to Three Tier Intelligence migrated to clients Mainframe ? Mainframe Batch processing cards (centralized) ? Dumb terminals & green screen Server Remote Job Entry 3270 TP Monitor ? Intelligent terminals database backends ORB ? Workflow Systems Active Object Request Brokers Application Generators J.J.Bunn, Distributed Databases, 2001 101
  • 102. Web Evolution to Three Tier Intelligence migrated to clients (like TP) Web WAIS Server ? Character-mode clients, archie ghopher smart servers green screen Mosaic ? GUI Browsers - Web file servers NS & IE ? GUI Plugins - Web dispatchers - CGI ? Smart clients - Web dispatcher (ORB) Active pools of app servers (ISAPI, Viper) workflow scripts at client & server J.J.Bunn, Distributed Databases, 2001 102
  • 103. PC Evolution to Three Tier Intelligence migrated to server ? Stand-alone PC (centralized) ? PC + File & print server IO request disk I/O reply message per I/O ? PC + Database server SQL Statement message per SQL statement ? PC + App server Transaction message per transaction ? ActiveX Client, ORB ActiveX server, Xscript J.J.Bunn, Distributed Databases, 2001 103
  • 104. The Pattern: Three Tier Computing ? Clients do presentation, gather input Presentation ? Clients do some workflow (Xscript) ? Clients send high-level requests to ORB workflow (Object Request Broker) ? ORB dispatches workflows and business objects -- proxies for client, orchestrate Business flows & queues Objects ? Server-side workflow scripts call on distributed business objects to execute Database task J.J.Bunn, Distributed Databases, 2001 104
  • 105. Web Client The Three Tiers HTML VB Java VBscritpt plug-ins JavaScrpt Middleware Object ORB VB or Java VB or Java TP Monitor Script Engine Virt Machine server Web Server... Pool HTTP+ ORB Internet DCOM Object & Data server . DCOM (oleDB, ODBC,...) 6.2 LU Legacy IBM Gateways J.J.Bunn, Distributed Databases, 2001 105
  • 106. Why Did Everyone Go To Three T - ier? ? Manageability Presentation ? Business rules must be with data ? Middleware operations tools ? Performance (scaleability) workflow ? Server resources are precious ? ORB dispatches requests to server pools ? Technology & Physics Business ? Put UI processing near user Objects ? Put shared data processing near shared data Database J.J.Bunn, Distributed Databases, 2001 106
  • 107. Why Put Business Objects at Server? MOM’s Business Objects DAD’sRaw Data Customer comes to store Customer comes to store with list Takes what he wants Gives list to clerk Fills out invoice Clerk gets goods, makes invoice Leaves money for goods Customer pays clerk, gets goods Easy to build Easy to manage No clerks Clerks controls access J.J.Bunn, Distributed Databases, 2001 Encapsulation 107
  • 108. Why Server Pools? ? Server resources are precious. Clients have 100x more power than server. ? Pre-allocate everything on server ? preallocate memory ? pre-open files ? pre-allocate threads N clients x N Servers x F files = ? pre-open and authenticate clients N x N x F file opens!!! ? Keep high duty-cycle on objects (re-use them) ? Pool threads, not one per client ? Classic example: Pool of TPC-C benchmark HTTP DBC links IE ? 2 processes 7,000 IIS SQL clients ? everything pre-allocated J.J.Bunn, Distributed Databases, 2001 108
  • 109. Classic Mistakes ? Thread per terminal fix: DB server thread pools fix: server pools ? Process per request (CGI) fix: ISAPI & NSAPI DLLs fix: connection pools ? Many messages per operation fix: stored procedures fix: server -side objects ? File open per request fix: cache hot files J.J.Bunn, Distributed Databases, 2001 109
  • 110. Distributed Applications need Transactions! ? Transactions are key to structuring distributed applications ? ACID properties ease exception handling ? Atomic: all or nothing ? Consistent : state transformation ? Isolated: no concurrency anomalies ? Durable : committed transaction effects persist J.J.Bunn, Distributed Databases, 2001 110
  • 111. Programming & Transactions The Application View ? You Start (e.g. in TransactSQL): Begin Begin ? Begin [Distributed] Transaction <name> ? Perform actions ? Optional Save Transaction <name> RollBack ? Commit or Rollback Commit ? You Inherit a XID ? Caller passes you a transaction XID ? You return or Rollback. ? You can Begin / Commit sub -trans. RollBack ? You can use save points Return Return J.J.Bunn, Distributed Databases, 2001 111
  • 112. Transaction Save Points Backtracking within a transaction BEGIN WORK:1 action ? Allows app to action SAVE WORK:2 cancel parts of a action action transaction prior SAVE WORK:3 action to commit action SAVE WORK:5 action action ? This is in most action SAVE WORK:6 action SQL products SAVE WORK:4 action action ROLLBACK SAVE WORK:7 WORK(2) action action action action ROLLBACK SAVE WORK:8 WORK(7) action J.J.Bunn, Distributed Databases, 2001 COMMIT WORK 112
  • 113. Chained Transactions ? Commit of T1 implicitly begins T2. ? Carries context forward to next transaction ? cursors ? locks ? other state Transaction #1 Transaction #2 C B Processing o e Processing m context m g context established i i n used t J.J.Bunn, Distributed Databases, 2001 113
  • 114. Nested Transactions Going Beyond Flat Transactions ? Need transactions within transactions ? Sub-transactions commit only if root does ? Only root commit is durable. ? Subtransactions may rollback if so, all its subtransactions rollback ? Parallel version of nested transactions T12 T121 T122 T123 T1 T11 T112 T13 T114 T131 T132 T133 T111 T113 J.J.Bunn, Distributed Databases, 2001 114
  • 115. Workflow: A Sequence of Transactions ? Application transactions are multi -step Presentation ? order, build, ship & invoice, reconcile ? Each step is an ACID unit ? Workflow is a script describing steps workflow ? Workflow systems Instantiate the scripts ? ? Drive the scripts Business ? Allow query against scripts Objects ? Examples Manufacturing Work In Process (WIP) Queued processing Loan application & approval, Database Hospital admissions… J.J.Bunn, Distributed Databases, 2001 115
  • 116. Workflow Scripts ? Workflow scripts are programs (could use VBScript or JavaScript) ? If step fails, compensation action handles error ? Events, messages, time, other steps cause step. ? Workflow controller drives flows fork Source join branch case loop Compensation Action J.J.Bunn, Distributed Databases, 2001 Step 116
  • 117. Workflow and ACID ? Workflow is not Atomic or Isolated ? Results of a step visible to all ? Workflow is Consistent and Durable ? Each flow may take hours, weeks, months ? Workflow controller ? keeps flows moving ? maintains context (state) for each flow ? provides a query and operator interface e.g.: “what is the status of Job # 72149?” J.J.Bunn, Distributed Databases, 2001 117
  • 118. ACID Objects Using ACID DBs The easy way to build transactional objects ? Application uses transactional objects (objects have ACID properties) SQL ? If object built on top of ACID objects, then object is ACID. ? Example: New, EnQueue, DeQueue on top of SQL ? SQL provides ACID dim c as Customer dim CM as CustomerMgr Business Object: Customer ... set C = CM.get(CustID) ... Business Object Mgr: CustomerMgr C.credit_limit = 1000 ... SQL CM.update(C, CustID) Persistent Programming languages automate this. J.J.Bunn, Distributed Databases, 2001 .. 118
  • 119. ACID Objects From Bare Metal The Hard Way to Build Transactional Objects ? Object Class is a Resource Manager (RM) ? Provides ACID objects from persistent storage ? Provides Undo (on rollback) ? Provides Redo (on restart or media failure) ? Provides Isolation for concurrent ops ? Microsoft SQL Server, IBM DB2, Oracle,… are Resource managers. ? Many more coming. ? RM implementation techniques described later J.J.Bunn, Distributed Databases, 2001 119
  • 120. Transaction Manager ? Transaction Manager (TM): manages transaction objects. TM ? XID factory egi n b ? tracks them XID enlist App ? coordinates them call(..XID) RM ? App gets XID from TM ? Transactional RPC ? passes XID on all calls ? manages XID inheritance ? TM manages commit & rollback J.J.Bunn, Distributed Databases, 2001 120
  • 121. TM Two P - hase Commit Dealing with multiple RMs ? If all use one RM, then all or none commit ? If multiple RMs, then need coordination ? Standard technique: ? Marriage: Do you? I do. I pronounce…Kiss ? Theater: Ready on the set? Ready! Action! Act ? Sailing: Ready about? Ready! Helm’s a-lee! Tack ? Contract law: Escrow agent ? Two-phase commit: ? 1. Voting phase: can you do it? ? 2. If all vote yes, then commit phase: do it! J.J.Bunn, Distributed Databases, 2001 121
  • 122. Two P - hase Commit In Pictures ? Transactions managed by TM ? App gets unique ID (XID) from TM at Begin() ? XID passed on Transactional RPC ? RMs Enlist when first do work on XID gin TM En Be lis t XID App RM1 En Call(..XID..) lis Call(..XI t D..) RM2 J.J.Bunn, Distributed Databases, 2001 122