SlideShare uma empresa Scribd logo
1 de 49
Big Data and NoSQL
    Landscape
              Sanjeev Mishra
    Silicon Valley Code Camp 2012



                        Sanjeev Mishra SVCC 2012
Timeline
 • 1970s – Genesis of modern db
  • Modeling the world based on relational
    calculus: best for managing uniform data
 • 1980s
  • RDBMS takes over the world
 • 1990s – 2000+
  • Invention of HTML
  • Spread of Web based technologies



                                  Sanjeev Mishra SVCC 2012
Need for Modern Data Storage
 • Amazon
  • Managing: Shopping carts, Seller Lists, Customer Preferences,
    Sales Rank, Recommendations

 • Google
  • Storing and managing web scale data

 • Facebook
  • Managing social graphs

 • LinkedIn, Twitter and others

                                              Sanjeev Mishra SVCC 2012
Data Explosion Current
              • Every two days now we
                create as much
                information as we did
                from the dawn of
                civilization up
                until 2003 - about 5
                exabytes (1K PB) of
                data: Eric Schmidt *


                         Sanjeev Mishra SVCC 2012
Data Explosion Future

 • A telescope planned to be finished
   in 2024 will generate more data
   in a single day than the entire
   Internet.*




                          Sanjeev Mishra SVCC 2012
What is Big Data?

 • Terabytes(TB) is not big data, petabytes
   (PB) (1000 TB) may be.

 • Current definition of big data: zettabytes
   (1M PB or 1G TB)




                                 Sanjeev Mishra SVCC 2012
Nature of Big Data
Web 2.0 kind of data

   • Different from traditional RDBMS/Warehouse
     data – more reads less updates
   • User Generated Content – Tweets, Reviews,
     Comments etc…
   • Lots of updates and lots of reads
   • Scale to millions of users
   • Not necessarily Transactional
   • Compromised consistency



                                  Sanjeev Mishra SVCC 2012
Data Explosion, So What?
 • Structural issues
   • The dynamic nature of data
 • Performance issues
   • Insertion
   • Search
 • Scaling Horizontally
   • Dozens or hundreds of machines to operate as single
     server




                                       Sanjeev Mishra SVCC 2012
What is NoSQL?
Not Only SQL or Not Relational

  •   Carlo Strozzi used it in 1998 and then Eric Evans in 2009

  •   Simple call level interface (SQL not supported)

  •   Flexible schema

  •   Efficient use of distributed indexes

   • Horizontally scaling of operations over many server
  •   No ACID but BASE (Basically Available, Soft state*,
      Eventually consistent**)
                                             Sanjeev Mishra SVCC 2012
CAP Theorem (Brewer’s Theorem)*

  A distributed system can satisfy any two of
  following three guarantees at any time

   o Consistency (all nodes see the same data at the same
     time)

   o Availability (a guarantee that every request receives a
     response about whether it was successful or failed)

   o Partition tolerance (the system continues to operate
     despite arbitrary message loss or failure of part of the
     system)
                                          Sanjeev Mishra SVCC 2012
Eventual Consistency Flavors
 • Causal consistency
   o changes are notified through events, the receiving
     session will always see the updated value.
 • Read your own writes
   o a session that updates the db will immediately see the
     changes.
 • Monotonic consistency*
   o once a session reads a value will never see an earlier
     value.



                                          Sanjeev Mishra SVCC 2012
Consistency Tradeoffs




Where,
  o N is # of copies of each data that db maintains
  o R is # of copies that is read for each read
  o W is # of copies that must be written for each write


• Most NoSQL use N>W>1: More than one write must
  complete but not all nodes need to update immediately.
                                          Sanjeev Mishra SVCC 2012
Column Vs Row Storage




                Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Id       First name   Last name     SSN                   DOB

1        John         Doe           111-222-3333          8/12/1968

2        Jane         Doe           111-332-3408          4/3/1972



Row oriented                      Column oriented
           1                                        1
         John                                       2
          Doe                                      John
      111-222-3333                                 Jane
       8/12/1968                                   Doe

           2                                       Doe

          Jane                             111-222-3333

          Doe                              111-332-3408

      111-332-3408                           8/12/1968

        4/3/1972                              4/3/1972
                                                    Sanjeev Mishra SVCC 2012
Contrasting Operations on Row vs Col DB
Insert a new tuple

Row oriented          Column oriented
                              1
             1
                              2
           John
                              3
           Doe
                            John
        111-22-3333
         8/12/1968
                            Jane
                            Foo
             2
                            Doe
           Jane
           Doe
                            Doe
        111-32-3408         Bar
         4/3/1972         111-22-3333

            3             111-32-3408

           Foo            237-23-3924

           Bar            8/12/1968
        237-23-3924        4/3/1972
         2/3/1978         2/3/1978
                                        Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Create a new attribute

Row oriented             Column oriented
           1                      1
         John                     2
          Doe                   John
      111-22-3333               Jane
       8/12/1968                Doe
      408-555-1212
                                Doe
           2                 111-22-3333
          Jane               111-32-3408
          Doe                8/12/1968
      111-32-3408             4/3/1972
        4/3/1972             408-555-1212

      650-555-2323           650-555-2323




                                            Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Get all who were born in a given year

Row oriented                             Column oriented
  Easy, just pick all rows where year     Not so simple, scan the years and
  of DOB matches the given year           remember the indexes of all
                                          occurrences that match given year
                                          and extract based on these
                                          indexes

Get sum of all years

  Little difficult, data does not live     Easy, the data is found
  consecutively so scanning through        consecutively
  entire dataset needed




                                                           Sanjeev Mishra SVCC 2012
Glossary
•   Consistent Hashing (Cassandra, Dynamo)
     o   the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the


•
         largest hash value wraps around to the smallest hash value)
     Vector Clock (Cassandra, Riak, Dynamo)
     o   an algorithm for generating a partial ordering of events in a distributed system and


•
         detecting causality violations


•
     Quorum (Cassandra, Dynamo (sloppy))
     Merkle Tree (Cassandra, Riak, Dynamo)
     o   a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in
         the tree are hashes of their respective children. The principal advantage of Merkle tree is
         that each branch of the tree can be checked independently without requiring nodes to


•
         download the entire data set
     Anti-Entropy Gossip Protocol (Cassandra, Dynamo)
     o   comparing all the replicas of each piece of data that exist and updating each replica to the


•
         newest version
     Order preserving partitioning (Cassandra, MongoDB)




                                                                                Sanjeev Mishra SVCC 2012
Glossary
•   MVCC
    o
•
        multi version concurrency control
    Atomicity
    o
•
        all or nothing
    Consistency
    o
•
        each transaction leaves the db in valid state
    Isolation
    o
•
        concurrent execution of txn results into a state that is obtained if txn were executed serially
    Durability
    o   committed txn remain so even in the event of power loss, crashes or errors

•   WAL
    o   Write ahead logging – changes are written to a log before they are applied (Durability)


•   Eventually consistent
    o   sufficiently long quiet period all updates can be expected to propagate eventually through
        the system and all replicas will be consistent




                                                                              Sanjeev Mishra SVCC 2012
Glossary
 •   Sharding
     o horizontal partitioning of data, storing records on different servers according to some key
 •   Tuple
     o row in RDBMS, predefined schema.
 •   Document
     o contains nested document or lists as well as scalar values. No predefined schema.
 •   Extensible Record
     o hybrid between Tuple and Document, families of attributes defined in a schema but attributes
          can be added on a per record basis.
 •   Key-value Stores
     o stores values indexed by a user defined key.
 •   Document Stores
     o indexed document store
 •   Extensible Record Stores aka Wide Column Stores
     o Stores extensible records partitioned vertically and horizontally across nodes.




                                                                      Sanjeev Mishra SVCC 2012
NoSQL Categories
 • Key-value Stores
   o Stores values indexed by a user defined key.


 • Document Stores
   o Indexed document store


 • Extensible Record Stores (Column Stores)
   o Stores extensible records partitioned vertically and
     horizontally across nodes.


 • Graph Databases
                                         Sanjeev Mishra SVCC 2012
Key-Value Stores




              Sanjeev Mishra SVCC 2012
Key-Value Stores
 • A distributed cache/Hashtable
   o Inspired by Amazon Dynamo
   o like memcached with
       o persistence, replication, versioning, locking, transactions,
          sorting etc.
   o   get/put and lookups
   o   No secondary indices or keys
   o   Values are BLOBs or in some cases JSON document
   o   Scalability through key distribution over nodes




                                                  Sanjeev Mishra SVCC 2012
Key-Value Stores
 •   Riak (Erlang/Basho/Apache)
 •   Membase (C+Erlang/Couchbase/Apache)
 •   Project Voldemort (Java/LinkedIn/Apache)
 •   Redis (C/VMWare/BSD)
 •   Scalaris (Erlang/Zuse+onScale/Apache)
 •   Tokyo Cabinet (C/Fal Labs/LGPL)
 •   Dynamo (Java/For Amazon internal use)


 There are others
     Key Value / Tuple Store at http://nosql-database.org/

                                                  Sanjeev Mishra SVCC 2012
Amazon Dynamo
•   KV Store Developed by Amazon to support
    o   Best Seller Lists
    o   Shopping carts
    o   Customer Preferences
    o   Session Management
    o   Sales Rank
    o   Product Catalog etc...
•   Variation of Consistent Hashing based Data
    Partitioning and Replication
•   Dynamic add/delete of Storage Nodes
•   Each service uses distinct instance of
    Dynamo
                                    Sanjeev Mishra SVCC 2012
Amazon Dynamo Cont...
•   Key/Value are opaque byte[]. ID= 128-
    bit MD5 hash of the Key
•   “always writeable” where no updates are
    rejected due to failures or concurrent writes
•   Simple Read/Write - get/put - operation on
    data uniquely identified by a key, value is
    binary object (BLOB)
    o get(key): single or a list (conflicts with context)
    o put(key,context,object)

•   Eventual consistency with no isolation
    guarantees
                                             Sanjeev Mishra SVCC 2012
RIAK
•   Developed in Erlang by Basho
•   Clients:Python, Javascript, Java, PHP, Erlang
•   Dynamo inspired Open-Source
    o Advanced K/V and
    o Document Store (not a full featured document store)
•   Replication and sharding by primary key hash
    o   Consistent Hashing
    o   De-Centralized (No-Master node)
•   Eventually consistent
    o Tunable number of replicas for read and write
    o Tunable per-read and per-write
    o Different parts of application can choose different trade
      offs                                      Sanjeev Mishra SVCC 2012
Project Voldemort
•   Java based advanced Key/Value store
•   Developed at LinkedIn
•   Open source, Apache license
•   Supports MVCC for updates
•   Replicas are updated asynchronously - up-to-
    date view guaranteed if majority of replicas read
•   Uses optimistic locking for consistent multi-
    record updates
•   Versions are ordered based on Vector clocks
•   More info: http://www.project-voldemort.com/voldemort/


                                             Sanjeev Mishra SVCC 2012
Document Stores




             Sanjeev Mishra SVCC 2012
Document Stores
 • Data more complex than that in K/V stores
 • Data encapsulated and encoded in
   o JSON, XML, YAML, BSON or some other standard format
 • Multiple types of documents per database
   o Documents of similar type grouped together
   o Optional metadata/schema for the document
   o Less rigid schema than that of RDBMS
 • Nested documents or collection
 • Secondary indexes
 • Complex query/update support
   o Multiple attributes, collections etc
                                            Sanjeev Mishra SVCC 2012
Document Example
 {
     "when": "2011-09-19T02:10:11.3Z",
     "author": "alex",
     "title": "No Free Lunch",
     "text": "This is the text of the post. It could be very long.",
     "tags": [ "business", "ramblings“ ],
     "votes": 5,
     "voters": ["jane“, "joe", "spencer", "phyllis", "li”],
     "comments": [
         {
              "who": "jane",
              "when": "2011-09-19T04:00:10.112Z",
              "comment": "I agree."
         },
         {
              "who": "meghan",
              "when": "2011-09-20T14:36:06.958Z",
              "comment": "You must be joking. etc etc ..."
         }
     ]
 }

                                                                       Sanjeev Mishra SVCC 2012
Document Stores
 •   MongoDB (C/10Gen/AGPL)
 •   Apache CouchDB (Erlang/Apache)
 •   Amazon SimpleDB (Erlang/Amazon)
 •   Terrastore (Java/Terracota/Apache)
 •   RavenDB (C#/HibernatingRhino/AGPL)


 There are others
     Document Store at http://nosql-database.org/


                                                    Sanjeev Mishra SVCC 2012
MongoDB




          Sanjeev Mishra SVCC 2012
MongoDB
huMongous

 • Document format: BSON (Binary JSON)
 • Supports nested documents
 • Documents are grouped in Collections
 • Supports secondary indexes
 • Scalability – auto sharding
 • Consistency – Tunable based on request
   (WriteConcerns)
 • Replication – replica set – master – slave
 • Atomicity – document level
                                Sanjeev Mishra SVCC 2012
MongoDB
          Data Type                               SQL                                  MongoDB
String         Integer       create table users                      db.createCollections(“users”)
                             (name varchar(128), age number)
Boolea         Double
                             insert into users values („bob‟,32‟)    db.users.insert
Null           Array
                                                                     ({name:”bob”, age:32})
Object         ObjectId
                             select * from user                      db.users.find()
Binary         Regex
Code                         select name, age from users             db.users.find
                                                                     ({}, {name:1, age:1,_id:0})
                             select name, age from users where age   db.users.find
                             =32                                     ({age:32}, {name:1, age:1})
  SQL         MongoDB        select * from user                      db.users.find().sort({name:1})
Database      Database       order by name asc
Table         Collection     select * from user                      db.users.find().skip(20).limit(10)
                             limit 10 offset 20
Index         Index
                             select distinct name from user          db.users.distinct(“name”)
Row           Document
Column        Field          select count(*) from user               db.users.count()

Join          Embedding or
                             update users set age =39 where name =   db.users.update({name:”bob”},
              Linking
                             „bob‟                                   {$set:{age:33}}, false, true)
Primary       _id            delete from users where name=„bob‟      db.users.remove({name:”bob”})
Key

                                                                       Sanjeev Mishra SVCC 2012
Extensible Record
     Stores
      aka
 Column Stores


              Sanjeev Mishra SVCC 2012
Extensible Record Stores
Column Stores

 • Motivated by Google BigTable
 • Basic Data Model – Rows and Columns
 • Scale by splitting rows and columns over
   multiple nodes
   o Rows split by sharding on primary key – split
     by range rather than hash function
   o Columns split by column groups



                                   Sanjeev Mishra SVCC 2012
Extensible Record Stores
 • Cassandra (Java/Facebook/Apache)
   •   Marriage of Dynamo and BigTable

 • HBase (Java/Yahoo/Apache)
   •   Inspired by BigTable, used HDFS for storage

 • HyperTable (C/Zvent/GPL)
   •   Similar to HBase/BigTable

 • Accumulo (Java/NSA/Apache)
   •   Uses Hadoop, ZooKeeper, and Thrift, cell level access control

 • Google BigTable (Internal to Google)

 There are others
   Wide Column Store at http://nosql-database.org/
                                                            Sanjeev Mishra SVCC 2012
Cassandra




            Sanjeev Mishra SVCC 2012
Cassandra Features
 • Decentralized
    o Data is distributed across cluster of nodes
    o No master, any node can address any request
    o No single point of failure
 • Fault-tolerant (Configurable replication strategies)
    o Simple Strategy (first determined by partitioner, rest
      on other nodes clockwise)
    o Network Topology Strategy: multi datacenter strategy




                                              Sanjeev Mishra SVCC 2012
Cassandra Features Cont…
 • Failure detection and recovery
   o Based on Gossip protocol
   o Node state updated based on gossip message version
   o Per-node heartbeat threshold
 • Tunable consistency
   o Can be configured per read/write




                                           Sanjeev Mishra SVCC 2012
Cassandra
            Data Type                              SQL                                  Cassandra QL
ascii            int            create database codecamp                      CREATE KEYSPACE codecamp WITH
                                                                              strategy_class =
float            decimal
                                                                              „NetworkTopologyStrategy‟ AND
boolean          bigint                                                       strategy_options:DC1=3
double           varchar        create table users                            CREATE COLUMNFAMILY users (key
                                (key varchar(128), name                       varchar PRIMARY KEY, name
counter          timestamp
                                varchar(128), age number)                     varchar, age int)
uuid             text
                                create index idx_name ON                      CREATE INDEX idx_name ON
blob             varint         users(name)                                   users(name)
                                insert into users values („bob‟, „Bob‟,32‟)   INSERT INTO users
                                                                              (KEY, name, age)
      SQL        Cassandra                                                    VALUES(„jdoe‟,‟Jane Doe‟, 39)
Database        Keyspace
                                select name, age from users                   SELECT name, age FROM users
Table           Column Family   where age>30                                  WHERE age>30

Index           Index           update users set age = 35                     UPDATE users SET age=35
                                where name = „bob‟                            WHERE name=„bob‟
Row             Row
                                delete from users where                       DELETE FROM users where KEY =
Column          Column          key=„bob‟                                     „bob‟
                                                                              DELETE age FROM users where
Join                                                                          KEY=„alice‟

Primary Key     Primary Key     drop table users                              DROP COLUMNFAMILY users


                                drop database codecamp                        DROP KEYSPACE codecamp

                                                                               Sanjeev Mishra SVCC 2012
Cassandra
Column and Column Family
   Column                                                 Super Column
name:byte[]                                       Name: byte[]
                                                  Value: Collection of Columns
value:byte[]

timestamp                                                  Super Column
                            name: homeaddress
   Column
name:”userid”               value:

value:”jdoe”                name: “street”                name: ”city”           name: “zip”
                            value: “555 Homestead Rd”     value:“Sunnyvale”      value: “95051”
Timestamp:                  timestamp:…                   timestamp:…            timestamp:…

                                                             Row
                      Row
                                     Column                 Column                 Column
                      Key
                                 name: “userid”      name: “name”             name: “age”
                     jdoe        value: “jdoe”       value: “Jane Doe”        value: 33
            Column               timestamp:…         timestamp:…=             timestamp:…
            Family               name: “userid”      name: “name”             name: “age”
                     ladams      value: “ladams”     value: “Larry Adam”      value: 47
                                 timestamp:…         timestamp:…=             timestamp:…
                                 name: “userid”      name: “name”             name: “age”
                     bdole       value: “bdole”      value: “Bob Dole”        value: 67
                                 timestamp:…         timestamp:…=             timestamp:…
                                                                               Sanjeev Mishra SVCC 2012
Cassandra Keyspace
Analogous to database in RDBMS

 •   Contains one or more Column Families
     analogous to tables in RDBMS
 •   Column Family contains columns
 •   A Row Key identifies a set of related columns
 •   A Row is not required to have same set of
     columns
 •   No join between two column families:
     o   Each column family is self contained to serve a query
     o   A rule of thumb - one column family per query for
         better performance
 •   Replication is controlled on per-keyspace basis
                                                Sanjeev Mishra SVCC 2012
Cassendra In Enterprise
 • Netflix, Twitter, Urban Airship, Constant
   Contact, Reddit, Cisco, OpenX, Rackspace,
   Ooyala, and many more
 • The largest Cassandra cluster has over 300
   TB of data in over 400 machines




                               Sanjeev Mishra SVCC 2012
HBase
•   Design influenced by Google BigTable
•   A type of NoSQL – more a data store than data base, lacks many
    RDBMS features such as
     •   Typed column, secondary indexes, triggers, advanced query language etc.
•   Build on top of HDFS: Data is stored in HDFS as indexed
    “StoreFiles”
•   Strongly consistent R/W not “eventually consistent” – suitable for
    counter aggregation
•   Auto Sharding
•   Auto Region Server Failover
•   Out of the box support for Hadoop/HDFS
•   Can be used as Source and/or Sink for MapReduce
•   Java, Thrift/REST client
•   Support Block Cache and Bloom Filters for high volume query
    optimization
•   Web management tool and JMX support
                                                        Sanjeev Mishra SVCC 2012
Sanjeev Mishra SVCC 2012
NoSQL Growth Trends




                      Sanjeev Mishra SVCC 2012
Big Data and NoSQL
    Landscape




              Sanjeev Mishra SVCC 2012

Mais conteúdo relacionado

Último

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Code camp2012

  • 1. Big Data and NoSQL Landscape Sanjeev Mishra Silicon Valley Code Camp 2012 Sanjeev Mishra SVCC 2012
  • 2. Timeline • 1970s – Genesis of modern db • Modeling the world based on relational calculus: best for managing uniform data • 1980s • RDBMS takes over the world • 1990s – 2000+ • Invention of HTML • Spread of Web based technologies Sanjeev Mishra SVCC 2012
  • 3. Need for Modern Data Storage • Amazon • Managing: Shopping carts, Seller Lists, Customer Preferences, Sales Rank, Recommendations • Google • Storing and managing web scale data • Facebook • Managing social graphs • LinkedIn, Twitter and others Sanjeev Mishra SVCC 2012
  • 4. Data Explosion Current • Every two days now we create as much information as we did from the dawn of civilization up until 2003 - about 5 exabytes (1K PB) of data: Eric Schmidt * Sanjeev Mishra SVCC 2012
  • 5. Data Explosion Future • A telescope planned to be finished in 2024 will generate more data in a single day than the entire Internet.* Sanjeev Mishra SVCC 2012
  • 6. What is Big Data? • Terabytes(TB) is not big data, petabytes (PB) (1000 TB) may be. • Current definition of big data: zettabytes (1M PB or 1G TB) Sanjeev Mishra SVCC 2012
  • 7. Nature of Big Data Web 2.0 kind of data • Different from traditional RDBMS/Warehouse data – more reads less updates • User Generated Content – Tweets, Reviews, Comments etc… • Lots of updates and lots of reads • Scale to millions of users • Not necessarily Transactional • Compromised consistency Sanjeev Mishra SVCC 2012
  • 8. Data Explosion, So What? • Structural issues • The dynamic nature of data • Performance issues • Insertion • Search • Scaling Horizontally • Dozens or hundreds of machines to operate as single server Sanjeev Mishra SVCC 2012
  • 9. What is NoSQL? Not Only SQL or Not Relational • Carlo Strozzi used it in 1998 and then Eric Evans in 2009 • Simple call level interface (SQL not supported) • Flexible schema • Efficient use of distributed indexes • Horizontally scaling of operations over many server • No ACID but BASE (Basically Available, Soft state*, Eventually consistent**) Sanjeev Mishra SVCC 2012
  • 10. CAP Theorem (Brewer’s Theorem)* A distributed system can satisfy any two of following three guarantees at any time o Consistency (all nodes see the same data at the same time) o Availability (a guarantee that every request receives a response about whether it was successful or failed) o Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) Sanjeev Mishra SVCC 2012
  • 11. Eventual Consistency Flavors • Causal consistency o changes are notified through events, the receiving session will always see the updated value. • Read your own writes o a session that updates the db will immediately see the changes. • Monotonic consistency* o once a session reads a value will never see an earlier value. Sanjeev Mishra SVCC 2012
  • 12. Consistency Tradeoffs Where, o N is # of copies of each data that db maintains o R is # of copies that is read for each read o W is # of copies that must be written for each write • Most NoSQL use N>W>1: More than one write must complete but not all nodes need to update immediately. Sanjeev Mishra SVCC 2012
  • 13. Column Vs Row Storage Sanjeev Mishra SVCC 2012
  • 14. Row vs. Column Oriented DB Id First name Last name SSN DOB 1 John Doe 111-222-3333 8/12/1968 2 Jane Doe 111-332-3408 4/3/1972 Row oriented Column oriented 1 1 John 2 Doe John 111-222-3333 Jane 8/12/1968 Doe 2 Doe Jane 111-222-3333 Doe 111-332-3408 111-332-3408 8/12/1968 4/3/1972 4/3/1972 Sanjeev Mishra SVCC 2012
  • 15. Contrasting Operations on Row vs Col DB Insert a new tuple Row oriented Column oriented 1 1 2 John 3 Doe John 111-22-3333 8/12/1968 Jane Foo 2 Doe Jane Doe Doe 111-32-3408 Bar 4/3/1972 111-22-3333 3 111-32-3408 Foo 237-23-3924 Bar 8/12/1968 237-23-3924 4/3/1972 2/3/1978 2/3/1978 Sanjeev Mishra SVCC 2012
  • 16. Row vs. Column Oriented DB Create a new attribute Row oriented Column oriented 1 1 John 2 Doe John 111-22-3333 Jane 8/12/1968 Doe 408-555-1212 Doe 2 111-22-3333 Jane 111-32-3408 Doe 8/12/1968 111-32-3408 4/3/1972 4/3/1972 408-555-1212 650-555-2323 650-555-2323 Sanjeev Mishra SVCC 2012
  • 17. Row vs. Column Oriented DB Get all who were born in a given year Row oriented Column oriented Easy, just pick all rows where year Not so simple, scan the years and of DOB matches the given year remember the indexes of all occurrences that match given year and extract based on these indexes Get sum of all years Little difficult, data does not live Easy, the data is found consecutively so scanning through consecutively entire dataset needed Sanjeev Mishra SVCC 2012
  • 18. Glossary • Consistent Hashing (Cassandra, Dynamo) o the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the • largest hash value wraps around to the smallest hash value) Vector Clock (Cassandra, Riak, Dynamo) o an algorithm for generating a partial ordering of events in a distributed system and • detecting causality violations • Quorum (Cassandra, Dynamo (sloppy)) Merkle Tree (Cassandra, Riak, Dynamo) o a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to • download the entire data set Anti-Entropy Gossip Protocol (Cassandra, Dynamo) o comparing all the replicas of each piece of data that exist and updating each replica to the • newest version Order preserving partitioning (Cassandra, MongoDB) Sanjeev Mishra SVCC 2012
  • 19. Glossary • MVCC o • multi version concurrency control Atomicity o • all or nothing Consistency o • each transaction leaves the db in valid state Isolation o • concurrent execution of txn results into a state that is obtained if txn were executed serially Durability o committed txn remain so even in the event of power loss, crashes or errors • WAL o Write ahead logging – changes are written to a log before they are applied (Durability) • Eventually consistent o sufficiently long quiet period all updates can be expected to propagate eventually through the system and all replicas will be consistent Sanjeev Mishra SVCC 2012
  • 20. Glossary • Sharding o horizontal partitioning of data, storing records on different servers according to some key • Tuple o row in RDBMS, predefined schema. • Document o contains nested document or lists as well as scalar values. No predefined schema. • Extensible Record o hybrid between Tuple and Document, families of attributes defined in a schema but attributes can be added on a per record basis. • Key-value Stores o stores values indexed by a user defined key. • Document Stores o indexed document store • Extensible Record Stores aka Wide Column Stores o Stores extensible records partitioned vertically and horizontally across nodes. Sanjeev Mishra SVCC 2012
  • 21. NoSQL Categories • Key-value Stores o Stores values indexed by a user defined key. • Document Stores o Indexed document store • Extensible Record Stores (Column Stores) o Stores extensible records partitioned vertically and horizontally across nodes. • Graph Databases Sanjeev Mishra SVCC 2012
  • 22. Key-Value Stores Sanjeev Mishra SVCC 2012
  • 23. Key-Value Stores • A distributed cache/Hashtable o Inspired by Amazon Dynamo o like memcached with o persistence, replication, versioning, locking, transactions, sorting etc. o get/put and lookups o No secondary indices or keys o Values are BLOBs or in some cases JSON document o Scalability through key distribution over nodes Sanjeev Mishra SVCC 2012
  • 24. Key-Value Stores • Riak (Erlang/Basho/Apache) • Membase (C+Erlang/Couchbase/Apache) • Project Voldemort (Java/LinkedIn/Apache) • Redis (C/VMWare/BSD) • Scalaris (Erlang/Zuse+onScale/Apache) • Tokyo Cabinet (C/Fal Labs/LGPL) • Dynamo (Java/For Amazon internal use) There are others Key Value / Tuple Store at http://nosql-database.org/ Sanjeev Mishra SVCC 2012
  • 25. Amazon Dynamo • KV Store Developed by Amazon to support o Best Seller Lists o Shopping carts o Customer Preferences o Session Management o Sales Rank o Product Catalog etc... • Variation of Consistent Hashing based Data Partitioning and Replication • Dynamic add/delete of Storage Nodes • Each service uses distinct instance of Dynamo Sanjeev Mishra SVCC 2012
  • 26. Amazon Dynamo Cont... • Key/Value are opaque byte[]. ID= 128- bit MD5 hash of the Key • “always writeable” where no updates are rejected due to failures or concurrent writes • Simple Read/Write - get/put - operation on data uniquely identified by a key, value is binary object (BLOB) o get(key): single or a list (conflicts with context) o put(key,context,object) • Eventual consistency with no isolation guarantees Sanjeev Mishra SVCC 2012
  • 27. RIAK • Developed in Erlang by Basho • Clients:Python, Javascript, Java, PHP, Erlang • Dynamo inspired Open-Source o Advanced K/V and o Document Store (not a full featured document store) • Replication and sharding by primary key hash o Consistent Hashing o De-Centralized (No-Master node) • Eventually consistent o Tunable number of replicas for read and write o Tunable per-read and per-write o Different parts of application can choose different trade offs Sanjeev Mishra SVCC 2012
  • 28. Project Voldemort • Java based advanced Key/Value store • Developed at LinkedIn • Open source, Apache license • Supports MVCC for updates • Replicas are updated asynchronously - up-to- date view guaranteed if majority of replicas read • Uses optimistic locking for consistent multi- record updates • Versions are ordered based on Vector clocks • More info: http://www.project-voldemort.com/voldemort/ Sanjeev Mishra SVCC 2012
  • 29. Document Stores Sanjeev Mishra SVCC 2012
  • 30. Document Stores • Data more complex than that in K/V stores • Data encapsulated and encoded in o JSON, XML, YAML, BSON or some other standard format • Multiple types of documents per database o Documents of similar type grouped together o Optional metadata/schema for the document o Less rigid schema than that of RDBMS • Nested documents or collection • Secondary indexes • Complex query/update support o Multiple attributes, collections etc Sanjeev Mishra SVCC 2012
  • 31. Document Example { "when": "2011-09-19T02:10:11.3Z", "author": "alex", "title": "No Free Lunch", "text": "This is the text of the post. It could be very long.", "tags": [ "business", "ramblings“ ], "votes": 5, "voters": ["jane“, "joe", "spencer", "phyllis", "li”], "comments": [ { "who": "jane", "when": "2011-09-19T04:00:10.112Z", "comment": "I agree." }, { "who": "meghan", "when": "2011-09-20T14:36:06.958Z", "comment": "You must be joking. etc etc ..." } ] } Sanjeev Mishra SVCC 2012
  • 32. Document Stores • MongoDB (C/10Gen/AGPL) • Apache CouchDB (Erlang/Apache) • Amazon SimpleDB (Erlang/Amazon) • Terrastore (Java/Terracota/Apache) • RavenDB (C#/HibernatingRhino/AGPL) There are others Document Store at http://nosql-database.org/ Sanjeev Mishra SVCC 2012
  • 33. MongoDB Sanjeev Mishra SVCC 2012
  • 34. MongoDB huMongous • Document format: BSON (Binary JSON) • Supports nested documents • Documents are grouped in Collections • Supports secondary indexes • Scalability – auto sharding • Consistency – Tunable based on request (WriteConcerns) • Replication – replica set – master – slave • Atomicity – document level Sanjeev Mishra SVCC 2012
  • 35. MongoDB Data Type SQL MongoDB String Integer create table users db.createCollections(“users”) (name varchar(128), age number) Boolea Double insert into users values („bob‟,32‟) db.users.insert Null Array ({name:”bob”, age:32}) Object ObjectId select * from user db.users.find() Binary Regex Code select name, age from users db.users.find ({}, {name:1, age:1,_id:0}) select name, age from users where age db.users.find =32 ({age:32}, {name:1, age:1}) SQL MongoDB select * from user db.users.find().sort({name:1}) Database Database order by name asc Table Collection select * from user db.users.find().skip(20).limit(10) limit 10 offset 20 Index Index select distinct name from user db.users.distinct(“name”) Row Document Column Field select count(*) from user db.users.count() Join Embedding or update users set age =39 where name = db.users.update({name:”bob”}, Linking „bob‟ {$set:{age:33}}, false, true) Primary _id delete from users where name=„bob‟ db.users.remove({name:”bob”}) Key Sanjeev Mishra SVCC 2012
  • 36. Extensible Record Stores aka Column Stores Sanjeev Mishra SVCC 2012
  • 37. Extensible Record Stores Column Stores • Motivated by Google BigTable • Basic Data Model – Rows and Columns • Scale by splitting rows and columns over multiple nodes o Rows split by sharding on primary key – split by range rather than hash function o Columns split by column groups Sanjeev Mishra SVCC 2012
  • 38. Extensible Record Stores • Cassandra (Java/Facebook/Apache) • Marriage of Dynamo and BigTable • HBase (Java/Yahoo/Apache) • Inspired by BigTable, used HDFS for storage • HyperTable (C/Zvent/GPL) • Similar to HBase/BigTable • Accumulo (Java/NSA/Apache) • Uses Hadoop, ZooKeeper, and Thrift, cell level access control • Google BigTable (Internal to Google) There are others Wide Column Store at http://nosql-database.org/ Sanjeev Mishra SVCC 2012
  • 39. Cassandra Sanjeev Mishra SVCC 2012
  • 40. Cassandra Features • Decentralized o Data is distributed across cluster of nodes o No master, any node can address any request o No single point of failure • Fault-tolerant (Configurable replication strategies) o Simple Strategy (first determined by partitioner, rest on other nodes clockwise) o Network Topology Strategy: multi datacenter strategy Sanjeev Mishra SVCC 2012
  • 41. Cassandra Features Cont… • Failure detection and recovery o Based on Gossip protocol o Node state updated based on gossip message version o Per-node heartbeat threshold • Tunable consistency o Can be configured per read/write Sanjeev Mishra SVCC 2012
  • 42. Cassandra Data Type SQL Cassandra QL ascii int create database codecamp CREATE KEYSPACE codecamp WITH strategy_class = float decimal „NetworkTopologyStrategy‟ AND boolean bigint strategy_options:DC1=3 double varchar create table users CREATE COLUMNFAMILY users (key (key varchar(128), name varchar PRIMARY KEY, name counter timestamp varchar(128), age number) varchar, age int) uuid text create index idx_name ON CREATE INDEX idx_name ON blob varint users(name) users(name) insert into users values („bob‟, „Bob‟,32‟) INSERT INTO users (KEY, name, age) SQL Cassandra VALUES(„jdoe‟,‟Jane Doe‟, 39) Database Keyspace select name, age from users SELECT name, age FROM users Table Column Family where age>30 WHERE age>30 Index Index update users set age = 35 UPDATE users SET age=35 where name = „bob‟ WHERE name=„bob‟ Row Row delete from users where DELETE FROM users where KEY = Column Column key=„bob‟ „bob‟ DELETE age FROM users where Join KEY=„alice‟ Primary Key Primary Key drop table users DROP COLUMNFAMILY users drop database codecamp DROP KEYSPACE codecamp Sanjeev Mishra SVCC 2012
  • 43. Cassandra Column and Column Family Column Super Column name:byte[] Name: byte[] Value: Collection of Columns value:byte[] timestamp Super Column name: homeaddress Column name:”userid” value: value:”jdoe” name: “street” name: ”city” name: “zip” value: “555 Homestead Rd” value:“Sunnyvale” value: “95051” Timestamp: timestamp:… timestamp:… timestamp:… Row Row Column Column Column Key name: “userid” name: “name” name: “age” jdoe value: “jdoe” value: “Jane Doe” value: 33 Column timestamp:… timestamp:…= timestamp:… Family name: “userid” name: “name” name: “age” ladams value: “ladams” value: “Larry Adam” value: 47 timestamp:… timestamp:…= timestamp:… name: “userid” name: “name” name: “age” bdole value: “bdole” value: “Bob Dole” value: 67 timestamp:… timestamp:…= timestamp:… Sanjeev Mishra SVCC 2012
  • 44. Cassandra Keyspace Analogous to database in RDBMS • Contains one or more Column Families analogous to tables in RDBMS • Column Family contains columns • A Row Key identifies a set of related columns • A Row is not required to have same set of columns • No join between two column families: o Each column family is self contained to serve a query o A rule of thumb - one column family per query for better performance • Replication is controlled on per-keyspace basis Sanjeev Mishra SVCC 2012
  • 45. Cassendra In Enterprise • Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and many more • The largest Cassandra cluster has over 300 TB of data in over 400 machines Sanjeev Mishra SVCC 2012
  • 46. HBase • Design influenced by Google BigTable • A type of NoSQL – more a data store than data base, lacks many RDBMS features such as • Typed column, secondary indexes, triggers, advanced query language etc. • Build on top of HDFS: Data is stored in HDFS as indexed “StoreFiles” • Strongly consistent R/W not “eventually consistent” – suitable for counter aggregation • Auto Sharding • Auto Region Server Failover • Out of the box support for Hadoop/HDFS • Can be used as Source and/or Sink for MapReduce • Java, Thrift/REST client • Support Block Cache and Bloom Filters for high volume query optimization • Web management tool and JMX support Sanjeev Mishra SVCC 2012
  • 48. NoSQL Growth Trends Sanjeev Mishra SVCC 2012
  • 49. Big Data and NoSQL Landscape Sanjeev Mishra SVCC 2012