SlideShare a Scribd company logo
1 of 34
Big Data: NoSQL & the DBA




– Aswani Vonteddu
                Aswani Vonteddu
The evolution of data stores

•   Data modeling
•   Data from the Developer’s standpoint
•   Data from the DBA’s standpoint
•   Impedance mismatch and the rise of ORM




                   Aswani Vonteddu
Hierarchical object graph model




Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman


                                     Aswani Vonteddu
Normalized for tables in RDBMS




Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
Bierman


                                     Aswani Vonteddu
Data – Summary
• In order to use an RDBMS,

  – Designer to model data into tables

  – Developer must normalize/de-no

  – DBA has to speed up queries



                    Aswani Vonteddu
Impedance mismatch and the rise of ORMs (like
                      Hibernate)
[Table(name="Products")]                                            [Table(name="Keywords")]
class Product                                                       class Keyword
{                                                                   {
    [Column(PrimaryKey=true)]int ID;
    [Column]string Title;                                                 [Column(PrimaryKey=true)]int ID;
    [Column]string Author;                                                [Column]string Keyword;
    [Column]int Year;                                                     [Column(IsForeignKey=true)]int ProductID;
    [Column]int Pages;                                              }
    private EntitySet<Rating> _Ratings;
    [                                                               [Table(name="Ratings")]
           Association( Storage="_Ratings",                         class Rating
                      ThisKey="ID",                                 {
                      OtherKey="ProductID“,
                      DeleteRule="ONDELETECASCADE“                        [Column(PrimaryKey=true)]int ID;
                      )                                                   [Column]string Rating;
    ]                                                                     [Column(IsForeignKey=true)]int ProductID;
    ICollection<Rating> Ratings{ ... }                              }

    private EntitySet<Keyword> _Keywords;
    […]
    ICollection<Keyword> Keywords{ ... }
}


               Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin
               Bierman


                                                    Aswani Vonteddu
o So what is Big Data?

o Sources

o Applications

o Technologies


                  Aswani Vonteddu
What is Big Data?
• It is not a technology in itself.

• It is information about everything that is
  happening around us, every where and every
  minute

• Almost all of us have contributed to Big Data
  with/with out our knowledge already, and we will
  continue to be doing that.

• Un-structured

                        Aswani Vonteddu
The four characteristics
• Volume

• Velocity

• Variety

• Veracity

                Aswani Vonteddu
Sources

• Clickstream

• Tweets

• Facebook: pictures and comments

• Sensors
  A Boeing 737 generates 240 TB of data
  during a single cross country flight.

                     Aswani Vonteddu
Applications
• Classification/Ontologies

• Crowdsourcing - CAPTCHA

• Natural language processing (NLP) –
  Google translate

• Visualization – Facebook map

                  Aswani Vonteddu
Aswani Vonteddu
Setting up a Big Data platform

• A Big Data platform must be equipped
  with technologies for the following stages
  of data processing:

• Acquisition
• Organization
• Analysis


                   Aswani Vonteddu
Technologies

• Acquisition
  – NoSQL databases (DynamoDB, Cassandra)
    • Very high speed writes


• Organization & Analysis
  – Map Reduce (Apache Hadoop)
    • Code to Data, not otherwise
    • Map function and Reduce function together
      perform the desired analysis

                    Aswani Vonteddu
NoSQL and why now?
• RDBMSs must ensure ACID properties

• CAP theorem says that all three of
  Consistency, Availability and Partition tolerance
  cannot be guaranteed by any distributed
  system

• NoSQL databases are distributed, and are
  better options than RDBMS for applications
  that can deal with lack of one of those
  properties.
                      Aswani Vonteddu
Relational Databases
• Random disk access

• Data model is totally structured, and
  predefined

• Shared Everything architecture – Single
  point of failure


                   Aswani Vonteddu
NoSQL categories
• Graph DB

• Column families

• Document




                    Aswani Vonteddu
Simple Key-Value stores
• Distributed Hash Tables

• Eventual consistency

• Replication and Data partitioning

• Example
  Amazon Dynamo

                  Aswani Vonteddu
Column families
• Distributed Key-Value stores

• Supports nested columns

• Example
  Cassandra



                  Aswani Vonteddu
Apache Cassandra

• Indexed by a Key
• Supports columns and super-columns
• Allows structured/un-structured data




                 Aswani Vonteddu
Cassandra
          N
          1




N                       N
4                       2




          N
          3




      Aswani Vonteddu
Cassandra
                                                           Coordinator

                                                                  N
                                                                  1
                                                                             3. Success


1. ConsistencyLevel.ONE


                                    2. Write
                                    request                                2. Write
                                        N                                  request        N
                                        4                                                 2
                                            Replica node              Responsible node




                                                                  N
                                                                  3




                          Aswani Vonteddu
Cassandra
                                                           Coordinator

                                                                  N
                                                                  1
                                                                             3. Success


1. ConsistencyLevel.ONE


                                    2. Write
                                    request                                2. Write
       4. Success                       N                                                 N
                                                                           request
                                        4                                                 2
                                            Replica node              Responsible node




                                                                  N
                                                                  3




                          Aswani Vonteddu
Cassandra
                                                            Coordinator

                                                                   N
                                                                   1
                                                                               3 or 4. Success

                                                           3 or 4. Success
1. ConsistencyLevel.TWO


                                    2. Write
                                    request                                  2. Write
                                        N                                    request             N
                                        4                                                        2
                                            Replica node               Responsible node




                                                                   N
                                                                   3




                          Aswani Vonteddu
Cassandra
                                                            Coordinator

                                                                   N
                                                                   1
                                                                               3 or 4. Success

                                                           3 or 4. Success
1. ConsistencyLevel.TWO


                                    2. Write
                                    request                                  2. Write
       5. Success                       N                                                        N
                                                                             request
                                        4                                                        2
                                            Replica node               Responsible node




                                                                   N
                                                                   3




                          Aswani Vonteddu
Cassandra
• Write operation:
  – Commit log
  – Memtable – In-Memory storage structure
    (kind of a hash table)
  – SSTable on disk
  – Compaction




                     Aswani Vonteddu
Cassandra
• Read operation:
  – Coordinator node forwards the request
    • to the node responsible
    • And replica nodes based on the consistency level
      requested
  – Each node
    • Looks up in the Memtable + all existing SSTables
    • Takes the one with the latest timestamp.
  – Bloom filters help speed up this operation

                      Aswani Vonteddu
Cassandra




Indexes:

• Primary index (on the key)
  supported default by the
  Cassandra engine
• Secondary indexes are to be
  built as a new column family
  with the column of interest
  as the key                     Aswani Vonteddu
Document DBs
• Similar to Key-Value stores, but Values
  are often documents (JSON, ION, …)

• Documents are versioned

• Example
  DynamoDB


                  Aswani Vonteddu
Map Reduce
• Introduced by Google
• List processing system
• Scales to clusters with thousands of nodes
• And petabytes or Exabytes of data volumes
• Code is taken to data, not otherwise
• Data must be disjoint
• Maps the functions to nodes where the data
  resides
• And Reduces the results from all nodes to build
  the final result
• Example: Hadoop

                      Aswani Vonteddu
Techniques & algorithms..
•   Vector Clocks
•   Hinted handoff
•   Read repair
•   Anti-entropy repair




                    Aswani Vonteddu
Big Data talent
• Deep analytical
  – Mathematicians, Operations research
    analysts, statisticians, ..
• Big data savvy
  – Business and functional
    managers, budget, credit and financial
    analysts
• Supporting Technology
  – DBAs, System & Network administrators, and
    Programmers
                    Aswani Vonteddu
The DBA’s role here?
• Tremendous opportunity for the DBAs

• Like in the early 90’s when businesses
  migrated from mainframes to Oracle/SQL
  Server/DB2

• Where?
  – Data modeling:
    Vast amounts of data, re-designing DHTs is
    harder than re-designing RDBMS by multiple
    folds since data migration is painful

                    Aswani Vonteddu
References
[1] McKinsey, Big data: The next frontier for
innovation, competition and productivity
[2] IDC, The rise of Big Data: Managing, Storing and gaining
value from endless information
• Others
   – http://slidesha.re/LF8umk
   – http://slidesha.re/LF8vGY




                             Aswani Vonteddu

More Related Content

Similar to Big Data: NoSQL & the DBA

Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
Kevin Smith
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
Kevin Smith
 

Similar to Big Data: NoSQL & the DBA (6)

Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
 
Introduction data structure for GraphDB
Introduction data structure for GraphDBIntroduction data structure for GraphDB
Introduction data structure for GraphDB
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Webinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDBWebinar: General Technical Overview of MongoDB
Webinar: General Technical Overview of MongoDB
 
Intro to Table-Grouping™ technology
Intro to Table-Grouping™ technologyIntro to Table-Grouping™ technology
Intro to Table-Grouping™ technology
 

Big Data: NoSQL & the DBA

  • 1. Big Data: NoSQL & the DBA – Aswani Vonteddu Aswani Vonteddu
  • 2. The evolution of data stores • Data modeling • Data from the Developer’s standpoint • Data from the DBA’s standpoint • Impedance mismatch and the rise of ORM Aswani Vonteddu
  • 3. Hierarchical object graph model Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 4. Normalized for tables in RDBMS Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 5. Data – Summary • In order to use an RDBMS, – Designer to model data into tables – Developer must normalize/de-no – DBA has to speed up queries Aswani Vonteddu
  • 6. Impedance mismatch and the rise of ORMs (like Hibernate) [Table(name="Products")] [Table(name="Keywords")] class Product class Keyword { { [Column(PrimaryKey=true)]int ID; [Column]string Title; [Column(PrimaryKey=true)]int ID; [Column]string Author; [Column]string Keyword; [Column]int Year; [Column(IsForeignKey=true)]int ProductID; [Column]int Pages; } private EntitySet<Rating> _Ratings; [ [Table(name="Ratings")] Association( Storage="_Ratings", class Rating ThisKey="ID", { OtherKey="ProductID“, DeleteRule="ONDELETECASCADE“ [Column(PrimaryKey=true)]int ID; ) [Column]string Rating; ] [Column(IsForeignKey=true)]int ProductID; ICollection<Rating> Ratings{ ... } } private EntitySet<Keyword> _Keywords; […] ICollection<Keyword> Keywords{ ... } } Source: A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman Aswani Vonteddu
  • 7. o So what is Big Data? o Sources o Applications o Technologies Aswani Vonteddu
  • 8. What is Big Data? • It is not a technology in itself. • It is information about everything that is happening around us, every where and every minute • Almost all of us have contributed to Big Data with/with out our knowledge already, and we will continue to be doing that. • Un-structured Aswani Vonteddu
  • 9. The four characteristics • Volume • Velocity • Variety • Veracity Aswani Vonteddu
  • 10. Sources • Clickstream • Tweets • Facebook: pictures and comments • Sensors A Boeing 737 generates 240 TB of data during a single cross country flight. Aswani Vonteddu
  • 11. Applications • Classification/Ontologies • Crowdsourcing - CAPTCHA • Natural language processing (NLP) – Google translate • Visualization – Facebook map Aswani Vonteddu
  • 13. Setting up a Big Data platform • A Big Data platform must be equipped with technologies for the following stages of data processing: • Acquisition • Organization • Analysis Aswani Vonteddu
  • 14. Technologies • Acquisition – NoSQL databases (DynamoDB, Cassandra) • Very high speed writes • Organization & Analysis – Map Reduce (Apache Hadoop) • Code to Data, not otherwise • Map function and Reduce function together perform the desired analysis Aswani Vonteddu
  • 15. NoSQL and why now? • RDBMSs must ensure ACID properties • CAP theorem says that all three of Consistency, Availability and Partition tolerance cannot be guaranteed by any distributed system • NoSQL databases are distributed, and are better options than RDBMS for applications that can deal with lack of one of those properties. Aswani Vonteddu
  • 16. Relational Databases • Random disk access • Data model is totally structured, and predefined • Shared Everything architecture – Single point of failure Aswani Vonteddu
  • 17. NoSQL categories • Graph DB • Column families • Document Aswani Vonteddu
  • 18. Simple Key-Value stores • Distributed Hash Tables • Eventual consistency • Replication and Data partitioning • Example Amazon Dynamo Aswani Vonteddu
  • 19. Column families • Distributed Key-Value stores • Supports nested columns • Example Cassandra Aswani Vonteddu
  • 20. Apache Cassandra • Indexed by a Key • Supports columns and super-columns • Allows structured/un-structured data Aswani Vonteddu
  • 21. Cassandra N 1 N N 4 2 N 3 Aswani Vonteddu
  • 22. Cassandra Coordinator N 1 3. Success 1. ConsistencyLevel.ONE 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 23. Cassandra Coordinator N 1 3. Success 1. ConsistencyLevel.ONE 2. Write request 2. Write 4. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 24. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success 1. ConsistencyLevel.TWO 2. Write request 2. Write N request N 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 25. Cassandra Coordinator N 1 3 or 4. Success 3 or 4. Success 1. ConsistencyLevel.TWO 2. Write request 2. Write 5. Success N N request 4 2 Replica node Responsible node N 3 Aswani Vonteddu
  • 26. Cassandra • Write operation: – Commit log – Memtable – In-Memory storage structure (kind of a hash table) – SSTable on disk – Compaction Aswani Vonteddu
  • 27. Cassandra • Read operation: – Coordinator node forwards the request • to the node responsible • And replica nodes based on the consistency level requested – Each node • Looks up in the Memtable + all existing SSTables • Takes the one with the latest timestamp. – Bloom filters help speed up this operation Aswani Vonteddu
  • 28. Cassandra Indexes: • Primary index (on the key) supported default by the Cassandra engine • Secondary indexes are to be built as a new column family with the column of interest as the key Aswani Vonteddu
  • 29. Document DBs • Similar to Key-Value stores, but Values are often documents (JSON, ION, …) • Documents are versioned • Example DynamoDB Aswani Vonteddu
  • 30. Map Reduce • Introduced by Google • List processing system • Scales to clusters with thousands of nodes • And petabytes or Exabytes of data volumes • Code is taken to data, not otherwise • Data must be disjoint • Maps the functions to nodes where the data resides • And Reduces the results from all nodes to build the final result • Example: Hadoop Aswani Vonteddu
  • 31. Techniques & algorithms.. • Vector Clocks • Hinted handoff • Read repair • Anti-entropy repair Aswani Vonteddu
  • 32. Big Data talent • Deep analytical – Mathematicians, Operations research analysts, statisticians, .. • Big data savvy – Business and functional managers, budget, credit and financial analysts • Supporting Technology – DBAs, System & Network administrators, and Programmers Aswani Vonteddu
  • 33. The DBA’s role here? • Tremendous opportunity for the DBAs • Like in the early 90’s when businesses migrated from mainframes to Oracle/SQL Server/DB2 • Where? – Data modeling: Vast amounts of data, re-designing DHTs is harder than re-designing RDBMS by multiple folds since data migration is painful Aswani Vonteddu
  • 34. References [1] McKinsey, Big data: The next frontier for innovation, competition and productivity [2] IDC, The rise of Big Data: Managing, Storing and gaining value from endless information • Others – http://slidesha.re/LF8umk – http://slidesha.re/LF8vGY Aswani Vonteddu

Editor's Notes

  1. Industries: Healthcare, Telecommunications, Retail, Manufacturing, Public sector