SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Finding the Right Data Solution
for Your Application in the Data
       Storage Haystack
               Srinath Perera Ph.D.
      Senior Software Architect, WSO2 Inc.
     Visiting Faculty, University of Moratuwa
  Research Scientist, Lanka Software Foundation
Data Models
 §  There has been many data models
     proposed (read Stonebraker’s
     “What Goes Around Comes
     Around” for more details)
      o  Hierarchical (IMS): late 1960’s and
         1970’s
      o  Directed graph (CODASYL): 1970’s
      o  Relational: 1970’s and early 1980’s
      o  Entity-Relationship: 1970’s
      o  Extended Relational: 1980’s
      o  Semantic: late 1970’s and 1980’s
 §  For last 20-30 years, Relational
     Database systems (SQL) together
     with transactions has been the
     defacto data solution.
Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
For many years, choice of data storage was
            a easy one (use RDBMS)
Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
Scale of Systems
  §  However, the scale of systems
      are changing due to
         o  Increasing user bases of
            systems.
         o  Mobile devices, online presence
         o  Cloud computing and multicore
            systems
   §  Scaling up RDBMS
          o  Put it in a bigger machine
          o  Replicate (Cluster) the database to 2-3 more nodes. But the
             approach does not scale up.
          o  Partition the data across many nodes (distribute, a.k.a.
             shredding). However, JOIN queries across many nodes are hard,
             and sometimes too slow. This often needs custom code and
             configurations. Also transactions do not scale as well.

Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
CAP Theorem, Transactions, and Storage
  §  RDBMS model provide two things
         o  Relational model with SQL
         o  ACID transactions – (Atomic,
            Isolation, Consistent, Durable)
  §  It was a classical one size fit all
      solution, but it worked for a quite a
      some time.
  §  However, CAP theorem says that
      you can not have it all.
         o  Consistency, Availability and Partition
            Tolerance, pick two!

 §  But there are many usecases that do not need all RDBMS
     features, when those are dropped, systems could scale. (e.g.
     Google Big Table)
 §  However, to use them, one has to understand and utilize the
     application specific behavior.
Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
NoSQL and other Storage Systems
§  Large internet companies hit the problem first, they build
    systems that are specific to their problems, and those
    systems did scale.
      o  Google Big table
      o  Amazon Dynamo
§  Soon many others followed, and most of them are free and
    open source.
§  Now there are couple of dozen
§  Among advantages of
    NoSQL are
      o  Scalability
      o  Flexible schema
      o  Designed to scale and support
         fault tolerance out of the Box


Copyright ind{yeah} and licensed for reuse under CC License ,
http://www.flickr.com/photos/flickcoolpix/3566848458/
However, with NoSQL solutions, choosing a
       data storage is no longer simple.
Copyright Philipp Salzgeber on and licensed for reuse under CC License http://
www.salzgeber.at/astro/pics/20081126_heart/index.html
Selecting the Right Data Solution




§  What are the right Questions to ask?
§  Categorize Answers for each question
§  Take different cases based on different answers and make
    recommendations!

 Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
          http://www.fotocommunity.com/pc/pc/display/22077920
What are the right Questions?
                                                                      o  Types of data
                                                                             -  Structured, Semi-Structured,
                                                                                Unstructured
                                                                      o  Need for Scalability
                                                                             -    Number of users
                                                                             -    Number of data items
                                                                             -    Size of files
                                                                             -    Read/Write ratio
                                                                      o  Types of Queries
                                                                             -    Retrieve by Key
                                                                             -    WHERE clauses
                                                                             -    JOIN queries
                                                                             -    Offline Queries
                                                                      o  Consistency
                                                                             -  Loose Consistency
                                                                             -  Single Operation Consistency
                                                                             -  Transactions



  Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/
                             photos/romainguy/249370084
Unstructured Data
                                                                 §  Data do not have a particular
                                                                     structure, often retrieved
                                                                     through a key (name).
                                                                        o  E.g. File systems.
                                                                 §  Humans are good in processing
                                                                     unstructured data, but
                                                                     computers do not.



§  This data are often stored in storage but consumed by humans
    at the end of the pipeline. (e.g. Document repository)
§  One common use case is building structured data from
    unstructured data
§  Often associate Metadata to help searching

Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
Structured Data
 §  Have a structure and often described through a Schema
 §  Often a table like 2D structure is used, but other structures
     also possible.
 §  Main advantage of the structure is search

§  Schema can be provided at
    the deployment time or at the
    runtime (dynamic schema)
§  Schema can be used to
    o  Validate data
    o  Support user friendly search
    o  Optimize storage and queries




 Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/
                              photos/ooocha/2611398859/
Semi-structured Data
  §  Structure is not fully defined.
      But there is some inherent
      structure.
  §  For example
       o  XML documents, data are
          stored in a tree like structure
       o  Graph data
       o  Data structures like lists and
          arrays
  §  Support queries based on
      structure
  §  But processing data often
      needs custom code.


Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
Search
§  Unstructured Data – no structure to support search.
     o  Search based on an reverse index
     o  Search through Properties
§  Semi-Structured Data
     o  To search XML, Xpath or XQuery (Any tree like structure).
     o  Tuple spaces can be queried through tuple space templates
     o  Data registries can be searched for entries that matches with given
        Metadata descriptions (search by properties)
     o  Graph’s can be queried based on connectivity
§  Structured Data
     o    Retrieve by Key
     o    WHERE clauses
     o    Queries with JOINs
     o    Offline Queries



Copyright bydigitalART2 and licensed for reuse under CC License ,
        http://www.flickr.com/photos/digitalart/2101765353/
Consistency and Scalability
§  Scalability – this is ability to
    handle more users, data, or
    larger files by adding more
    nodes. We will have 3 categories.
   o  Small systems (can handle with 1-3
      nodes)
   o  Scalable systems (can handle with
      about 10 nodes)
   o  Highly scalable systems (anything
      larger, can be 100s or 1000s of      Copyright NNSANews and licensed for reuse under CC
      nodes)                                 License , http://www.flickr.com/photos/nnsanews/
                                                                5347287260/
 §  Consistency – this is how to keep the replicas of same data
     in many nodes synced up (e.g. replicas) how they can be
     updated without data corruptions. We will have 3 categories.
    o  Transactional – series of operations updated in ACID manner
    o  Atomic operation – single operation, updated in all replicas
    o  Eventual consistency - data will be eventually consistent
Data Storage
 Alternatives
Data Storage Implementations
§  Expectations from data
    storages
   o  Reliably store the data
   o  Efficient search and retrieval
      of data whenever needed
   o  Data management – delete,
      update data
                                       Copyright John Atherton by and licensed for reuse under CC
                                        License , http://www.flickr.com/photos/gbaku/2231332836/
Challenges of Data Storage
§  Reliability
   o  Replicating data
   o  Creating backup or recovering using backups
§  Security
§  Scaling and Parallel access
   o  Distribution or replications
   o  ACID transactions
§  Availability
   o  Data replications
§  Vendor lock-in
   o  Interoperability, standard query languages
§  Simple use experience
   o  Hide the physical location of data,
   o  Provide simple API and security models
   o  Expressive query languages.
Data Storage Choices
                                                                    Queries
                                                                              Join Transactio       Flexible
    Storage       Type    Advantages        Disadvantages     Key     Where    s       ns     Scale schema

                                                                                 No unless
Local memory                Very fast        Not durable      Yes      No     No  STMs         No     Yes
                                            Rigid schema,
                                            good for read
                                               oriented                                      Moder
Relational/ SQL           Standardized        usecases.       Yes      Yes    Yes     Yes     ate     No
Column                     High write            Not                   Yes,
families                 performance,       transactional,           secondar
(NoSQL )                   replicated       no-online joins   Yes     y index No      No      High    Yes
                           High write            Not
Documents                performance,       transactional,             Yes,
DBs                        replicated       no-online joins   Yes     views   No      No      Yes     Yes
                        Easy to integrate
                              with
Object            Struct programming
Databases         ured    languages                           Yes      Yes    Yes     Yes      No     No
Queries         trans
                                         Disadvanta                                 action              Flexible
  Storage     Type     Advantages            ges              Key      Search         s       Scale     schema
                                              No
                                          structured
                    Save big files whose search on
Files              format not understood content              Yes      Indexing      No      Moderate     Yes
Data
Registries/             Metadata search                                Property
Metadata    Unstru                                                   based search
Catalogs    ctured                                            Yes      (Where)       No      Moderate     Yes
                     Representation of flow
                       of messages over
Queues                    time/ Tasks                         Yes        N/A         No        Yes        Yes
                     Used to inference, very
Triple                  fast relationship                            Relationship
Stores                     processing                         Yes      search        No        No         Yes
XML                                                                    XPath/
database                  XML native                                   XQuery
Distributed
Cache                   Fast, replicated         No search    Yes        No          No        Yes        Yes
                                               Model is too
                                                  simple in
                                                    some
                      High write                 cases, not
Key-value           performance,               transactiona
pairs                 replicated                      l       Yes        No          No        Yes        Yes
          Semi- Very fast joins, natural
         structur    to represent               Not very
Graph DBs ed        relationships,              scalable      Yes    Graph Search Yes          Low        N/A
Choosing the Right
  Data Solution
How do We do this?


                                                           Copyright 8664 and
                                                            licensed for reuse
                                                           under CC License ,
                                                           http://www.flickr.com/
                                                                   photos/
                                                           80464769@N00/186
                                                                  598462/




§  Consider structured, semi-structured, and unstructured
    separately.
   o  Then drill down based on other 3 properties: scale, consistency,
      and search.
§  Structured case is more complicated, other two are bit
    simpler.
§  Start by giving a defacto for each case
Handling Structured Data
  §  There are three main considerations: scale, consistency
      and queries
                Small (1-3 nodes)           Scalable (10 nodes)         Highly Scalable (1000s
                                                                                nodes)
            Loose Operat ACID Loose Operat ACID Loose Operat ACID
            Consist   ion  Transa Consi   ion  Transa Consi   ion  Transa
             ency   Consi ctions stency Consi ctions stency Consi ctions
                    stency              stency              stency
Primary     DB/ KV/     DB/       DB      KV/CF    KV/CF     Partitio   KV/CF    KV/CF    No
  Key         CF       KV/ CF                                 ned
                                                              DB?
 Where      DB/ CF/      DB/      DB       CF/    CF/    Partitio          CF/   CF/      No
             Doc         CF/              Doc(?) Doc (?)  ned              Doc   Doc
                         Doc                              DB?
  JOIN         DB        DB       DB        ??        ??       ??          No     No      No


 Offline     DB/CF/    DB/CF/ DB/CF/       CF/       CF/       No          CF/   CF/      No
              Doc       Doc    Doc         Doc       Doc                   Doc   Doc

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
Handling Small Scale Systems (1-3 nodes)
             Small (1-3 nodes)            §  In general using DB here for
                                              every case might work.
             Loose Operati ACID
             Consi on       Transa        §  Reason for using options
             stency Consist ctions            other than DB
                    ency                     o  When there is potential need
  Primary    DB/    DB/ KV/ DB                  to scale later.
  Key        KV/ CF CF                       o  High write throughput
  Where      DB/      DB/        DB       §  KV is 1-D where as other two
             CF/      CF/Doc
             Doc
                                              are 2D
  JOIN       DB       DB         DB


  Offline    DB/      DB/CF/     DB/CF/
             CF/      Doc        Doc
             Doc

*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
Handling Scalable Systems
             Scalable (10 nodes)               §  KV, CF, and Doc can easily
                                                   handle this case.
             Loose     Operati ACID            §  If DBs used with data shredded
             Consi     on      Transa
             stenc     Consist ctions              across many nodes
             y         ency                       o  Transactions might work given that
Primary      KV/CF KV/CF           Partition         participants on one transaction are
Key                                ed DB?            not too many.
Where        CF/       CF/Doc      Partition
                                                  o  JOINs might need to transfer too
             Doc                   ed DB?            much data between nodes.
                                                  o  Also should consider in Memory
JOIN         ??        ??          Partition
                                   ed                DBs like Vault DB.
                                   DB??        §  Offline mode will work.
Offline      CF/       CF/Doc      No          §  Most systems let users choose
             Doc
                                                   consistency, and loose
*KV-Key-Value Systems, CF-Column
                                                   consistency can scale more.
Families, Doc- document based Systems              (e.g. Cassandra)
Highly Scalable Systems

                                          §  Transactions do not work in
               Highly Scalable (1000s
                       nodes)                 this scale. (CAP theorem).
             Loose     Operati ACID       §  Same for JOINs. The problem
             Consis    on      Transac        is sometime too much data
             tency     Consist tions
                       ency                   needs to be transferred
  Primary KV/CF        KV/CF         No
                                              between nodes to perform the
  Key                                         JOIN.
  Where      CF/Doc    CF/Doc        No   §  Offline case handled through
                                              Map-Reduce. Even JOIN
  JOIN       No        No            No       case is OK since there is
                                              time.
  Offline    CF/Doc    CF/Doc        No


*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
Highly Scalable Systems + Primary Key Retrieval

            Highly Scalable (1000s    §  This is (comparatively) the
                    nodes)                easy one.
           Loose Operat ACID          §  Can be solved through
           Consis  ion  Transa
           tency Consis ctions
                                          DHT (Distributed Hash
                  tency                   table) based solutions or
  Primar   KV/CF    KV/CF       No        architectures like
  y Key                                   OceanStore.
  Where CF/Doc CF/Doc           No    §  Both Key-Value storage
          (?)    (?)
                                          (KV) and Column Families
   JOIN      No       No        No
                                          (CF) can be used. But
                                          Key-Value model is
  Offline CF/Doc CF/Doc         No
                                          preferred as it is more
                                          scalable.
   *KV-Key-Value Systems, CF-Column
     Families, Doc- document based
                Systems
Highly Scalable systems + WHERE

             Highly Scalable (1000s     §  This Generally OK, but tricky.
                     nodes)
                                        §  CF work through a Secondary
            Loose Operat Transa
            Consis  ion  ctions             index that do Scatter-gather
            tency Consis                    (e.g. Cassandra).
                   tency
                                        §  Doc work through Map-
   Primar   KV/CF    KV/CF         No
   y Key                                    Reduce views (e.g.
   Where CF/Doc CF/Doc             No
                                            CouchDB)
           (?)    (?)                   §  There is Bissa, which build a
    JOIN      No       No          No       index for all possible queries
                                            (No range queries)
   Offline CF/Doc CF/Doc           No   §  If you are doing this, you
                                            should do pilot runs and
*KV-Key-Value Systems, CF-Column            make sure things work.
Families, Doc- document based
Systems
Handling Unstructured Data




§  Storage Options
   o  Distributed File systems - generally scalable (e.g. NSF), but HDFS
      (Hadoop) and Lustre are highly scalable versions.
   o  Metadata registries (e.g. Niravana, SDSC Resource Broker)
Handling Semi-Structured Data
                           Small Scale (1-3     Scalable (10 nodes)      Highly
                              nodes)                                    Scalable
           XML (Queried    XML DB or convert   XML DB or convert to a       ??
          through XPath)    to a structured      structured model
                                 model
             Graphs           Graph DBs        Graph DBs if graph can       ??
                                                   be partitioned
      Data Structures       Data Structure
                            Servers, Object
                              Databases
      Queues                  Distributed       Distributed Queues      Distributed
                               Queues                                    Queues
      !
§  Storage Options
   o  Answer depends on the type of structure. If there is a server
      optimized for a given type, it is often much more efficient than
      using a DB. (e.g. Graph databases can support fast relationship
      search)
§  Search
   o  Very much custom. E.g. XML or any tree = Xpath, Graph can
      support very fast relationship search
Hybrid Approaches
§  Some solutions have many types
    of data and hence need more than
    one data solution (hybrid
    architectures).
§  For example
   o  Using DB for transactional data and
      CF for other data.
   o  Keeping metadata and actual data
      separate for large data archives.
   o  Use GraphDB to store relationship
      data while other data is in Column
      Family storage.                       Copyright Matthew Oliphant by and licensed for

§  However, if transactions are            reuse under CC License , http://www.flickr.com/
                                                      photos/fajalar/3174131216/

    needed, transactions have to be
    handled outside storage (e.g.
    using Atomikos Zookeeper ).
Other parameters
§  Above list is not exhaustive, and there are other
    parameters
   o  Read/ Write ratio – when high it is easy to scale
   o  High write throughput
   o  Very large data products – you will need a file system. May be
      keep metadata in Data registry and store data in a file system.
   o  Flexible Schema
   o  Archival usecases
   o  Analytical usecases
   o  Others …
§  So there is no silver bullet …
Conclusion
§  For last 20 years or so, DBMS were the de facto storage
    solution
§  However, DBMS could not scale well, and many NoSQL
    solutions have been proposed instead
§  As a results. it is no longer easy to find the best data
    solution for your problem.
§  We discussed may dimensions (types of data, scalability,
    queries, and consistency) and provided guidelines on when
    to use which data solution.
§  Your feedback and thoughts are most welcome .. Contact
    me through srinath@wso2.com

Mais conteúdo relacionado

Destaque

iDSS User Conference Presentation
iDSS User Conference PresentationiDSS User Conference Presentation
iDSS User Conference Presentation
Informz
 

Destaque (10)

The future of work: What does this mean for a CIO?
The future of work: What does this mean for a CIO?The future of work: What does this mean for a CIO?
The future of work: What does this mean for a CIO?
 
iDSS User Conference Presentation
iDSS User Conference PresentationiDSS User Conference Presentation
iDSS User Conference Presentation
 
Stickyeyes: Consultancy Services
Stickyeyes: Consultancy ServicesStickyeyes: Consultancy Services
Stickyeyes: Consultancy Services
 
Winner: Best Use of Content Marketing
Winner: Best Use of Content MarketingWinner: Best Use of Content Marketing
Winner: Best Use of Content Marketing
 
Innovative Pricing & Packaging Strategies (Accelerate East)
Innovative Pricing & Packaging Strategies (Accelerate East)Innovative Pricing & Packaging Strategies (Accelerate East)
Innovative Pricing & Packaging Strategies (Accelerate East)
 
Understanding Hacker Tools and Techniques: A live Demonstration
Understanding Hacker Tools and Techniques: A live Demonstration Understanding Hacker Tools and Techniques: A live Demonstration
Understanding Hacker Tools and Techniques: A live Demonstration
 
Dialogfeed for Media companies
Dialogfeed for Media companiesDialogfeed for Media companies
Dialogfeed for Media companies
 
China Global Recruiting Trends 2013 | Simplified Chinese
China Global Recruiting Trends 2013 | Simplified ChineseChina Global Recruiting Trends 2013 | Simplified Chinese
China Global Recruiting Trends 2013 | Simplified Chinese
 
What you should know about Facebook Applications
What you should know about Facebook ApplicationsWhat you should know about Facebook Applications
What you should know about Facebook Applications
 
Content migration part 2: TERMINALFOUR t44u 2013
Content migration part 2: TERMINALFOUR t44u 2013Content migration part 2: TERMINALFOUR t44u 2013
Content migration part 2: TERMINALFOUR t44u 2013
 

Semelhante a Finding the Right Data Solution for your Application in the Data Storage Haystack

Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
Bhaskar Gunda
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
Radhika R
 

Semelhante a Finding the Right Data Solution for your Application in the Data Storage Haystack (20)

Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 
NoSql
NoSqlNoSql
NoSql
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Applications of data structures
Applications of data structuresApplications of data structures
Applications of data structures
 
Zookeeper
ZookeeperZookeeper
Zookeeper
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Gfs论文
Gfs论文Gfs论文
Gfs论文
 
The google file system
The google file systemThe google file system
The google file system
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better Together
 
Encrypted Databases for Untrusted Cloud
Encrypted Databases for Untrusted CloudEncrypted Databases for Untrusted Cloud
Encrypted Databases for Untrusted Cloud
 
MongoDB
MongoDBMongoDB
MongoDB
 

Mais de DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 

Mais de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Finding the Right Data Solution for your Application in the Data Storage Haystack

  • 1. Finding the Right Data Solution for Your Application in the Data Storage Haystack Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa Research Scientist, Lanka Software Foundation
  • 2. Data Models §  There has been many data models proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o  Hierarchical (IMS): late 1960’s and 1970’s o  Directed graph (CODASYL): 1970’s o  Relational: 1970’s and early 1980’s o  Entity-Relationship: 1970’s o  Extended Relational: 1980’s o  Semantic: late 1970’s and 1980’s §  For last 20-30 years, Relational Database systems (SQL) together with transactions has been the defacto data solution. Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
  • 3. For many years, choice of data storage was a easy one (use RDBMS) Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
  • 4. Scale of Systems §  However, the scale of systems are changing due to o  Increasing user bases of systems. o  Mobile devices, online presence o  Cloud computing and multicore systems §  Scaling up RDBMS o  Put it in a bigger machine o  Replicate (Cluster) the database to 2-3 more nodes. But the approach does not scale up. o  Partition the data across many nodes (distribute, a.k.a. shredding). However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well. Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
  • 5. CAP Theorem, Transactions, and Storage §  RDBMS model provide two things o  Relational model with SQL o  ACID transactions – (Atomic, Isolation, Consistent, Durable) §  It was a classical one size fit all solution, but it worked for a quite a some time. §  However, CAP theorem says that you can not have it all. o  Consistency, Availability and Partition Tolerance, pick two! §  But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table) §  However, to use them, one has to understand and utilize the application specific behavior. Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
  • 6. NoSQL and other Storage Systems §  Large internet companies hit the problem first, they build systems that are specific to their problems, and those systems did scale. o  Google Big table o  Amazon Dynamo §  Soon many others followed, and most of them are free and open source. §  Now there are couple of dozen §  Among advantages of NoSQL are o  Scalability o  Flexible schema o  Designed to scale and support fault tolerance out of the Box Copyright ind{yeah} and licensed for reuse under CC License , http://www.flickr.com/photos/flickcoolpix/3566848458/
  • 7. However, with NoSQL solutions, choosing a data storage is no longer simple. Copyright Philipp Salzgeber on and licensed for reuse under CC License http:// www.salzgeber.at/astro/pics/20081126_heart/index.html
  • 8. Selecting the Right Data Solution §  What are the right Questions to ask? §  Categorize Answers for each question §  Take different cases based on different answers and make recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License. http://www.fotocommunity.com/pc/pc/display/22077920
  • 9. What are the right Questions? o  Types of data -  Structured, Semi-Structured, Unstructured o  Need for Scalability -  Number of users -  Number of data items -  Size of files -  Read/Write ratio o  Types of Queries -  Retrieve by Key -  WHERE clauses -  JOIN queries -  Offline Queries o  Consistency -  Loose Consistency -  Single Operation Consistency -  Transactions Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/ photos/romainguy/249370084
  • 10. Unstructured Data §  Data do not have a particular structure, often retrieved through a key (name). o  E.g. File systems. §  Humans are good in processing unstructured data, but computers do not. §  This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository) §  One common use case is building structured data from unstructured data §  Often associate Metadata to help searching Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
  • 11. Structured Data §  Have a structure and often described through a Schema §  Often a table like 2D structure is used, but other structures also possible. §  Main advantage of the structure is search §  Schema can be provided at the deployment time or at the runtime (dynamic schema) §  Schema can be used to o  Validate data o  Support user friendly search o  Optimize storage and queries Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/ photos/ooocha/2611398859/
  • 12. Semi-structured Data §  Structure is not fully defined. But there is some inherent structure. §  For example o  XML documents, data are stored in a tree like structure o  Graph data o  Data structures like lists and arrays §  Support queries based on structure §  But processing data often needs custom code. Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
  • 13. Search §  Unstructured Data – no structure to support search. o  Search based on an reverse index o  Search through Properties §  Semi-Structured Data o  To search XML, Xpath or XQuery (Any tree like structure). o  Tuple spaces can be queried through tuple space templates o  Data registries can be searched for entries that matches with given Metadata descriptions (search by properties) o  Graph’s can be queried based on connectivity §  Structured Data o  Retrieve by Key o  WHERE clauses o  Queries with JOINs o  Offline Queries Copyright bydigitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
  • 14. Consistency and Scalability §  Scalability – this is ability to handle more users, data, or larger files by adding more nodes. We will have 3 categories. o  Small systems (can handle with 1-3 nodes) o  Scalable systems (can handle with about 10 nodes) o  Highly scalable systems (anything larger, can be 100s or 1000s of Copyright NNSANews and licensed for reuse under CC nodes) License , http://www.flickr.com/photos/nnsanews/ 5347287260/ §  Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o  Transactional – series of operations updated in ACID manner o  Atomic operation – single operation, updated in all replicas o  Eventual consistency - data will be eventually consistent
  • 16. Data Storage Implementations §  Expectations from data storages o  Reliably store the data o  Efficient search and retrieval of data whenever needed o  Data management – delete, update data Copyright John Atherton by and licensed for reuse under CC License , http://www.flickr.com/photos/gbaku/2231332836/
  • 17. Challenges of Data Storage §  Reliability o  Replicating data o  Creating backup or recovering using backups §  Security §  Scaling and Parallel access o  Distribution or replications o  ACID transactions §  Availability o  Data replications §  Vendor lock-in o  Interoperability, standard query languages §  Simple use experience o  Hide the physical location of data, o  Provide simple API and security models o  Expressive query languages.
  • 18. Data Storage Choices Queries Join Transactio Flexible Storage Type Advantages Disadvantages Key Where s ns Scale schema No unless Local memory Very fast Not durable Yes No No STMs No Yes Rigid schema, good for read oriented Moder Relational/ SQL Standardized usecases. Yes Yes Yes Yes ate No Column High write Not Yes, families performance, transactional, secondar (NoSQL ) replicated no-online joins Yes y index No No High Yes High write Not Documents performance, transactional, Yes, DBs replicated no-online joins Yes views No No Yes Yes Easy to integrate with Object Struct programming Databases ured languages Yes Yes Yes Yes No No
  • 19. Queries trans Disadvanta action Flexible Storage Type Advantages ges Key Search s Scale schema No structured Save big files whose search on Files format not understood content Yes Indexing No Moderate Yes Data Registries/ Metadata search Property Metadata Unstru based search Catalogs ctured Yes (Where) No Moderate Yes Representation of flow of messages over Queues time/ Tasks Yes N/A No Yes Yes Used to inference, very Triple fast relationship Relationship Stores processing Yes search No No Yes XML XPath/ database XML native XQuery Distributed Cache Fast, replicated No search Yes No No Yes Yes Model is too simple in some High write cases, not Key-value performance, transactiona pairs replicated l Yes No No Yes Yes Semi- Very fast joins, natural structur to represent Not very Graph DBs ed relationships, scalable Yes Graph Search Yes Low N/A
  • 20. Choosing the Right Data Solution
  • 21. How do We do this? Copyright 8664 and licensed for reuse under CC License , http://www.flickr.com/ photos/ 80464769@N00/186 598462/ §  Consider structured, semi-structured, and unstructured separately. o  Then drill down based on other 3 properties: scale, consistency, and search. §  Structured case is more complicated, other two are bit simpler. §  Start by giving a defacto for each case
  • 22. Handling Structured Data §  There are three main considerations: scale, consistency and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s nodes) Loose Operat ACID Loose Operat ACID Loose Operat ACID Consist ion Transa Consi ion Transa Consi ion Transa ency Consi ctions stency Consi ctions stency Consi ctions stency stency stency Primary DB/ KV/ DB/ DB KV/CF KV/CF Partitio KV/CF KV/CF No Key CF KV/ CF ned DB? Where DB/ CF/ DB/ DB CF/ CF/ Partitio CF/ CF/ No Doc CF/ Doc(?) Doc (?) ned Doc Doc Doc DB? JOIN DB DB DB ?? ?? ?? No No No Offline DB/CF/ DB/CF/ DB/CF/ CF/ CF/ No CF/ CF/ No Doc Doc Doc Doc Doc Doc Doc *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 23. Handling Small Scale Systems (1-3 nodes) Small (1-3 nodes) §  In general using DB here for every case might work. Loose Operati ACID Consi on Transa §  Reason for using options stency Consist ctions other than DB ency o  When there is potential need Primary DB/ DB/ KV/ DB to scale later. Key KV/ CF CF o  High write throughput Where DB/ DB/ DB §  KV is 1-D where as other two CF/ CF/Doc Doc are 2D JOIN DB DB DB Offline DB/ DB/CF/ DB/CF/ CF/ Doc Doc Doc *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 24. Handling Scalable Systems Scalable (10 nodes) §  KV, CF, and Doc can easily handle this case. Loose Operati ACID §  If DBs used with data shredded Consi on Transa stenc Consist ctions across many nodes y ency o  Transactions might work given that Primary KV/CF KV/CF Partition participants on one transaction are Key ed DB? not too many. Where CF/ CF/Doc Partition o  JOINs might need to transfer too Doc ed DB? much data between nodes. o  Also should consider in Memory JOIN ?? ?? Partition ed DBs like Vault DB. DB?? §  Offline mode will work. Offline CF/ CF/Doc No §  Most systems let users choose Doc consistency, and loose *KV-Key-Value Systems, CF-Column consistency can scale more. Families, Doc- document based Systems (e.g. Cassandra)
  • 25. Highly Scalable Systems §  Transactions do not work in Highly Scalable (1000s nodes) this scale. (CAP theorem). Loose Operati ACID §  Same for JOINs. The problem Consis on Transac is sometime too much data tency Consist tions ency needs to be transferred Primary KV/CF KV/CF No between nodes to perform the Key JOIN. Where CF/Doc CF/Doc No §  Offline case handled through Map-Reduce. Even JOIN JOIN No No No case is OK since there is time. Offline CF/Doc CF/Doc No *KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
  • 26. Highly Scalable Systems + Primary Key Retrieval Highly Scalable (1000s §  This is (comparatively) the nodes) easy one. Loose Operat ACID §  Can be solved through Consis ion Transa tency Consis ctions DHT (Distributed Hash tency table) based solutions or Primar KV/CF KV/CF No architectures like y Key OceanStore. Where CF/Doc CF/Doc No §  Both Key-Value storage (?) (?) (KV) and Column Families JOIN No No No (CF) can be used. But Key-Value model is Offline CF/Doc CF/Doc No preferred as it is more scalable. *KV-Key-Value Systems, CF-Column Families, Doc- document based Systems
  • 27. Highly Scalable systems + WHERE Highly Scalable (1000s §  This Generally OK, but tricky. nodes) §  CF work through a Secondary Loose Operat Transa Consis ion ctions index that do Scatter-gather tency Consis (e.g. Cassandra). tency §  Doc work through Map- Primar KV/CF KV/CF No y Key Reduce views (e.g. Where CF/Doc CF/Doc No CouchDB) (?) (?) §  There is Bissa, which build a JOIN No No No index for all possible queries (No range queries) Offline CF/Doc CF/Doc No §  If you are doing this, you should do pilot runs and *KV-Key-Value Systems, CF-Column make sure things work. Families, Doc- document based Systems
  • 28. Handling Unstructured Data §  Storage Options o  Distributed File systems - generally scalable (e.g. NSF), but HDFS (Hadoop) and Lustre are highly scalable versions. o  Metadata registries (e.g. Niravana, SDSC Resource Broker)
  • 29. Handling Semi-Structured Data Small Scale (1-3 Scalable (10 nodes) Highly nodes) Scalable XML (Queried XML DB or convert XML DB or convert to a ?? through XPath) to a structured structured model model Graphs Graph DBs Graph DBs if graph can ?? be partitioned Data Structures Data Structure Servers, Object Databases Queues Distributed Distributed Queues Distributed Queues Queues ! §  Storage Options o  Answer depends on the type of structure. If there is a server optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search) §  Search o  Very much custom. E.g. XML or any tree = Xpath, Graph can support very fast relationship search
  • 30. Hybrid Approaches §  Some solutions have many types of data and hence need more than one data solution (hybrid architectures). §  For example o  Using DB for transactional data and CF for other data. o  Keeping metadata and actual data separate for large data archives. o  Use GraphDB to store relationship data while other data is in Column Family storage. Copyright Matthew Oliphant by and licensed for §  However, if transactions are reuse under CC License , http://www.flickr.com/ photos/fajalar/3174131216/ needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).
  • 31. Other parameters §  Above list is not exhaustive, and there are other parameters o  Read/ Write ratio – when high it is easy to scale o  High write throughput o  Very large data products – you will need a file system. May be keep metadata in Data registry and store data in a file system. o  Flexible Schema o  Archival usecases o  Analytical usecases o  Others … §  So there is no silver bullet …
  • 32. Conclusion §  For last 20 years or so, DBMS were the de facto storage solution §  However, DBMS could not scale well, and many NoSQL solutions have been proposed instead §  As a results. it is no longer easy to find the best data solution for your problem. §  We discussed may dimensions (types of data, scalability, queries, and consistency) and provided guidelines on when to use which data solution. §  Your feedback and thoughts are most welcome .. Contact me through srinath@wso2.com