SlideShare uma empresa Scribd logo
1 de 45
Couchbase Server 2.0
  Indexing and Querying
     Quick Overview

        Dipti Borkar
        Director, Product Management




                                       1   1
What we’ll talk about
 •    View basics
 •    Lifecycle of a view
      Index definition, build, and query phase
      Indexing details
 •    Replica indexes, failover and compaction
 •    Primary and Secondary indexes
 •    View best practices
 •    Additional patterns




                                                  2
JSON Documents

• Map more closely to objects or entities
• CRUD Operations, lightweight schema
  {
      “fields” : [“with basic types”, 3.14159, true],
      “like” : “your favorite language”
  }
• Stored under an identifier key
  client.set(“mydocumentid”, myDocument);
  mySavedDocument = client.get(“mydocumentid”);




                                                        3
What are Views?
 • Extract fields from JSON documents and produce an index of
   the selected information
Views – The basics

• Define materialized views on JSON documents and then query
  across the data set
• Using views you can define
    •   Primary indexes
    •   Simple secondary indexes (most common use case)
    •   Complex secondary, tertiary and composite indexes
    •   Aggregations (reduction)
• Indexes are eventually indexed
• Queries are eventually consistent with respect to documents
• Built using Map/Reduce technology
    • Map and Reduce functions are written in Javascript
View Lifecycle
Define -> Build -> Query




                           66
Buckets & Design docs & Views
• Create design documents on a bucket
• Create views within a design document

        BUCKET 1                              View 11
                                              View
                                   Design
                                document 1   View 22
                                              View
                                              View 33
                                              View


                                   Design     View 44
                                              View
                                document 2
                                              View 55
                                              View




                                   Design     View 66
                                              View
                                document 3
                                              View 77
                                              View

        BUCKET 2                                        7
Eventually indexed Views – Data flow
                                                        2

                                               Doc 1
                      App Server



                   Couchbase Server Node
                                         33          2      33
                                        Managed Cache 2
   To other node     Replication
                                               Doc 1
                       Queue




                                                                 Disk Queue
                               Disk
                                                Doc 1




                                      View engine

                                                                              8
Distributed Indexing and Querying
                                                                                                 Create Index / View
                    App Server 1                              App Server 2
           COUCHBASE Client Library
            COUCHBASE Client Library                    COUCHBASE Client Library
                                                         COUCHBASE Client Library
                     Cluster Map                              Cluster Map



                                                                       Query

            Server 1                           Server 2                              Server 3
                                                                                                • Indexing work is distributed
           Active                              Active                            Active           amongst nodes
      Doc 5      Doc                       Doc 3    Doc                      Doc 4       Doc
                                                                                                • Parallelize the effort
      Doc 2       Doc                      Doc 1    Doc                      Doc 6       Doc
                                                                                                • Each node has index for data stored
      Doc 9       Doc                      Doc 8    Doc                      Doc 7       Doc      on it

          REPLICA
                                                                                                • Queries combine the results from
                                              REPLICA                           REPLICA
                                                                                                  required nodes
      Doc 3      Doc                       Doc 6    Doc                      Doc 2       Doc

      Doc 1      Doc                       Doc 4    Doc                      Doc 5       Doc


      Doc 7      Doc                       Doc 9    Doc                      Doc 8       Doc


                                    Couchbase Server Cluster


User Configured Replica Count = 1
                                                                                                                                 9
DEFINE  Index / View Definition in JavaScript


       CREATE INDEX City ON Brewery.City;




                                                 10
BUILD  Distributed Index Build Phase
• Optimized for lookups, in-order access and aggregations
• View reads are from disk (different performance profile than GET/SET)
• Views built against every document on every node
  – Group them in a design document
• Views are automatically kept up to date




                                                                          11
QUERY  Dynamic Queries with Optional Aggregation
•   Eventually consistent with respect to document updates
•   Efficiently fetch a document or group of similar documents
•   Queries will use cached values from B-tree inner nodes when possible
•   Take advantage of in-order tree traversal with group_level queries
    Query ?startkey=“J”&endkey=“K”
    {“rows”:[{“key”:“Juneau”,“value”:null}]}




                                                                           12
Index building details

 – All the views within a design document are incrementally updated
   when the view is accessed or auto-indexing kicks in
 – Automatic view updates
   • In addition to forcing an index build at query time, active & replica indexes are
     updated every 3 seconds of inactivity if there are at least 5000 new changes
     (configurable)
 – The entire view is recreated if the view definition has changed
 – Views can be conditionally updated by specifying the “stale”
   argument to the view query
 – The index information stored on disk consists of the combination
   of both the key and value information defined within your view.
Queries run against stale indexes by default

• stale=update_after (default if nothing is specified)
  – always get fastest response
  – can take two queries to read your own writes
• stale=ok
  – auto update will trigger eventually
  – might not see your own writes for a few minutes
  – least frequent updates -> least resource impact
• stale=false
  – Use with “set with persistence” if data needs to be included in
    view results
  – BUT be aware of delay it adds, only use when really required
                                                                      14
Views and Replica indexes
• In addition to replicas for data (up to 3 copies), optionally
  create replica for indexes
• Each node manages replica index data structures
• Set at a bucket level
• Replica index populated from replica data
• Replica index is used after a failover
Views and failover

• Replica indexes enabled on failover
• Replicas indexes are rebuilt on replica nodes
  – Automatically incrementally built based on replica data
  – Updated every 3 seconds of inactivity if there are at least 5000
    new changes
  – Not copied/moved to be consistent with persisted replica data
View Compaction

 • Compaction is ONLINE
 • Reclaims empty allocated space from disk
 • Indexes are stored on disk for active vBuckets on each
   node and updated in append-only manner
 • Auto-compaction performed in the background
    – Set the database fragmentation levels
    – Set the index fragmentation levels
    – Choose a schedule
    – Global and bucket specific settings
Development vs. Production Views
• Development views index a
  subset of the data.
• Publishing a view builds the
  index across the entire
  cluster.
• Queries on production views
  are scattered to all cluster
  members and results are
  gathered and returned to
  the client.


                                   18
Simple Primary
      and
Secondary Indexing




                     19 1
Example Document
                   Document ID




                                 20
Define a primary index on the bucket
• Lookup the document ID / key by key, range, prefix, suffix

                           Index
                         definition




                                                               21
Define a secondary index on the bucket
• Lookup an attribute by value, range, prefix, suffix

                            Index
                          definition




                                                        22
Find documents by a specific attribute
• Lets find beers by brewery_id!




                                         23
The index definition




                   Key   Value




                                 24
The result set: beers keyed by brewery_id




                                            25
View Best Practices




                      26 2
View writing guidance

• Move frequently used views out to a separate design document
  – All views in a design document are updated at the same time
  – This can result in increase index building time if all views are in a single design
    document, especially for frequently accessed views.
  – However, grouping views into smaller number of design documents improves overall
    performance
• Try to avoid computing too many things with one view
• Use built-in reduces where possible - custom reduces are not optimized
• Check for attribute existence
         function(doc, meta){
         function(doc, meta){
           if (doc.ingredient)
           if (doc.ingredient)
           {
           {
              emit(doc.ingredient.ingredtext, null);
              emit(doc.ingredient.ingredtext, null);
           }
           }
         }
         }                                                                          27
View writing guidance

• Do not include the document in the view value
  – Instead either use the GET / SET API or the API that includes documents filtered by
    the query [example: willIncludeDocs()]
  – Emit either null or the ID instead (meta.id) in your key or value data
                             emit(doc.name, null)
                              emit(doc.name, null)
• Don’t emit too much data into a view value
  – Use views to filter documents
  – Then use the data path to access the matched documents
• Use Document Types to make views more selective
                     function(doc, meta)
                     function(doc, meta)
                     {
                     {
                       if(doc.type == “player”)
                       if(doc.type == “player”)
                        emit(doc.experience, null);
                        emit(doc.experience, null);
                     }
                     }
                                                                                      28
What impact do views have on the system?

•   Complexity of the index  CPU
•   Size of the value emitted and selectivity  Disk size, I/O
•   Replica index  Disk size, I/O, CPU
•   Number of design doc  CPU, I/O, Disk size
    – 4 active and 2 replica design documents are built in parallel by default
    – Can be changed using the maxParallelIndexers and
      maxParallelReplicaIndexers parameters
• Compaction of views  CPU, I/O
• Rebalance time Increases with views to support consistent
    query results during rebalance
    – Can be disabled using the indexAwareRebalanceDisabled parameter
Views and OS caching

• File system cache availability for the index has a big impact
  performance
• Indexes are disk based and should have sufficient file system
  cache available for faster query access
• In house performance results show that by doubling system
  cache availability
  – query latency reduces to half
  – throughput increases by 50%
• Runs based on 10 million items with 16GB bucket quota and
  4GB, 8GB system RAM availability for indexes
Query Pattern
Basic Aggregations




                     31 3
Use a built-in reduce function with a group query

• Lets find average abv for each brewery!




                                                    32
We are reducing doc.abv with _stats




                                      33 33
Group reduce (reduce by unique key)




                                      34 34
Query Pattern
Time-based Rollups




                     35 3
Find patterns in beer comments by time


                {
                  "type": "comment",
                  "about_id":
                "beer_Enlightened_Black_Ale",
                  "user_id": 525,
    timestamp
                  "text": "tastes like college!",
                  "updated": "2010-07-22 20:00:20"
                {
                }
                  "id": "f1e62"
                }


                                                     36
Query with group_level=2 to get monthly rollups




                                                  37
dateToArray() is your friend




                                      ()
                                rr ay
                                  oA
                                eT
                            dat
• String or Integer based timestamps
• Output optimized for group_level queries
• array of JSON numbers:
  [2012,9,21,11,30,44]                       38
group_level=2 results




• Monthly rollup
• Sorted by time—sort the query results in your application if
  you want to rank by value—no chained map-reduce
                                                            39
group_level=3 - daily results - great for graphing




• Daily, hourly, minute or second rollup all possible with the
  same index.

                                                                 40
Query Pattern
 Leaderboard




                41 4
Aggregate value stored in a document
• Lets find the top-rated beers!
                       {
                            "brewery": "New Belgium Brewing",
                            "name": "1554 Enlightened Black Ale",
                            "abv": 5.5,
                            "description": "Born of a flood...",
                            "category": "Belgian and French Ale",
                            "style": "Other Belgian-Style Ales",
                            "updated": "2010-07-22 20:00:20",
                           “ratings” : {
            ratings          “jchris” : 5,
                             “scalabl3” : 4,
                             “damienkatz” : 1                    42
Sort each beer by its average rating

• Lets find the top-rated beers!




                        average




                                       43 43
Questions?




             44 4
THANK YOU

DIPTI@COUCBASE.COM
     @DBORKAR




                     45 4

Mais conteúdo relacionado

Destaque

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slidesdbelova
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Rick Branson
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsSaltStack
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDBMongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Destaque (11)

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
OpenStack Heat slides
OpenStack Heat slidesOpenStack Heat slides
OpenStack Heat slides
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
A user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management toolsA user's perspective on SaltStack and other configuration management tools
A user's perspective on SaltStack and other configuration management tools
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Hadoop on the Cloud
Hadoop on the CloudHadoop on the Cloud
Hadoop on the Cloud
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Semelhante a Couchbase Server 2.0 - Indexing and Querying - Deep dive

Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveDipti Borkar
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbaseDipti Borkar
 
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»Yandex
 
How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014Dipti Borkar
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID
 
MongoDB as Search Engine Repository @ MongoTokyo2011
MongoDB as Search Engine Repository @ MongoTokyo2011MongoDB as Search Engine Repository @ MongoTokyo2011
MongoDB as Search Engine Repository @ MongoTokyo2011Preferred Networks
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesanynines GmbH
 
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...mfrancis
 
Nuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo
 
OpenProdoc 0.8 Load Tests
OpenProdoc 0.8 Load TestsOpenProdoc 0.8 Load Tests
OpenProdoc 0.8 Load Testsjhierrot
 
Nuxeo World Session: CMIS - What's Next?
Nuxeo World Session: CMIS - What's Next?Nuxeo World Session: CMIS - What's Next?
Nuxeo World Session: CMIS - What's Next?Nuxeo
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structureZhichao Liang
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Kubernetes overview 101
Kubernetes overview 101Kubernetes overview 101
Kubernetes overview 101Boskey Savla
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresmkorremans
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,futureEdward Baker
 

Semelhante a Couchbase Server 2.0 - Indexing and Querying - Deep dive (20)

Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep dive
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbase
 
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»
Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»
 
AxonShare_Scanning_Integration
AxonShare_Scanning_IntegrationAxonShare_Scanning_Integration
AxonShare_Scanning_Integration
 
How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
 
MongoDB as Search Engine Repository @ MongoTokyo2011
MongoDB as Search Engine Repository @ MongoTokyo2011MongoDB as Search Engine Repository @ MongoTokyo2011
MongoDB as Search Engine Repository @ MongoTokyo2011
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...
The Bundle Dilemma - Richard S. Hall, Researcher, Laboratoire d’Informatique ...
 
Nuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to NuxeoNuxeo World Session: Migrating to Nuxeo
Nuxeo World Session: Migrating to Nuxeo
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
OpenProdoc 0.8 Load Tests
OpenProdoc 0.8 Load TestsOpenProdoc 0.8 Load Tests
OpenProdoc 0.8 Load Tests
 
Nuxeo World Session: CMIS - What's Next?
Nuxeo World Session: CMIS - What's Next?Nuxeo World Session: CMIS - What's Next?
Nuxeo World Session: CMIS - What's Next?
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Kubernetes overview 101
Kubernetes overview 101Kubernetes overview 101
Kubernetes overview 101
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,future
 

Mais de Dipti Borkar

Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Revolutionizing the customer experience - Hello Engagement Database
Revolutionizing the customer experience - Hello Engagement DatabaseRevolutionizing the customer experience - Hello Engagement Database
Revolutionizing the customer experience - Hello Engagement DatabaseDipti Borkar
 
How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013Dipti Borkar
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databasesDipti Borkar
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013Dipti Borkar
 
How companies use NoSQL and Couchbase
How companies use NoSQL and CouchbaseHow companies use NoSQL and Couchbase
How companies use NoSQL and CouchbaseDipti Borkar
 
Launch webinar-introducing couchbase server 2.0-01202013
Launch webinar-introducing couchbase server 2.0-01202013Launch webinar-introducing couchbase server 2.0-01202013
Launch webinar-introducing couchbase server 2.0-01202013Dipti Borkar
 
Part 2 of the webinar - Which freaking database should I use?
Part 2 of the webinar - Which freaking database should I use?Part 2 of the webinar - Which freaking database should I use?
Part 2 of the webinar - Which freaking database should I use?Dipti Borkar
 
Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Dipti Borkar
 
Transition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayTransition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayDipti Borkar
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseDipti Borkar
 
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Dipti Borkar
 
Couchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = ThreeCouchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = ThreeDipti Borkar
 
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and DemoIntroduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and DemoDipti Borkar
 
Go simple-fast-elastic-with-couchbase-server-borkar
Go simple-fast-elastic-with-couchbase-server-borkarGo simple-fast-elastic-with-couchbase-server-borkar
Go simple-fast-elastic-with-couchbase-server-borkarDipti Borkar
 

Mais de Dipti Borkar (16)

Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Couchbase 101
Couchbase 101 Couchbase 101
Couchbase 101
 
Revolutionizing the customer experience - Hello Engagement Database
Revolutionizing the customer experience - Hello Engagement DatabaseRevolutionizing the customer experience - Hello Engagement Database
Revolutionizing the customer experience - Hello Engagement Database
 
How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013How companies-use-no sql-and-couchbase-10152013
How companies-use-no sql-and-couchbase-10152013
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013
 
How companies use NoSQL and Couchbase
How companies use NoSQL and CouchbaseHow companies use NoSQL and Couchbase
How companies use NoSQL and Couchbase
 
Launch webinar-introducing couchbase server 2.0-01202013
Launch webinar-introducing couchbase server 2.0-01202013Launch webinar-introducing couchbase server 2.0-01202013
Launch webinar-introducing couchbase server 2.0-01202013
 
Part 2 of the webinar - Which freaking database should I use?
Part 2 of the webinar - Which freaking database should I use?Part 2 of the webinar - Which freaking database should I use?
Part 2 of the webinar - Which freaking database should I use?
 
Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0Introduction to Couchbase Server 2.0
Introduction to Couchbase Server 2.0
 
Transition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA DayTransition from relational to NoSQL Philly DAMA Day
Transition from relational to NoSQL Philly DAMA Day
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and Couchbase
 
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
 
Couchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = ThreeCouchbase Server and IBM BigInsights: One + One = Three
Couchbase Server and IBM BigInsights: One + One = Three
 
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and DemoIntroduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo
Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo
 
Go simple-fast-elastic-with-couchbase-server-borkar
Go simple-fast-elastic-with-couchbase-server-borkarGo simple-fast-elastic-with-couchbase-server-borkar
Go simple-fast-elastic-with-couchbase-server-borkar
 

Couchbase Server 2.0 - Indexing and Querying - Deep dive

  • 1. Couchbase Server 2.0 Indexing and Querying Quick Overview Dipti Borkar Director, Product Management 1 1
  • 2. What we’ll talk about • View basics • Lifecycle of a view  Index definition, build, and query phase  Indexing details • Replica indexes, failover and compaction • Primary and Secondary indexes • View best practices • Additional patterns 2
  • 3. JSON Documents • Map more closely to objects or entities • CRUD Operations, lightweight schema { “fields” : [“with basic types”, 3.14159, true], “like” : “your favorite language” } • Stored under an identifier key client.set(“mydocumentid”, myDocument); mySavedDocument = client.get(“mydocumentid”); 3
  • 4. What are Views? • Extract fields from JSON documents and produce an index of the selected information
  • 5. Views – The basics • Define materialized views on JSON documents and then query across the data set • Using views you can define • Primary indexes • Simple secondary indexes (most common use case) • Complex secondary, tertiary and composite indexes • Aggregations (reduction) • Indexes are eventually indexed • Queries are eventually consistent with respect to documents • Built using Map/Reduce technology • Map and Reduce functions are written in Javascript
  • 6. View Lifecycle Define -> Build -> Query 66
  • 7. Buckets & Design docs & Views • Create design documents on a bucket • Create views within a design document BUCKET 1 View 11 View Design  document 1 View 22 View View 33 View Design View 44 View  document 2 View 55 View Design View 66 View  document 3 View 77 View BUCKET 2 7
  • 8. Eventually indexed Views – Data flow 2 Doc 1 App Server Couchbase Server Node 33 2 33 Managed Cache 2 To other node Replication Doc 1 Queue Disk Queue Disk Doc 1 View engine 8
  • 9. Distributed Indexing and Querying Create Index / View App Server 1 App Server 2 COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library Cluster Map Cluster Map Query Server 1 Server 2 Server 3 • Indexing work is distributed Active Active Active amongst nodes Doc 5 Doc Doc 3 Doc Doc 4 Doc • Parallelize the effort Doc 2 Doc Doc 1 Doc Doc 6 Doc • Each node has index for data stored Doc 9 Doc Doc 8 Doc Doc 7 Doc on it REPLICA • Queries combine the results from REPLICA REPLICA required nodes Doc 3 Doc Doc 6 Doc Doc 2 Doc Doc 1 Doc Doc 4 Doc Doc 5 Doc Doc 7 Doc Doc 9 Doc Doc 8 Doc Couchbase Server Cluster User Configured Replica Count = 1 9
  • 10. DEFINE  Index / View Definition in JavaScript CREATE INDEX City ON Brewery.City; 10
  • 11. BUILD  Distributed Index Build Phase • Optimized for lookups, in-order access and aggregations • View reads are from disk (different performance profile than GET/SET) • Views built against every document on every node – Group them in a design document • Views are automatically kept up to date 11
  • 12. QUERY  Dynamic Queries with Optional Aggregation • Eventually consistent with respect to document updates • Efficiently fetch a document or group of similar documents • Queries will use cached values from B-tree inner nodes when possible • Take advantage of in-order tree traversal with group_level queries Query ?startkey=“J”&endkey=“K” {“rows”:[{“key”:“Juneau”,“value”:null}]} 12
  • 13. Index building details – All the views within a design document are incrementally updated when the view is accessed or auto-indexing kicks in – Automatic view updates • In addition to forcing an index build at query time, active & replica indexes are updated every 3 seconds of inactivity if there are at least 5000 new changes (configurable) – The entire view is recreated if the view definition has changed – Views can be conditionally updated by specifying the “stale” argument to the view query – The index information stored on disk consists of the combination of both the key and value information defined within your view.
  • 14. Queries run against stale indexes by default • stale=update_after (default if nothing is specified) – always get fastest response – can take two queries to read your own writes • stale=ok – auto update will trigger eventually – might not see your own writes for a few minutes – least frequent updates -> least resource impact • stale=false – Use with “set with persistence” if data needs to be included in view results – BUT be aware of delay it adds, only use when really required 14
  • 15. Views and Replica indexes • In addition to replicas for data (up to 3 copies), optionally create replica for indexes • Each node manages replica index data structures • Set at a bucket level • Replica index populated from replica data • Replica index is used after a failover
  • 16. Views and failover • Replica indexes enabled on failover • Replicas indexes are rebuilt on replica nodes – Automatically incrementally built based on replica data – Updated every 3 seconds of inactivity if there are at least 5000 new changes – Not copied/moved to be consistent with persisted replica data
  • 17. View Compaction • Compaction is ONLINE • Reclaims empty allocated space from disk • Indexes are stored on disk for active vBuckets on each node and updated in append-only manner • Auto-compaction performed in the background – Set the database fragmentation levels – Set the index fragmentation levels – Choose a schedule – Global and bucket specific settings
  • 18. Development vs. Production Views • Development views index a subset of the data. • Publishing a view builds the index across the entire cluster. • Queries on production views are scattered to all cluster members and results are gathered and returned to the client. 18
  • 19. Simple Primary and Secondary Indexing 19 1
  • 20. Example Document Document ID 20
  • 21. Define a primary index on the bucket • Lookup the document ID / key by key, range, prefix, suffix Index definition 21
  • 22. Define a secondary index on the bucket • Lookup an attribute by value, range, prefix, suffix Index definition 22
  • 23. Find documents by a specific attribute • Lets find beers by brewery_id! 23
  • 24. The index definition Key Value 24
  • 25. The result set: beers keyed by brewery_id 25
  • 27. View writing guidance • Move frequently used views out to a separate design document – All views in a design document are updated at the same time – This can result in increase index building time if all views are in a single design document, especially for frequently accessed views. – However, grouping views into smaller number of design documents improves overall performance • Try to avoid computing too many things with one view • Use built-in reduces where possible - custom reduces are not optimized • Check for attribute existence function(doc, meta){ function(doc, meta){ if (doc.ingredient) if (doc.ingredient) { { emit(doc.ingredient.ingredtext, null); emit(doc.ingredient.ingredtext, null); } } } } 27
  • 28. View writing guidance • Do not include the document in the view value – Instead either use the GET / SET API or the API that includes documents filtered by the query [example: willIncludeDocs()] – Emit either null or the ID instead (meta.id) in your key or value data emit(doc.name, null) emit(doc.name, null) • Don’t emit too much data into a view value – Use views to filter documents – Then use the data path to access the matched documents • Use Document Types to make views more selective function(doc, meta) function(doc, meta) { { if(doc.type == “player”) if(doc.type == “player”) emit(doc.experience, null); emit(doc.experience, null); } } 28
  • 29. What impact do views have on the system? • Complexity of the index  CPU • Size of the value emitted and selectivity  Disk size, I/O • Replica index  Disk size, I/O, CPU • Number of design doc  CPU, I/O, Disk size – 4 active and 2 replica design documents are built in parallel by default – Can be changed using the maxParallelIndexers and maxParallelReplicaIndexers parameters • Compaction of views  CPU, I/O • Rebalance time Increases with views to support consistent query results during rebalance – Can be disabled using the indexAwareRebalanceDisabled parameter
  • 30. Views and OS caching • File system cache availability for the index has a big impact performance • Indexes are disk based and should have sufficient file system cache available for faster query access • In house performance results show that by doubling system cache availability – query latency reduces to half – throughput increases by 50% • Runs based on 10 million items with 16GB bucket quota and 4GB, 8GB system RAM availability for indexes
  • 32. Use a built-in reduce function with a group query • Lets find average abv for each brewery! 32
  • 33. We are reducing doc.abv with _stats 33 33
  • 34. Group reduce (reduce by unique key) 34 34
  • 36. Find patterns in beer comments by time { "type": "comment", "about_id": "beer_Enlightened_Black_Ale", "user_id": 525, timestamp "text": "tastes like college!", "updated": "2010-07-22 20:00:20" { } "id": "f1e62" } 36
  • 37. Query with group_level=2 to get monthly rollups 37
  • 38. dateToArray() is your friend () rr ay oA eT dat • String or Integer based timestamps • Output optimized for group_level queries • array of JSON numbers: [2012,9,21,11,30,44] 38
  • 39. group_level=2 results • Monthly rollup • Sorted by time—sort the query results in your application if you want to rank by value—no chained map-reduce 39
  • 40. group_level=3 - daily results - great for graphing • Daily, hourly, minute or second rollup all possible with the same index. 40
  • 42. Aggregate value stored in a document • Lets find the top-rated beers! { "brewery": "New Belgium Brewing", "name": "1554 Enlightened Black Ale", "abv": 5.5, "description": "Born of a flood...", "category": "Belgian and French Ale", "style": "Other Belgian-Style Ales", "updated": "2010-07-22 20:00:20", “ratings” : { ratings “jchris” : 5, “scalabl3” : 4, “damienkatz” : 1 42
  • 43. Sort each beer by its average rating • Lets find the top-rated beers! average 43 43
  • 44. Questions? 44 4

Notas do Editor

  1. if you are ingesting Tweets, git commits, and linked-in API data, there ’ s little value in transforming it before you save it. just store it and sort it out later — the same holds for user data
  2. schemaless is good as far as it goes, but what it ’ s really saying is: “ don ’ t worry about the database ” so a lot of the patterns move to the application. that ’ s what this section is about.
  3. 1.  A set request comes in from the application . 2.  Couchbase Server responses back that they key is written 3. Couchbase Server then Replicates the data out to memory in the other nodes 4. At the same time it is put the data into a write que to be persisted to disk
  4. Bulletize the text. Make sure the builds work.
  5. Defined via SDKs or administration console Deploying a new view to production is an online operation , but can be heavy
  6. no downtime deploy?
  7. If Sum = 11 and Count = 2, the Average is 5.5
  8. ratings are stored in a hash to ensure each user can only rate each beer once