O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

1.646 visualizações

Publicada em

This presentation introduces the open sourced Lucene based implementation of the Cassandra secondary indexes developed by Stratio. It allows users to make complex queries in Cassandra using CQL3, including full text search, top-k queries and free multivariable search. Relevance queries and filters can be combined to make searches such as “give me the 100 tweets that best matches this phrase of those written in a certain date range”.

Cluster-wide relevance search allows retrieving the N more relevant results that meet a given condition. It’s done through a modified version of Cassandra’s storage proxy in which the coordinator node requests the N best results of each node in the cluster in parallel and combines their partial results to get the N best of them.

Stratio’s index is fully compatible with Apache Spark and Apache Hadoop because it supports all the key/token restrictions in the CQL3 statements. Filters are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks such as Hadoop or, even better, Spark. Filtering the job input avoids full data scanning, dramatically reducing the amount of data to be processed.

Any cell in the tables can be indexed, including primary keys as well as collections. CQL3 wide rows are also supported.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

  1. 1. Advanced search and Top-K queries in Cassandra 1 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena
  2. 2. Who are we? • Stratio is a Big Data Company • Founded in 2013 • Commercially launched in 2014 • 70+ employees in Madrid • Office in San Francisco • Certified Spark distribution #CassandraSummit 2014
  3. 3. Cassandra query methods Stratio Lucene based 2i implementation Integrating Lucene 2i with Apache Spark 1 2 3 CONTENTS
  4. 4. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 4
  5. 5. Primary key queries • O(1) node lookup for partition key • Range slices for clustering key • Usually requires denormalization Partition key CLIENT Clustering key range Node 3 Node 1 Node 2 apena 2014-04-10:body When you.. aagea dhiguero apena 2014-04-06:body 2014-04-07:body 2014-04-08:body To study and… To think and... If you see what.. 2014-04-06:body The cautious… 2014-04-10:body When you.. 2014-04-11:body When you do… #CassandraSummit 2014 5
  6. 6. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 6
  7. 7. CLIENT C* node C* node 2i local column family C* node 2i local column family 2i local column family Secondary indexes queries • Inverted index • Mitigates denormalization • Queries may involve all C* nodes • Queries limited to a single column #CassandraSummit 2014 7
  8. 8. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 8
  9. 9. C*# node# C*# node# C*# node# Spark master Token range queries • Used by MapReduce frameworks as Hadoop or Spark • All kinds of queries are possible • Low throughput • Ad-hoc queries • Batch processing • Materialized views CLIENT query= function (all data) #CassandraSummit 2014 9
  10. 10. C*# node# C*# node# C*# node# Combining 2i with MapReduce • Expressiveness avoiding full scans • Still limited by one indexed column per query Spark CLIENT master Secondary index Secondary index Secondary index #CassandraSummit 2014 10
  11. 11. What do we miss from 2i indexes? MORE EXPRESIVENESS • Range queries • Multivariable search • Full text search • Sorting by fields • Top-k queries #CassandraSummit 2014 11
  12. 12. What do we like from the existing 2i? IT’S ARCHITECTURE • Each node indexes its own data • The index implementations do not need to be distributed • Natural extension point • Can be created after design and ingestion #CassandraSummit 2014 12
  13. 13. Thinking in a custom secondary index implementation… WHY NOT USE ? #CassandraSummit 2014 13
  14. 14. Why we like Lucene • Proven stable and fast indexing solution • Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. • Mature distributed search solutions built on top of it - Solr, ElasticSearch • Can be fully embedded in application code • Published under the Apache License #CassandraSummit 2014 14
  15. 15. HOW IT WORKS
  16. 16. ALTER TABLE tweets ADD lucene TEXT; CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) ); Create index • Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer", fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string"} }} '}; #CassandraSummit 2014 16
  17. 17. SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"} }’ LIMIT 10; search 10 found 6 found 4 We are done ! Filtering query CLIENT C* node C* node C* node Lucene index Lucene index Lucene index #CassandraSummit 2014 17
  18. 18. Found 5 Found 4 Found 5 Top-k query SELECT * FROM tweets WHERE lucene = ‘{ query: {type:”match", field : ”text”, value : “cassandra”} }’ LIMIT 5; C* node Search top-5 CLIENT Search top-5 C* node C* node Lucene index Lucene index Lucene index Merge 14 to best 5 #CassandraSummit 2014 18
  19. 19. Modifying Cassandra for generic top-k queries Two new methods in SecondaryIndexSearcher: boolean'requiresFullScan(List<IndexExpression>'clause);' List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);' Two new methods in AbstractRangeCommand: boolean'requiresFullScan();' List<Row>'combine(List<Row>'rows);' And some changes in StorageProxy#getRangeSlice… #CassandraSummit 2014 19
  20. 20. Queries can be as complex as you want SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] } }’ LIMIT 10000; #CassandraSummit 2014 20
  21. 21. Some implementation details • A Lucene document per CQL row, and a Lucene field per indexed column • SortingMergePolicy keeps index sorted in the same way that C* does • Index commits synchronized with column family flushes • Segments merge synchronized with column family compactions NO MAINTENANCE REQUIRED #CassandraSummit 2014 21
  22. 22. QUERY BUILDER
  23. 23. FLUENT QUERY BUILDER Query builder • Facilitates writing index-related clauses • Compatible with the existing C* Query Builder #CassandraSummit 2014 23
  24. 24. Query builder example SELECT * FROM tweets WHERE lucene = ‘{ query : {type : "range", field : "time”, lower : "2014/04/25”, upper : "2014/04/30”}, filter : {type : "match", field : "text", value : "cassandra"}, sort : { fields: [ {field :"time", reverse : true}} ] } }’ LIMIT 10; String'filter'='SearchBuilders' '''.filter(range("time")' '''''''''''''.lower("2014/04/25")' '''''''''''''.upper("2014/04/30"))' '''.query(match("text",'"cassandra")' '''.sort(sorting("time",'true))' '''.toJson();' ' QueryBuilder.select()'''''''''''''''''''''''''' '''.from(KEYSPACE,'TABLE)' '''.where(eq("lucene",'filter))' '''.limit(10) #CassandraSummit 2014 24
  25. 25. LUCENE AND SPARK
  26. 26. Integrating Lucene & Spark Split friendly. It supports searches within a token range SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:”text", value:"cassandra"} }’ AND TOKEN(userid, createdAt, id) > 253653456456 AND TOKEN(userid, createdAt, id) <= 3456467456756 LIMIT 10000; #CassandraSummit 2014 26
  27. 27. Integrating Lucene & Spark Paging friendly: It supports starting queries in a certain point SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:”match", field:”text", value:”cassandra”} }’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000; #CassandraSummit 2014 27
  28. 28. Integrating Lucene & Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene • Compute large amounts of data • Avoid systematic full scan • Reduces the amount of data to be processed • Filtering push-down #CassandraSummit 2014 28
  29. 29. WHEN TO USE INDEXES AND WHEN TO USE FULL SCAN
  30. 30. Index performance in Spark Time Full scan Lucene 2i Records returned #CassandraSummit 2014 30
  31. 31. Lucene indexes in C* DEMO
  32. 32. OTHER TOOLS
  33. 33. Stratio Deep INTEGRATING SPARK WITH DIFFERENT DATASTORES • Common Cell abstraction in the RDD • Maintain compatibility with Spark operations • Compatible with multiple datastore technologies • DeepSparkContext • DeepJobConfig • Compatible with Lucene indexes #CassandraSummit 2014 33
  34. 34. Stratio Crossdata UNIFYING BATCH AND STREAMING QUERIES • Single SQL-like language • Compatible with multiple datastore technologies • Connector-based architecture • Ability to combine data from different datastore • Complement non-native operation with Spark • E.g., JOIN in Cassandra • Custom support for Lucene-based secondary indexes #CassandraSummit 2014 34
  35. 35. CREATING INDEXES Stratio Crossdata CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);' QUERYING THE INDEXES SELECT'*'FROM'app.users'' where'email'MATCH'‘*@stratio.com’;' #CassandraSummit 2014 35
  36. 36. Conclusions • Added new query methods - Multivariable queries (AND, OR, NOT) - Range queries (>, >=, <, <=) and regular expressions - Full text queries (match, phrase, fuzzy...) • Top-k query support - Lucene scoring formula - Sort by field values • Compatible with MapReduce frameworks • Preserves Cassandra’s functionality #CassandraSummit 2014 36
  37. 37. github.com/stratio/stratio-cassandra • Published as fork of Apache Cassandra • Apache License Version 2.0 stratio.github.io/crossdata Its open source • Apache License Version 2.0 #CassandraSummit 2014 37
  38. 38. Advanced search and Top-K queries in Cassandra 38 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena

×