4. It is Hard to Scale Horizontally!
■ Functions
- Trivial
- Loadbalancing of stateless services (macro- / microservices)
- More users -> more machines
- Nontrivial
- More machines -> faster response times
■ Data
- Trivial
- Linear distribution of data on multiple machines
- More machines -> more data
- Nontrivial
- Constant response times with growing datasets
4
5. 5
Cloud
-Document based NoSQL database with outstanding search capabilities
A document is a collection of fields (string, number, date, …)
Single und multiple fields (fields can be arrays)
Nested documents
Static und dynamic scheme
Powerful query language (Lucene)
-Horizontally scalable with Solr Cloud
Distributed data in separate shards
Resilience by combination of zookeeper and replication
-Powerful aggregations (aka facets)
6. 6
Shard2
Solr Server
Zookeeper
Solr ServerSolr Server
Shard1
Zookeeper Zookeeper Zookeeper
Ensamble
Solr Cloud
Leader
Scale Out
Shard3
Replica8 Replica9
Shard5Shard4 Shard6 Shard8Shard7 Shard9
Replica2 Replica3 Replica5
Shards
Replicas
Collection
Replica4 Replica7 Replica1 Replica6
The Architecture of Solr Cloud
Two Levels of Distribution
8. READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
■Distributed computing (100x faster than Hadoop M/R)
■Distributed Map/Reduce on distributed data can be done in-memory
■Supports online and batch workloads
■Scala with Java/Scala/Python APIs
■Processes data from distributed and local sources
-Textfiles (accessible from all nodes)
-Hadoop File System (HDFS)
-Databases (JDBC)
-Solr per Lucidworks API
8
16. Spark Executor
16
CloudSolrClient Solr Server 1
add(List<document> batch) Shards
Parallel Cloud Importer
Distributed
Input Data
-read input data
-create batch
-add batch to Solr
Parallel Import with Spark makes Import Scalable
Node1
CloudSolrClient Solr Server 2Spark ExecutorNode2
Scale upScale up Scale up
Node n
Solr Server 3CloudSolrClientSpark ExecutorNode3
20. 20
Import takes - 78411 ms
—> 180.000 Docs per Second
Indexing 14 Mio Docs in 1:20 Min
21. SolrJ and Spark have Different Transitive Dependencies
Depending on the Software Version
■ Adding both libraries to your classpath leads by transitivity to serious
problems at runtime (Serialization errors / ClassNotFoundExceptions…)
■ Pinning / Exclusion helps - but can produce strange errors. There is
currently no satisfying solution for the BigData class path hell.
21
23. 23
Using Solr Facet Queries for Aggregation
#
# Grouping per sub query
#
curl $SOLR/$COLLECTION/select -d '
q=process:wls1 AND metric:*.HeapMemoryUsage.used&
rows=0&
json.facet={
Hosts: {
type: terms,
field: host,
facet:{
Off : { query : "value: [* TO 0]" },
Idle : { query : "value: [0 TO 1000000000]" },
Busy : { query : "value: [1000000001 TO 10000000000]" },
Overload : { query : "value: [10000000001 TO *]" }
}
}
}
24. Why Do we Need Even More?
■ Data centerer applications need a scalable way of
- Post processing search results or facets (business logik, ML,
data analytics)
- Post filtering search results
- Processing denormalized data (if you store a one-to-many
relation in a single Solr document)
24
25. Accessing Solr from Spark with SolrRDD
■ https://github.com/
lucidworks/spark-solr
■ You have to build the
library locally. There is no
released version at Maven
Central.
■ Make sure to adjust the
versions depending on
your environment
25
33. A Naive Solr Datamodel
A single Solr document per CSV cell
‣ Advantage
You can use Solr for aggregation, sorting and
searching for values or time intervals
‣ Disadvantage
Data explosion (single compressed CSV file with 3MB
in size produces 1 Mil Solr documents)
33
34. Column Based Denormalization
wls1_lpapp18_jmx.csv
Date CPU % Usage Heap % Usage #GC Invocations
1/10/16 9:00,000 50 50 1000
1/10/16 10:00,000 60 60 1100
1/10/16 11:00,000 70 70 1300
1/10/16 12:00,000 80 80 1800
CSV
SolrDocument {
process: wls1
host: lpapp18
type: jmx
maxdate: 1/10/16 9:00
mindate: 1/10/16 12:00
metric: CPU % Usage
values: [BINARY (Date, Long)]
max: 80
min: 50
avg: 65
}
n 1
Store 1000-10000 events in a single document
Document per column
34
35. Storing 1-to-1400 Relation in a Single Document
Base64 encoded and gzipped
values: [{date: …, value:}, … ]
35
32k Limit for DocValues
36. Benefits of Denomalization
‣ Benefits
- You can scale from a xxx million documents in a Solr Cloud up to
trillions of searchable events
- Import is vastly faster
‣ Drawbacks
- Searching on single values requires additional logic
- Counting and faceting requires additional logic
‣ Spark can solve these problems by parallel post processing
- Decompressing, aggregating, joining, grouping
36
38. 38
Indexing 19 Million of CSV Values
in 13500 Solr documents
takes now 24 Seconds (before 1:20)
—> 800,000 Values per Second
39. 39
Streaming One Billion of Solr Values into Spark
Takes now 34 Seconds (Before 700 s)
—> 29,000,000 Values per Second
40. Summary
■ The combination of Solr Cloud and Spark gives you the power to
deal with BigData workloads in realtime
■ Denormalization can make your Solr application vastly faster
■ Make use of the /export handler when using the SolrRDD
■ Parallel post processing is mandatory for nontrivial applications
■ If you want to learn more: come to the Chronix talk on Friday
40