Mais conteúdo relacionado Semelhante a Real World Analytics with Solr Cloud and Spark (20) Real World Analytics with Solr Cloud and Spark1.
Real-World Analytics with Solr Cloud and Spark
Solving Analytic Problems for Billions of Records Within Seconds
Vancouver, May 2016 | Johannes Weigend | QAware GmbH
Johannes Weigend
Apache Big Data North America 2016
May 2016
2. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Any Question?
Ask or Twitter with the Hashtag #cloudnativenerd
3. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
The Problem We Want to Solve
■Interactive applications with runtimes lower than a second!
■Processing of billions of records (>109 rows / records)
■Continuously import data (near realtime)
■Applications on top of the Reactive Manifesto
4. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
5. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
6. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Horizontal Scalability can be difficult!
■Horizontal Scalability of functions
■Trivial
■Loadbalancing of (stateless) services (makro- / microservices)
■More users ! more machines
■Not trivial
■More machines ! faster response times
■Horizontal Scalability of data
■Trivial
■Linear distribution of data on multiple machines
■More machines ! more data
■Not trivial
■Constant response times with growing datasets
7. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Hadoop Gives Answers for Horizontal Scalability of
Data and Functions
8. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
9. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
The Processing of Distributed Data can be Quite Slow!
9
Data Flow
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
foreach()
-> Minutes / Hours
HDFS / NFS / NoSQL
10. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With Former Indexing and Searching,
Less Data has to be Read and Filtered.
10
Filter
Search Search Search
Map Map Map
Reduce
Data FlowFilter Filter
foreach()
-> Seconds/Minutes
Search / NoSQL
12. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
13. Spark
1. Solr Cloud for Analytics
Filter
Search Search Search
Map Map Map
Reduce
Data FlowFilter Filter
Search / NoSQL
14. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Document based NoSQL database with outstanding search capabilities
■A document is a collection of fields (string, number, date, …)
■Single und multiple fields (fields can be arrays)
■Nested documents
■Static und dynamic scheme
■Powerful query language (Lucene)
■Horizontal scalable with Solr Cloud
■Distributed data in separate shards
■Resilience by the combination of zookeeper and replication
■Powerful aggregations (aka facets)
■Stable —> V 6.0
Cloud
15. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Shard2
The Architecture of Solr Cloud
Solr Server
Zookeeper
Solr ServerSolr Server
Shard1
Zookeeper Zookeeper Zookeeper
Cluster
Solr Cloud
Leader
Scale Out
Shard3
Replika8 Replika9
Shard5Shard4 Shard6 Shard8Shard7 Shard9
Replika2 Replika3 Replika5
Shards
Replicas
Collection
Replica4 Replica7 Replika1 Shard6
16. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Stores Everything in a Single „Table“ (BigTable).
Searching is Extremely Fast and Powerful.*
Customer Order
*1
Name Amount
Address Product
Type ID Name Address Amount Product K2B
Customer 1 K 1 A 1 - - [3,5]
Customer 2 K 2 A 2 - - [4]
Order 3 - - Z 1 P 1 [1]
Order 4 - - Z 2 P 2 [2]
...
SolrDocument
SolrDocument
SolrDocument
SolrDocument
(*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms
17. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Solr Cloud can be Started in Seconds.
■ Create a scheme by reusing an existing set of solr config files
■ There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be
copied and modified
■ Start solr
■ When the wizzard asks for a collection name use „bigdata2016“ (see above)
■ Make a first test
cp $SOLR_HOME/server/solr/configset/basic_configs
$SOLR_HOME/server/solr/configsets/bigdata2016
$SOLR_HOME/bin/solr start –e cloud
curl localhost:8983/solr/jax2016/query?q=*:*
18. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With the Solr Cloud Collection API,
Shards can be Created, Changed or Deleted.
■ Create a collection
■ Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE&
name=<<name of collection>>
<<SOLR URL>>/solr/admin/collections?action=CREATE&
name=<<name of collection>>&
numShards=16&
replicationFactor=2&
maxShardsPerNode=8&
collection.configName=
<<name of uploaded zookeeper configuration>>
https://cwiki.apache.org/confluence/display/solr/Collections+API
19. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Zookeeper has to be Started First and the Solr Configuration
must be Uploaded to Use a Solr Cloud.
1.Start zookeeper on 2n+1 nodes (odd number)
2.Upload the solr configuration into zookeeper
3.Start solr on n-nodes connected to the zookeeper cluster
4.Create a collection with a number of shards and replicas
$SOLR_HOME/bin/solr start –c -z
192.168.1.100:2181,192.168.1.101:2181,192.168.1.102
$SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd
upconfig -zkhost
192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 -
confname ekgdata -solrhome /opt/solr/server/solr -confdir /
opt/solr/server/solr/configsets/ekgdata_configs/conf
$ZOO_HOME/bin/zkServer.sh start
20. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Example: Solr Cloud for Analytics of Insurance Data
■Insurance sample data with the following fields
Education IncomeGender
...
21. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
22. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Supports JSON Queries per HTTP Post
23. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Term Facets Group and Count a Single Field.
23
24. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Function Facets Aggregate Fields.
24
http://yonik.com/solr-facet-functions/
25. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Pivot Facets Compose Facets into Hierarchies.
25
26. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr 6 Supports SQL
■ Solr 6 supports distributed SQL
■ The JDBC Driver is part of the solrj client library
■ A collection is currently mapped as single table.
■ Collection -> Table
■ SolrDocument -> Row
■ Field -> Column
■ The Solr 6.0 is limited, but more functionality is expected in upcoming versions
■ No database metadata, no prepared statements, no mapping to tables per type field
27. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Resilience
■The number of replicas per shard is configurable (replication factor)
■This number corresponds with the number of nodes which can silently
fail
■Zookeeper is the single source of failure, but can also be failsafe by
running multiple instances
■Solr knows all zookeeper instances and can silently switch over to the
next available leader if last connected zookeeper crashes
28. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You Got Everything What You Need! – Or Not?
■Client side processing of solr documents does not scale
■No possibility to run parallel business logic inside solr
■The solr index is not a general purpose store for huge data
■Images
■Videos
■Binaries / large text documents
■No Interface to machine learning or typical statistics libraries (R) ...
28
30. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Distributed computing (100x faster than Hadoop (M/R)
■Distributed Map/Reduce on distributed data can be done in-memory
■Written in Scala (JVM)
■Java/Scala/Python APIs
■Processes data from distributed and non-distributed sources
■Textfiles (accessible from all nodes)
■Hadoop File System (HDFS)
■Databases (JDBC)
■Solr per Lucidworks API
■...
READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
31. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Cluster
JVM
Worker
Worker
JVM
JVM
JVM
Worker
Master / Yarn / Mesos
JVM
Executor
Executor
JVM
JVM
JVM
Executor
start
start
start
Task
Task(s)
Slave
Slave
Slave
Master
Host
Spark
Context
MasterURL
Resilient
Distributed
Dataset
RDD
Driver Node
creates
Driver Application
Application
uses
Partition
Task(s)
Partition
Task(s)
Partition
32. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Very First Spark Application
33. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 1: Distributed Task with Params
34. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 2: Distributed Read from External Sources
35. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 3: Caching and Further Processing with RDDs
36. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
38. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
How to implement readFromShard()?
■ Several possibilities for that:
■ SolrJ: SolrStream
■ /export Handler kann Massendaten aus SOLR streamen
■ Unterstützt nur JSON Export (Kein Binary Format !)
■ Or: SolrJ cursor marks
■ Or: Custom export handler
http://localhost:8983/solr/jax2016/export?q=*:*&sort=id%20asc&fl=id&wt=xml
39. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
LucidWorks has released a Spark/Solr Integration Library.
https://github.com/lucidworks/spark-solr
40. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
1
2
3
4
Lucidworks Solr-Spark
Adapter V 2.1
41. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Logfile Analytics with Solr and Spark
■Histogram of all exception from hosts A,B,C during time
interval D
■Step 1: Search with Solr
■Solr Query (q=*Exception AND (server: A OR server:B OR server:C) AND timestamp
between [1.1.2015, 31.12.2015]
■Step 2: Create a map with key = << exception name >>, value =
count
■Group with Spark
42. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
42
1
2
3
4
43. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
+
44. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Specifications – Intel NUC6i5SYK
6th generation Intel® Core™ i5-6260U
processor with Intel® Iris™ graphics
(1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB
Cache, 15W TDP)
CPU
32 GB Dual-channel DDR4 SODIMMs
1.2V, 2133 MHz
RAM
256 GB Samsung M.2 internal SSDDISK
! This case is as powerful like four notebooks
8 Cores, 16 HT Units, 128 GB RAM, 1 TB DiskTotal
45. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Technical Cluster Architecture
hdfs
Ubuntu Linux
Solr Cloud
Zookeeper
#1
Spark
Zeppelin
Master JVM Slave JVM
Executor JVM #1
Ubuntu Linux
Solr Cloud
Zookeeper
#2
Spark
Zeppelin
Master JVM #2 Slave JVM #2
Executor JVM #2
Ubuntu Linux
Solr Cloud
Spark
Master JVM #4 Slave JVM #4
Executor JVM #4
Ubuntu Linux
Solr Cloud
Zookeeper
#3
Spark
Master JVM #3 Slave JVM #3
Executor JVM #3
s1 s2 s3 s4
s5 s6 s7 s8
s13 s14 s15 s16
s9 s10 s11 s12
1
23
4
46. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You can even run Solr Cloud and Spark on Odroid 4 70$ ARM
Computers
■ 8 Cores
■ ca. 1/10 CPU performance in comparison to the Intel NUC 6 / Core i5
47. 47
SPARK Worker
SOLR 5.3
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SPARK Worker
SOLR 5.3
SPARK Worker
SOLR 5.3
SPARK Master
SOLR 5.3
SPARK Worker
ZOOKEEPER
40 Cores
10 GB RAM
320 GB eMMC Disk
48. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
49. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Summary
■Solr Cloud and Spark are a powerful combination for interactive
analytics and data intense applications
■Writing distributed software stays hard. Only distribute if you have to.
■100% Open Source
■A simple integration of Solr and Spark is easy. For high performance
applications things could be more complicated.
■If professional product support is needed, customers can switch to
Lucidworks Fusion to get a pre integrated and supported Solr/Spark
platform
50. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
@JohannesWeigend
@qaware
slideshare.net/qaware
blog.qaware.de