SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline


Real-World Analytics with Solr Cloud and Spark
Solving Analytic Problems for Billions of Records Within Seconds
Vancouver, May 2016 | Johannes Weigend | QAware GmbH
Johannes Weigend
Apache Big Data North America 2016
May 2016
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Any Question?
Ask or Twitter with the Hashtag #cloudnativenerd
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH


The Problem We Want to Solve
■Interactive applications with runtimes lower than a second!
■Processing of billions of records (>109 rows / records)

■Continuously import data (near realtime)

■Applications on top of the Reactive Manifesto
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Horizontal Scalability can be difficult!
■Horizontal Scalability of functions

■Trivial
■Loadbalancing of (stateless) services (makro- / microservices)
■More users ! more machines
■Not trivial
■More machines ! faster response times
■Horizontal Scalability of data

■Trivial
■Linear distribution of data on multiple machines
■More machines ! more data
■Not trivial
■Constant response times with growing datasets
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Hadoop Gives Answers for Horizontal Scalability of
Data and Functions
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
The Processing of Distributed Data can be Quite Slow!
9
Data Flow
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
foreach()
-> Minutes / Hours
HDFS / NFS / NoSQL
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With Former Indexing and Searching,
Less Data has to be Read and Filtered.
10
Filter
Search Search Search
Map Map Map
Reduce
Data FlowFilter Filter
foreach()
-> Seconds/Minutes
Search / NoSQL
Spark
Search Search Search
Map Map Map
Reduce
Distributed
Data
Cluster
Processing
Business Layer
Frontend
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Spark
1. Solr Cloud for Analytics
Filter
Search Search Search
Map Map Map
Reduce
Data FlowFilter Filter
Search / NoSQL
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Document based NoSQL database with outstanding search capabilities

■A document is a collection of fields (string, number, date, …)

■Single und multiple fields (fields can be arrays)

■Nested documents

■Static und dynamic scheme

■Powerful query language (Lucene)

■Horizontal scalable with Solr Cloud
■Distributed data in separate shards 

■Resilience by the combination of zookeeper and replication

■Powerful aggregations (aka facets)
■Stable —> V 6.0
Cloud
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Shard2
The Architecture of Solr Cloud
Solr Server
Zookeeper
Solr ServerSolr Server
Shard1
Zookeeper Zookeeper Zookeeper
Cluster
Solr Cloud
Leader
Scale Out
Shard3
Replika8 Replika9
Shard5Shard4 Shard6 Shard8Shard7 Shard9
Replika2 Replika3 Replika5
Shards
Replicas
Collection
Replica4 Replica7 Replika1 Shard6
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Stores Everything in a Single „Table“ (BigTable). 

Searching is Extremely Fast and Powerful.*
Customer Order
*1
Name Amount
Address Product
Type ID Name Address Amount Product K2B
Customer 1 K 1 A 1 - - [3,5]
Customer 2 K 2 A 2 - - [4]
Order 3 - - Z 1 P 1 [1]
Order 4 - - Z 2 P 2 [2]
...
SolrDocument
SolrDocument
SolrDocument
SolrDocument
(*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Solr Cloud can be Started in Seconds.
■ Create a scheme by reusing an existing set of solr config files

■ There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be
copied and modified

■ Start solr

■ When the wizzard asks for a collection name use „bigdata2016“ (see above)

■ Make a first test
cp $SOLR_HOME/server/solr/configset/basic_configs 
$SOLR_HOME/server/solr/configsets/bigdata2016
$SOLR_HOME/bin/solr start –e cloud
curl localhost:8983/solr/jax2016/query?q=*:*
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With the Solr Cloud Collection API, 

Shards can be Created, Changed or Deleted.
■ Create a collection

■ Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE&
name=<<name of collection>>
<<SOLR URL>>/solr/admin/collections?action=CREATE&
name=<<name of collection>>&
numShards=16&
replicationFactor=2&
maxShardsPerNode=8&
collection.configName=
<<name of uploaded zookeeper configuration>>
https://cwiki.apache.org/confluence/display/solr/Collections+API
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Zookeeper has to be Started First and the Solr Configuration
must be Uploaded to Use a Solr Cloud.
1.Start zookeeper on 2n+1 nodes (odd number)

2.Upload the solr configuration into zookeeper

3.Start solr on n-nodes connected to the zookeeper cluster

4.Create a collection with a number of shards and replicas
$SOLR_HOME/bin/solr start –c -z
192.168.1.100:2181,192.168.1.101:2181,192.168.1.102
$SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd
upconfig -zkhost
192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 -
confname ekgdata -solrhome /opt/solr/server/solr -confdir /
opt/solr/server/solr/configsets/ekgdata_configs/conf
$ZOO_HOME/bin/zkServer.sh start
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Example: Solr Cloud for Analytics of Insurance Data
■Insurance sample data with the following fields
Education IncomeGender
...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Supports JSON Queries per HTTP Post
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Term Facets Group and Count a Single Field.
23
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Function Facets Aggregate Fields.
24
http://yonik.com/solr-facet-functions/
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Pivot Facets Compose Facets into Hierarchies.
25
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr 6 Supports SQL
■ Solr 6 supports distributed SQL

■ The JDBC Driver is part of the solrj client library

■ A collection is currently mapped as single table. 

■ Collection -> Table

■ SolrDocument -> Row

■ Field -> Column

■ The Solr 6.0 is limited, but more functionality is expected in upcoming versions

■ No database metadata, no prepared statements, no mapping to tables per type field
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Resilience
■The number of replicas per shard is configurable (replication factor)

■This number corresponds with the number of nodes which can silently
fail

■Zookeeper is the single source of failure, but can also be failsafe by
running multiple instances

■Solr knows all zookeeper instances and can silently switch over to the
next available leader if last connected zookeeper crashes
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You Got Everything What You Need! – Or Not?
■Client side processing of solr documents does not scale

■No possibility to run parallel business logic inside solr

■The solr index is not a general purpose store for huge data

■Images

■Videos

■Binaries / large text documents

■No Interface to machine learning or typical statistics libraries (R) ...
28
Spark
Distributed In-Memory Computing
mit Apache Spark
Filter
Search Search Search
Map Map Map
Reduce
Data flowFilter Filter
Search / NoSQL
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Distributed computing (100x faster than Hadoop (M/R)

■Distributed Map/Reduce on distributed data can be done in-memory 

■Written in Scala (JVM)

■Java/Scala/Python APIs

■Processes data from distributed and non-distributed sources

■Textfiles (accessible from all nodes)

■Hadoop File System (HDFS)

■Databases (JDBC)

■Solr per Lucidworks API

■...
READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Cluster
JVM
Worker
Worker
JVM
JVM
JVM
Worker
Master / Yarn / Mesos
JVM
Executor
Executor
JVM
JVM
JVM
Executor
start
start
start
Task
Task(s)
Slave
Slave
Slave
Master
Host
Spark
Context
MasterURL
Resilient
Distributed
Dataset
RDD
Driver Node
creates
Driver Application
Application
uses
Partition
Task(s)
Partition
Task(s)
Partition
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Very First Spark Application
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 1: Distributed Task with Params
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 2: Distributed Read from External Sources
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 3: Caching and Further Processing with RDDs
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Spark
Putting all together

Solr & Spark in Action
Filter
Search Search Search
Map Map Map
Reduce
DatenflussFilter Filter
Search / NoSQL
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
How to implement readFromShard()?
■ Several possibilities for that:

■ SolrJ: SolrStream

■ /export Handler kann Massendaten aus SOLR streamen

■ Unterstützt nur JSON Export (Kein Binary Format !)

■ Or: SolrJ cursor marks

■ Or: Custom export handler
http://localhost:8983/solr/jax2016/export?q=*:*&sort=id%20asc&fl=id&wt=xml
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
LucidWorks has released a Spark/Solr Integration Library.

https://github.com/lucidworks/spark-solr
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
1
2
3
4
Lucidworks Solr-Spark
Adapter V 2.1
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Logfile Analytics with Solr and Spark
■Histogram of all exception from hosts A,B,C during time
interval D

■Step 1: Search with Solr

■Solr Query (q=*Exception AND (server: A OR server:B OR server:C) AND timestamp
between [1.1.2015, 31.12.2015]
■Step 2: Create a map with key = << exception name >>, value =
count

■Group with Spark
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
42
1
2
3
4
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
+
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Specifications – Intel NUC6i5SYK

6th generation Intel® Core™ i5-6260U
processor with Intel® Iris™ graphics
(1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB
Cache, 15W TDP)
CPU
32 GB Dual-channel DDR4 SODIMMs
1.2V, 2133 MHz
RAM
256 GB Samsung M.2 internal SSDDISK
! This case is as powerful like four notebooks
8 Cores, 16 HT Units, 128 GB RAM, 1 TB DiskTotal
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Technical Cluster Architecture
hdfs
Ubuntu Linux
Solr Cloud
Zookeeper
#1
Spark
Zeppelin
Master JVM Slave JVM
Executor JVM #1
Ubuntu Linux
Solr Cloud
Zookeeper
#2
Spark
Zeppelin
Master JVM #2 Slave JVM #2
Executor JVM #2
Ubuntu Linux
Solr Cloud
Spark
Master JVM #4 Slave JVM #4
Executor JVM #4
Ubuntu Linux
Solr Cloud
Zookeeper
#3
Spark
Master JVM #3 Slave JVM #3
Executor JVM #3
s1 s2 s3 s4
s5 s6 s7 s8
s13 s14 s15 s16
s9 s10 s11 s12
1
23
4
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You can even run Solr Cloud and Spark on Odroid 4 70$ ARM
Computers
■ 8 Cores 

■ ca. 1/10 CPU performance in comparison to the Intel NUC 6 / Core i5
47
SPARK Worker
SOLR 5.3
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SPARK Worker
SOLR 5.3
SPARK Worker
SOLR 5.3
SPARK Master
SOLR 5.3
SPARK Worker
ZOOKEEPER
40 Cores
10 GB RAM
320 GB eMMC Disk
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Summary
■Solr Cloud and Spark are a powerful combination for interactive
analytics and data intense applications

■Writing distributed software stays hard. Only distribute if you have to.

■100% Open Source

■A simple integration of Solr and Spark is easy. For high performance
applications things could be more complicated.

■If professional product support is needed, customers can switch to
Lucidworks Fusion to get a pre integrated and supported Solr/Spark
platform
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
@JohannesWeigend
@qaware
slideshare.net/qaware
blog.qaware.de
51

Mais conteúdo relacionado

Mais procurados

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

Mais procurados (20)

Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 

Destaque

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Secure Architecture and Programming 101
Secure Architecture and Programming 101Secure Architecture and Programming 101
Secure Architecture and Programming 101
QAware GmbH
 

Destaque (20)

Real World Analytics
Real World AnalyticsReal World Analytics
Real World Analytics
 
Hadoop assignment 1
Hadoop assignment 1Hadoop assignment 1
Hadoop assignment 1
 
Kubernetes 101 and Fun
Kubernetes 101 and FunKubernetes 101 and Fun
Kubernetes 101 and Fun
 
Hands-on K8s: Deployments, Pods and Fun
Hands-on K8s: Deployments, Pods and FunHands-on K8s: Deployments, Pods and Fun
Hands-on K8s: Deployments, Pods and Fun
 
JEE on DC/OS - MesosCon Europe
JEE on DC/OS - MesosCon EuropeJEE on DC/OS - MesosCon Europe
JEE on DC/OS - MesosCon Europe
 
Lightweight developer provisioning with gradle and seu as-code
Lightweight developer provisioning with gradle and seu as-codeLightweight developer provisioning with gradle and seu as-code
Lightweight developer provisioning with gradle and seu as-code
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
 
Cloud Native Unleashed
Cloud Native UnleashedCloud Native Unleashed
Cloud Native Unleashed
 
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
 
Automotive Information Research driven by Apache Solr
Automotive Information Research driven by Apache SolrAutomotive Information Research driven by Apache Solr
Automotive Information Research driven by Apache Solr
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Secure Architecture and Programming 101
Secure Architecture and Programming 101Secure Architecture and Programming 101
Secure Architecture and Programming 101
 
Die Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickeln
Die Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickelnDie Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickeln
Die Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickeln
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
 
Der Cloud Native Stack in a Nutshell
Der Cloud Native Stack in a NutshellDer Cloud Native Stack in a Nutshell
Der Cloud Native Stack in a Nutshell
 
Clickstream Analysis with Spark - Understanding Visitors in Real Time
Clickstream Analysis with Spark - Understanding Visitors in Real TimeClickstream Analysis with Spark - Understanding Visitors in Real Time
Clickstream Analysis with Spark - Understanding Visitors in Real Time
 
Per Anhalter durch den Cloud Native Stack (extended edition)
Per Anhalter durch den Cloud Native Stack (extended edition)Per Anhalter durch den Cloud Native Stack (extended edition)
Per Anhalter durch den Cloud Native Stack (extended edition)
 
From pets to cattle - powered by CoreOS, docker, Mesos & nginx
From pets to cattle - powered by CoreOS, docker, Mesos & nginxFrom pets to cattle - powered by CoreOS, docker, Mesos & nginx
From pets to cattle - powered by CoreOS, docker, Mesos & nginx
 
Automotive Information Research driven by Apache Solr
Automotive Information Research driven by Apache SolrAutomotive Information Research driven by Apache Solr
Automotive Information Research driven by Apache Solr
 

Semelhante a Real World Analytics with Solr Cloud and Spark

Semelhante a Real World Analytics with Solr Cloud and Spark (20)

Leveraging the power of solr with spark
Leveraging the power of solr with sparkLeveraging the power of solr with spark
Leveraging the power of solr with spark
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Learn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best PracticesLearn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best Practices
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
TDC2016SP - Trilha NoSQL
TDC2016SP - Trilha NoSQLTDC2016SP - Trilha NoSQL
TDC2016SP - Trilha NoSQL
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 

Mais de QAware GmbH

"Mixed" Scrum-Teams – Die richtige Mischung macht's!
"Mixed" Scrum-Teams – Die richtige Mischung macht's!"Mixed" Scrum-Teams – Die richtige Mischung macht's!
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
QAware GmbH
 
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
 Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See... Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
QAware GmbH
 

Mais de QAware GmbH (20)

50 Shades of K8s Autoscaling #JavaLand24.pdf
50 Shades of K8s Autoscaling #JavaLand24.pdf50 Shades of K8s Autoscaling #JavaLand24.pdf
50 Shades of K8s Autoscaling #JavaLand24.pdf
 
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
 
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN MainzFully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
 
Down the Ivory Tower towards Agile Architecture
Down the Ivory Tower towards Agile ArchitectureDown the Ivory Tower towards Agile Architecture
Down the Ivory Tower towards Agile Architecture
 
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
"Mixed" Scrum-Teams – Die richtige Mischung macht's!"Mixed" Scrum-Teams – Die richtige Mischung macht's!
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
 
Make Developers Fly: Principles for Platform Engineering
Make Developers Fly: Principles for Platform EngineeringMake Developers Fly: Principles for Platform Engineering
Make Developers Fly: Principles for Platform Engineering
 
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
Der Tod der Testpyramide? – Frontend-Testing mit PlaywrightDer Tod der Testpyramide? – Frontend-Testing mit Playwright
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
 
Was kommt nach den SPAs
Was kommt nach den SPAsWas kommt nach den SPAs
Was kommt nach den SPAs
 
Cloud Migration mit KI: der Turbo
Cloud Migration mit KI: der Turbo Cloud Migration mit KI: der Turbo
Cloud Migration mit KI: der Turbo
 
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
 Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See... Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
 
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
 
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
 
Kubernetes with Cilium in AWS - Experience Report!
Kubernetes with Cilium in AWS - Experience Report!Kubernetes with Cilium in AWS - Experience Report!
Kubernetes with Cilium in AWS - Experience Report!
 
50 Shades of K8s Autoscaling
50 Shades of K8s Autoscaling50 Shades of K8s Autoscaling
50 Shades of K8s Autoscaling
 
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAPKontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
 
Service Mesh Pain & Gain. Experiences from a client project.
Service Mesh Pain & Gain. Experiences from a client project.Service Mesh Pain & Gain. Experiences from a client project.
Service Mesh Pain & Gain. Experiences from a client project.
 
50 Shades of K8s Autoscaling
50 Shades of K8s Autoscaling50 Shades of K8s Autoscaling
50 Shades of K8s Autoscaling
 
Blue turns green! Approaches and technologies for sustainable K8s clusters.
Blue turns green! Approaches and technologies for sustainable K8s clusters.Blue turns green! Approaches and technologies for sustainable K8s clusters.
Blue turns green! Approaches and technologies for sustainable K8s clusters.
 
Per Anhalter zu Cloud Nativen API Gateways
Per Anhalter zu Cloud Nativen API GatewaysPer Anhalter zu Cloud Nativen API Gateways
Per Anhalter zu Cloud Nativen API Gateways
 
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Real World Analytics with Solr Cloud and Spark

  • 1. 
 Real-World Analytics with Solr Cloud and Spark Solving Analytic Problems for Billions of Records Within Seconds Vancouver, May 2016 | Johannes Weigend | QAware GmbH Johannes Weigend Apache Big Data North America 2016 May 2016
  • 2. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Any Question? Ask or Twitter with the Hashtag #cloudnativenerd
  • 3. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 
 The Problem We Want to Solve ■Interactive applications with runtimes lower than a second! ■Processing of billions of records (>109 rows / records) ■Continuously import data (near realtime) ■Applications on top of the Reactive Manifesto
  • 4. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  • 5. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  • 6. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Horizontal Scalability can be difficult! ■Horizontal Scalability of functions ■Trivial ■Loadbalancing of (stateless) services (makro- / microservices) ■More users ! more machines ■Not trivial ■More machines ! faster response times ■Horizontal Scalability of data ■Trivial ■Linear distribution of data on multiple machines ■More machines ! more data ■Not trivial ■Constant response times with growing datasets
  • 7. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Hadoop Gives Answers for Horizontal Scalability of Data and Functions
  • 8. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  • 9. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH The Processing of Distributed Data can be Quite Slow! 9 Data Flow Read Read Read Filter Filter Filter Map Map Map Reduce foreach() -> Minutes / Hours HDFS / NFS / NoSQL
  • 10. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH With Former Indexing and Searching, Less Data has to be Read and Filtered. 10 Filter Search Search Search Map Map Map Reduce Data FlowFilter Filter foreach() -> Seconds/Minutes Search / NoSQL
  • 11. Spark Search Search Search Map Map Map Reduce Distributed Data Cluster Processing Business Layer Frontend
  • 12. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  • 13. Spark 1. Solr Cloud for Analytics Filter Search Search Search Map Map Map Reduce Data FlowFilter Filter Search / NoSQL
  • 14. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH ■Document based NoSQL database with outstanding search capabilities ■A document is a collection of fields (string, number, date, …) ■Single und multiple fields (fields can be arrays) ■Nested documents ■Static und dynamic scheme ■Powerful query language (Lucene) ■Horizontal scalable with Solr Cloud ■Distributed data in separate shards ■Resilience by the combination of zookeeper and replication ■Powerful aggregations (aka facets) ■Stable —> V 6.0 Cloud
  • 15. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Shard2 The Architecture of Solr Cloud Solr Server Zookeeper Solr ServerSolr Server Shard1 Zookeeper Zookeeper Zookeeper Cluster Solr Cloud Leader Scale Out Shard3 Replika8 Replika9 Shard5Shard4 Shard6 Shard8Shard7 Shard9 Replika2 Replika3 Replika5 Shards Replicas Collection Replica4 Replica7 Replika1 Shard6
  • 16. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr Stores Everything in a Single „Table“ (BigTable). 
 Searching is Extremely Fast and Powerful.* Customer Order *1 Name Amount Address Product Type ID Name Address Amount Product K2B Customer 1 K 1 A 1 - - [3,5] Customer 2 K 2 A 2 - - [4] Order 3 - - Z 1 P 1 [1] Order 4 - - Z 2 P 2 [2] ... SolrDocument SolrDocument SolrDocument SolrDocument (*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms
  • 17. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH A Solr Cloud can be Started in Seconds. ■ Create a scheme by reusing an existing set of solr config files ■ There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be copied and modified ■ Start solr ■ When the wizzard asks for a collection name use „bigdata2016“ (see above) ■ Make a first test cp $SOLR_HOME/server/solr/configset/basic_configs $SOLR_HOME/server/solr/configsets/bigdata2016 $SOLR_HOME/bin/solr start –e cloud curl localhost:8983/solr/jax2016/query?q=*:*
  • 18. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH With the Solr Cloud Collection API, 
 Shards can be Created, Changed or Deleted. ■ Create a collection ■ Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE& name=<<name of collection>> <<SOLR URL>>/solr/admin/collections?action=CREATE& name=<<name of collection>>& numShards=16& replicationFactor=2& maxShardsPerNode=8& collection.configName= <<name of uploaded zookeeper configuration>> https://cwiki.apache.org/confluence/display/solr/Collections+API
  • 19. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Zookeeper has to be Started First and the Solr Configuration must be Uploaded to Use a Solr Cloud. 1.Start zookeeper on 2n+1 nodes (odd number) 2.Upload the solr configuration into zookeeper 3.Start solr on n-nodes connected to the zookeeper cluster 4.Create a collection with a number of shards and replicas $SOLR_HOME/bin/solr start –c -z 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 $SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd upconfig -zkhost 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 - confname ekgdata -solrhome /opt/solr/server/solr -confdir / opt/solr/server/solr/configsets/ekgdata_configs/conf $ZOO_HOME/bin/zkServer.sh start
  • 20. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Example: Solr Cloud for Analytics of Insurance Data ■Insurance sample data with the following fields Education IncomeGender ...
  • 21. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  • 22. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr Supports JSON Queries per HTTP Post
  • 23. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Term Facets Group and Count a Single Field. 23
  • 24. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Function Facets Aggregate Fields. 24 http://yonik.com/solr-facet-functions/
  • 25. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Pivot Facets Compose Facets into Hierarchies. 25
  • 26. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr 6 Supports SQL ■ Solr 6 supports distributed SQL ■ The JDBC Driver is part of the solrj client library ■ A collection is currently mapped as single table. ■ Collection -> Table ■ SolrDocument -> Row ■ Field -> Column ■ The Solr 6.0 is limited, but more functionality is expected in upcoming versions ■ No database metadata, no prepared statements, no mapping to tables per type field
  • 27. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Resilience ■The number of replicas per shard is configurable (replication factor) ■This number corresponds with the number of nodes which can silently fail ■Zookeeper is the single source of failure, but can also be failsafe by running multiple instances ■Solr knows all zookeeper instances and can silently switch over to the next available leader if last connected zookeeper crashes
  • 28. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH You Got Everything What You Need! – Or Not? ■Client side processing of solr documents does not scale ■No possibility to run parallel business logic inside solr ■The solr index is not a general purpose store for huge data ■Images ■Videos ■Binaries / large text documents ■No Interface to machine learning or typical statistics libraries (R) ... 28
  • 29. Spark Distributed In-Memory Computing mit Apache Spark Filter Search Search Search Map Map Map Reduce Data flowFilter Filter Search / NoSQL
  • 30. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH ■Distributed computing (100x faster than Hadoop (M/R) ■Distributed Map/Reduce on distributed data can be done in-memory ■Written in Scala (JVM) ■Java/Scala/Python APIs ■Processes data from distributed and non-distributed sources ■Textfiles (accessible from all nodes) ■Hadoop File System (HDFS) ■Databases (JDBC) ■Solr per Lucidworks API ■... READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 31. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Cluster JVM Worker Worker JVM JVM JVM Worker Master / Yarn / Mesos JVM Executor Executor JVM JVM JVM Executor start start start Task Task(s) Slave Slave Slave Master Host Spark Context MasterURL Resilient Distributed Dataset RDD Driver Node creates Driver Application Application uses Partition Task(s) Partition Task(s) Partition
  • 32. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH A Very First Spark Application
  • 33. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 1: Distributed Task with Params
  • 34. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 2: Distributed Read from External Sources
  • 35. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 3: Caching and Further Processing with RDDs
  • 36. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  • 37. Spark Putting all together Solr & Spark in Action Filter Search Search Search Map Map Map Reduce DatenflussFilter Filter Search / NoSQL
  • 38. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH How to implement readFromShard()? ■ Several possibilities for that: ■ SolrJ: SolrStream ■ /export Handler kann Massendaten aus SOLR streamen ■ Unterstützt nur JSON Export (Kein Binary Format !) ■ Or: SolrJ cursor marks ■ Or: Custom export handler http://localhost:8983/solr/jax2016/export?q=*:*&sort=id%20asc&fl=id&wt=xml
  • 39. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH LucidWorks has released a Spark/Solr Integration Library.
 https://github.com/lucidworks/spark-solr
  • 40. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 1 2 3 4 Lucidworks Solr-Spark Adapter V 2.1
  • 41. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Logfile Analytics with Solr and Spark ■Histogram of all exception from hosts A,B,C during time interval D ■Step 1: Search with Solr ■Solr Query (q=*Exception AND (server: A OR server:B OR server:C) AND timestamp between [1.1.2015, 31.12.2015] ■Step 2: Create a map with key = << exception name >>, value = count ■Group with Spark
  • 42. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 42 1 2 3 4
  • 43. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO +
  • 44. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Specifications – Intel NUC6i5SYK
 6th generation Intel® Core™ i5-6260U processor with Intel® Iris™ graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP) CPU 32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz RAM 256 GB Samsung M.2 internal SSDDISK ! This case is as powerful like four notebooks 8 Cores, 16 HT Units, 128 GB RAM, 1 TB DiskTotal
  • 45. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Technical Cluster Architecture hdfs Ubuntu Linux Solr Cloud Zookeeper #1 Spark Zeppelin Master JVM Slave JVM Executor JVM #1 Ubuntu Linux Solr Cloud Zookeeper #2 Spark Zeppelin Master JVM #2 Slave JVM #2 Executor JVM #2 Ubuntu Linux Solr Cloud Spark Master JVM #4 Slave JVM #4 Executor JVM #4 Ubuntu Linux Solr Cloud Zookeeper #3 Spark Master JVM #3 Slave JVM #3 Executor JVM #3 s1 s2 s3 s4 s5 s6 s7 s8 s13 s14 s15 s16 s9 s10 s11 s12 1 23 4
  • 46. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH You can even run Solr Cloud and Spark on Odroid 4 70$ ARM Computers ■ 8 Cores ■ ca. 1/10 CPU performance in comparison to the Intel NUC 6 / Core i5
  • 47. 47 SPARK Worker SOLR 5.3 Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$ SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 SPARK Master SOLR 5.3 SPARK Worker ZOOKEEPER 40 Cores 10 GB RAM 320 GB eMMC Disk
  • 48. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  • 49. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Summary ■Solr Cloud and Spark are a powerful combination for interactive analytics and data intense applications ■Writing distributed software stays hard. Only distribute if you have to. ■100% Open Source ■A simple integration of Solr and Spark is easy. For high performance applications things could be more complicated. ■If professional product support is needed, customers can switch to Lucidworks Fusion to get a pre integrated and supported Solr/Spark platform
  • 50. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH @JohannesWeigend @qaware slideshare.net/qaware blog.qaware.de
  • 51. 51