2. Enterprises today are collecting and have access
to more data points in their ecosystem then ever.
Tuesday, 12 March 13
3. File Store Example
File / Folder Navigation
Integration - Mount Points
Limited Metadata
Hierarchical Structure
Regular File Store
Tuesday, 12 March 13
4. ⢠Find a document from December 2011 about transfer containing proposal and David
⢠Find the document received from John containing David and transfer
⢠Find the revisions of transfer document
File Store Example
File / Folder Navigation
Integration - Mount Points
Limited Metadata
Hierarchical Structure
Tuesday, 12 March 13
5. ⢠Find a document from December 2011 about transfer containing proposal and David
⢠Find the document received from John containing David and transfer
⢠Find the revisions of transfer document
File Store Example
File / Folder Navigation
Integration - Mount Points
Limited Metadata
Hierarchical Structure Collections / Documents
Local / Distributed Integrations
Semantic Metadata
Declarative Queries
Automatic Indexing
Provenance
Automatic Organization
Virtual Collections
Regular File Store
Intelligent File Store
Tuesday, 12 March 13
6. ElasticSearch is an open source, scalable,
distributed, cloud-ready, highly-available full-text
search engine and database with powerful
aggregation features, communicating by JSON over
RESTful HTTP, based on Apache Lucene.
Tuesday, 12 March 13
7. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Capture & Curate
Index
Streams
Analyse
Search
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
8. Playing with ElasticSearch
Rivers
⢠Data ďŹows from sources using Rivers
⢠Continues to add data as it ďŹows
⢠Can be added, removed, conďŹgured dynamically
Tuesday, 12 March 13
9. Playing with ElasticSearch
Rivers
⢠Data ďŹows from sources using Rivers
⢠Continues to add data as it ďŹows
⢠Can be added, removed, conďŹgured dynamically
ES NodeData Source
Data Source
Data Source
River
River
River
ES Index
Tuesday, 12 March 13
10. Playing with ElasticSearch
Rivers
⢠Data ďŹows from sources using Rivers
⢠Continues to add data as it ďŹows
⢠Can be added, removed, conďŹgured dynamically
ES NodeData Source
Data Source
Data Source
River
River
River
ES Index
Tuesday, 12 March 13
11. Playing with ElasticSearch
River Modules
⢠CouchDB ⢠JDBC
⢠MongoDB ⢠Solr
⢠Wikipedia ⢠Jira
⢠Twitter ⢠CSV
⢠ActiveMQ ⢠FileSystem
⢠RabbitMQ ⢠SysInfo
⢠NSQ ⢠Logs
⢠RSS ⢠LDAP
Tuesday, 12 March 13
12. Playing with ElasticSearch
Index
⢠Describes document structure to the search engine
⢠Automatically created with sensible defaults
⢠Explicit mapping can be provided (generally, a good idea)
⢠Simple:
⢠string, integer/long, ďŹoat/double, boolean, and null
⢠Complex:
⢠array, object
Tuesday, 12 March 13
13. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Shards
Replication Load Balancing Nodes
Distributed
Capture & Curate
Index
Streams
Analyse
Search
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
14. Playing with ElasticSearch
Distributed Model
⢠Number of shards is the scaling unit [ #shards > #nodes ]
⢠each one is a separate Lucene index thus, many per-index settings are available
⢠Moving shards around is faster than splitting them (no reindex)
⢠Replicas also serves reads, allowing to scale search
⢠# of replicas can be updated dynamically after index creation
Node 1
user (0)
user (1)
Node 2
user1 (0)
user (1)
Node 3
user (0)
user2 (0)
Automatic Discovery Protocol
Replica
Shard
Tuesday, 12 March 13
15. Playing with ElasticSearch
Index Aliases
curl -X POST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
âindexâ : âusersâ,
âaliasâ : âuser_1â,
âďŹlterâ : { âtermâ : { âuserâ : â1â } },
âroutingâ : â1â
}
} ]
}'
Indexing and search happens on the alias, with automatic use of routing and ďŹltering
Tuesday, 12 March 13
16. Playing with ElasticSearch
Index Aliases
curl -X POST 'http://localhost:9200/_aliases' -d
' {
"actions" : [ { "add" : {
"index" : "user_1",
"alias" : "users"
}
},
{ "add" : {
"index" : "user_2",
"alias" : "users"
}
} ]
}'
users
user_1
user_2
curl -X GET "http://localhost:9200/users/_search?q=..."
Tuesday, 12 March 13
17. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Shards
Replication Load Balancing Nodes
Distributed
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
18. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
19. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Discovery
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
20. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Discovery
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
21. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Monitor
Discovery
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
22. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Monitor
Discovery
RESTful
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
23. REST API : http://host:port/[index]/[type]/[_action/id]
HTTP Methods : GET, POST, PUT, DELETE
Playing with ElasticSearch
Tuesday, 12 March 13
24. REST API : http://host:port/[index]/[type]/[_action/id]
HTTP Methods : GET, POST, PUT, DELETE
Playing with ElasticSearch
Some DeďŹnitions.....
⢠index -> Like a database
⢠type -> Like a table
⢠id -> Like a row in a table
Tuesday, 12 March 13
25. Playing with ElasticSearch
REST API : http://host:port/[index]/[type]/_action/id]
HTTP Methods : GET, POST, PUT, DELETE
curl -X POST "http://localhost:9200/articles/article/1" -d '
{
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only âworksâ with JSON, it understands it! Letâs first ...",
"published_on" : "2013/02/06 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Bruce",
"last_name" : "Croft",
"email" : "bruce@croft.org"
}
}'
request
Tuesday, 12 March 13
26. Playing with ElasticSearch
REST API : http://host:port/[index]/[type]/_action/id]
HTTP Methods : GET, POST, PUT, DELETE
curl -X POST "http://localhost:9200/articles/article/1" -d '
{
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only âworksâ with JSON, it understands it! Letâs first ...",
"published_on" : "2013/02/06 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Bruce",
"last_name" : "Croft",
"email" : "bruce@croft.org"
}
}'
{
"ok":true,
"_index":"articles",
"_type":"article",
"_id":"1",
"_version":1
}
requestresponse
Tuesday, 12 March 13
27. Playing with ElasticSearch
REST API : http://host:port/[index]/[type]/_action/id]
HTTP Methods : GET, POST, PUT, DELETE
request
curl -X GET "http://localhost:9200/articles/_search?q=author.first_name:BRUCE"
Tuesday, 12 March 13
28. Playing with ElasticSearch
REST API : http://host:port/[index]/[type]/_action/id]
HTTP Methods : GET, POST, PUT, DELETE
{
"took":1,
"timed_out":false,
"_shards":{"total":5,"successful":5,"failed":0},
"hits":{
"total":1,
"max_score":0.30685282,
"hits":[{
"_index":"articles",
"_type":"article",
"_id":"1",
"_score":0.30685282,
"_source" :
{
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only âworksâ with JSON, it understands it! Letâs first ...",
"published_on" : "2013/02/06 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Bruce",
"last_name" : "Croft",
"email" : "bruce@croft.org"
}
} } ] } }
request
curl -X GET "http://localhost:9200/articles/_search?q=author.first_name:BRUCE"
response
Tuesday, 12 March 13
29. Playing with ElasticSearch
REST API : http://host:port/[index]/[type]/_action/id]
HTTP Methods : GET, POST, PUT, DELETE
{
"took":1,
"timed_out":false,
"_shards":{"total":5,"successful":5,"failed":0},
"hits":{
"total":1,
"max_score":0.30685282,
"hits":[{
"_index":"articles",
"_type":"article",
"_id":"1",
"_score":0.30685282,
"_source" :
{
"title" : "ElasticSearch Understands JSON!",
"body" : "ElasticSearch not only âworksâ with JSON, it understands it! Letâs first ...",
"published_on" : "2013/02/06 10:00:00",
"tags" : ["search", "json"],
"author" : {
"first_name" : "Bruce",
"last_name" : "Croft",
"email" : "bruce@croft.org"
}
}
} ] } }
request
curl -X GET "http://localhost:9200/articles/_search?q=author.first_name:BRUCE"
response
Location & ID
Document Source
Total number of documents
Tuesday, 12 March 13
30. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Monitor
Discovery
RESTful Micro Apps
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
31. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
HTML5/CSS3 Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Monitor
Discovery
RESTful Micro Apps
MemoryShared FS FS + MemoryLocal FS
Document Store
Playing with ElasticSearch
Tuesday, 12 March 13
37. Micro Applications
Rich, interactive single-page web applications powered by JavaScript, HTML
and CSS.
⢠A self-described framework for ambitious applications
⢠Rails-inspired âconvention over conďŹgurationâ approach
⢠High level abstractions, two-way binding and auto-updating templates
Data Model
ControllerRouter
View
Model Model
Controller
View View
View
Tuesday, 12 March 13
38. Micro Applications
Rich, interactive single-page web applications powered by JavaScript, HTML
and CSS.
⢠A self-described framework for ambitious applications
⢠Rails-inspired âconvention over conďŹgurationâ approach
⢠High level abstractions, two-way binding and auto-updating templates
⢠Ember Data
⢠Client side storage adapter
⢠Provides a common interface to persist application data
⢠RESTful HTTP service - primary endpoint
⢠Browserâs localStorage
⢠Emerging web databases such as IndexedDB
Data Model
ControllerRouter
View
Model Model
Controller
View View
View
Tuesday, 12 March 13
40. Structured Data
Unstructured Data Data ReďŹnery
Message Queues
Inverted index
Transaction Log Versioning
Source Document
Data Sources
Tokenisers
Retrieval Models
Structured Results
Language Bindings Transport
Shards
Replication Load Balancing Nodes
Distributed
Zen EC2
mvel Python
Groovy
Javascript
HTML5/CSS3 Javascript
Capture & Curate
Index
Streams
Analyse
Search
Transport
HTTP WebSockets
Thrift
ZeroMQ
memcached
TCP
Modules
Extend
Script
Monitor
Discovery
RESTful Micro Apps
MemoryShared FS FS + MemoryLocal FS
Document Store
An alternative that would allow scientists or even casual users to perform analysis of
distributed data regardless of where the data resides.
Tuesday, 12 March 13
41. Search is the primary interface for getting
information today. Letâs build on it.
Search
DiscoverAnalyse
Tuesday, 12 March 13
43. Data Management Tools - Challenges
⢠Interactive queries, data exploration or iterative query reďŹnement poses
signiďŹcant challenges for current methods
⢠Building and running jobs and queries requires deep understanding of
cluster size and structure, job performance, etc.
⢠Time-consuming to set up, deploy and use
Tuesday, 12 March 13