Soft-Shake 2013 : Enabling Realtime Queries to End Users

Enabling Real-time
Queries to End
Users
Benoit Perroud
SoftShake, Geneva, October 24, 2013

About Me
•
•
•
•
•

Benoit Perroud
Software Engineer @ Verisign
Leading Hadoop Infrastructure Team
Apache Committer
@killerwhile

Verisign Public

2

Agenda
•
•
•
•
•

What’s going on
Data lifecycle
Batch and Realtime
Hadoop Deployments
Next Steps

Verisign Public

3

What’s going on
• Mainframes are obsolete, replaced by commodity hardware’s
cluster
• TenG (10Gb/s) links are the new standard
• RESTful APIs are everywhere
• Everybody wants to visit Paxos Island
• Firehoses do not only carry water
• Asynchronous non-blocking functional programming is taught
at primary school
• NoSQL is the new way to store data at scale
• API management startups are rising (and raising)
• Hadoop keywords boost your LinkedIn profile by 2000%
• Public clouds are responsible for more than 50% of the global
Internet traffic
• … and counting …
Verisign Public

4

A Possible Deployment

Verisign Public

Source: http://dev.datasift.com/blog/high-scalability
Note: the diagram is stamped from 2009, it is probably
partially or even completely outdated today
5

Data Lifecycle

Verisign Public

6

Data Lifecycle

Data Storage
Data Retrieval

Data Ingestion

Consumers

Producers

Data Processing

Verisign Public

7

• Copying internal and external sources of data into the
cluster
• Pre-processing: data cleanup, proper format, …
• Time vs. block-size tradeoff

• Targeted property: Availability
Source of
Data
Ingesting the
flow

Uploading to
HDFS

HDFS

Local
buffering

Verisign Public

8

• Hadoop HDFS is a well established distributed file
system
• File system is the central component of every datadriven approach
• Space vs. network tradeoff
• Targeted property: Reliability

DataNode1

DataNode2

File1

Upload to
HDFS

Verisign Public

DataNode3

DataNode4

9

• Hadoop MapReduce
• Higher level tools (Hive, Pig, Impala) help
• Data catalog needs to be maintained
Targeted property: parallelism

Verisign Public

10

•
•
•
•

Only way to make use of the data
Business driven need
At scale, data needs to be stored as they are queried.
DPI: Data Programmable Interfaces

Targeted property: user friendliness, reliability

Verisign Public

11

Batch and Realtime

Verisign Public

12

Batch Processing

Batch 1 starts
processing

Batch 2 starts
processing

Batch 2 ready
to be served

Batch 1 ready
to be served

Batch 1

Batch 2
t2

t1

Batch 3 starts
processing

t4

t3

Query data from batch 1
Data gap

Verisign Public

Batch 3
t5

Time

Query data from batch 2

Data gap

13

Batch Processing in details

Let some time
for data to finish
upload

Load results
in a data store

Batch with data from
yesterday
Time
New batch
granularity
period

Processing time

Query data from
the day before yesterday?

Verisign Public

Notify the retrieval system
a new batch is ready
to be served

14

Realtime Query
• Interactive query
• REST like request/response queries
• With SLA

And

• Query the latest version of the data
• Latest means n seconds ago with n predictible

Verisign Public

15

Hadoop Deployments

Verisign Public

16

Naïve Hadoop Deployment

hdfs dfs -put
Gateway
mapred job …jar

hdfs dfs -get

Verisign Public

NameNode

JobTracker

DataNode
DataNode
DataNode
DataNode
DataNode
Processing
DataNode
DataNode
DataNode
DataNode
DataNode

17

Industry Hadoop Deployment

Gateway

Data In GW

Data Out GW

NameNode
NameNode

JobTracker
JobTracker

DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Processing
DataNode

DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode

Monitoring

Verisign Public

NameNode
NameNode

J

DataNode
DataNode

DataN
Dat
D
DataNode
Research,
DataNode
DataNode Data Science
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Metadata Store

18

Realtime Hadoop Deployment

Gateway

NameNode
NameNode

JobTracker
JobTracker

DataNode
DataNode
DataNode
DataNode
Processing
Data In GW

DataNode
DataNode
DataNode
DataNode

RT Data Out GW

RT
processing

Verisign Public

19

Hybrid Approach

Batch 1 starts
processing

Batch 2 starts
processing
Batch 2 ready
to be served

Batch 1 ready
to be served

Batch 1
t1

Batch 2
t2

t3

t4

Time

Complementary data for batch 1
Complementary data for batch 2

Verisign Public

20

Realtime Search with Hadoop

Gateway

Data In GW

NameNode
NameNode

Generate
Indexes
DataNode
DataNode
DataNode
DataNode

JobTracker
JobTracker

DataNode
DataNode
DataNode
DataNode
Coordinator

RT Data Out GW

Update
indexes

Verisign Public

21

Next Steps

Verisign Public

22

Hadoop Ecosystem
… is moving … really fast

• Interactive Queries: Cloudera Impala, Apache Drills,
Tez, …
• Search: SolrCloud, ElasticSearch, Cloudera Search
• Hybrid layer: Twitter SummingBird
• … and counting…

Verisign Public

23

Thanks for the
attention!
Follow @killewhile
bperroud@verisign.com

“Copyright © 2013 VeriSign, Inc. All rights reserved. The VERISIGN word mark, the Verisign logo, and other Verisign trademarks,
service marks, and designs that may appear herein are registered or unregistered trademarks or service marks of VeriSign, Inc.,
and its subsidiaries in the United States and foreign countries. All other trademarks, service marks, and designs are property of their
respective owners. Verisign has made efforts to ensure the accuracy and completeness of the information in this document.
However, Verisign makes no warranties of any kind (whether express, implied or statutory) with respect to the information contained
herein. Verisign assumes no liability to any party for any loss or damage (whether direct or indirect) caused by any errors, omissions,
or statements of any kind contained in this document. Further, Verisign assumes no liability arising from the application or use of the
products, services, or materials described or referenced herein and specifically disclaims any representation that any such products,
services, or materials do not infringe upon any existing or future intellectual property rights.”

Soft-Shake 2013 : Enabling Realtime Queries to End Users

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Soft-Shake 2013 : Enabling Realtime Queries to End Users

Semelhante a Soft-Shake 2013 : Enabling Realtime Queries to End Users (20)

Último

Último (20)

Soft-Shake 2013 : Enabling Realtime Queries to End Users

Notas do Editor