Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness.
At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries.
Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind.
There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.
4. What’s going on
• Mainframes are obsolete, replaced by commodity hardware’s
cluster
• TenG (10Gb/s) links are the new standard
• RESTful APIs are everywhere
• Everybody wants to visit Paxos Island
• Firehoses do not only carry water
• Asynchronous non-blocking functional programming is taught
at primary school
• NoSQL is the new way to store data at scale
• API management startups are rising (and raising)
• Hadoop keywords boost your LinkedIn profile by 2000%
• Public clouds are responsible for more than 50% of the global
Internet traffic
• … and counting …
Verisign Public
4
5. A Possible Deployment
Verisign Public
Source: http://dev.datasift.com/blog/high-scalability
Note: the diagram is stamped from 2009, it is probably
partially or even completely outdated today
5
8. • Copying internal and external sources of data into the
cluster
• Pre-processing: data cleanup, proper format, …
• Time vs. block-size tradeoff
• Targeted property: Availability
Source of
Data
Ingesting the
flow
Uploading to
HDFS
HDFS
Local
buffering
Verisign Public
8
9. • Hadoop HDFS is a well established distributed file
system
• File system is the central component of every datadriven approach
• Space vs. network tradeoff
• Targeted property: Reliability
DataNode1
DataNode2
File1
Upload to
HDFS
Verisign Public
DataNode3
DataNode4
9
10. • Hadoop MapReduce
• Higher level tools (Hive, Pig, Impala) help
• Data catalog needs to be maintained
Targeted property: parallelism
Verisign Public
10
11. •
•
•
•
Only way to make use of the data
Business driven need
At scale, data needs to be stored as they are queried.
DPI: Data Programmable Interfaces
Targeted property: user friendliness, reliability
Verisign Public
11
13. Batch Processing
Batch 1 starts
processing
Batch 2 starts
processing
Batch 2 ready
to be served
Batch 1 ready
to be served
Batch 1
Batch 2
t2
t1
Batch 3 starts
processing
t4
t3
Query data from batch 1
Data gap
Verisign Public
Batch 3
t5
Time
Query data from batch 2
Data gap
13
14. Batch Processing in details
Let some time
for data to finish
upload
Load results
in a data store
Batch with data from
yesterday
Time
New batch
granularity
period
Processing time
Query data from
the day before yesterday?
Verisign Public
Notify the retrieval system
a new batch is ready
to be served
14
15. Realtime Query
• Interactive query
• REST like request/response queries
• With SLA
And
• Query the latest version of the data
• Latest means n seconds ago with n predictible
Verisign Public
15
20. Hybrid Approach
Batch 1 starts
processing
Batch 2 starts
processing
Batch 2 ready
to be served
Batch 1 ready
to be served
Batch 1
t1
Batch 2
t2
t3
t4
Time
Complementary data for batch 1
Complementary data for batch 2
Verisign Public
20
21. Realtime Search with Hadoop
Gateway
Data In GW
NameNode
NameNode
Generate
Indexes
DataNode
DataNode
DataNode
DataNode
JobTracker
JobTracker
DataNode
DataNode
DataNode
DataNode
Coordinator
RT Data Out GW
Update
indexes
Verisign Public
21
The remainder of this presentation focuses on the prevention of DNS or DDoS attacks and global server load balancing to add resiliency to your eCommerce architecture