Mais conteúdo relacionado Semelhante a Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift (20) Mais de Daniel Krook (20) Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift1. © 2014 IBM Corporation
Advanced Data Retrieval and Analytics
with Apache Spark and Openstack Swift
Gil Vernik
IBM Research - Haifa
2. © 2014 IBM Corporation
Topics Covered in This Talk
§ Openstack Swift
§ Apache Spark
§ Basic integration between Spark and Swift
§ Advanced integration between Spark and Swift by
utilizing the Storlets technology.
3. © 2014 IBM Corporation
Digital Universe
More than 1.8 zettabytes
(1.8 trillion gigabytes)
Grows rapidly
80% owned by enterprises
75% generated by individuals
According IDC iView "Extracting Value from Chaos,"
4. © 2014 IBM Corporation
Map-Reduce, Databases, etc..
Data needs to be replicated, Time, Cost, etc..
6. © 2014 IBM Corporation
Openstack Swift
§ A massively scalable object store
§ Known to work with thousands of
servers, stores petabytes of data.
§ Exposes REST API
§ Features:
– Storage polices
– Erasure codes
– Data replication
– ….
PUTProxy Nodes
Storage Nodes
7. © 2014 IBM Corporation
Apache Spark
§ Apache Spark™ is a fast and general engine for
large-scale data processing
– Up to 100x faster than Hadoop Map
Reduce in-memory, 10x faster on disk
§ Combines SQL, streaming, and complex analytics
§ Can read existing Hadoop data
§ Most active project in Apache today
8. © 2014 IBM Corporation
Swift enablement for data retrieval in Spark
§ Apache Spark implements Hadoop interfaces and can use
HDFS or Amazon S3 as a data source.
Swift
Network
§ IBM research enabled Spark to access data stored in
Openstack Swift.
9. © 2014 IBM Corporation
What do we analyze?
Swift
Network
Stored Data Input to Analytics
Images EXIF metadata
PDF Hidden metadata
LOGs Only ‘ERROR’ records
…. ….
10. © 2014 IBM Corporation
Yes! We can do it better.
11. © 2014 IBM Corporation
Storlets: Flexibly extend for Swift
Advanced Data processing inside Swift
§ Storlets is a way to ‘extend’
cloud computational capabilities
§ Storlet is compiled code,
deployed to Swift and when
triggered is executed by Storlet
Engine directly on storage
nodes.
§ Storlet engine - responsible to
execute every storlet in a secure
environment
§ Storlet is a standard Java code
12. © 2014 IBM Corporation
Storlets extend an object store by
moving computation to the data –
filtering, transforming, analyzing –
instead of bringing the data to the
computation
13. © 2014 IBM Corporation
Swift Storlets: How do they benefit Spark?
Swift Storlet
Network
Objects
Filter
Data processing+
14. © 2014 IBM Corporation
Storlets Enable Extending the Functionality of Spark
Example: analyzing EXIF metadata from photos
§ Object store is a
natural repository for
photos
§ Photos contain rich
capture metadata
§ Analyzing this
metadata for a set of
photos can show how
the camera is used
15. © 2014 IBM Corporation
Example: Analyzing EXIF metadata
Storlets can extract metadata, returning as JSON
(rather than of processing the binary data directly by Spark)
10MB 1KB
16. © 2014 IBM Corporation
Example: Analyzing EXIF metadata.
• Spark accesses images via storlet
• No change to Spark, only changes the URI
• JSON file returned by storlet defines schema
• SQL from Spark processes metadata
17. © 2014 IBM Corporation
Example: Analyzing EXIF metadata.
18. © 2014 IBM Corporation
Summary
§ Openstack Swift is the most popular open source
object store
§ Apache Spark is the next big thing in data analytics
§ Spark and Swift can be integrated
§ Storlets in Swift provide clear benefits for analytics
use cases.
Thank you!
More information
Gil Vernik, IBM Research -Haifa
gilv@il.ibm.com