During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
3. 7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta
President, Caserta Concepts
Author, Data Warehouse ETL Toolkit
Welcome
About the Meetup and about Caserta Concepts
7:30 Elliott Cordo
Principal Consultant, Caserta Concepts
Intro to Real-Time Queries in
Hadoop
7:50 Abhijit Lele
Solutions Engineer at Hortonworks
Deep dive into Hortonworks
STINGER
8:10 -
9:00
More Networking
Tell us what you’re up to…
Agenda
4. About the BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Next BDW Meetup: September 16.
• Topic: Real-World Use Case and
Solution in Financial Sector
• Want to present your idea/solution?
Contact joe@casertaconcepts.com
5. About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise
9. Contacts
Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com
Bob Eilbacher
VP, Sales & Marketing
P: (855) 755-2246 x 345
E: bob@casertaconcepts.com
Elliott Cordo
Principal Consultant, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
10. Why talk about Interactive Queries?
ERP
Finance
Legacy
ETL
Search/Data
Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data BI
NoSQL
Database Cassandra
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
12. The most ambiguous term of all..
REAL-TIME
What do we mean by this?
• Real-time ingest/ processing
• Very Recent Data
• Real-time/interactive queries
• Fast Queries
What are the acceptable latencies and how are they measured?
Would we categorize 1 minute latency from transaction occurrence to
query availability real-time?
What is our threshold for query latency?
Let’s explore both aspects
13. Plumbing: Ingest and Processing
• Lets assume “freshness” of data is important for our “Real-time”
requirement.
• Micro-batch: bring in data in incremental batches as quickly as
possible. Highly optimized ETL!
• Streaming: Continuously push messages into Hadoop
14. Micro-batch
Use our familiar tools, and build REALY good ETL:
Traditional:
• Informatica
• Talend
Big Data tools:
• Sqoop
• PIG
• Hive
How “Real-time” can we be on the scale of minutes
15. Streaming
Microbatch is too slow, we need to push data into our analytics
system, at low latency!
Several classes of products fit this requirement:
Stream data collection
• Flume
Complex Events Processors:
• ESPER
• Streambase
Distributed Computation Systems:
• Storm
• Akka
How “Real-time” are we now?? minutes to milliseconds!
16. A quick Storm example:
CONTINOUS FEED OF DATA!
Are we fast enough yet to be considered “Real-time”?
Market Prices
Calc CRM Opportunities
Stock Trades
Calc Trade Price
vs Market
CRUD Operations
CRUD Operations
Lookup Customer
Profile
Message
Queues
Next tuple
Next tuple
Next tuple…
New “Topic’” on
Message Queue
17. So we have options for fresh data,
Now queries need some attention!
• Hadoop is a batch system Queries are dispatched as
map reduce jobs
• Simple queries take around a minute or two
• Complex queries (joins and aggregation) can take much
longer
18. So how have we run queries in Hadoop up
until now?
• Hive – compiles SQL code into Map Reduce
• PIG – suited for data transform but query capable!
• Third Party Tools – Such as Datameer
• HBase Low Latency!!!!!
• But query language is Spartan
• Low query flexibility!
• Anticipate your query and materialize it!
RDBMS NoSQL
Volume
QueryFlexibility
19. MPP Connectors
• Massively Parallel Processing: Horizontally scalable database platform (columnar
under the hood) that present themselves relationally
• Many have built sophisticated integrations to Hadoop
• Use MPP managed tables and Hadoop in same query
• Ship data to MPP from Hadoop for faster queries
20. Downstream Relational Databases
• Move aggregate data out to relational Datamarts using
ETL
• Both this solution and MPP connectors suffer from a few
problems:
• Batch/Micro-batch Latency
• Processing and imposition of relational model loss of agility
• Majority of data left behind in Hadoop
ETL
Relational
Data Mart
21. Dremel
• Research published by Google in 2010
• Main Features
• Fast/interactive ad-hoc queries
• Scaling to trillion of records, Petabytes of data
• Relies on it’s own processing model outside Map-Reduce
• Leverages a special columnar storage
• Foundation for Google Big Query
22. Inspired by Dremel
• There are several new query engines!
• Drill (Incubator)
• Stinger (Hortonworks)
• Impala (Cloudera)
• All process outside Map Reduce framework
• Evolution or extension of Hive!
• Several MPP features have been adopted to deal with
things like query planning and join optimization such as
collocation and broadcasting.
• Also note that to achieve the ultimate performance some
structure will need to be imposed on the data:
• ORC File
• Parquet
23. Now we have something that can provide
us “Real-time” in Hadoop
• At least most of the time
•Queries are significantly faster but
not always instantaneous
• Simple selects A couple seconds
• Join queries 10’s of seconds
24. Where is this going?
• So is the roadmap for these engines to be a
Hadoop MPP?
• Likely not
• Are we ready to build an EDW on Hadoop?
• For the right use case we are getting there!
Consider as a supplement to MPP or relational
EDW.
• Will there be a “winner” in the open source
race?
• Maybe, or they will evolve and find their own
strengths, niches
25. What we do know about these new
engines
• They are made to fit a need for fast queries
on large sets of data!
• They present an exciting feature for the
Hadoop ecosystem
Notas do Editor
Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB