2. Big Data
Terabytes and petabytes of data
Sometimes per day
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
3. Example Use Cases Today
Transactional
•Fraud detection
•Financial services/stock markets
Sub-Transactional
•Weblogs
•Social/online media
•Telecoms events
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
4. Example Use Cases Today
Non-Transactional
•Web pages, blogs etc
•Documents
•Physical events
•Application events
•Machine events
In most cases structured or semi-structured
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
5. Data Lake
• Single source
• Large volume
• Not distilled
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
6. Data Lakes
• 0-2 lakes per company
• Known and unknown questions
• Multiple user communities
• $1-10k questions, not $1m ones
• Don’t fit in traditional RDBMS with a
reasonable cost
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
7. Data Lake Requirements
• Store all the data
• Satisfy routine reporting and analysis
• Satisfy ad-hoc query / analysis / reporting
• Balance performance and cost
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
8. Traditional BI
Data Mart(s)
Tape/Trash
Data ? ? ?
Source ?
? ??
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
9. What if...
Data Mart(s) Ad-Hoc Data Warehouse
Data Lake(s)
Data
Source
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
10. Big Data Does Not Replace Data Marts
• It’s not a database
• High latency
• Optimized for massive data-crunching
• Databases are immature
• Databases are no-SQL
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
11. Big Data
Map/Reduce
And
Sometimes per day
Hadoop
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
12. What is Map/Reduce
• Obligatory Wikipedia quote: “... is a patented software
framework introduced by Google to support
distributed computing on large data sets on clusters
of computers”
• Invented by Google to index “The Internet”
• Apache Hadoop is an Open Source implementation of the
Map/Reduce algorithm
• Scalable & fault-tolerant, not efficient!
13. What Hadoop Really Is
• Core components
• HDFS – a distributed file system allowing massive storage across a cluster of
commodity servers
• Map-Reduce
• Framework for distributed computation, common use cases include
aggregating, sorting, and filtering BIG data sets
• Problem is broken up into small fragments of work that can be computed or
recomputed in isolation on any node of the cluster
• Related Projects
• Hive – a data warehouse infrastructure on top of Hadoop
• Implements a SQL like Query language, including a JDBC driver
• Allows MapReduce developers to plugin custom mappers and reducers
• Hbase – the Hadoop database – AH HA!
• A variant of NoSQL databases, problematic for traditional BI
• Best at storing large amounts of unstructured data
14. No seriously, what’s is Hadoop?
Java software framework that supports data-
intensive distributed applications
• Apache project
• Created by Yahoo, Google’s idea
• Distributed filesystem + MapReduce engine
• Commodity hardware
• Scales out beyond technology and/or
economy of RDBMS
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
15. Hadoop and BI?
• Distributed processing
• Distributed file system
• Commodity hardware
• Platform independent (in theory)
• Scales out beyond technology and/or
economy of a RDBMS
In many cases it’s the only viable solution
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
16. Hadoop and BI?
90% of new Hadoop use cases
are transformation of
semi/structured data*
* of those companies we’ve talked to...
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
17. Hadoop and BI?
“The working conditions
within Hadoop are shocking”
ETL Developer
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
18. Hadoop and BI?
Instead of this...
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
19. Hadoop and BI?
You have to do this in Java...
•public void map(
• Text key,
• Text value,
• OutputCollector output,
• Reporter reporter)
•public void reduce(
• Text key,
• Iterator values,
• OutputCollector output,
• Reporter reporter)
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
20. People don’t use
Hadoop for BI because
they want to...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
21. ...they do it because
they have to...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
22. ... and unfortunately it
wasn’t designed
for most BI requirements
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
23. Why not add to Hadoop
the things it’s missing...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
24. ... until it can do
what we need it to?
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
25. If only we had a
Java, embeddable,
data transformation engine...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
26. Pentaho Data Integration
Data Marts, Data Warehouse,
Analytical Applications
Pentaho Data
Integration
Design
Pentaho Data Deploy
Hadoop Integration
Orchestrate
Pentaho Data
Integration
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
27. Visualize Reporting / Dashboards /
Analysis
Web Tier
DM & DW RDBMS
Optimize
Hive
Hadoop
Files / HDFS
Load Applications & Systems
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
28. Reporting / Dashboards /
Analysis
Web Tier
DM RDBMS
Hive
Hadoop
HDFS
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
29. 30000ft View
Host Machine
pentaho-hadoop-vm
Hadoop
PDI Client
HDFS Hive
Tasks and Jobs
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 29
30. Inside the VM
pentaho-hadoop-vm
Hadoop
HDFS Hive
Job
Mapper Reducer
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 30
31. Inside a job
Job
Mapper Reducer
*
Java Application Java Application
Scripting Scripting
* Combiner can be used to pre-reduce in memory on the mappers before data is transmitted.
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 31
32. Inside a job with PDI
Job
Mapper Reducer
PDI Execution Engine PDI Execution Engine
Transformation Transformation
Step
Step Step
Step
Step Step
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 32
33. Demo
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
34. The Single Threaded Transformation Engine
• Designed to use a single thread
• Processes rows per batch because Hadoop
delivers rows in batches
• Knows when the batch of rows is processed
• Is only initialized once and disposed of once
• Has reduced overhead for data passing
between steps
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
35. The Single Threaded Transformation Engine
• Is no longer used inside of Hadoop thanks
to new developments. “The multi-threaded
engine is still faster” they said.
• Is being introduced into PDI 4.2.0 (CE)
• You will be able to specify a mapping to run
single threaded
• Allows you to reduce context switching in
large to huge transformations (lots of steps)
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
36. Pentaho for Hadoop Resources
Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press,
events, partnerships and more:
www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James
Dixon, Pentaho CTO
Or contact me : mcasters at pentaho dot org
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
37. Thank You.
Join the conversation. You can find us on:
http://blog.pentaho.com
@Pentaho
Pentaho Facebook Group
Pentaho - Open Source Business Intelligence Group
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide