Putting Analytics in Big Data Analytics
Jake Cornelius
Director of Product Management, Pentaho Corporation
Learn more @ http://www.cloudera.com/hadoop/
2. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Traditional BI
Tape/Trash
Data Mart(s)
Data
Source
?
? ?
?
?
??
3. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake(s)
Big Data Architecture
Data Mart(s)
Data
Source
Data WarehouseAd-Hoc
4. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho Data Integration
Hadoop
Pentaho Data
Integration
Data Marts, Data Warehouse,
Analytical Applications
Design
Deploy
Orchestrate
Pentaho Data
Integration
Pentaho Data
Integration
5. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Optimize
Visualize
Load
Files / HDFS
Hive
DM & DW
Applications & Systems
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis
6. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis
HDFS
Hive
DM
7. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Demo
8. • Pentaho for Hadoop Download Capability
• Includes support for development, production support will follow with GA
• Collaborative effort between Pentaho and the Pentaho Community
• 60+ beta sites over three month beta cycle
• Pentaho contributed code for API integration with HIVE to the open source
Apache Foundation
• Pentaho and Cloudera Partnership
• Combines Pentaho ‘s business intelligence and data integration capabilities
with Cloudera’s Distribution for Hadoop (CDH)
• Enables business users to take advantage of Hadoop with ability to easily and
cost-effectively mine, visualize and analyze their Hadoop data
Pentaho for Hadoop Announcements
9. Pentaho for Hadoop Announcements (cont)
• Pentaho and Impetus Technologies Partnership
• Incorporates Pentaho Agile BI and Pentaho BI Suite for Hadoop into Impetus
Large Data Analytics practice
• First major SI to adopt Pentaho for Hadoop
• Facilitates large data analytics projects including expert consulting services,
best practices support in Hadoop implementations and nCluster including
deployment on private and public clouds
10. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho for Hadoop Resources & Events
Resources
Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press, events, partnerships and
more: www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO
Events
Hadoop World: NYC - Oct 12, Gold Sponsor, Exhibitor, Richard Daley
presenting, ‘Putting Analytics in Big Data Analysis’
London Hadoop User Group - Oct 12, London
Agile BI Meets Big Data - Oct 13, New York City
11. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Thank You.
Join the conversation. You can find us on:
Pentaho Facebook Group
@Pentaho
http://blog.pentaho.com
Pentaho - Open Source Business Intelligence Group
Notas do Editor
In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective.
Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away.
TRANSITION
The problem is we don’t know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict ahead of time.
When we look at the Big Data architecture we described before we recall that
* We want to store all of the data, so we can answer both known and unknown questions
* We want to satisfy our standard reporting and analysis requirements
* We want to satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data
* We want to balance balance performance and cost as we scale
We need the ability to take the data in the Data Lake and easily convert it into data suitable for a data mart, data warehouse or ad-hoc data set - without requiring custom Java code
Fortunately we have an embeddable data integration engine, written in Java
We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas:
* We have the ability to move files between Hadoop and external locations
* We have the ability to read and write to HDFS files during data transformations
* We have the ability to execute data transformations within the MapReduce engine
* We have the ability to extract information from Hadoop and load it into external data bases and applications
* And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring, logging etc
Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid.
* At the bottom of the pyramid we have Hadoop, containing our complete set of data.
* Higher up we have our data mart layer. This layer has less data in it, but has better performance.
* At the top we have application-level data caches.
* Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only thing that varies is the query time, depending on what data they want.
* Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to
TRANSITION 1
Get a weblog file from an FTP server
TRANSITION 2
Make sure the source file does not exist with the Hadoop file system
TRANSITION 3
Copy the weblog file into Hadoop
TRANSITION 4
Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser attributes
TRANSITION 5
Write the results of the data transformation to a new, improved, data file
TRANSITION 6
Load the data into Hive
TRANSITION 7
Read an aggregated data set from Hadoop
TRANSITION 8
And write it into a database
TRANSITION 9
Slice and dice the data with the database
TRANSITION 10
And execute an ad-hoc query into Hadoop