Putting Business Intelligence to Work on Hadoop Data Stores

Putting Business Intelligence to
Work on Hado Data Stores
oop

Ian Fyfe, Chief Techno
ology Evangelist, Pentaho

© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights R
Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 1

Session Abstract
This presentation will cover how to ov
vercome Hadoop's constraints to get
more out of your business data analyssis.
An inexpensive way of storing large volumes of da ata,
ata Hadoop is also scalable and redundant But
redundant.
getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users
k
experience high latency (up to several minutes pe query), Hadoop is not appropriate for ad hoc
er
query, reporting, and business analysis with tradiitional tools.
The fi t t in
Th first step i overcoming H d
i Hadoop's constraints i connecting t HIVE a d t warehouse
' t i ts is ti to HIVE, data h
infrastructure built on top of Hadoop, which provvides the relational structure necessary for
schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query
i
language called Hive QL which is based on SQL an which enables users familiar with SQL to query
nd
this data.
But to really unlock the power of Hadoop, you mu be able to efficiently extract data stored across
ust
multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load)
tool that will then allow you to move y
y your Hadoop data into a relational data mart or warehouse
op
where you can use BI tools for analysis.

Attendees will learn, how an IT person without java programming skills can:
Integrate with Hadoop and Hive to bring ETL, dat warehousing and BI applications to the tasks of
ta
analyzing Big Data;
Provide key data integration and transformation functionality to Hadoop data;
f
Manage and control Hadoop jobs using a graphica interface;
al
Integrating Hadoop data with data from other souurces to drive compelling reporting and analytics
for today's massive volumes of data.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 2

THE CASE FOR B DATA
BIG


The Case for Big Data
Enterprises increasingly face nee to store, process and maintain
eds
larger and larger volumes of structured and unstructured data
Compliance
Competitive Advantage
Challenges associated with big da
ata
Cost – storage and processing power
r
Timeliness of data processing
Why Hadoop? Google trends for ‘Hadoop’

Low cost, reliable scale-out architec
cture for storing massive amounts of data
Parallel,
Parallel distributed computing frammework for processing data
Proven success in solving Big Data pr
roblems at fortune 500 companies like
Google, Yahoo!, IBM and GE
Vibrant community, exploding i
Vib i l di intere strong commercial i
est, i l investments


Hadoop for Data Integration and BI
Top Use Cases for Hadoop*
1. “mine data for improved busines intelligence”
ss
2 “reducing cost of data analysis”
2. reducing analysis
3. “log analysis”

Top Challenges with Hadoop*
1. Steep technical learning curve
2. Hiring qualified people
3. Availability of appropriate produ
ucts and tools

Unfortunately, Hadoop was not designed specifically for ETL and BI use cases:
d
It’s not a database
High latency queries and jobs not ideal for all BI use cases
Skill set mismatch for traditional ETL us
sers and BI Solution architects

*Based on a survey of 100+ Hadoop users conducted by Karmasphere Sept 2010
d Karmasphere, Sept.


ESTABLISHING A
AN
ARCHITECTURE FFOR BIG DATA


Example Use Cases Today
p y
Transactional
•Fraud detection
•Financial services/sto k markets
Fi i l i / tock k t

Sub-Transactional
•Weblogs
•Social/online media
•Telecoms events
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide
US and Worldwide: +1 (866) 660-7555 Slide 7

Example Use Cases Today
p y
Non-Transactional
•Web pages, blogs etc
c
•Documents
D t
•Physical events
y
•Application events
•Machine events

In most cases structur or semi-structured
red


Traditional Business In
ntelligence ( )
g (BI)
Data Mart(s)

Tape/T
Trash

Data ? ? ?
Source ?
? ??


Data Lake
• Single source
• Large volume
• Not distilled
• T i ll no more th 0 2
Typically than 0-2
lakes per company
• Known and unknown
questions
• Multiple user communities
• Don’t fit in traditional
RDBMS with a reasonable
cost

US and Worldwide: +1 (866) 660-7555Slide 10

Data Lake Requiremen
q nts
• Store all the data
• Satisfy routine reporting
and analysis
• Satisfy ad-hoc query /
analysis / reporting
• Balance performance and
cost


What if...
Data Mart(s) Ad-H
Hoc Data Warehouse

Data L
Lake(s)

Data
Source


Big Data Does Not Replace Data Marts
g p

It’s not a database
High latency
sive data-crunching
Optimized for mass
Big Data databases are immature
s
Databases are no SQL
no-

What Hadoop Really is
p y s….
Core Components

HDFS
a distributed file system allow
wing massive
storage across a cluster of com
mmodity
servers
MapReduce
Framework for distributed com mputation,
common use cases include agg gregating,
sorting, and filtering BIG data sets
Problem is broken up into sma fragments
all
of work that can be computed or
d
recomputed in isolation on any node of the
y
cluster


What Hadoop Really is
p y s….
Related Projects
Hive – a data warehouse
infrastructure on top of Hadoop
H
Implements a SQL like Query l
language,
language
including a JDBC driver
Allows MapReduce developers to plugin
p p p g
custom mappers and reducers
Hbase – the Hadoop data
abase –
AH HA!
A variant of NoSQL databases,
problematic for traditional BI
Best at storing large amounts of
unstructured data

Hadoop and BI?
p
Distributed processin
ng
Distributed file syste
em
Commodity h d re
C dit hardwar
Platform independen (in theory)
nt
Scales out beyond te
echnology and/or
economy of a RDBM MS

In many cases it’s the only viable solution


Hadoop and BI?
p

90% of new Had doop use cases
are transfo
ormation of
semi/struct
tured data*
data

* of those companies we’ve talke to
we ve ed to...


Hadoop and BI?
p

“The working conditio
ons
within Hadoop are sho
ocking”
ocking

ETL Developer


Hadoop and BI?
p
Instead of this...


Hadoop and BI?
p
You have to do this in Java...
public void map(
Text key,
Text value,
OutputCollector output
t,
Reporter reporter)

public void reduce(
p
Text key,
Iterator values,
OutputCollector output
t,
Reporter reporter)


People d t use
don
don’t
Hadoop for BI because
they wa to
ant to...


...they do i because
they it
they ha to
ave to...


... and unfo
ortunately it
wasn’t d
designed
for most BI requirements


Why not add to Hadoop
d
the things it’s missing...


... until it can do
t
what we n need it to?


If only w had a
we
Java,
Java emb beddable,
beddable
data transformmation engine
engine...


A Data Integration Eng
g g
gine for Hadoop
p
Data Marts, Da Warehouse,
ata
Analytical App
y Applications

Data Integr
ration
Enginee

Design
Data Integr
ration
Hadoop Engine
E i e Deploy
Orchestrate
Data Integr
ration
Engine
g e


Visualize Reporting / Dashb
boards / Analysis

Web Tier

DM &
& DW RDBMS
Optimize
Hiv
ve
Hadoop
Files / HDFS

Load Applications
s & Systems


Reporting / Dashb
boards / Analysis

Web Tier

DM &
& DW RDBMS
adata
Meta

Hiv
ve
Hadoop
Files / HDFS

Applications
s & Systems


Data Mart(s) Ad-H
Hoc Data Warehouse

Data Lake(s)

Data
Source


Reporting / Dashb
boards / Analysis

Web Tier

RDBMS

Data Hadoop
Lake

Applications
s & Systems


Product Requirements for BI Ag
gainst Hadoop
Lower technical barriers through grap
phical ETL
environment for creating and managing Hadoop
g
MapReduce j b
M R d jobs Interactive Analysis

Batch Reporting
Extreme ETL scalability through deplo
oyment and Ad Hoc Query
across the Hadoop cluster Data M t
D t Marts

Easily spin-off high performance data marts for

Ag BI
interactive analysis

gile
Hive
Hi
Easily integrate data from Hadoop with data from
h
other sources Hadoop
Provide end-to-end BI addressing comm BI use
P id dt d dd i mon
Data Integration Jobs
cases with Hadoop including reporting, ad hoc
query and interactive analysis
Reduce costs through subscription-base pricing,
ed
reduced dependency on scarce technica al Log DBs and
Files other sources
resources, and easier maintainability
d i i t i bilit


THE ROAD AHEAD


The Road Ahead
Other NoSQL Integration
Facilitate BI use cases on top of HBase, possibly others like
HBase
MongoDB, Cassandra
Streaming Data Source Su
upport
In support of near-realtime us cases
se
Long/always running data proc cessing jobs
Contiguous Meta-data
Data Lineage and Impact Analy covering the entire big data
ysis
architecture
The End of MapReduce ( as a concept ETL users need to
p (… s p
understand)
Push down optimization of Tra
ansformations that generate
native MapReduce tasks in Had
doop

Hadoop Distro Wars

The Apache Software Foundation


Tools That Make Hado Easier
oop
e.g. Apache Pig

Pig is a platform for
analyzing large data sets
Produces sequences of
MapReduce programs
Integrate Pig scripts into
enterprise data integration
workflows e.g.
1 Submit and monitor a
1.
series of Pig and
MapReduce jobs
2. Process a database bulk
load step to ready data
for ad-hoc analysis or
report bursting


Growth in Adoption of Other
o
NoSQL Big Data Platf
forms

Hbase – the Hadoop database
mongoDB – scalable high performance document oriented database
scalable, high-performance, document-oriented
LexisNexis HPCC – a data intensive computing system platform
Many others


Summary
Hadoop and other Big Data NoSQL platforms
N
Great at storing and processin large diverse data volumes
ng
Not designed for Business Inte
elligence

Choosing the right BI technoology can unlock your Big Data
to drive actionable insights
g
Graphical user interfaces
Scalable
Spin-off data marts
Integrate data into data warehhouses
Integrated dashboards, reportting, data analysis, data
integration


Thank You!
k

ifyfe@pen
ntaho.com
ntaho com


Putting Business Intelligence to Work on Hadoop Data Stores

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Putting Business Intelligence to Work on Hadoop Data Stores

Semelhante a Putting Business Intelligence to Work on Hadoop Data Stores (20)

Mais de DATAVERSITY

Mais de DATAVERSITY (20)

Último

Último (20)

Putting Business Intelligence to Work on Hadoop Data Stores