Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

© Hortonworks Inc. 2013
Modern Data Architecture
… and the Data Lake
John Haddad
Senior Director Product Marketing - Informatica
Jim Walker
Director Product Marketing - Hortonworks
Page 1

Your Presenters
• John Haddad
– Senior Director Product Marketing, Informatica
– Over 25 years experience developing and
marketing enterprise applications
– Enjoys art, science, and the great outdoors
• Jim Walker
– Director Product Marketing, Hortonworks
– Over 20 years in data management as a
developer and a marketer
– Amateur Photographer
Page 2

Today’s Topics
• Introduction
• Drivers for the Modern Data Architecture (MDA)
• Apache Hadoop in the MDA
• Informatica’s role in the MDA
• Q&A
Page 3

Enterprise Data Architecture
Page 4
APPLICATIONS
DATA
SYSTEMS

REPOSITORIES

RDBMS
EDW
MPP

DATA
SOURCES

OLTP,
POS

SYSTEMS

Tradi8onal
Sources

(RDBMS,
OLTP,
OLAP)

Business

Analy8cs

Custom

Applica8ons

Packaged

Applica8ons

Traditional Approach – Under Pressure
Page 5
APPLICATIONS
DATA
SYSTEMS

REPOSITORIES

RDBMS
EDW
MPP

DATA
SOURCES

OLTP,
POS

SYSTEMS

Tradi8onal
Sources

(RDBMS,
OLTP,
OLAP)

Business

Analy8cs

Custom

Applica8ons

Packaged

Applica8ons

New
Sources

(sen8ment,
clickstream,
geo,
sensor,
…)

Source: IDC
2.8
ZB
in
2012

85%
from
New
Data
Types

15x
Machine
Data
by
2020

40
ZB
by
2020

Modern Data Architecture Enabled
Page 6
APPLICATIONS
DATA
SYSTEMS

REPOSITORIES

RDBMS
EDW
MPP

DATA
SOURCES

OLTP,
POS

SYSTEMS

Tradi8onal
Sources

(RDBMS,
OLTP,
OLAP)

Business

Analy8cs

Custom

Applica8ons

Packaged

Applica8ons

New
Sources

(sen8ment,
clickstream,
geo,
sensor,
…)

OPERATIONAL

TOOLS

MANAGE
&

MONITOR

DEV
&
DATA

TOOLS

BUILD
&

TEST

Hadoop Powers Modern Data Architecture
Page 7
Apache Hadoop is an open source project
governed by the Apache Software Foundation
(ASF) that allows you to gain insight from massive
amounts of structured and unstructured data
quickly and without significant investment.
Hadoop Cluster
compute
&
storage
. . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware

Driving Efficiency Driving Opportunity
Drivers for Hadoop Adoption
Modern Data Architecture
Hadoop has a central role in next
generation data architectures while
integrating with existing data systems
Business Applications
Use Hadoop to extract insights that
enable new customer value and
competitive edge
Existing
Traditional
Server log
Clickstream
Big Data Sets
Emerging
Sentiment/Social
Machine/Sensor
Geo-locations

Opportunity in types of data
1.  Sentiment
Understand how your customers feel about your brand and
products – right now
2.  Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3.  Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4.  Geographic
Analyze location-based data to manage operations where
they occur
5.  Server Logs
Research logs to diagnose process failures and prevent
security breaches
6.  Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents
Value
Page 9

Efficiency in Modern Data Architecture
•  Drive efficiency via
modern data
architecture
•  Store data once and
access it in many
ways
•  Often referred to a
data lake or data
repository
•  Infrastructure
platform driven
•  IT-oriented, TCO
based
Page 10
APPLICATIONS
DATA
SYSTEMS

REPOSITORIES

RDBMS
EDW
MPP

DATA
SOURCES

OLTP,
POS

SYSTEMS

Tradi8onal
Sources

(RDBMS,
OLTP,
OLAP)

Business

Analy8cs

Custom

Applica8ons

Packaged

Applica8ons

New
Sources

(sen8ment,
clickstream,
geo,
sensor,
…)

Page 11
APPLICATIONS
DATA
SYSTEMS

TRADITIONAL
REPOS

DEV
&
DATA

TOOLS

OPERATIONAL

TOOLS

Viewpoint
Microsoft Applications
DATA
SOURCES

DATA
INTEGRATION

Engineered for Interoperability
Tradi8onal
Sources

(RDBMS,
OLTP,
OLAP)

New
Sources

(sen8ment,
clickstream,
geo,
sensor,
…)

Integrated
Interoperable with
existing data center
investments Skills
Leverage your existing
skills: development,
operations, analytics
Requirements for Hadoop Adoption
Page 12
Key Services
Platform, operational and
data services essential for
the enterprise
3Requirements for Hadoop’s Role
in the Modern Data Architecture

Today’s Topics
• Introduction
• Drivers for the Modern Data Architecture (MDA)
• Apache Hadoop’s role in the MDA
• Informatica’s role in the MDA
• Q&A
Page 13

Hortonworks & Informatica
Visual Development Environment
Enterprise
Repositories
EDW
LOAD
Data
Virtualization
Batch
CEP
MDM
INTERFACE
HIVE
JDBC
HDFS API
AMBARI
MAPREDUCE
YARN
HDFS
DATA REFINEMENT
HIVE (HiveQL and UDFs)
ProfileProfile
Parse
ETL
Cleanse
Match
HDFSAPI
LOAD
Reference
Architecture
SOURCE
DATA
Batch
Replicate
Stream
Archive
JMS Queue’s
Servers &
Mainframe
Files
Databases
Sensor data
Social

Data Lake Processes
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich
with metadata
3. Stream real-time
data
8. Explore & validate
data
4. Mask
sensitive data
2. Replicate changed
data & schemas
Visualization
& Analytics
11. Subscribe to
datasets
Data
Integration
Hub
1. Load or archive
batch data
Data
Virtualization
5. Access customer
“golden record
MDM
Enterprise
Data Warehouse
10. Correlate real-time
events with historical
patterns & trends
6. Refine &
curate data
7. Move results
to EDW

Telco Call Detail Record (CDR)
Use-Case

Use-Case: CDR Processing
•  Each job picks up a number of files containing Text CDRs
(Call Detail Records)
•  First task is to merge partial call records
•  Some records may be partial – ex. multiple records for a single
call due to a dropped line or switching cell towers
•  Partial records need to be merged and total call time needs to be
added to duration for the merged record
•  Partial records for a single call may reside in multiple files or be
included in different jobs.
•  Incomplete partial records need to be reprocessed by
consecutive jobs
•  Second task is to sort all processed CDRs by calling number

Input CDR File Example
These 3 numbers
uniquely identify a call Partial calls starts with
1 and end with 0
Some partial
records are
incomplete
Processed
completed records
are sorted by caller

Output CDR Files
Completed Calls
Partial Calls
Duration times are
added to the
merged records
Partial records are
merged into a single
completed record
Partial records will
be reprocessed

Logical Design
Partial records only
Separate partial records
from completed records
Completed
records only
Separate
incomplete and
complete partial
records
Select incomplete
partial records
Aggregate all
completed and
partial-completed
records

Constructing Logical Expressions

CDR Pipeline
Sort records
by Key Summarize
by Key Group
Filter by
Province ID
Filter by
Collection
Date
City Code
Lookup
Read Files
Write report
•  Scenario – Filter records by Date, City and Province;
Aggregate and summarize records by a composite Key

Adding Transactional Source
•  Scenario - Report website use (Facebook, Twitter, etc.)
by Age and by Postal Code
Read WAP
log records
Get MSISDN
and URL fields
Lookup Age,
Postal Code by
MSISDN
Count URL
frequency Calculate
percentages

Connecting to Relational Source

Result
•  Easily combine big data sources with transactional data
•  Example – Report website use (Facebook, Twitter, etc.) by Age and by
Region
Look-up of
Age, Region
by MSISDN
CRM
EDW
Log
Files,
HDFS

Integrated
Interoperable with
existing data center
investments Skills
Leverage your existing
skills: development,
operations, analytics
Requirements for Hadoop Adoption
Page 35
Key Services
Platform, operational and
data services essential for
the enterprise
3Requirements for Hadoop’s Role
in the Modern Data Architecture

Next Steps:
Page 36
Learn more about Informatica and Hadoop
http://www.informatica.com/us/vision/harnessing-big-
data/hadoop/
Get started on Hadoop with Hortonworks
Sandbox
http://hortonworks.com/products/hortonworks-
sandbox/
Follow us:
@hortonworks, @informatica

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform

Semelhante a Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform (20)

Mais de Hortonworks

Mais de Hortonworks (20)

Último

Último (20)

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Data Platform