Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

© 2014 IBM Corporation | IBM Confidential
Faster, cheaper, easier… and successful!
Best practices for Big Data Integration
2014-06-05

Big Data Integration Is Critical For Success With Hadoop
Extract, Transform, and Load Big Data With Apache Hadoop - White Paper
https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-
hadoop.pdf
“By most
accounts
80%
of the development effort in
a big data project goes into
data integration
goes towards data
analysis.”
…and only
20%
Most Hadoop
initiatives involve
collecting, moving,
transforming,
cleansing,
integrating,
exploring, and
analysing volumes
of disparate data
sources and types.

Why is 80% of the effort in “data integration”
Heterogeneity of
data sources
Optimizing
performance
Data Issues
Missing or bad
requirements
Complexity
Lack of
understanding
Diverse
formats
To be useful, the
meaning and
accuracy of the data
should never be in
question
As such, data needs
to be made fit-for-
purpose so that it is
used correctly and
consistently
Inhibitors – Both traditional & Big Data

Isn’t there any good news?
most Hadoop initiatives will end up
achieving “garbage in, garbage out” faster,
against larger data volumes, at much lower
total cost than without Hadoop
YES
Without effective Big Data Integration,
you won’t have consumable data

Getting consumable data from the data lake
The Data Lake unfettered
Adding integration &
governance discipline

Five Best Practices for Big Data Integration
No Hand Coding Anywhere For Any Purpose
One Data Integration And Governance Platform For The Enterprise
Massively Scalable Data Integration Wherever It Needs To Run
World-Class Data Governance Across The Enterprise
Robust Administration And Operations Control Across The Enterprise
1
2
3
4
5

Best Practice #1
No hand coding anywhere, for any purpose
• No hand coding for any aspect of Big Data Integration
• Data access and movement across the enterprise
• Data integration logic
• Assembling data integration jobs from logic objects
• Assembling larger workflows
• Data governance
• Operational and administrative management
What does
this mean

Cost of hand coding vs market-leading tooling
30 man days to write
Almost 2,000 lines of code
71,000 characters
No documentation
Difficult to re-use
Difficult to maintain
2 days to write
Graphical
Self Documenting
Reusability
More Maintainable
Improved Performance
Handcoding / Legacy DI Tooling / Info Server
VS
Saving in
Dev Costs
Our largest customers concluded years ago that they will not succeed
with Big Data Initiatives without Information Server
* Pharmaceutical Customer example
87%
Winner
Loser

Best Practice #1
No hand coding anywhere, for any purpose
Lowers Costs
• DI tooling reduces
labor costs by
90% over hand
coding
• One set of skills
and best practices
leveraged across
all projects
Faster time to
value
• DI tooling reduces
project timelines
by 90% over hand
coding
• Much less time
required to add
new sources and
new DI processes
Higher quality
data
• Data profiling and
cleansing are very
difficult to
implement using
hand coding
Effective data
governance
• requires world-
class data
integration tooling
to support
objectives like
impact analysis
and data lineage

Best Practice #2
One Data Integration And Governance Platform For The Enterprise
• Build a job once and run it anywhere on any platform in
the enterprise without modification
• Access, move and load data between a variety of
sources and targets across the enterprise
• Support a variety of data integration paradigms
• Batch processing
• Federation
• Change data capture
• SOA enablement of data integration tasks
• Real-time with transactional integrity
• Self-service for business users
• Support the establishment of world-class data
governance across the enterprise
What does
this mean

Self-Service Big Data Integration On-Demand
InfoSphere Data Click
• Provides a simple web-based
interface for any user
• Move data in batch or real-time in
a few clicks
• Policy choices which are then
automated, without any coding
• Optimized runtime
• Automatically captures metadata
for built-in governance
"I have a feeling before long Gartner will
be telling us if we’re not doing this
something is wrong.”
– an IBM Customer

Optimized for Hadoop with blazing fast HDFS speeds
Extends the same easy
drag and drop
paradigm, then simply
add your hadoop
server name and port
number
15tb/hr
Information Server
engine has streaming
parallelization
techniques to pipe data
in and out at massive
scale
Performance study
run up to 15 TB/hr
before HDFS disks
were completely
saturated

Make data available to Hadoop in real-time
Non Invasive Record Capture
Read data from transactional database logs, to minimize
impact to source systems
High Speed Data Replication
Low-latency capture and delivery of real time
information
Consistently current Hadoop data
Data available in Hadoop moments after it was committed
in source databases to accelerate analytics currency

Best Practice #3
Massively Scalable Data Integration Wherever It Needs To Run
Case 1. InfoServer parallel
engine running against any
traditional data source
Case 2. Push processing
into parallel database
Case 4. Push processing
into Hadoop MapReduce
Case 5. InfoServer parallel
engine running against
HDFS without M/R
Outside Hadoop
Environment
Within Hadoop
Environment
Case 3. Move and process data in
parallel between environments
Design Once
• Develop the logic in the same manner
regardless of execution platform
Scale Anywhere
• Execute the logic in any of the 5
patterns for scalable data integration
… no single pattern is sufficient
What does this mean
Information Server is the only Big Data
Integration platform supporting all 5 use
cases

Dynamic
Instantly get better performance
as hardware resources are
added
Extendable
Add a new server to scale out
through simple text file edit (or, in
grid config, automatically via
integration with grid management
software).
Data Partitioned
In true MPP fashion (like
Hadoop) data persisted in the DI
parallel to scale out the I/O.
Sour
ce
Data
Transform Cleanse Enrich EDW
Information Server is Big Data Integration
Disk
CPU
Memory
Sequential
Disk
CPU
Shared
Memory
CPUCPU CPU
4-way Parallel 64-way Parallel
Uniprocessor SMP System MPP Clustered System

Information Server: Customers stories with Big Data
Using the Information Server MPP Engine
200,000 programs built in
Information Server on a
grid/cluster of low commodity
hardware
Uses text analytics across 200
million medical documents a
weekend, creating indexes to
support optimal retrieval by users
Desensitizes 200 TB of
data one weekend each
month to populate their dev
environments
Process 50,000 tps with
complex transformation and
guaranteed delivery
Information Server powered grid
processing over
40+ trillion rcds each month
Global Bank Data Services Co
Global Bank Health Care Health Care

Where should you run scalable data integration
Run In the Database
Advantages:
• Exploit database MPP engine
• Minimize data movement
• Leverage database for joins/aggregations
• Works best when data is already clean
• Frees up cycles on ETL server
• Use excess capacity on RDBMS server
• Database faster for some processes
Disadvantages:
• Very expensive hardware and storage
• Can force 100% reliance on ELT
• Degradation of query SLAs
• Not all ETL logic can be pushed into
RDBMS (with ELT tool or hand coding)
• Can’t exploit commodity hardware
• Usually requires hand coding
• Limitations on complex transformations
• Limited data cleansing
• Database slower for some processes
• ELT can consume RDBMS capacity
• (capacity planning is nontrivial)
Run in the DI engine
Advantages:
• Exploit ETL MPP engine
• Exploit commodity hardware and
storage
• Exploit grid to consolidate SMP
servers
• Perform complex transforms (data
cleansing) that can’t be pushed into
RDBMS
• Free up capacity on RDBMS server
• Process heterogenous data sources
(not stored in the database)
• ETL server faster for some
processes
Disadvantages:
• ETL server slower for some
processes (data already stored in
relational tables)
• May require extra hardware (low
cost hardware)
Run in Hadoop
Advantages:
• Exploit MapReduce MPP engine
• Exploit commodity hardware and
storage
• Free up capacity on the database
server
• Support processing of unstructured
data
• Exploit Hadoop’s capabilities for
persisting
• data (e.g. updating and indexing)
• Low cost archiving of history data
Disadvantages:
• Not all ETL logic can be pushed into
RDBMS (with ELT tool or hand coding)
• Can require complex programming
• MapReduce will usually be much
slower than parallel database or
scalable ETL tool
• Risk: Hadoop is still a young
technology
Big Data Integration requires a balanced approach that supports all of the above

Automated MapReduce Job Generation
• Leverage the same UI and the same stages to automatically build
MapReduce
• Drag and drop stages to the canvas to create a job, rather than have to
learn MapReduce programming.
• Push the processing to the data for patterns when you don’t want to
transport the data on the network.

© 2013 IBM Corporation
Build integration jobs
with the same data
flow tool and stages
Automatically
creates
MapReduce
code.
Automated MapReduce Job Generation

Big Data Integration also requires running scalable data integration workloads outside of
the Hadoop MapReduce environment.
Complex data integration logic can’t be pushed into a parallel database or MapReduce
easily and efficiently, or at all in some cases.
• IBM’s experiences with customers’ early Hadoop initiatives have shown that much of their data integration
processing logic can’t be pushed into MapReduce.
• Without Information Server, these more complex data integration processes would have to be hand coded to
run in MapReduce, increasing project time, cost, and complexity
MapReduce has significant and known performance limitations
• For processing large data volumes with complex transformations (including data integration)
• Many Big Data vendors and researchers are focusing on bypassing MapReduce performance limitations
DataStage will process data integration 10X-15X faster than MapReduce.
So why is Pushdown Of ETL Into MapReduce not sufficient
for Big Data Integration

Best Practice #4
World-Class Data Governance Across The Enterprise
What does
this mean
• Both IT and Line of business need to have a high
degree of confidence in the data
• Confidence requires that data is understood to be of
high quality, secure and fit-for-purpose
• Where does the data in my report come from?
• What is being done with it inside of Hadoop?
• Where was it before reaching our data lake?
• Oftentimes these requirements extends from
regulations within the specific industry

How well do your business users understand the content of the information in
your Big Data stores?
Are you measuring the quality of your information in Big Data?
?
Why Is Data Governance Critical For Big Data?

All data needs to build confidence via a
Fully Governed Data Lifecycle
Find
• Leverage Terms, Labels
and Collections to find
governed, curated data
sources
Curate
• Add Labels, Terms,
Custom Properties to
relevant assets
Collect
• Use Collections to
capture assets for a
specific analysis or
governance effort
Collaborate
• Share Collections for
additional Curation and
Governance
Govern
• Create and reference IG
Policies and Rules
• Apply DQ, Masking,
Archiving, Cleansing, …
to data
Offload
• Copy data in one click to
HDFS for analysis for
warehouse augmentation
Analyze
• Perform analyses on
offloaded data
Reuse & Trust
• Understand how data is
being used today via
lineage for analyses and
reports

Best Practice #5
Robust Administration And Operations Control Across The Enterprise
What does
this mean
• Operations Management For Big Data Integration
• provides quick answers for the operators, developers and other
stakeholders as they monitor run-time environment
• Workload Management allocates resource priority in a shared
services environment and queues workload on a busy system
• Performance Analysis provides insight into resource consumption to
help understand when system may need more resources.
• Build workflows that include hadoop activities defined via Oozie
directly along with other data integration activities.
• Administrative Management For Big Data Integration
• Web-based installer for all Integration & Governance capabilities
• High-available configurations for meeting 24/7 requirements
• Instantly provision/deploy a new project instance
• Centralized Authentication, Authorization, & Session Management
• Audit logging of security related events to promote SOX compliance

Combined Workflows For Big Data
• Simple design paradigm for workflows
(same as for job design)
• Mix any Oozie activity right alongside
other data integration activities
• Allows users to have the data
sourcing, ETL, analytics and delivery
of information all controlled through a
single coordinating process
• Monitor all stages through Operations
Console’s web based interface
25

Automotive manufacturer
uses Big Data Integration to
build out global data warehouse
Challenges
• Need to manage massive amounts of vehicle data – about
5TB per day
• Need to understand, incorporate, and correlate a variety
of data sources to better understand problems and
product quality issues
• Need to share information across functional teams for
improved decision making
Business Benefits
• Doubled number of models for JD Power award for initial
quality study in 1 year
• Improved and streamlined decision making and system
efficiency
• Lowered warranty costs
IT Benefits
• Single infrastructure to consolidate structured, semi-
structured and unstructured data; simplified management
 Optimize existing Teradata environment – size,
performance, and TCO
 High-performance ETL for in-database transformations
26

Other Examples of Proven Value of IBM Big Data Integration
european telco
• ELT pushdown into
the database and
Hadoop was not
sufficient for Big
Data Integration
• InfoSphere
DataStage runs
some DI processes
faster than the
parallel database
and MapReduce
wireless carrier
• InfoSphere Information
Server can transform a
dirty Hadoop lake into
a clean Hadoop lake
• IBM met requirements
for processing 25
terabytes in 24 hours
• Capabilities of
InfoSphere Information
Server and InfoSphere
Optim data masking all
helped to produce a
clean Hadoop lake
insurance company
• IBM could shred
complex XML claims
messages and flatten
them for Hadoop
reporting and analysis
• IBM could meet all
requirements for large-
scale batch processing
• Information Server could
adjust for changes in
XML structure while
tolerating future
unanticipated changes

Promote Object Reuse
Build once, share, and run anywhere
(ETL/ELTt/real-time)
Reduce Operational Cost
Provides a robust framework to manage
data integration
Protect from Changes
isolation from underlying technologies
changes as they continue to evolve
Summing it up
Increase ROI for Hadoop via 5 Best Practices for Big Data Integration
Speed Productivity
Graphical design easier to use than
hand coding
Simplify Heterogeneity
Common method for diverse data
sources
Shorten Project Cycles
Pre-built components reduce cost and
timelines

Thank you

Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

Semelhante a Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration