SlideShare uma empresa Scribd logo
1 de 29
© 2014 IBM Corporation | IBM Confidential
Faster, cheaper, easier… and successful!
Best practices for Big Data Integration
2014-06-05
© 2014 IBM Corporation | IBM Confidential
Big Data Integration Is Critical For Success With Hadoop
Extract, Transform, and Load Big Data With Apache Hadoop - White Paper
https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-
hadoop.pdf
“By most
accounts
80%
of the development effort in
a big data project goes into
data integration
goes towards data
analysis.”
…and only
20%
Most Hadoop
initiatives involve
collecting, moving,
transforming,
cleansing,
integrating,
exploring, and
analysing volumes
of disparate data
sources and types.
© 2014 IBM Corporation | IBM Confidential
Why is 80% of the effort in “data integration”
Heterogeneity of
data sources
Optimizing
performance
Data Issues
Missing or bad
requirements
Complexity
Lack of
understanding
Diverse
formats
To be useful, the
meaning and
accuracy of the data
should never be in
question
As such, data needs
to be made fit-for-
purpose so that it is
used correctly and
consistently
Inhibitors – Both traditional & Big Data
© 2014 IBM Corporation | IBM Confidential
Isn’t there any good news?
most Hadoop initiatives will end up
achieving “garbage in, garbage out” faster,
against larger data volumes, at much lower
total cost than without Hadoop
YES
Without effective Big Data Integration,
you won’t have consumable data
© 2014 IBM Corporation | IBM Confidential
Getting consumable data from the data lake
The Data Lake unfettered
Adding integration &
governance discipline
© 2014 IBM Corporation | IBM Confidential
Five Best Practices for Big Data Integration
No Hand Coding Anywhere For Any Purpose
One Data Integration And Governance Platform For The Enterprise
Massively Scalable Data Integration Wherever It Needs To Run
World-Class Data Governance Across The Enterprise
Robust Administration And Operations Control Across The Enterprise
1
2
3
4
5
© 2014 IBM Corporation | IBM Confidential
Best Practice #1
No hand coding anywhere, for any purpose
• No hand coding for any aspect of Big Data Integration
• Data access and movement across the enterprise
• Data integration logic
• Assembling data integration jobs from logic objects
• Assembling larger workflows
• Data governance
• Operational and administrative management
What does
this mean
© 2014 IBM Corporation | IBM Confidential
Cost of hand coding vs market-leading tooling
30 man days to write
Almost 2,000 lines of code
71,000 characters
No documentation
Difficult to re-use
Difficult to maintain
2 days to write
Graphical
Self Documenting
Reusability
More Maintainable
Improved Performance
Handcoding / Legacy DI Tooling / Info Server
VS
Saving in
Dev Costs
Our largest customers concluded years ago that they will not succeed
with Big Data Initiatives without Information Server
* Pharmaceutical Customer example
87%
Winner
Loser
© 2014 IBM Corporation | IBM Confidential
Best Practice #1
No hand coding anywhere, for any purpose
Lowers Costs
• DI tooling reduces
labor costs by
90% over hand
coding
• One set of skills
and best practices
leveraged across
all projects
Faster time to
value
• DI tooling reduces
project timelines
by 90% over hand
coding
• Much less time
required to add
new sources and
new DI processes
Higher quality
data
• Data profiling and
cleansing are very
difficult to
implement using
hand coding
Effective data
governance
• requires world-
class data
integration tooling
to support
objectives like
impact analysis
and data lineage
© 2014 IBM Corporation | IBM Confidential
Best Practice #2
One Data Integration And Governance Platform For The Enterprise
• Build a job once and run it anywhere on any platform in
the enterprise without modification
• Access, move and load data between a variety of
sources and targets across the enterprise
• Support a variety of data integration paradigms
• Batch processing
• Federation
• Change data capture
• SOA enablement of data integration tasks
• Real-time with transactional integrity
• Self-service for business users
• Support the establishment of world-class data
governance across the enterprise
What does
this mean
© 2014 IBM Corporation | IBM Confidential
Self-Service Big Data Integration On-Demand
InfoSphere Data Click
• Provides a simple web-based
interface for any user
• Move data in batch or real-time in
a few clicks
• Policy choices which are then
automated, without any coding
• Optimized runtime
• Automatically captures metadata
for built-in governance
"I have a feeling before long Gartner will
be telling us if we’re not doing this
something is wrong.”
– an IBM Customer
© 2014 IBM Corporation | IBM Confidential
Optimized for Hadoop with blazing fast HDFS speeds
Extends the same easy
drag and drop
paradigm, then simply
add your hadoop
server name and port
number
15tb/hr
Information Server
engine has streaming
parallelization
techniques to pipe data
in and out at massive
scale
Performance study
run up to 15 TB/hr
before HDFS disks
were completely
saturated
© 2014 IBM Corporation | IBM Confidential
Make data available to Hadoop in real-time
Non Invasive Record Capture
Read data from transactional database logs, to minimize
impact to source systems
High Speed Data Replication
Low-latency capture and delivery of real time
information
Consistently current Hadoop data
Data available in Hadoop moments after it was committed
in source databases to accelerate analytics currency
© 2014 IBM Corporation | IBM Confidential
Best Practice #3
Massively Scalable Data Integration Wherever It Needs To Run
Case 1. InfoServer parallel
engine running against any
traditional data source
Case 2. Push processing
into parallel database
Case 4. Push processing
into Hadoop MapReduce
Case 5. InfoServer parallel
engine running against
HDFS without M/R
Outside Hadoop
Environment
Within Hadoop
Environment
Case 3. Move and process data in
parallel between environments
Design Once
• Develop the logic in the same manner
regardless of execution platform
Scale Anywhere
• Execute the logic in any of the 5
patterns for scalable data integration
… no single pattern is sufficient
What does this mean
Information Server is the only Big Data
Integration platform supporting all 5 use
cases
© 2014 IBM Corporation | IBM Confidential
Dynamic
Instantly get better performance
as hardware resources are
added
Extendable
Add a new server to scale out
through simple text file edit (or, in
grid config, automatically via
integration with grid management
software).
Data Partitioned
In true MPP fashion (like
Hadoop) data persisted in the DI
parallel to scale out the I/O.
Sour
ce
Data
Transform Cleanse Enrich EDW
Information Server is Big Data Integration
Disk
CPU
Memory
Sequential
Disk
CPU
Shared
Memory
CPUCPU CPU
4-way Parallel 64-way Parallel
Uniprocessor SMP System MPP Clustered System
© 2014 IBM Corporation | IBM Confidential
Information Server: Customers stories with Big Data
Using the Information Server MPP Engine
200,000 programs built in
Information Server on a
grid/cluster of low commodity
hardware
Uses text analytics across 200
million medical documents a
weekend, creating indexes to
support optimal retrieval by users
Desensitizes 200 TB of
data one weekend each
month to populate their dev
environments
Process 50,000 tps with
complex transformation and
guaranteed delivery
Information Server powered grid
processing over
40+ trillion rcds each month
Global Bank Data Services Co
Global Bank Health Care Health Care
© 2014 IBM Corporation | IBM Confidential
Where should you run scalable data integration
Run In the Database
Advantages:
• Exploit database MPP engine
• Minimize data movement
• Leverage database for joins/aggregations
• Works best when data is already clean
• Frees up cycles on ETL server
• Use excess capacity on RDBMS server
• Database faster for some processes
Disadvantages:
• Very expensive hardware and storage
• Can force 100% reliance on ELT
• Degradation of query SLAs
• Not all ETL logic can be pushed into
RDBMS (with ELT tool or hand coding)
• Can’t exploit commodity hardware
• Usually requires hand coding
• Limitations on complex transformations
• Limited data cleansing
• Database slower for some processes
• ELT can consume RDBMS capacity
• (capacity planning is nontrivial)
Run in the DI engine
Advantages:
• Exploit ETL MPP engine
• Exploit commodity hardware and
storage
• Exploit grid to consolidate SMP
servers
• Perform complex transforms (data
cleansing) that can’t be pushed into
RDBMS
• Free up capacity on RDBMS server
• Process heterogenous data sources
(not stored in the database)
• ETL server faster for some
processes
Disadvantages:
• ETL server slower for some
processes (data already stored in
relational tables)
• May require extra hardware (low
cost hardware)
Run in Hadoop
Advantages:
• Exploit MapReduce MPP engine
• Exploit commodity hardware and
storage
• Free up capacity on the database
server
• Support processing of unstructured
data
• Exploit Hadoop’s capabilities for
persisting
• data (e.g. updating and indexing)
• Low cost archiving of history data
Disadvantages:
• Not all ETL logic can be pushed into
RDBMS (with ELT tool or hand coding)
• Can require complex programming
• MapReduce will usually be much
slower than parallel database or
scalable ETL tool
• Risk: Hadoop is still a young
technology
Big Data Integration requires a balanced approach that supports all of the above
© 2014 IBM Corporation | IBM Confidential
Automated MapReduce Job Generation
• Leverage the same UI and the same stages to automatically build
MapReduce
• Drag and drop stages to the canvas to create a job, rather than have to
learn MapReduce programming.
• Push the processing to the data for patterns when you don’t want to
transport the data on the network.
© 2014 IBM Corporation | IBM Confidential
© 2013 IBM Corporation
Build integration jobs
with the same data
flow tool and stages
Automatically
creates
MapReduce
code.
Automated MapReduce Job Generation
© 2014 IBM Corporation | IBM Confidential
Big Data Integration also requires running scalable data integration workloads outside of
the Hadoop MapReduce environment.
Complex data integration logic can’t be pushed into a parallel database or MapReduce
easily and efficiently, or at all in some cases.
• IBM’s experiences with customers’ early Hadoop initiatives have shown that much of their data integration
processing logic can’t be pushed into MapReduce.
• Without Information Server, these more complex data integration processes would have to be hand coded to
run in MapReduce, increasing project time, cost, and complexity
MapReduce has significant and known performance limitations
• For processing large data volumes with complex transformations (including data integration)
• Many Big Data vendors and researchers are focusing on bypassing MapReduce performance limitations
DataStage will process data integration 10X-15X faster than MapReduce.
So why is Pushdown Of ETL Into MapReduce not sufficient
for Big Data Integration
© 2014 IBM Corporation | IBM Confidential
Best Practice #4
World-Class Data Governance Across The Enterprise
What does
this mean
• Both IT and Line of business need to have a high
degree of confidence in the data
• Confidence requires that data is understood to be of
high quality, secure and fit-for-purpose
• Where does the data in my report come from?
• What is being done with it inside of Hadoop?
• Where was it before reaching our data lake?
• Oftentimes these requirements extends from
regulations within the specific industry
© 2014 IBM Corporation | IBM Confidential
How well do your business users understand the content of the information in
your Big Data stores?
Are you measuring the quality of your information in Big Data?
?
Why Is Data Governance Critical For Big Data?
© 2014 IBM Corporation | IBM Confidential
All data needs to build confidence via a
Fully Governed Data Lifecycle
Find
• Leverage Terms, Labels
and Collections to find
governed, curated data
sources
Curate
• Add Labels, Terms,
Custom Properties to
relevant assets
Collect
• Use Collections to
capture assets for a
specific analysis or
governance effort
Collaborate
• Share Collections for
additional Curation and
Governance
Govern
• Create and reference IG
Policies and Rules
• Apply DQ, Masking,
Archiving, Cleansing, …
to data
Offload
• Copy data in one click to
HDFS for analysis for
warehouse augmentation
Analyze
• Perform analyses on
offloaded data
Reuse & Trust
• Understand how data is
being used today via
lineage for analyses and
reports
© 2014 IBM Corporation | IBM Confidential
Best Practice #5
Robust Administration And Operations Control Across The Enterprise
What does
this mean
• Operations Management For Big Data Integration
• provides quick answers for the operators, developers and other
stakeholders as they monitor run-time environment
• Workload Management allocates resource priority in a shared
services environment and queues workload on a busy system
• Performance Analysis provides insight into resource consumption to
help understand when system may need more resources.
• Build workflows that include hadoop activities defined via Oozie
directly along with other data integration activities.
• Administrative Management For Big Data Integration
• Web-based installer for all Integration & Governance capabilities
• High-available configurations for meeting 24/7 requirements
• Instantly provision/deploy a new project instance
• Centralized Authentication, Authorization, & Session Management
• Audit logging of security related events to promote SOX compliance
© 2014 IBM Corporation | IBM Confidential
Combined Workflows For Big Data
• Simple design paradigm for workflows
(same as for job design)
• Mix any Oozie activity right alongside
other data integration activities
• Allows users to have the data
sourcing, ETL, analytics and delivery
of information all controlled through a
single coordinating process
• Monitor all stages through Operations
Console’s web based interface
25
© 2014 IBM Corporation | IBM Confidential
Automotive manufacturer
uses Big Data Integration to
build out global data warehouse
Challenges
• Need to manage massive amounts of vehicle data – about
5TB per day
• Need to understand, incorporate, and correlate a variety
of data sources to better understand problems and
product quality issues
• Need to share information across functional teams for
improved decision making
Business Benefits
• Doubled number of models for JD Power award for initial
quality study in 1 year
• Improved and streamlined decision making and system
efficiency
• Lowered warranty costs
IT Benefits
• Single infrastructure to consolidate structured, semi-
structured and unstructured data; simplified management
 Optimize existing Teradata environment – size,
performance, and TCO
 High-performance ETL for in-database transformations
26
© 2014 IBM Corporation | IBM Confidential
Other Examples of Proven Value of IBM Big Data Integration
european telco
• ELT pushdown into
the database and
Hadoop was not
sufficient for Big
Data Integration
• InfoSphere
DataStage runs
some DI processes
faster than the
parallel database
and MapReduce
wireless carrier
• InfoSphere Information
Server can transform a
dirty Hadoop lake into
a clean Hadoop lake
• IBM met requirements
for processing 25
terabytes in 24 hours
• Capabilities of
InfoSphere Information
Server and InfoSphere
Optim data masking all
helped to produce a
clean Hadoop lake
insurance company
• IBM could shred
complex XML claims
messages and flatten
them for Hadoop
reporting and analysis
• IBM could meet all
requirements for large-
scale batch processing
• Information Server could
adjust for changes in
XML structure while
tolerating future
unanticipated changes
© 2014 IBM Corporation | IBM Confidential
Promote Object Reuse
Build once, share, and run anywhere
(ETL/ELTt/real-time)
Reduce Operational Cost
Provides a robust framework to manage
data integration
Protect from Changes
isolation from underlying technologies
changes as they continue to evolve
Summing it up
Increase ROI for Hadoop via 5 Best Practices for Big Data Integration
Speed Productivity
Graphical design easier to use than
hand coding
Simplify Heterogeneity
Common method for diverse data
sources
Shorten Project Cycles
Pre-built components reduce cost and
timelines
© 2014 IBM Corporation | IBM Confidential
Thank you

Mais conteúdo relacionado

Mais procurados

The Value of Postgres to IT and Finance
The Value of Postgres to IT and FinanceThe Value of Postgres to IT and Finance
The Value of Postgres to IT and FinanceEDB
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...Dell EMC World
 
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...Chad Lawler
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17David Spurway
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks
 
IBM Spectrum Scale and Its Use for Content Management
 IBM Spectrum Scale and Its Use for Content Management IBM Spectrum Scale and Its Use for Content Management
IBM Spectrum Scale and Its Use for Content ManagementSandeep Patil
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalDiego Alberto Tamayo
 
Nippon It Solutions Data services offering 2015
Nippon It Solutions Data services offering 2015Nippon It Solutions Data services offering 2015
Nippon It Solutions Data services offering 2015Vinay Mistry
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
Transform DBMS to Drive Apps of Engagement Innovation
Transform DBMS to Drive Apps of Engagement InnovationTransform DBMS to Drive Apps of Engagement Innovation
Transform DBMS to Drive Apps of Engagement InnovationEDB
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15IBMInfoSphereUGFR
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTMT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTDell EMC World
 
Powerplay: Postgres and Lenovo for the Best Performance & Savings
Powerplay: Postgres and Lenovo for the Best Performance & SavingsPowerplay: Postgres and Lenovo for the Best Performance & Savings
Powerplay: Postgres and Lenovo for the Best Performance & SavingsEDB
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
 

Mais procurados (20)

The Value of Postgres to IT and Finance
The Value of Postgres to IT and FinanceThe Value of Postgres to IT and Finance
The Value of Postgres to IT and Finance
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...
MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...
 
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at Scale
 
IBM Spectrum Scale and Its Use for Content Management
 IBM Spectrum Scale and Its Use for Content Management IBM Spectrum Scale and Its Use for Content Management
IBM Spectrum Scale and Its Use for Content Management
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
 
Nippon It Solutions Data services offering 2015
Nippon It Solutions Data services offering 2015Nippon It Solutions Data services offering 2015
Nippon It Solutions Data services offering 2015
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Transform DBMS to Drive Apps of Engagement Innovation
Transform DBMS to Drive Apps of Engagement InnovationTransform DBMS to Drive Apps of Engagement Innovation
Transform DBMS to Drive Apps of Engagement Innovation
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTMT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
 
Powerplay: Postgres and Lenovo for the Best Performance & Savings
Powerplay: Postgres and Lenovo for the Best Performance & SavingsPowerplay: Postgres and Lenovo for the Best Performance & Savings
Powerplay: Postgres and Lenovo for the Best Performance & Savings
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Optimalisert datasenter
Optimalisert datasenterOptimalisert datasenter
Optimalisert datasenter
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
 

Destaque

Talend For Big Data : Secret Key to Hadoop
Talend For Big Data  : Secret Key to HadoopTalend For Big Data  : Secret Key to Hadoop
Talend For Big Data : Secret Key to HadoopEdureka!
 
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyData Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyAlasdair Gray
 
Webinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationWebinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationSnapLogic
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendEdureka!
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend Edureka!
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 

Destaque (7)

Talend For Big Data : Secret Key to Hadoop
Talend For Big Data  : Secret Key to HadoopTalend For Big Data  : Secret Key to Hadoop
Talend For Big Data : Secret Key to Hadoop
 
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyData Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case Study
 
Webinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationWebinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data Integration
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with Talend
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 

Semelhante a Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWSAmazon Web Services
 
Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3IBMInfoSphereUGFR
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformEMC
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataHalo BI
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)Xavier Constant
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
7 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 20227 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 2022Safe Software
 

Semelhante a Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration (20)

Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analytics
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
 
Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3Présentation IBM InfoSphere Information Server 11.3
Présentation IBM InfoSphere Information Server 11.3
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
7 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 20227 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 2022
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integration

  • 1. © 2014 IBM Corporation | IBM Confidential Faster, cheaper, easier… and successful! Best practices for Big Data Integration 2014-06-05
  • 2. © 2014 IBM Corporation | IBM Confidential Big Data Integration Is Critical For Success With Hadoop Extract, Transform, and Load Big Data With Apache Hadoop - White Paper https://software.intel.com/sites/default/files/article/402274/etl-big-data-with- hadoop.pdf “By most accounts 80% of the development effort in a big data project goes into data integration goes towards data analysis.” …and only 20% Most Hadoop initiatives involve collecting, moving, transforming, cleansing, integrating, exploring, and analysing volumes of disparate data sources and types.
  • 3. © 2014 IBM Corporation | IBM Confidential Why is 80% of the effort in “data integration” Heterogeneity of data sources Optimizing performance Data Issues Missing or bad requirements Complexity Lack of understanding Diverse formats To be useful, the meaning and accuracy of the data should never be in question As such, data needs to be made fit-for- purpose so that it is used correctly and consistently Inhibitors – Both traditional & Big Data
  • 4. © 2014 IBM Corporation | IBM Confidential Isn’t there any good news? most Hadoop initiatives will end up achieving “garbage in, garbage out” faster, against larger data volumes, at much lower total cost than without Hadoop YES Without effective Big Data Integration, you won’t have consumable data
  • 5. © 2014 IBM Corporation | IBM Confidential Getting consumable data from the data lake The Data Lake unfettered Adding integration & governance discipline
  • 6. © 2014 IBM Corporation | IBM Confidential Five Best Practices for Big Data Integration No Hand Coding Anywhere For Any Purpose One Data Integration And Governance Platform For The Enterprise Massively Scalable Data Integration Wherever It Needs To Run World-Class Data Governance Across The Enterprise Robust Administration And Operations Control Across The Enterprise 1 2 3 4 5
  • 7. © 2014 IBM Corporation | IBM Confidential Best Practice #1 No hand coding anywhere, for any purpose • No hand coding for any aspect of Big Data Integration • Data access and movement across the enterprise • Data integration logic • Assembling data integration jobs from logic objects • Assembling larger workflows • Data governance • Operational and administrative management What does this mean
  • 8. © 2014 IBM Corporation | IBM Confidential Cost of hand coding vs market-leading tooling 30 man days to write Almost 2,000 lines of code 71,000 characters No documentation Difficult to re-use Difficult to maintain 2 days to write Graphical Self Documenting Reusability More Maintainable Improved Performance Handcoding / Legacy DI Tooling / Info Server VS Saving in Dev Costs Our largest customers concluded years ago that they will not succeed with Big Data Initiatives without Information Server * Pharmaceutical Customer example 87% Winner Loser
  • 9. © 2014 IBM Corporation | IBM Confidential Best Practice #1 No hand coding anywhere, for any purpose Lowers Costs • DI tooling reduces labor costs by 90% over hand coding • One set of skills and best practices leveraged across all projects Faster time to value • DI tooling reduces project timelines by 90% over hand coding • Much less time required to add new sources and new DI processes Higher quality data • Data profiling and cleansing are very difficult to implement using hand coding Effective data governance • requires world- class data integration tooling to support objectives like impact analysis and data lineage
  • 10. © 2014 IBM Corporation | IBM Confidential Best Practice #2 One Data Integration And Governance Platform For The Enterprise • Build a job once and run it anywhere on any platform in the enterprise without modification • Access, move and load data between a variety of sources and targets across the enterprise • Support a variety of data integration paradigms • Batch processing • Federation • Change data capture • SOA enablement of data integration tasks • Real-time with transactional integrity • Self-service for business users • Support the establishment of world-class data governance across the enterprise What does this mean
  • 11. © 2014 IBM Corporation | IBM Confidential Self-Service Big Data Integration On-Demand InfoSphere Data Click • Provides a simple web-based interface for any user • Move data in batch or real-time in a few clicks • Policy choices which are then automated, without any coding • Optimized runtime • Automatically captures metadata for built-in governance "I have a feeling before long Gartner will be telling us if we’re not doing this something is wrong.” – an IBM Customer
  • 12. © 2014 IBM Corporation | IBM Confidential Optimized for Hadoop with blazing fast HDFS speeds Extends the same easy drag and drop paradigm, then simply add your hadoop server name and port number 15tb/hr Information Server engine has streaming parallelization techniques to pipe data in and out at massive scale Performance study run up to 15 TB/hr before HDFS disks were completely saturated
  • 13. © 2014 IBM Corporation | IBM Confidential Make data available to Hadoop in real-time Non Invasive Record Capture Read data from transactional database logs, to minimize impact to source systems High Speed Data Replication Low-latency capture and delivery of real time information Consistently current Hadoop data Data available in Hadoop moments after it was committed in source databases to accelerate analytics currency
  • 14. © 2014 IBM Corporation | IBM Confidential Best Practice #3 Massively Scalable Data Integration Wherever It Needs To Run Case 1. InfoServer parallel engine running against any traditional data source Case 2. Push processing into parallel database Case 4. Push processing into Hadoop MapReduce Case 5. InfoServer parallel engine running against HDFS without M/R Outside Hadoop Environment Within Hadoop Environment Case 3. Move and process data in parallel between environments Design Once • Develop the logic in the same manner regardless of execution platform Scale Anywhere • Execute the logic in any of the 5 patterns for scalable data integration … no single pattern is sufficient What does this mean Information Server is the only Big Data Integration platform supporting all 5 use cases
  • 15. © 2014 IBM Corporation | IBM Confidential Dynamic Instantly get better performance as hardware resources are added Extendable Add a new server to scale out through simple text file edit (or, in grid config, automatically via integration with grid management software). Data Partitioned In true MPP fashion (like Hadoop) data persisted in the DI parallel to scale out the I/O. Sour ce Data Transform Cleanse Enrich EDW Information Server is Big Data Integration Disk CPU Memory Sequential Disk CPU Shared Memory CPUCPU CPU 4-way Parallel 64-way Parallel Uniprocessor SMP System MPP Clustered System
  • 16. © 2014 IBM Corporation | IBM Confidential Information Server: Customers stories with Big Data Using the Information Server MPP Engine 200,000 programs built in Information Server on a grid/cluster of low commodity hardware Uses text analytics across 200 million medical documents a weekend, creating indexes to support optimal retrieval by users Desensitizes 200 TB of data one weekend each month to populate their dev environments Process 50,000 tps with complex transformation and guaranteed delivery Information Server powered grid processing over 40+ trillion rcds each month Global Bank Data Services Co Global Bank Health Care Health Care
  • 17. © 2014 IBM Corporation | IBM Confidential Where should you run scalable data integration Run In the Database Advantages: • Exploit database MPP engine • Minimize data movement • Leverage database for joins/aggregations • Works best when data is already clean • Frees up cycles on ETL server • Use excess capacity on RDBMS server • Database faster for some processes Disadvantages: • Very expensive hardware and storage • Can force 100% reliance on ELT • Degradation of query SLAs • Not all ETL logic can be pushed into RDBMS (with ELT tool or hand coding) • Can’t exploit commodity hardware • Usually requires hand coding • Limitations on complex transformations • Limited data cleansing • Database slower for some processes • ELT can consume RDBMS capacity • (capacity planning is nontrivial) Run in the DI engine Advantages: • Exploit ETL MPP engine • Exploit commodity hardware and storage • Exploit grid to consolidate SMP servers • Perform complex transforms (data cleansing) that can’t be pushed into RDBMS • Free up capacity on RDBMS server • Process heterogenous data sources (not stored in the database) • ETL server faster for some processes Disadvantages: • ETL server slower for some processes (data already stored in relational tables) • May require extra hardware (low cost hardware) Run in Hadoop Advantages: • Exploit MapReduce MPP engine • Exploit commodity hardware and storage • Free up capacity on the database server • Support processing of unstructured data • Exploit Hadoop’s capabilities for persisting • data (e.g. updating and indexing) • Low cost archiving of history data Disadvantages: • Not all ETL logic can be pushed into RDBMS (with ELT tool or hand coding) • Can require complex programming • MapReduce will usually be much slower than parallel database or scalable ETL tool • Risk: Hadoop is still a young technology Big Data Integration requires a balanced approach that supports all of the above
  • 18. © 2014 IBM Corporation | IBM Confidential Automated MapReduce Job Generation • Leverage the same UI and the same stages to automatically build MapReduce • Drag and drop stages to the canvas to create a job, rather than have to learn MapReduce programming. • Push the processing to the data for patterns when you don’t want to transport the data on the network.
  • 19. © 2014 IBM Corporation | IBM Confidential © 2013 IBM Corporation Build integration jobs with the same data flow tool and stages Automatically creates MapReduce code. Automated MapReduce Job Generation
  • 20. © 2014 IBM Corporation | IBM Confidential Big Data Integration also requires running scalable data integration workloads outside of the Hadoop MapReduce environment. Complex data integration logic can’t be pushed into a parallel database or MapReduce easily and efficiently, or at all in some cases. • IBM’s experiences with customers’ early Hadoop initiatives have shown that much of their data integration processing logic can’t be pushed into MapReduce. • Without Information Server, these more complex data integration processes would have to be hand coded to run in MapReduce, increasing project time, cost, and complexity MapReduce has significant and known performance limitations • For processing large data volumes with complex transformations (including data integration) • Many Big Data vendors and researchers are focusing on bypassing MapReduce performance limitations DataStage will process data integration 10X-15X faster than MapReduce. So why is Pushdown Of ETL Into MapReduce not sufficient for Big Data Integration
  • 21. © 2014 IBM Corporation | IBM Confidential Best Practice #4 World-Class Data Governance Across The Enterprise What does this mean • Both IT and Line of business need to have a high degree of confidence in the data • Confidence requires that data is understood to be of high quality, secure and fit-for-purpose • Where does the data in my report come from? • What is being done with it inside of Hadoop? • Where was it before reaching our data lake? • Oftentimes these requirements extends from regulations within the specific industry
  • 22. © 2014 IBM Corporation | IBM Confidential How well do your business users understand the content of the information in your Big Data stores? Are you measuring the quality of your information in Big Data? ? Why Is Data Governance Critical For Big Data?
  • 23. © 2014 IBM Corporation | IBM Confidential All data needs to build confidence via a Fully Governed Data Lifecycle Find • Leverage Terms, Labels and Collections to find governed, curated data sources Curate • Add Labels, Terms, Custom Properties to relevant assets Collect • Use Collections to capture assets for a specific analysis or governance effort Collaborate • Share Collections for additional Curation and Governance Govern • Create and reference IG Policies and Rules • Apply DQ, Masking, Archiving, Cleansing, … to data Offload • Copy data in one click to HDFS for analysis for warehouse augmentation Analyze • Perform analyses on offloaded data Reuse & Trust • Understand how data is being used today via lineage for analyses and reports
  • 24. © 2014 IBM Corporation | IBM Confidential Best Practice #5 Robust Administration And Operations Control Across The Enterprise What does this mean • Operations Management For Big Data Integration • provides quick answers for the operators, developers and other stakeholders as they monitor run-time environment • Workload Management allocates resource priority in a shared services environment and queues workload on a busy system • Performance Analysis provides insight into resource consumption to help understand when system may need more resources. • Build workflows that include hadoop activities defined via Oozie directly along with other data integration activities. • Administrative Management For Big Data Integration • Web-based installer for all Integration & Governance capabilities • High-available configurations for meeting 24/7 requirements • Instantly provision/deploy a new project instance • Centralized Authentication, Authorization, & Session Management • Audit logging of security related events to promote SOX compliance
  • 25. © 2014 IBM Corporation | IBM Confidential Combined Workflows For Big Data • Simple design paradigm for workflows (same as for job design) • Mix any Oozie activity right alongside other data integration activities • Allows users to have the data sourcing, ETL, analytics and delivery of information all controlled through a single coordinating process • Monitor all stages through Operations Console’s web based interface 25
  • 26. © 2014 IBM Corporation | IBM Confidential Automotive manufacturer uses Big Data Integration to build out global data warehouse Challenges • Need to manage massive amounts of vehicle data – about 5TB per day • Need to understand, incorporate, and correlate a variety of data sources to better understand problems and product quality issues • Need to share information across functional teams for improved decision making Business Benefits • Doubled number of models for JD Power award for initial quality study in 1 year • Improved and streamlined decision making and system efficiency • Lowered warranty costs IT Benefits • Single infrastructure to consolidate structured, semi- structured and unstructured data; simplified management  Optimize existing Teradata environment – size, performance, and TCO  High-performance ETL for in-database transformations 26
  • 27. © 2014 IBM Corporation | IBM Confidential Other Examples of Proven Value of IBM Big Data Integration european telco • ELT pushdown into the database and Hadoop was not sufficient for Big Data Integration • InfoSphere DataStage runs some DI processes faster than the parallel database and MapReduce wireless carrier • InfoSphere Information Server can transform a dirty Hadoop lake into a clean Hadoop lake • IBM met requirements for processing 25 terabytes in 24 hours • Capabilities of InfoSphere Information Server and InfoSphere Optim data masking all helped to produce a clean Hadoop lake insurance company • IBM could shred complex XML claims messages and flatten them for Hadoop reporting and analysis • IBM could meet all requirements for large- scale batch processing • Information Server could adjust for changes in XML structure while tolerating future unanticipated changes
  • 28. © 2014 IBM Corporation | IBM Confidential Promote Object Reuse Build once, share, and run anywhere (ETL/ELTt/real-time) Reduce Operational Cost Provides a robust framework to manage data integration Protect from Changes isolation from underlying technologies changes as they continue to evolve Summing it up Increase ROI for Hadoop via 5 Best Practices for Big Data Integration Speed Productivity Graphical design easier to use than hand coding Simplify Heterogeneity Common method for diverse data sources Shorten Project Cycles Pre-built components reduce cost and timelines
  • 29. © 2014 IBM Corporation | IBM Confidential Thank you