Handwritten Text Recognition for manuscripts and early printed texts
Hadoop is not an Island in the Enterprise
1. Understanding Deployment Practices that Merge the
Strengths of Hadoop and the Data Warehouse
Joe Rao
PS Consultant, Teradata Corporation
HADOOP IS NOT AN ISLAND
IN THE ENTERPRISE:
2. 2 6/17/2014 Teradata Confidential
This presentation covers
• A comparison of the strengths of Hadoop and a Data
Warehouse
• Architectures that involve Hadoop and the data warehouse
working together
AGENDA
3. 3 6/17/2014 Teradata Confidential
• Our two platforms:
> The Data Platform – Hadoop
> The Enterprise Data Warehouse – Teradata
• Both platforms could handle everything by themselves
if we really wanted them to
• Biased organizations will favor one over the other, and
argue that everything can be done in one place
• And they're both right
FRAMING THE DISCUSSION
4. 4 6/17/2014 Teradata Confidential
•Let's consider a software startup or
company that has no IT department yet
•They need to:
> Acquire their technology from scratch
> Build business logic from scratch
> Staff their new department from scratch
•With no existing technology, how should
they structure their data center?
FRAMING THE DISCUSSION
5. 5 6/17/2014 Teradata Confidential
• Traditional data warehouses (like the Teradata
database) have been used as the central
repository
of business data for years.
• Data warehouses are great with:
> Thousands of concurrent users and queries
> Full ANSI SQL interfaces
> Very complex SQL query logic
> Advanced workload management
> Transactional capabilities
> Secure access
DATA WAREHOUSE STRENGTHS
6. 6 6/17/2014 Teradata Confidential
• Many companies that have been doing things the
old way with a data warehouse don't think they need
to change anything
• What they've been doing has worked for years. Hadoop
is young and immature they say. Why change?
• These companies are change resistant. They are missing
out on the advancements in big data and can fall behind
their competition.
DATA WAREHOUSE ONLY?
I’m lonely
7. 7 6/17/2014 Teradata Confidential
• Hadoop is changing the game in the enterprise
data landscape. It's major strengths include:
> Economical
> Able to process extremely large data sets
> Extremely flexible storage and processing
> Open, free, active community development
HADOOP STRENGTHS
8. 8 6/17/2014 Teradata Confidential
• Appliance Solution
> Purpose-built integrated hardware/software solution
> Optimized hardware for Hadoop, software, storage, and
networking in a single rack
> Delivered ready to run at a competitive price point
• Enterprise Ready
> 100% open-source Hadoop via Hortonworks HDP
> Integrated with Teradata Unified Data Architecture on 40GB/s
InfiniBand BYNET V5 for performance and reliability
> Support for major ETL tools, enhanced security, and
metadata management
> Management tools for monitoring system health
• Benefits
> Lowest TCO and fastest time to value
> Fully engineered and supported by Teradata
TERADATA APPLIANCE FOR HADOOP
9. 9 6/17/2014 Teradata Confidential
• Many companies are so eager to jump onto the Hadoop wave
that they think they can run their entire datacenter on
Hadoop.
• It's free, it has lots of development effort put into it, it's
flexible. Why go the “old way” with an EDW?
• These companies are using Hadoop beyond its design and
maturity level, and may run into technical problems
meeting requirements.
HADOOP ONLY?
I’m lonely
10. 10 6/17/2014 Teradata Confidential
CONCLUSIONS — TWO TCOD EXAMPLES
1. TCOD is NOT platform cost – it is total project cost
2. Each technology has large advantages in its sweet spot(s)
3. Neither platform is cost effective in the other’s sweet spot
4. Biggest differences for the data warehouse are the development of:
Complex queries
Analytics Source: WinterCorp - Full report at www.wintercorp.com/tcod-report
Data Refining: Hadoop wins
Also: Landing Zone, Archive EDW: Data W/H Platform Wins
$0
$5
$10
$15
$20
$25
$30
$35
On Hadoop On Data
Warehouse
Millions
$0
$100
$200
$300
$400
$500
$600
$700
$800
On Hadoop On Data
Warehouse
Millions
Total System Cost
System and Data Admin
Application Development
ETL
Complex Queries
Analysis
11. 11 6/17/2014 Teradata Confidential
• These two platforms are complementary!
• Successful enterprise datacenters merge the strengths
of both platforms.
EDW VS. HADOOP
13. 13 6/17/2014 Teradata Confidential
Insurance Use Case
Impact
• Quickly analyze data for informed decisions and ad hoc reporting
• Streamlined process to calculate vehicle and fleet scores
• Cost effectively quantify, adjust and manage risk premiums
Situation
A large diversified customers needed to accurately calculate scores and adjust risk
premiums for its enterprise fleets based on vehicle data, driver behavior, GPS data,
weather data, traffic and DW data. Current custom developed applications limits the
effectiveness of these scores.
Problem
Lacks infrastructure and system to handle the huge volumes of real time data. No ad-hoc
reporting systems to combine, enrich and analyze the data. Limited storage capacity limits
the amount of data that can be captured, refined and stored.
Solution
Used Teradata Big Analytics Appliance to design a platform to streamline the ingestion
process for telematics data from multiple sources, data types, structure, and frequency
and combine with other data sources to perform meaningful analytics.
14. 14 6/17/2014 Teradata Confidential
HADOOP
TeradataINTEGRATED DATA WAREHOUSE
• The Data Warehouse and Hadoop run different workloads
on different data sets.
SPLIT WORKLOADS
Big Data
Operational Data
15. 15 6/17/2014 Teradata Confidential
• It is not economical to put gigantic, “value sparse” data
sets on an enterprise data warehouse.
• Hadoop was not built to be an accessible, highly concurrent
transactional database.
• The easiest natural architecture is to split up the two
platforms based on the data set and workload.
> Teradata handles the operational business data and queries
> Hadoop handles the cost prohibitive “big data” sets, such as
web, machine, social data
SPLIT WORKLOADS
16. 16 6/17/2014 Teradata Confidential
• Both systems operate favorably on cost and performance
with respect to their given workloads.
• The business can analyze new data and gain new insights
that their existing platform couldn't handle before.
SPLIT WORKLOADS — BUSINESS VALUE
17. 17 6/17/2014 Teradata Confidential
LARGE COMPUTER MANUFACTURER
Analysis of Customer Web Interactions
Capture, Refine, Store ClickStream Data
Impact
• Reduced data inconsistencies and improved performance
• Capture and curate ALL the data and prepare for analysis
• Perform ad hoc analytics on multi-level interactions
• Improves the marketing campaigns and the customer support process
Situation
Customers interact interact with public websites of large PC vendor for various purposes — resulting in
huge volumes of raw omniture data. Because of its nature, the data structure and format is not always
consistent and because of the volumes, processing the amount of data is difficult.
Problem
Inconsistencies like file errors, corrupted file compressions in the raw omniture data makes the
capturing and analysis process error prone. The volume, velocity (70files/hr, 1M files) adds to the
complexity.
Solution
Teradata Big Analytics solution to provide a landing and staging area for in-coming data at high
velocity. Hadoop nodes to curate the data, check for data consistency, and prepare the data for
consumption by higher end analytic platforms.
18. 18 6/17/2014 Teradata Confidential
HADOOP Teradata
TERADATA
PLATFORM FAMILY
• Hadoop can be used as a staging and ETL preprocessing
layer for the Data Warehouse.
ETL SYSTEM ARCHITECTURE
Source Data Transformed Data
19. 19 6/17/2014 Teradata Confidential
• The Data Warehouse is busy with operational queries.
We can reduce the workload on the DW by
migrating some ETL to Hadoop.
• ETL processing is a write once step, which fits
Hadoop's architecture.
• Hadoop can inexpensively retain the raw source
data for data lineage purposes.
*Note that there are many cases where this migration doesn't make sense,
such as when it's necessary to do referential integrity checks. The DW is
capable of handling its ETL if necessary.
ETL SYSTEM ARCHITECTURE
20. 20 6/17/2014 Teradata Confidential
HADOOP TERADATA
PLATFORM FAMILY
• Command line interface for Hadoop / TD data transfer
• Batch mapreduce jobs
• Bidirectional
• Run on the Hadoop side
TERADATA CONNNECTOR
FOR HADOOP (TDCH)
TDCH
21. 21 6/17/2014 Teradata Confidential
hadoop jar /home/jo845b/teradata-connector-1.0.10/lib/teradata-connector-
1.0.10.jar
com.teradata.hadoop.tool.TeradataExportTool
-libjars $LIB_JARS
-classname com.teradata.jdbc.TeraDriver
-url jdbc:teradata://terarps.ca.boeing.com/DATABASE=SQLH_TEST
-username jo845b
-password Teradata14
-jobtype hcat
-fileformat rcfile
-method internal.fastload
-sourcedatabase default
-sourcetable ontime_sqoop
-targettable ontime_sqoop
-usexviews true
• There are a plethora of options to fine-tune data transfer
between Teradata and Hadoop
TERADATA CONNECTOR FOR HADOOP
22. 22 6/17/2014 Teradata Confidential
• Hadoop frees up the Data Warehouse's limited storage and
processing resources, saving the business time and money.
• Data can now be kept in its raw form, adding new data
lineage capabilities to the data center.
ETL SYSTEM ARCHITECTURE — BUSINESS VALUE
23. 23 6/17/2014 Teradata Confidential
BANKING USE CASE
Impact
• Analyze multi-structured data types
• Keep data confidential to those with access rights
• SQL users have easy access to big data sources
Situation
A large national bank needed to securely and inexpensively store and analyze raw
financial data in varied nonrelational formats. The data needs strict access privileges and
should be generally accessible to SQL users in some way.
Problem
Current infrastructure is not flexible enough to handle the expected variations in data
formats and processing algorithms. Security requirements are too strict for vanilla
Hadoop.
Solution
Use Teradata Big Analytics Appliance to ingest and store the data. Data is accessed by
analysts though an access layer with the data warehouse, and power users manipulate
the data on the Hadoop system directly.
24. 24 6/17/2014 Teradata Confidential
HADOOP TERADATA
PLATFORM FAMILY
Sub-queries
Data
Queries
SECURE ACCESS ARCHITECTURE
• Teradata can be used as an access layer to the data
stored in Hadoop.
25. 25 6/17/2014 Teradata Confidential
• Data in Hadoop can be accessed by data
warehouse users with no knowledge of the
inner workings of Hadoop.
• The full Teradata SQL library is now available to
Hadoop users
• Teradata can be used as a secure gateway to
limit the authentication gap in Hadoop without
needing Kerberos.
SECURE ACCESS ARCHITECTURE
26. 26 6/17/2014 Teradata Confidential
HADOOP TERADATA
PLATFORM FAMILY
Query Grid
Data
TERADATA QUERY GRID:
TERADATA DATABASE TO HADOOP
• Direct data transfer from the Hadoop Distributed Filesystem
• Hadoop data referenced in normal SQL queries
• Transfers occur in a high speed, parallel, scalable fashion
• Data can be processed on the fly or stored long-term
27. 27 6/17/2014 Teradata Confidential
CREATE VIEW TOM AS (
SELECT * FROM load_from_hcatalog(
USING
server('sdll4364.labs.teradata.com')
port('9083')
username('hive')
dbname('vim')
templeton_port('1880')
));
• There are a plethora of options to fine-tune data transfer
between Teradata and Hadoop
• Access rights on the view can limit users' access to other
data sets.
TERADATA QUERY GRID
28. 28 6/17/2014 Teradata Confidential
• Businesses can leverage the much more widespread
SQL and EDW user community instead of the small,
expensive Hadoop expert community. This saves the
business money.
• Data can be stored inexpensively, securely, and
accessibly at the same time.
SECURE ACCESS ARCHITECTURE —
BUSINESS VALUE
29. 29 6/17/2014 Teradata Confidential
PHARMACY USE CASE
Impact
• Reduced storage costs for data variety
• Perform adhoc analytics on the multiple versions of data
• Retrieve data in minutes ( vs. days with tape archives )
• Reduced load and improved performance of DW/Databases
Situation
High performance storage is expensive. A Large integrated pharmacy HC providers deals
With a variety of data with different business value. All data cannot be store on the same
system. Ever expanding data is only adding to this challenge.
Problem
Long terms storage data cannot be queried and it takes a long time for retrieval. No analysis
can be performed on the archived data. Losing out on business value from this valuable data.
Solution
Used Teradata Hadoop nodes to store all the data coming in from weblogs, medical
data, JSON files. Hadoop also serves as a enrichment layer to enhance data for high-end
analytics consumption. The complete solution provides easy movement of data from
Hadoop, Aster and Teradata.
30. 30 6/17/2014 Teradata Confidential
HADOOP TERADATA
PLATFORM FAMILY
ACTIVE ARCHIVE
• Hadoop can be used to store the data warehouse's
cold data, historical data, and regular backups.
Backups
Historical Data
31. 31 6/17/2014 Teradata Confidential
• Using Hadoop as an active archive allows database users to
access cold or historical data on the fly, unlike tape
archives.
• Hadoop data can be accessed in the EDW using Teradata
QueryGrid: Teradata-Hadoop.
• The data is no longer stored in the data warehouse,
freeing valuable space. Hadoop is a less expensive
platform to store this data on.
ACTIVE ARCHIVE
32. 32 6/17/2014 Teradata Confidential
• Storing data on Hadoop frees up cold data storage space
on the relatively expensive data warehouse, saving the
business money.
• Compared to tape, businesses can still analyze and
access their data on Hadoop. This saves time and effort.
ACTIVE ARCHIVE — BUSINESS VALUE
33. 33 6/17/2014 Teradata Confidential
• A successful DW / Hadoop coexistence system will see
varying uses of all four of these mechanisms concurrently.
• Replacing existing infrastructures with Hadoop is not a
feasible goal.
• In order to get Hadoop's foot in the door with large
established enterprises, we need to push Hadoop as an
integrated solution in tandem with a DW.
CONCLUDING REMARKS
PUSHING HADOOP FURTHER