Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
Comparing Sidecar-less Service Mesh from Cilium and Istio
Testing Big Data: Automated ETL Testing of Hadoop
1. Webinar
Testing Big Data:
Automated ETL Testing of Hadoop
Laura Poggi
Marketing Manager
RTTS
Bill Hayduk
CEO/President
RTTS
Jeff Bocarsly, Ph.D.
Chief Architect
RTTS
built by
2. Today’s Agenda
• About Big Data and Hadoop
• Data Warehouse refresher
AGENDA
Topic: Testing Big Data:
Automated ETL Testing of Hadoop
• Hadoop and DWH Use Case
Host: RTTS
Date: Thursday, January 30, 2014
Time: 1:00 pm, Eastern Standard
Time (New York, GMT-05:00)
• How to test Big Data
Session number:630 771 732
• Demo of QuerySurge & Hadoop
built by
3. FACTS
Founded: 1996
About
Primary Focus:
consulting services, software
Locations: New York,
Atlanta, Philly, Phoenix
Geographic region:
North America
Customer profile:
Fortune 1000, > 600 clients
Software:
RTTS is the leading provider of software quality
for critical business systems
4. Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data market will grow from $3.2 billion in
2010 to $32.4 billion in 2017.
- Research Firm IDC
65% of…advanced analytics will have Hadoop
embedded (in them) by 2015.
-Gartner
built by
5. about Big Data
Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
built by
6. ?
What is
Hadoop is an
open source project that
develops software for scalable, distributed computing.
•
•
is a
of large data sets across
clusters of computers using simple programming models.
easily deals with complexities of high
of data
from single servers to 1,000’s of machines, each offering local
computation and storage.
•
detects and
at the application layer
built by
7. Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
built by
8. Basic Hadoop Architecture
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
MapReduce
(Task Tracker)
HDFS
(Data
Node)
machine
built by
9. Basic Hadoop Architecture (continued)
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker Tracker
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Name Node
Name Node
Coordination for HDFS. Inserts and extraction are communicated through
the Name Node.
built by
10. Apache Hive
Apache Hive - a data warehouse infrastructure built on top
of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
MapReduce
•
•
•
•
•
(Task Tracker)
create
insert
update
delete
select
HiveQL
HiveQL HiveQL
HiveQL
HiveQL
HDFS
(Data
Node)
built by
12. about Data Warehouses…
Data Warehouse
•
typically a relational database that is designed for query and analysis rather
than for transaction processing
•
a place where historical data is stored for archival, analysis and security
purposes.
•
contains either raw data or formatted data
•
combines data from multiple sources
•
•
•
•
•
•
•
•
•
Sales
salaries
operational data
human resource data
inventory data
web logs
Social networks
Internet text and docs
other
built by
13. Data Warehousing: the ETL process
ETL = Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it
can serve its purpose of facilitating business analysis.
Extract - data from one or more OLTP systems and copied into
the warehouse
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Load – map the data and load it into the DWH
built by
14. Data Warehouse – the ETL process
Source Data
Legacy DB
ETL Process
Target DWH
Extract
CRM/ERP
DB
Finance DB
Transform
Load
built by
16. DWH & Hadoop: A Use Case
USE CASE***
Use Hadoop as a landing zone for big data & raw data
1)
bring all raw, big data into Hadoop
2)
perform some pre-processing of this data
3)
determine which data goes to EDWH
4)
Extract, transform and load (ETL) pertinent data into EDHW
***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013
built by
17. DWH & Hadoop: A Use Case
Use case data flow
Source Data
Source
ETL Process
Target DWH
ETL
built by
19. Testing Big Data: Entry Points
Recommended functional test strategy: Test every entry point in the
system (feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
test entry point
Source Data
Source Hadoop
ETL Process
Target DWH
B
I
ETL
built by
20. Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster
- we need to automate the testing
effort
- We need to be able to test across different platforms
We need a testing tool!
built by
22. What is QuerySurge?
QuerySurge
is the
premier test tool
built
to automate
Data Warehouse testing
and the
ETL Testing Process
built by
23. What does QuerySurge ™do?
QuerySurge finds bad data
•
Most firms test < 1% of their data
•
BI apps sit on top of DWHs that have at best, untested data & at worst, bad data
•
CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions
•
Bad data will cause execs to make decisions that will cost them $millions
•
QuerySurge tests up to 100% of your data quickly & finds bad data
built by
24. QuerySurge Roles & Uses
Testers
- functional testing
- regression testing
ETL Developers
- unit testing
Data Analysts
- review, analyze data
- verify mappings and
failures.
Operations teams
- monitoring
built by
26. QuerySurge™ Modules
Design Library
Create Query Pairs (source & target queries)
Scheduling
Build groups of Query Pairs
Schedule Test Runs
built by
26
27. QuerySurge™ Modules
Run Dashboard
View real-time execution
Analyze real-time results
Deep-Dive Reporting
Examine and automatically
email test results
built by
28. the QuerySurge solution…
verifies more data
verifies upwards of 100% of all data quickly
automates the testing effort
the kickoff, the tests, the comparison, emailing the results
tests across different platforms
any JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop
speeds up testing
up to 1,000 times faster than manual testing
built by
29. QuerySurge Value-Add
QuerySurge provides value by either:
in testing data coverage from < 1% to
upwards of 100%
in testing time by as much as 1,000 x
combination of
testing time
in test coverage while
in
built by
29
30. Return on Investment (ROI)
redeployment of head count because of an increase in
coverage
a savings over manual testing (minus queries, manual
compares, other)
an increase in better data due to shorter / more thorough
testing cycle, possibly saving $ millions by preventing key
decisions made on bad data.
built by
30
Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database.Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
Web browsers: Internet Explorer, Chrome, Firefox and Safari.Operating systems: Windows & Linux.