Big data migration testing for transferring relational database management files is a very time-consuming, high-compute task; we offer a hands-on, detailed framework for data validation in an open source (Hadoop) environment incorporating Amazon Web Services (AWS) for cloud capacity, S3 (Simple Storage Service) and EMR (Elastic MapReduce), Hive tables, Sqoop tools, PIG scripting and Jenkins Slave Machines.
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
From Relational Database Management to Big Data: Solutions for Data Migration Testing
1. From Relational Database Management
to Big Data: Solutions for Data
Migration Testing
A successful approach to big data migration testing requires
end-to-end automation and swift verification of huge volumes of
data to produce quick and lasting results.
ing of customers. Such insight will reveal what
customers are buying, doing, saying, thinking and
feeling, as well as what they need.
But this requires capturing and analyzing
huge pools of interactional and transactional
data. Capturing such large data sets, however,
has created a double-edged sword for many
companies. On the plus side, it affords companies
the opportunity to make meaning from Code Halo
intersections; the downside is figuring out how
and where to store all this data.
Enter Hadoop, the de facto open source
standard that is increasingly being used by many
companies in large data migration projects.
Hadoop is an open-source framework that allows
for the distributed processing of large data sets.
It is designed to scale up from single servers to
thousands of machines, each offering local com-
putation and storage. As data from different
sources flows into Hadoop, the biggest challenge
is “data validation from source to Hadoop.”
In fact, according to a report published by IDG
Enterprise, “70% of enterprises have either
deployed or are planning to deploy big data
projects and programs this year.”
2
Executive Summary
Large enterprises face numerous challenges
connecting multiple CRM applications and their
data warehouse systems to connect with end
users across the multitude of products they
offer. When their disparate data is spread across
multiple systems, these enterprises cannot:
• Conduct sophisticated analytics that substan-
tially improve business decision-making.
• Offer better search and data sharing.
• Gain a holistic view of a single individual
across multiple identities; customers may have
multiple accounts due to multiple locations or
devices such as company or Facebook IDs.
• Unlock the power of data science to create
reports using tools of their choice.
In such situations, companies lose the ability
to understand customers. Overcoming these
obstacles is critical to gaining the insights needed
to customize user experience and personalize
interactions. By applying Code HaloTM1
thinking –
and distilling insights from the swirl of data that
surrounds people, processes, organizations and
devices – companies of all shapes and sizes and
across all sectors can gain a deep understand-
cognizant 20-20 insights | september 2015
• Cognizant 20-20 Insights
2. cognizant 20-20 insights 2
With the huge amount of data migrated to
Hadoop and other big data platforms, the
challenge of data quality emerges. The simple,
widely used, cumbersome solution is manual
validation. However, this is not scalable and may
not offer any significant value-add to customers.
It impacts project schedules. Moreover, testing
cycle times can get squeezed.
This white paper posits a solution: a framework
that can be adopted across industries to perform
effective big data migration testing with all open-
source tools.
Challenges in RDBMS to Big Data
Migration Testing
Big data migration typically involves multiple
source systems and large volumes of data.
However, most organizations lack the open-
source tools to handle this important task.
The right tool should be set up quickly and
offer multiple customization options. Migration
generally happens in entity batches. A set
of entities is selected, migrated and tested.
This cycle goes on until all application data
is migrated.
An easily scalable solution can reduce the con-
secutive testing cycles. Even minimal human
intervention can hinder testing efforts. Another
challenge comes when defining effective
scenarios for each entity. Performing 100%
field-to-field validation of data is ideal, but when
the data volume is in petabytes, test execution
duration increases tremendously. A proper
sampling method should be adopted, and solid
data transformation rules should be considered
in testing.
Big Data Migration Process
Hadoop as a service is offered by Amazon Web
Services (AWS), a cloud computing solution that
abstracts the operational challenges of running
Hadoop and making medium- and large-scale
data processing accessible, easy, fast and inex-
pensive. The typical services available include
Amazon S3 (Simple Storage Service) and Amazon
EMR (Elastic MapReduce). Also preferred is
Amazon Redshift, a fast, fully managed, pet-
abyte-scale data warehouse service.
The migration to the AWS Hadoop environment is
a three-step process:
• Cloud service: Virtual machines/physical
machines are used to connect and extract the
tables from source databases using Sqoop,
which pushes them to Amazon S3.
• Cloud storage: Amazon S3 cloud storage
center is used for all the data that is being sent
by virtual machines. It stores data in flat file
format.
• Data processing: Amazon EMR processes and
distributes vast amounts of data using Hadoop.
The data is grabbed from S3 and stored as Hive
tables (see Glossary, page 7).
RDBMS to Big Data Migration
Testing Solution
Step 1: Define Scenarios
To test migrated data, performing one-to-one
comparison of all the entities is required. Since
big data volumes are (as the term suggests) huge,
three test scenarios are performed for each entity:
• Count reconciliation for all rows.
• Find missing primary keys for all rows.
• Compare field-to-field data for sample records.
These steps are required to, first, verify the
record count in the source DB and target DB and,
second, to ensure that all records from source
systems flow to the target DB, which is performed
by checking the primary key in the source
system and the target system for all records. This
confirms that all records are present in the target
DB. Third, and most important, is comparing the
source and target databases for all columns for
sample records. This ensures that the data is
not corrupted, date formats are maintained and
data is not truncated. The number of records
for sample testing can be decided according to
the data volume. A basic data corruption can be
identified by testing 100 sample records.
Step 2: Choose the Appropriate Method
of Testing
Per our analysis, we shortlisted two methods of
testing:
• UNIX shell script and T-SQL-based reconcilia-
tion.
• PIG scripting.
Migration generally happens in
entity batches. A set of entities
is selected, migrated and tested.
This cycle goes on until all
application data is migrated.
3. Another option is to use Microsoft Hive ODBC
Driver to access Hive data, but this approach is
more appropriate for smaller volumes.
Figure 1 shows a comparison of the two methods.
Hence, based on this comparison, we recommend
a focus on the first approach, where full
end-to-end automation is possible. If any transfor-
mations are present, those need to be performed
in the staging layer – which can be treated as
source, to further implement similar solutions.
According to the above analysis, PIG scripting
is more appropriate for testing migration with
complex transformation logic. But for this type
Unix Shell Script and T-SQL-Based
Reconciliation
PIG Scripting
Prerequisites Load target Hadoop data into the
central QA server (SQL server) as
different entities and validate with
source tables.
SQL server database to store tables
and perform comparison using SQL
queries.
Preconfigured linked server in SQL
server DB is needed to connect to all
your source databases.
Migrate data from RDBMS to HDFS
and compare QA HDFS files with Dev
HDFS files using Pig scripting.
Flat files for each entity created
using Sqoop tool.
Efforts Initial coding for five to 10 tables
takes one week.
Consecutive additions take two days
for ~10 tables.
Compares flat files.
Scripting needed for each column in
the table.
Efforts are equally proportionate
to the number of tables and their
columns.
Automation/
Manual
Full automation possible. No automation possible.
Performance (On
Windows XP, 3 GB
RAM, 1 CPU)
Delivers the results quickly compared
to other methods.
For 15 tables with an average 100K,
records will take:
~30 minutes for count.
~20 minutes for sample 100 records.
~1 hour for missing primary keys.
This method needs migration
of source table to HDFS files as
a prerequisite, which is time-
consuming.
Processing can be faster than other
methods.
Highlights Full automation possible/job
scheduling possible.
Fast comparison.
No permission or security issues
faced while accessing big data on
AWS.
Offers a lot of flexibility in coding.
Very useful in more complex
transformations.
Low Points Initial framework setup is time-
consuming.
Greater effort for decoding,
reporting results and handling script
errors.
cognizant 20-20 insights 3
Testing Approach Comparison
Figure 1
4. cognizant 20-20 insights 4
Figure 2
High-Level Validation Approach
of simple migration, the PIG scripting approach is
very time-consuming and resource-intensive.
Step 3: Build a Framework
Here we bring in data from Hadoop to a SQL
server’s consolidation database and validate it
with the source. Figure 2 illustrates the set of
methods we recommend.
• UNIX shell scripting: In the migration process,
the development team uses the Sqoop tool to
migrate RDBMS tables as HDFS files. LOAD
DATA INPATH command creates the table
definition in the Hive metastore. HDSF files are
stored in Amazon S3.
To fetch data from Hive to a flat file:
>> Store the table list in the CSV file on a UNIX
server.
>> Write a UNIX shell script with input as a ta-
ble list CSV file and generate another shell
script to extract the hive data into the CSV
files for each table.
»» This shell script will be executed from
the Windows batch file.
>> Dynamically generate a UNIX shell script to
ensure there is a need to update only the
table list CSV file of every iteration/release
for the new table additions.
• WinSCP: The next step is to transfer the files
in the Hadoop environment to the Windows
server. WinSCP batch command interface can
SQL
Server
DB
Oracle
DB
Any
Other
RDBMS
MySQL
DB
• Stored procedure to compute the count of each table from
source system. Results from Hive are compared with this result.
• Stored procedure to pull ROW_ID from all tables of source
and find out missing/extra ones in Hive results.
• Stored procedure to pull source column data of the sample
records pulled from Hive and compare results.
Report any data mismatch.
Source Systems
LOAD DATA INPATH
‘hdfs_file’ INTO TABLE
tablename
Windows batch
script
Shell script to
generate
Get Data
shell script
dynamically.
CSV File with
Hive table names.
• CSV file with count of records and table name
for each table in Hive.
• CSV file with ROW_ID from all tables available in Hive.
• CSV file with first 100 records of all columns from Hive tables.
Source to target data flow:
Data from source systems is migrated
to HDFS using Sqoop – ETL.
Linked server to
pull data from various DBs.
SQL batch
files used to
load file contents
to QA tables.
Files from UNIX
server are
downloaded to
Windows server.
Files from Hive
Get Data shell
script to get
count, ROW_IDs
and sample
data from
Hive tables.
WinSCP
Download
Commands
Jenkins Slave Machine
Hadoop
Distributed
File System
HIVE
QA DB
Server
(SQL server)
AWS HADOOP
Server
Figure 3
Sample Code from Final Shell Script
5. cognizant 20-20 insights 5
be implemented for this. The WinSCP batch
file (.SCP) connects to the Hadoop environ-
ment using an open sftp command. A simple
“GET” command with the file name can copy
the file to the Windows server.
• SQL server database usage: The SQL server
is the main database used for loading Hive
data and final reconciliation results. A SQL
script is created to load data from the .CSV
file to the database table. The script uses the
“Bulk Insert” command.
• Windows batch command: The above-men-
tioned process of transferring the data in .CSV
files, importing the files into the SQL server and
validating the source and target data should all
be done sequentially. All validation processes
can be automated by creating a Windows batch
file. The batch file executes the shell script
from the Windows server on the Hadoop envi-
ronment using the Plink command. In this way,
all Hive data is loaded into the SQL server
table. The next step is to execute the SQL
server procedure to perform count/primary
key/sample data comparison. We use SQLCMD
to execute the SQL server procedure from the
batch file.
• Jenkins: End-to-end validation processes
can be triggered by Jenkins. Jenkins jobs can
be scheduled to execute on an hourly/daily/
weekly basis without manual intervention. On
Jenkins, an auto-scheduled ANT script invokes
a Java program to connect to the SQL server
to generate the HTML report of the latest
records. Jenkins jobs can e-mail the results to
the predefined recipients in the HTML format.
Figure 4
Results of Count Reconciliation for Hive Tables Migrated from a Webpage
SIEBEL HIVE COUNT CLUSTER RECON SUMMARY
AUD_ID EXECUTION_DATE SCHEMA_NAME TARGET_DB TOTAL_PASS TOTAL_FAIL ENV
153 2015-04-14 21:55:27.787 SIEBELPRD HIVE 60 94 PRD
SIEBEL HIVE COUNT CLUSTER RECON DETAIL
AUD_ID AUD_SEQ SOURCE_TAB_
NAME
SOURCE_ROW_CNT HIVE_TAB_
NAME
HIVE_ROW_CNT DIFF PERCENTAGE_DIFF STATUS EXEC_DATE
153 1 S_ADDR_PER 353420 S_ADDR_PER 343944 9476 2.68 FAIL 2015-04-14
21:55:27.787
153 2 S_PARTY 2730468 S_PARTY 2730468 0 0 PASS 2015-04-14
21:55:27.787
153 3 S_ORG_GROUP 16852 S_ORG_GROUP 16852 0 0 PASS 2015-04-14
21:55:27.787
153 4 S_LST_OF_VAL 29624 S_LST_OF_VAL 29624 0 0 PASS 2015-04-14
21:55:27.787
153 5 S_GROUP_
CONTACT
413912 S_GROUP_
CONTACT
413912 0 0 PASS 2015-04-14
21:55:27.787
153 6 S_CONTACT 1257758 S_CONTACT 1257758 0 0 PASS 2015-04-14
21:55:27.787
153 7 S_CON_ADDR 6220 S_CON_ADDR 6220 0 0 PASS 2015-04-14
21:55:27.787
153 8 S_CIF_CON_MAP 28925 S_CIF_CON_
MAP
28925 0 0 PASS 2015-04-14
21:55:27.787
153 9 S_ADDR_PER 93857 S_ADDR_PER 93857 0 0 PASS 2015-04-14
21:55:27.787
153 10 S_PROD_LN 1114 S_PROD_LN 1106 8 0.72 FAIL 2015-04-14
21:55:27.787
153 11 S_ASSET_REL 696178 S_ASSET_REL 690958 5220 0.75 FAIL 2015-04-14
21:55:27.787
153 12 S_AGREE_ITM_
REL
925139 S_AGREE_ITM_
REL
917657 7482 0.81 FAIL 2015-04-14
21:55:27.787
153 13 S_REVN 131111 S_REVN 128949 2162 1.65 FAIL 2015-04-14
21:55:27.787
153 14 S_ENTLMNT 127511 S_ENTLMNT 125144 2367 1.86 FAIL 2015-04-14
21:55:27.787
153 15 S_ASSET_XA 5577029 S_ASSET_XA 5457724 119305 2.14 FAIL 2015-04-14
21:55:27.787
153 16 S_BU 481 S_BU 470 11 2.29 FAIL 2015-04-14
21:55:27.787
153 17 S_ORG_EXT 345276 S_ORG_EXT 336064 9212 2.67 FAIL 2015-04-14
21:55:27.787
153 18 S_ORG_BU 345670 S_ORG_BU 336424 9246 2.67 FAIL 2015-04-14
21:55:27.787
6. cognizant 20-20 insights 6
Implementation Issues and Resolutions
Organizations may face a number of implementa-
tion issues. Figure 5 provides probable resolutions.
Impact of Testing
Figure 6 summarizes the impact of the manual
Excel testing when using our framework for one
of the customer’s CRM applications based on
Oracle and SQL server databases.
Looking Forward
More and more organizations are using big data
tools and techniques to quickly and effectively
analyze data for improved customer understand-
ing and product/service delivery. This white paper
presents a framework to help organizations to
more quickly, efficiently and accurately conduct
big data migration testing. As your organization
moves forward, here are key points to consider
before implementing a framework like the one
presented in this white paper.
• Think big when it comes to big data testing.
Choose an optimum data subset for testing;
sampling should be based on geographies,
priority customers, customer types, product
types and product mix.
• Create an environment to accommodate huge
data sets. Cloud setups are recommended.
• BeawareoftheAgile/Scrumcadencemismatch.
Break up data into smaller incremental blocks
as a work-around.
• Get smart about open-source capabilities.
Spend a good amount of time up front under-
standing the tools and techniques that drive
success.
* Effort calculated for one table with around 500k records with summary report generation
Figure 6
The Advantages of a Big Data Testing Migration Framework*
Scenario Manual (mins.) Framework
(mins.)
Gain (mins.) % Gain
Count 20 2 18 90.00%
Sample 100 Records 100 1.3 98.7 98.70%
Missing Primary Key 40 4 36 90.00%
Figure 5
Overcoming Test Migration Challenges
SNO Implementation Issues Resolutions
1 Column data type mismatch errors while
loading .CSV files and Hive data into the SQL
server table.
Create tables in SQL server by matching Hive table
data types.
2 No FTP access in Hadoop database to transfer
files.
Use WinSCP software.
3 Column Sequence Mismatch between Hive
tables and Source tables, which results in
failure to load the .CSV files into Hive_* tables.
Create tables in SQL server for target entities by
matching Hive table column order.
4 Inability to load .CSV files, due to end of file
issue in SQL server bulk insert.
Update SQL statement with appropriate row
terminator “char (10)” linefeed, which allows import
of .CSV files from a Unix server.
5 Performance issues on primary key validations. Performance tuning on SQL server stored
procedures and increasing more temp DB space of
SQL server, etc.
6 Handling comma in the column values. Create TSV file, so it will not create any issue while
data is loading.
Remove NULL, null from TSV and generate .txt file.
Finally convert into UTF-8 to UTF-16 and generate
XLS file, which can be loaded to SQL server
database.
7. cognizant 20-20 insights 7
Glossary
• AWS: Amazon Web Services is a collection of remote computing services, also called Web services,
that make up a cloud computing platform from Amazon.com.
• Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost-
effectively process vast amounts of data.
• Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable,
highly-scalable object storage.
• Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis-
tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks:
massive data storage and faster processing.
• Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza-
tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon
EMR on AWS.
• Jenkins: Jenkins is an open-source, continuous-integration tool written in Java. Jenkins provides
continuous integration services for software development.
• PIG scripting: PIG is a high-level platform for creating MapReduce programs used with Hadoop. The
language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL for RDBMS systems.
• RDBMS: A relational database management system is a database management system (DBMS) that
is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory.
• Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from
non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into
HDFS. This process is briefly called extract, transform and load (ETL).
• WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft
Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond
this, WinSCP offers basic file manager and file synchronization functionality.
• Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a
command line interpreter.
• T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language
(SQL).