SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
From Relational Database Management
to Big Data: Solutions for Data
Migration Testing
A successful approach to big data migration testing requires
end-to-end automation and swift verification of huge volumes of
data to produce quick and lasting results.
ing of customers. Such insight will reveal what
customers are buying, doing, saying, thinking and
feeling, as well as what they need.
But this requires capturing and analyzing
huge pools of interactional and transactional
data. Capturing such large data sets, however,
has created a double-edged sword for many
companies. On the plus side, it affords companies
the opportunity to make meaning from Code Halo
intersections; the downside is figuring out how
and where to store all this data.
Enter Hadoop, the de facto open source
standard that is increasingly being used by many
companies in large data migration projects.
Hadoop is an open-source framework that allows
for the distributed processing of large data sets.
It is designed to scale up from single servers to
thousands of machines, each offering local com-
putation and storage. As data from different
sources flows into Hadoop, the biggest challenge
is “data validation from source to Hadoop.”
In fact, according to a report published by IDG
Enterprise, “70% of enterprises have either
deployed or are planning to deploy big data
projects and programs this year.”
2
Executive Summary
Large enterprises face numerous challenges
connecting multiple CRM applications and their
data warehouse systems to connect with end
users across the multitude of products they
offer. When their disparate data is spread across
multiple systems, these enterprises cannot:
•	Conduct sophisticated analytics that substan-
tially improve business decision-making.
•	Offer better search and data sharing.
•	Gain a holistic view of a single individual
across multiple identities; customers may have
multiple accounts due to multiple locations or
devices such as company or Facebook IDs.
•	Unlock the power of data science to create
reports using tools of their choice.
In such situations, companies lose the ability
to understand customers. Overcoming these
obstacles is critical to gaining the insights needed
to customize user experience and personalize
interactions. By applying Code HaloTM1
thinking –
and distilling insights from the swirl of data that
surrounds people, processes, organizations and
devices – companies of all shapes and sizes and
across all sectors can gain a deep understand-
cognizant 20-20 insights | september 2015
• Cognizant 20-20 Insights
cognizant 20-20 insights 2
With the huge amount of data migrated to
Hadoop and other big data platforms, the
challenge of data quality emerges. The simple,
widely used, cumbersome solution is manual
validation. However, this is not scalable and may
not offer any significant value-add to customers.
It impacts project schedules. Moreover, testing
cycle times can get squeezed.
This white paper posits a solution: a framework
that can be adopted across industries to perform
effective big data migration testing with all open-
source tools.
Challenges in RDBMS to Big Data
Migration Testing
Big data migration typically involves multiple
source systems and large volumes of data.
However, most organizations lack the open-
source tools to handle this important task.
The right tool should be set up quickly and
offer multiple customization options. Migration
generally happens in entity batches. A set
of entities is selected, migrated and tested.
This cycle goes on until all application data
is migrated.
An easily scalable solution can reduce the con-
secutive testing cycles. Even minimal human
intervention can hinder testing efforts. Another
challenge comes when defining effective
scenarios for each entity. Performing 100%
field-to-field validation of data is ideal, but when
the data volume is in petabytes, test execution
duration increases tremendously. A proper
sampling method should be adopted, and solid
data transformation rules should be considered
in testing.
Big Data Migration Process
Hadoop as a service is offered by Amazon Web
Services (AWS), a cloud computing solution that
abstracts the operational challenges of running
Hadoop and making medium- and large-scale
data processing accessible, easy, fast and inex-
pensive. The typical services available include
Amazon S3 (Simple Storage Service) and Amazon
EMR (Elastic MapReduce). Also preferred is
Amazon Redshift, a fast, fully managed, pet-
abyte-scale data warehouse service.
The migration to the AWS Hadoop environment is
a three-step process:
•	Cloud service: Virtual machines/physical
machines are used to connect and extract the
tables from source databases using Sqoop,
which pushes them to Amazon S3.
•	Cloud storage: Amazon S3 cloud storage
center is used for all the data that is being sent
by virtual machines. It stores data in flat file
format.
•	Data processing: Amazon EMR processes and
distributes vast amounts of data using Hadoop.
The data is grabbed from S3 and stored as Hive
tables (see Glossary, page 7).
RDBMS to Big Data Migration
Testing Solution
Step 1: Define Scenarios
To test migrated data, performing one-to-one
comparison of all the entities is required. Since
big data volumes are (as the term suggests) huge,
three test scenarios are performed for each entity:
•	Count reconciliation for all rows.
•	Find missing primary keys for all rows.
•	Compare field-to-field data for sample records.
These steps are required to, first, verify the
record count in the source DB and target DB and,
second, to ensure that all records from source
systems flow to the target DB, which is performed
by checking the primary key in the source
system and the target system for all records. This
confirms that all records are present in the target
DB. Third, and most important, is comparing the
source and target databases for all columns for
sample records. This ensures that the data is
not corrupted, date formats are maintained and
data is not truncated. The number of records
for sample testing can be decided according to
the data volume. A basic data corruption can be
identified by testing 100 sample records.
Step 2: Choose the Appropriate Method
of Testing
Per our analysis, we shortlisted two methods of
testing:
•	UNIX shell script and T-SQL-based reconcilia-
tion.
•	PIG scripting.
Migration generally happens in
entity batches. A set of entities
is selected, migrated and tested.
This cycle goes on until all
application data is migrated.
Another option is to use Microsoft Hive ODBC
Driver to access Hive data, but this approach is
more appropriate for smaller volumes.
Figure 1 shows a comparison of the two methods.
Hence, based on this comparison, we recommend
a focus on the first approach, where full
end-to-end automation is possible. If any transfor-
mations are present, those need to be performed
in the staging layer – which can be treated as
source, to further implement similar solutions.
According to the above analysis, PIG scripting
is more appropriate for testing migration with
complex transformation logic. But for this type
Unix Shell Script and T-SQL-Based
Reconciliation
PIG Scripting
Prerequisites Load target Hadoop data into the
central QA server (SQL server) as
different entities and validate with
source tables.
SQL server database to store tables
and perform comparison using SQL
queries.
Preconfigured linked server in SQL
server DB is needed to connect to all
your source databases.
Migrate data from RDBMS to HDFS
and compare QA HDFS files with Dev
HDFS files using Pig scripting.
Flat files for each entity created
using Sqoop tool.
Efforts Initial coding for five to 10 tables
takes one week.
Consecutive additions take two days
for ~10 tables.
Compares flat files.
Scripting needed for each column in
the table.
Efforts are equally proportionate
to the number of tables and their
columns.
Automation/
Manual
Full automation possible. No automation possible.
Performance (On
Windows XP, 3 GB
RAM, 1 CPU)
Delivers the results quickly compared
to other methods.
For 15 tables with an average 100K,
records will take:
~30 minutes for count.
~20 minutes for sample 100 records.
~1 hour for missing primary keys.
This method needs migration
of source table to HDFS files as
a prerequisite, which is time-
consuming.
Processing can be faster than other
methods.
Highlights Full automation possible/job
scheduling possible.
Fast comparison.
No permission or security issues
faced while accessing big data on
AWS.
Offers a lot of flexibility in coding.
Very useful in more complex
transformations.
Low Points Initial framework setup is time-
consuming.
Greater effort for decoding,
reporting results and handling script
errors.
cognizant 20-20 insights 3
Testing Approach Comparison
Figure 1
cognizant 20-20 insights 4
Figure 2
High-Level Validation Approach
of simple migration, the PIG scripting approach is
very time-consuming and resource-intensive.
Step 3: Build a Framework
Here we bring in data from Hadoop to a SQL
server’s consolidation database and validate it
with the source. Figure 2 illustrates the set of
methods we recommend.
•	UNIX shell scripting: In the migration process,
the development team uses the Sqoop tool to
migrate RDBMS tables as HDFS files. LOAD
DATA INPATH command creates the table
definition in the Hive metastore. HDSF files are
stored in Amazon S3.
To fetch data from Hive to a flat file:
>> Store the table list in the CSV file on a UNIX
server.
>> Write a UNIX shell script with input as a ta-
ble list CSV file and generate another shell
script to extract the hive data into the CSV
files for each table.
»» This shell script will be executed from
the Windows batch file.
>> Dynamically generate a UNIX shell script to
ensure there is a need to update only the
table list CSV file of every iteration/release
for the new table additions.
•	WinSCP: The next step is to transfer the files
in the Hadoop environment to the Windows
server. WinSCP batch command interface can
SQL
Server
DB
Oracle
DB
Any
Other
RDBMS
MySQL
DB
• Stored procedure to compute the count of each table from
source system. Results from Hive are compared with this result.
• Stored procedure to pull ROW_ID from all tables of source
and find out missing/extra ones in Hive results.
• Stored procedure to pull source column data of the sample
records pulled from Hive and compare results.
Report any data mismatch.
Source Systems
LOAD DATA INPATH
‘hdfs_file’ INTO TABLE
tablename
Windows batch
script
Shell script to
generate
Get Data
shell script
dynamically.
CSV File with
Hive table names.
• CSV file with count of records and table name
for each table in Hive.
• CSV file with ROW_ID from all tables available in Hive.
• CSV file with first 100 records of all columns from Hive tables.
Source to target data flow:
Data from source systems is migrated
to HDFS using Sqoop – ETL.
Linked server to
pull data from various DBs.
SQL batch
files used to
load file contents
to QA tables.
Files from UNIX
server are
downloaded to
Windows server.
Files from Hive
Get Data shell
script to get
count, ROW_IDs
and sample
data from
Hive tables.
WinSCP
Download
Commands
Jenkins Slave Machine
Hadoop
Distributed
File System
HIVE
QA DB
Server
(SQL server)
AWS HADOOP
Server
Figure 3
Sample Code from Final Shell Script
cognizant 20-20 insights 5
be implemented for this. The WinSCP batch
file (.SCP) connects to the Hadoop environ-
ment using an open sftp command. A simple
“GET” command with the file name can copy
the file to the Windows server.
•	SQL server database usage: The SQL server
is the main database used for loading Hive
data and final reconciliation results. A SQL
script is created to load data from the .CSV
file to the database table. The script uses the
“Bulk Insert” command.
•	Windows batch command: The above-men-
tioned process of transferring the data in .CSV
files, importing the files into the SQL server and
validating the source and target data should all
be done sequentially. All validation processes
can be automated by creating a Windows batch
file. The batch file executes the shell script
from the Windows server on the Hadoop envi-
ronment using the Plink command. In this way,
all Hive data is loaded into the SQL server
table. The next step is to execute the SQL
server procedure to perform count/primary
key/sample data comparison. We use SQLCMD
to execute the SQL server procedure from the
batch file.
•	Jenkins: End-to-end validation processes
can be triggered by Jenkins. Jenkins jobs can
be scheduled to execute on an hourly/daily/
weekly basis without manual intervention. On
Jenkins, an auto-scheduled ANT script invokes
a Java program to connect to the SQL server
to generate the HTML report of the latest
records. Jenkins jobs can e-mail the results to
the predefined recipients in the HTML format.
Figure 4
Results of Count Reconciliation for Hive Tables Migrated from a Webpage
SIEBEL HIVE COUNT CLUSTER RECON SUMMARY
AUD_ID EXECUTION_DATE SCHEMA_NAME TARGET_DB TOTAL_PASS TOTAL_FAIL ENV
153 2015-04-14 21:55:27.787 SIEBELPRD HIVE 60 94 PRD
SIEBEL HIVE COUNT CLUSTER RECON DETAIL
AUD_ID AUD_SEQ SOURCE_TAB_
NAME
SOURCE_ROW_CNT HIVE_TAB_
NAME
HIVE_ROW_CNT DIFF PERCENTAGE_DIFF STATUS EXEC_DATE
153 1 S_ADDR_PER 353420 S_ADDR_PER 343944 9476 2.68 FAIL 2015-04-14
21:55:27.787
153 2 S_PARTY 2730468 S_PARTY 2730468 0 0 PASS 2015-04-14
21:55:27.787
153 3 S_ORG_GROUP 16852 S_ORG_GROUP 16852 0 0 PASS 2015-04-14
21:55:27.787
153 4 S_LST_OF_VAL 29624 S_LST_OF_VAL 29624 0 0 PASS 2015-04-14
21:55:27.787
153 5 S_GROUP_
CONTACT
413912 S_GROUP_
CONTACT
413912 0 0 PASS 2015-04-14
21:55:27.787
153 6 S_CONTACT 1257758 S_CONTACT 1257758 0 0 PASS 2015-04-14
21:55:27.787
153 7 S_CON_ADDR 6220 S_CON_ADDR 6220 0 0 PASS 2015-04-14
21:55:27.787
153 8 S_CIF_CON_MAP 28925 S_CIF_CON_
MAP
28925 0 0 PASS 2015-04-14
21:55:27.787
153 9 S_ADDR_PER 93857 S_ADDR_PER 93857 0 0 PASS 2015-04-14
21:55:27.787
153 10 S_PROD_LN 1114 S_PROD_LN 1106 8 0.72 FAIL 2015-04-14
21:55:27.787
153 11 S_ASSET_REL 696178 S_ASSET_REL 690958 5220 0.75 FAIL 2015-04-14
21:55:27.787
153 12 S_AGREE_ITM_
REL
925139 S_AGREE_ITM_
REL
917657 7482 0.81 FAIL 2015-04-14
21:55:27.787
153 13 S_REVN 131111 S_REVN 128949 2162 1.65 FAIL 2015-04-14
21:55:27.787
153 14 S_ENTLMNT 127511 S_ENTLMNT 125144 2367 1.86 FAIL 2015-04-14
21:55:27.787
153 15 S_ASSET_XA 5577029 S_ASSET_XA 5457724 119305 2.14 FAIL 2015-04-14
21:55:27.787
153 16 S_BU 481 S_BU 470 11 2.29 FAIL 2015-04-14
21:55:27.787
153 17 S_ORG_EXT 345276 S_ORG_EXT 336064 9212 2.67 FAIL 2015-04-14
21:55:27.787
153 18 S_ORG_BU 345670 S_ORG_BU 336424 9246 2.67 FAIL 2015-04-14
21:55:27.787
cognizant 20-20 insights 6
Implementation Issues and Resolutions
Organizations may face a number of implementa-
tion issues. Figure 5 provides probable resolutions.
Impact of Testing
Figure 6 summarizes the impact of the manual
Excel testing when using our framework for one
of the customer’s CRM applications based on
Oracle and SQL server databases.
Looking Forward
More and more organizations are using big data
tools and techniques to quickly and effectively
analyze data for improved customer understand-
ing and product/service delivery. This white paper
presents a framework to help organizations to
more quickly, efficiently and accurately conduct
big data migration testing. As your organization
moves forward, here are key points to consider
before implementing a framework like the one
presented in this white paper.
•	Think big when it comes to big data testing.
Choose an optimum data subset for testing;
sampling should be based on geographies,
priority customers, customer types, product
types and product mix.
•	Create an environment to accommodate huge
data sets. Cloud setups are recommended.
•	BeawareoftheAgile/Scrumcadencemismatch.
Break up data into smaller incremental blocks
as a work-around.
•	Get smart about open-source capabilities.
Spend a good amount of time up front under-
standing the tools and techniques that drive
success.
* Effort calculated for one table with around 500k records with summary report generation
Figure 6
The Advantages of a Big Data Testing Migration Framework*
Scenario Manual (mins.) Framework
(mins.)
Gain (mins.) % Gain
Count 20 2 18 90.00%
Sample 100 Records 100 1.3 98.7 98.70%
Missing Primary Key 40 4 36 90.00%
Figure 5
Overcoming Test Migration Challenges
SNO Implementation Issues Resolutions
1 Column data type mismatch errors while
loading .CSV files and Hive data into the SQL
server table.
Create tables in SQL server by matching Hive table
data types.
2 No FTP access in Hadoop database to transfer
files.
Use WinSCP software.
3 Column Sequence Mismatch between Hive
tables and Source tables, which results in
failure to load the .CSV files into Hive_* tables.
Create tables in SQL server for target entities by
matching Hive table column order.
4 Inability to load .CSV files, due to end of file
issue in SQL server bulk insert.
Update SQL statement with appropriate row
terminator “char (10)” linefeed, which allows import
of .CSV files from a Unix server.
5 Performance issues on primary key validations. Performance tuning on SQL server stored
procedures and increasing more temp DB space of
SQL server, etc.
6 Handling comma in the column values. Create TSV file, so it will not create any issue while
data is loading.
Remove NULL, null from TSV and generate .txt file.
Finally convert into UTF-8 to UTF-16 and generate
XLS file, which can be loaded to SQL server
database.
cognizant 20-20 insights 7
Glossary
•	AWS: Amazon Web Services is a collection of remote computing services, also called Web services,
that make up a cloud computing platform from Amazon.com.
•	Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost-
effectively process vast amounts of data.
•	Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable,
highly-scalable object storage.
•	Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis-
tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks:
massive data storage and faster processing.
•	Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza-
tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon
EMR on AWS.
•	Jenkins: Jenkins is an open-source, continuous-integration tool written in Java. Jenkins provides
continuous integration services for software development.
•	PIG scripting:	 PIG is a high-level platform for creating MapReduce programs used with Hadoop. The
language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL for RDBMS systems.
•	RDBMS: A relational database management system is a database management system (DBMS) that
is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory.
•	Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from
non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into
HDFS. This process is briefly called extract, transform and load (ETL).
•	WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft
Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond
this, WinSCP offers basic file manager and file synchronization functionality.
•	Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a
command line interpreter.
•	T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language
(SQL).
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the world’s leading companies build stronger busi-
nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac-
tion, technology innovation, deep industry and business process expertise, and a global, collaborative
workforce that embodies the future of work. With over 100 development and delivery centers worldwide
and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100,
the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and
fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
500 Frank W. Burr Blvd.
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
About the Author
Rashmi Khanolkar is a Senior Architect within Cognizant’s Comms-Tech Business Unit. Proficient in appli-
cation architecture, data architecture and technical design, Rashmi has 15-plus years of experience in
the software industry. She has managed multiple data migration quality projects involving large volumes
of data. Rashmi also has extensive experience on multiple development projects on .Net and Moss
2007, and has broad knowledge within the CRM, insurance and banking domains. She can be reached at
Rashmi.Khanolkar@cognizant.com.
Footnotes
1	 For more on Code Halos and innovation, read “Code Rules: A Playbook for Managing at the Crossroads,”
Cognizant Technology Solutions, June 2013, http://www.cognizant.com/Futureofwork/Documents/
code-rules.pdf, and the book, Code Halos: How the Digital Lives of People, Things, and Organizations Are
Changing the Rules of Business, by Malcolm Frank, Paul Roehrig and Ben Pring, published by John Wiley &
Sons, April 2014, http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118862074.html.
2	 2014 IDG Enterprise Big Data Report, http://www.idgenterprise.com/report/big-data-2.
Codex 1439

Mais conteúdo relacionado

Mais procurados

Cloud_Testing_The_future_of_softwareV1.04
Cloud_Testing_The_future_of_softwareV1.04Cloud_Testing_The_future_of_softwareV1.04
Cloud_Testing_The_future_of_softwareV1.04
Mrityunjaya Hikkalgutti
 
Software Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_finalSoftware Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_final
www.pixelsolutionbd.com
 
Acceleration Technology: Taking Media File Transfers From Days to Minutes
Acceleration Technology: Taking Media File Transfers From Days to MinutesAcceleration Technology: Taking Media File Transfers From Days to Minutes
Acceleration Technology: Taking Media File Transfers From Days to Minutes
FileCatalyst
 

Mais procurados (20)

Adopting Cloud Testing for Continuous Delivery, with the premier global provi...
Adopting Cloud Testing for Continuous Delivery, with the premier global provi...Adopting Cloud Testing for Continuous Delivery, with the premier global provi...
Adopting Cloud Testing for Continuous Delivery, with the premier global provi...
 
Cloud testing: challenges and opportunities, TaaS, Integration Testing
Cloud testing: challenges and opportunities, TaaS, Integration TestingCloud testing: challenges and opportunities, TaaS, Integration Testing
Cloud testing: challenges and opportunities, TaaS, Integration Testing
 
Cloud_Testing_The_future_of_softwareV1.04
Cloud_Testing_The_future_of_softwareV1.04Cloud_Testing_The_future_of_softwareV1.04
Cloud_Testing_The_future_of_softwareV1.04
 
Taking Testing to the Cloud
Taking Testing to the CloudTaking Testing to the Cloud
Taking Testing to the Cloud
 
White paper on testing in cloud
White paper on testing in cloudWhite paper on testing in cloud
White paper on testing in cloud
 
Software Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_finalSoftware Testing in Cloud Platform A Survey_final
Software Testing in Cloud Platform A Survey_final
 
Cloud testing
Cloud testingCloud testing
Cloud testing
 
Cloud based testing
Cloud based testingCloud based testing
Cloud based testing
 
Cloud Testing Framework
Cloud Testing FrameworkCloud Testing Framework
Cloud Testing Framework
 
Cloud testing
Cloud testingCloud testing
Cloud testing
 
Testing Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the CloudTesting Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the Cloud
 
Cloud Testing: The Future of software Testing
Cloud Testing: The Future of software TestingCloud Testing: The Future of software Testing
Cloud Testing: The Future of software Testing
 
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...
The eBay Architecture:  Striking a Balance between Site Stability, Feature Ve...The eBay Architecture:  Striking a Balance between Site Stability, Feature Ve...
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...
 
Microservices architecture
Microservices architectureMicroservices architecture
Microservices architecture
 
Acceleration Technology: Taking Media File Transfers From Days to Minutes
Acceleration Technology: Taking Media File Transfers From Days to MinutesAcceleration Technology: Taking Media File Transfers From Days to Minutes
Acceleration Technology: Taking Media File Transfers From Days to Minutes
 
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
 
Net scaler 10_customer_presentation
Net scaler 10_customer_presentationNet scaler 10_customer_presentation
Net scaler 10_customer_presentation
 
Log insight 3.3 customer presentation
Log insight 3.3 customer presentationLog insight 3.3 customer presentation
Log insight 3.3 customer presentation
 
Service fabric overview
Service fabric overviewService fabric overview
Service fabric overview
 
Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...
Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...
Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...
 

Semelhante a From Relational Database Management to Big Data: Solutions for Data Migration Testing

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Jane Roberts
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
bindu1512
 
strategies-for-migrating-oracle-database-to-aws
strategies-for-migrating-oracle-database-to-awsstrategies-for-migrating-oracle-database-to-aws
strategies-for-migrating-oracle-database-to-aws
Abdul Sathar Sait
 
Dataintensive
DataintensiveDataintensive
Dataintensive
sulfath
 

Semelhante a From Relational Database Management to Big Data: Solutions for Data Migration Testing (20)

Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
ExecutiveWhitePaper
ExecutiveWhitePaperExecutiveWhitePaper
ExecutiveWhitePaper
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
Operating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentOperating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environment
 
strategies-for-migrating-oracle-database-to-aws
strategies-for-migrating-oracle-database-to-awsstrategies-for-migrating-oracle-database-to-aws
strategies-for-migrating-oracle-database-to-aws
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 

Mais de Cognizant

Mais de Cognizant (20)

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition Engineered
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for Sustainability
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for Insurers
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to Value
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First Approach
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the Cloud
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the Future
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data Platform
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
 

From Relational Database Management to Big Data: Solutions for Data Migration Testing

  • 1. From Relational Database Management to Big Data: Solutions for Data Migration Testing A successful approach to big data migration testing requires end-to-end automation and swift verification of huge volumes of data to produce quick and lasting results. ing of customers. Such insight will reveal what customers are buying, doing, saying, thinking and feeling, as well as what they need. But this requires capturing and analyzing huge pools of interactional and transactional data. Capturing such large data sets, however, has created a double-edged sword for many companies. On the plus side, it affords companies the opportunity to make meaning from Code Halo intersections; the downside is figuring out how and where to store all this data. Enter Hadoop, the de facto open source standard that is increasingly being used by many companies in large data migration projects. Hadoop is an open-source framework that allows for the distributed processing of large data sets. It is designed to scale up from single servers to thousands of machines, each offering local com- putation and storage. As data from different sources flows into Hadoop, the biggest challenge is “data validation from source to Hadoop.” In fact, according to a report published by IDG Enterprise, “70% of enterprises have either deployed or are planning to deploy big data projects and programs this year.” 2 Executive Summary Large enterprises face numerous challenges connecting multiple CRM applications and their data warehouse systems to connect with end users across the multitude of products they offer. When their disparate data is spread across multiple systems, these enterprises cannot: • Conduct sophisticated analytics that substan- tially improve business decision-making. • Offer better search and data sharing. • Gain a holistic view of a single individual across multiple identities; customers may have multiple accounts due to multiple locations or devices such as company or Facebook IDs. • Unlock the power of data science to create reports using tools of their choice. In such situations, companies lose the ability to understand customers. Overcoming these obstacles is critical to gaining the insights needed to customize user experience and personalize interactions. By applying Code HaloTM1 thinking – and distilling insights from the swirl of data that surrounds people, processes, organizations and devices – companies of all shapes and sizes and across all sectors can gain a deep understand- cognizant 20-20 insights | september 2015 • Cognizant 20-20 Insights
  • 2. cognizant 20-20 insights 2 With the huge amount of data migrated to Hadoop and other big data platforms, the challenge of data quality emerges. The simple, widely used, cumbersome solution is manual validation. However, this is not scalable and may not offer any significant value-add to customers. It impacts project schedules. Moreover, testing cycle times can get squeezed. This white paper posits a solution: a framework that can be adopted across industries to perform effective big data migration testing with all open- source tools. Challenges in RDBMS to Big Data Migration Testing Big data migration typically involves multiple source systems and large volumes of data. However, most organizations lack the open- source tools to handle this important task. The right tool should be set up quickly and offer multiple customization options. Migration generally happens in entity batches. A set of entities is selected, migrated and tested. This cycle goes on until all application data is migrated. An easily scalable solution can reduce the con- secutive testing cycles. Even minimal human intervention can hinder testing efforts. Another challenge comes when defining effective scenarios for each entity. Performing 100% field-to-field validation of data is ideal, but when the data volume is in petabytes, test execution duration increases tremendously. A proper sampling method should be adopted, and solid data transformation rules should be considered in testing. Big Data Migration Process Hadoop as a service is offered by Amazon Web Services (AWS), a cloud computing solution that abstracts the operational challenges of running Hadoop and making medium- and large-scale data processing accessible, easy, fast and inex- pensive. The typical services available include Amazon S3 (Simple Storage Service) and Amazon EMR (Elastic MapReduce). Also preferred is Amazon Redshift, a fast, fully managed, pet- abyte-scale data warehouse service. The migration to the AWS Hadoop environment is a three-step process: • Cloud service: Virtual machines/physical machines are used to connect and extract the tables from source databases using Sqoop, which pushes them to Amazon S3. • Cloud storage: Amazon S3 cloud storage center is used for all the data that is being sent by virtual machines. It stores data in flat file format. • Data processing: Amazon EMR processes and distributes vast amounts of data using Hadoop. The data is grabbed from S3 and stored as Hive tables (see Glossary, page 7). RDBMS to Big Data Migration Testing Solution Step 1: Define Scenarios To test migrated data, performing one-to-one comparison of all the entities is required. Since big data volumes are (as the term suggests) huge, three test scenarios are performed for each entity: • Count reconciliation for all rows. • Find missing primary keys for all rows. • Compare field-to-field data for sample records. These steps are required to, first, verify the record count in the source DB and target DB and, second, to ensure that all records from source systems flow to the target DB, which is performed by checking the primary key in the source system and the target system for all records. This confirms that all records are present in the target DB. Third, and most important, is comparing the source and target databases for all columns for sample records. This ensures that the data is not corrupted, date formats are maintained and data is not truncated. The number of records for sample testing can be decided according to the data volume. A basic data corruption can be identified by testing 100 sample records. Step 2: Choose the Appropriate Method of Testing Per our analysis, we shortlisted two methods of testing: • UNIX shell script and T-SQL-based reconcilia- tion. • PIG scripting. Migration generally happens in entity batches. A set of entities is selected, migrated and tested. This cycle goes on until all application data is migrated.
  • 3. Another option is to use Microsoft Hive ODBC Driver to access Hive data, but this approach is more appropriate for smaller volumes. Figure 1 shows a comparison of the two methods. Hence, based on this comparison, we recommend a focus on the first approach, where full end-to-end automation is possible. If any transfor- mations are present, those need to be performed in the staging layer – which can be treated as source, to further implement similar solutions. According to the above analysis, PIG scripting is more appropriate for testing migration with complex transformation logic. But for this type Unix Shell Script and T-SQL-Based Reconciliation PIG Scripting Prerequisites Load target Hadoop data into the central QA server (SQL server) as different entities and validate with source tables. SQL server database to store tables and perform comparison using SQL queries. Preconfigured linked server in SQL server DB is needed to connect to all your source databases. Migrate data from RDBMS to HDFS and compare QA HDFS files with Dev HDFS files using Pig scripting. Flat files for each entity created using Sqoop tool. Efforts Initial coding for five to 10 tables takes one week. Consecutive additions take two days for ~10 tables. Compares flat files. Scripting needed for each column in the table. Efforts are equally proportionate to the number of tables and their columns. Automation/ Manual Full automation possible. No automation possible. Performance (On Windows XP, 3 GB RAM, 1 CPU) Delivers the results quickly compared to other methods. For 15 tables with an average 100K, records will take: ~30 minutes for count. ~20 minutes for sample 100 records. ~1 hour for missing primary keys. This method needs migration of source table to HDFS files as a prerequisite, which is time- consuming. Processing can be faster than other methods. Highlights Full automation possible/job scheduling possible. Fast comparison. No permission or security issues faced while accessing big data on AWS. Offers a lot of flexibility in coding. Very useful in more complex transformations. Low Points Initial framework setup is time- consuming. Greater effort for decoding, reporting results and handling script errors. cognizant 20-20 insights 3 Testing Approach Comparison Figure 1
  • 4. cognizant 20-20 insights 4 Figure 2 High-Level Validation Approach of simple migration, the PIG scripting approach is very time-consuming and resource-intensive. Step 3: Build a Framework Here we bring in data from Hadoop to a SQL server’s consolidation database and validate it with the source. Figure 2 illustrates the set of methods we recommend. • UNIX shell scripting: In the migration process, the development team uses the Sqoop tool to migrate RDBMS tables as HDFS files. LOAD DATA INPATH command creates the table definition in the Hive metastore. HDSF files are stored in Amazon S3. To fetch data from Hive to a flat file: >> Store the table list in the CSV file on a UNIX server. >> Write a UNIX shell script with input as a ta- ble list CSV file and generate another shell script to extract the hive data into the CSV files for each table. »» This shell script will be executed from the Windows batch file. >> Dynamically generate a UNIX shell script to ensure there is a need to update only the table list CSV file of every iteration/release for the new table additions. • WinSCP: The next step is to transfer the files in the Hadoop environment to the Windows server. WinSCP batch command interface can SQL Server DB Oracle DB Any Other RDBMS MySQL DB • Stored procedure to compute the count of each table from source system. Results from Hive are compared with this result. • Stored procedure to pull ROW_ID from all tables of source and find out missing/extra ones in Hive results. • Stored procedure to pull source column data of the sample records pulled from Hive and compare results. Report any data mismatch. Source Systems LOAD DATA INPATH ‘hdfs_file’ INTO TABLE tablename Windows batch script Shell script to generate Get Data shell script dynamically. CSV File with Hive table names. • CSV file with count of records and table name for each table in Hive. • CSV file with ROW_ID from all tables available in Hive. • CSV file with first 100 records of all columns from Hive tables. Source to target data flow: Data from source systems is migrated to HDFS using Sqoop – ETL. Linked server to pull data from various DBs. SQL batch files used to load file contents to QA tables. Files from UNIX server are downloaded to Windows server. Files from Hive Get Data shell script to get count, ROW_IDs and sample data from Hive tables. WinSCP Download Commands Jenkins Slave Machine Hadoop Distributed File System HIVE QA DB Server (SQL server) AWS HADOOP Server Figure 3 Sample Code from Final Shell Script
  • 5. cognizant 20-20 insights 5 be implemented for this. The WinSCP batch file (.SCP) connects to the Hadoop environ- ment using an open sftp command. A simple “GET” command with the file name can copy the file to the Windows server. • SQL server database usage: The SQL server is the main database used for loading Hive data and final reconciliation results. A SQL script is created to load data from the .CSV file to the database table. The script uses the “Bulk Insert” command. • Windows batch command: The above-men- tioned process of transferring the data in .CSV files, importing the files into the SQL server and validating the source and target data should all be done sequentially. All validation processes can be automated by creating a Windows batch file. The batch file executes the shell script from the Windows server on the Hadoop envi- ronment using the Plink command. In this way, all Hive data is loaded into the SQL server table. The next step is to execute the SQL server procedure to perform count/primary key/sample data comparison. We use SQLCMD to execute the SQL server procedure from the batch file. • Jenkins: End-to-end validation processes can be triggered by Jenkins. Jenkins jobs can be scheduled to execute on an hourly/daily/ weekly basis without manual intervention. On Jenkins, an auto-scheduled ANT script invokes a Java program to connect to the SQL server to generate the HTML report of the latest records. Jenkins jobs can e-mail the results to the predefined recipients in the HTML format. Figure 4 Results of Count Reconciliation for Hive Tables Migrated from a Webpage SIEBEL HIVE COUNT CLUSTER RECON SUMMARY AUD_ID EXECUTION_DATE SCHEMA_NAME TARGET_DB TOTAL_PASS TOTAL_FAIL ENV 153 2015-04-14 21:55:27.787 SIEBELPRD HIVE 60 94 PRD SIEBEL HIVE COUNT CLUSTER RECON DETAIL AUD_ID AUD_SEQ SOURCE_TAB_ NAME SOURCE_ROW_CNT HIVE_TAB_ NAME HIVE_ROW_CNT DIFF PERCENTAGE_DIFF STATUS EXEC_DATE 153 1 S_ADDR_PER 353420 S_ADDR_PER 343944 9476 2.68 FAIL 2015-04-14 21:55:27.787 153 2 S_PARTY 2730468 S_PARTY 2730468 0 0 PASS 2015-04-14 21:55:27.787 153 3 S_ORG_GROUP 16852 S_ORG_GROUP 16852 0 0 PASS 2015-04-14 21:55:27.787 153 4 S_LST_OF_VAL 29624 S_LST_OF_VAL 29624 0 0 PASS 2015-04-14 21:55:27.787 153 5 S_GROUP_ CONTACT 413912 S_GROUP_ CONTACT 413912 0 0 PASS 2015-04-14 21:55:27.787 153 6 S_CONTACT 1257758 S_CONTACT 1257758 0 0 PASS 2015-04-14 21:55:27.787 153 7 S_CON_ADDR 6220 S_CON_ADDR 6220 0 0 PASS 2015-04-14 21:55:27.787 153 8 S_CIF_CON_MAP 28925 S_CIF_CON_ MAP 28925 0 0 PASS 2015-04-14 21:55:27.787 153 9 S_ADDR_PER 93857 S_ADDR_PER 93857 0 0 PASS 2015-04-14 21:55:27.787 153 10 S_PROD_LN 1114 S_PROD_LN 1106 8 0.72 FAIL 2015-04-14 21:55:27.787 153 11 S_ASSET_REL 696178 S_ASSET_REL 690958 5220 0.75 FAIL 2015-04-14 21:55:27.787 153 12 S_AGREE_ITM_ REL 925139 S_AGREE_ITM_ REL 917657 7482 0.81 FAIL 2015-04-14 21:55:27.787 153 13 S_REVN 131111 S_REVN 128949 2162 1.65 FAIL 2015-04-14 21:55:27.787 153 14 S_ENTLMNT 127511 S_ENTLMNT 125144 2367 1.86 FAIL 2015-04-14 21:55:27.787 153 15 S_ASSET_XA 5577029 S_ASSET_XA 5457724 119305 2.14 FAIL 2015-04-14 21:55:27.787 153 16 S_BU 481 S_BU 470 11 2.29 FAIL 2015-04-14 21:55:27.787 153 17 S_ORG_EXT 345276 S_ORG_EXT 336064 9212 2.67 FAIL 2015-04-14 21:55:27.787 153 18 S_ORG_BU 345670 S_ORG_BU 336424 9246 2.67 FAIL 2015-04-14 21:55:27.787
  • 6. cognizant 20-20 insights 6 Implementation Issues and Resolutions Organizations may face a number of implementa- tion issues. Figure 5 provides probable resolutions. Impact of Testing Figure 6 summarizes the impact of the manual Excel testing when using our framework for one of the customer’s CRM applications based on Oracle and SQL server databases. Looking Forward More and more organizations are using big data tools and techniques to quickly and effectively analyze data for improved customer understand- ing and product/service delivery. This white paper presents a framework to help organizations to more quickly, efficiently and accurately conduct big data migration testing. As your organization moves forward, here are key points to consider before implementing a framework like the one presented in this white paper. • Think big when it comes to big data testing. Choose an optimum data subset for testing; sampling should be based on geographies, priority customers, customer types, product types and product mix. • Create an environment to accommodate huge data sets. Cloud setups are recommended. • BeawareoftheAgile/Scrumcadencemismatch. Break up data into smaller incremental blocks as a work-around. • Get smart about open-source capabilities. Spend a good amount of time up front under- standing the tools and techniques that drive success. * Effort calculated for one table with around 500k records with summary report generation Figure 6 The Advantages of a Big Data Testing Migration Framework* Scenario Manual (mins.) Framework (mins.) Gain (mins.) % Gain Count 20 2 18 90.00% Sample 100 Records 100 1.3 98.7 98.70% Missing Primary Key 40 4 36 90.00% Figure 5 Overcoming Test Migration Challenges SNO Implementation Issues Resolutions 1 Column data type mismatch errors while loading .CSV files and Hive data into the SQL server table. Create tables in SQL server by matching Hive table data types. 2 No FTP access in Hadoop database to transfer files. Use WinSCP software. 3 Column Sequence Mismatch between Hive tables and Source tables, which results in failure to load the .CSV files into Hive_* tables. Create tables in SQL server for target entities by matching Hive table column order. 4 Inability to load .CSV files, due to end of file issue in SQL server bulk insert. Update SQL statement with appropriate row terminator “char (10)” linefeed, which allows import of .CSV files from a Unix server. 5 Performance issues on primary key validations. Performance tuning on SQL server stored procedures and increasing more temp DB space of SQL server, etc. 6 Handling comma in the column values. Create TSV file, so it will not create any issue while data is loading. Remove NULL, null from TSV and generate .txt file. Finally convert into UTF-8 to UTF-16 and generate XLS file, which can be loaded to SQL server database.
  • 7. cognizant 20-20 insights 7 Glossary • AWS: Amazon Web Services is a collection of remote computing services, also called Web services, that make up a cloud computing platform from Amazon.com. • Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost- effectively process vast amounts of data. • Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable, highly-scalable object storage. • Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis- tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. • Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza- tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon EMR on AWS. • Jenkins: Jenkins is an open-source, continuous-integration tool written in Java. Jenkins provides continuous integration services for software development. • PIG scripting: PIG is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. • RDBMS: A relational database management system is a database management system (DBMS) that is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory. • Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into HDFS. This process is briefly called extract, transform and load (ETL). • WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond this, WinSCP offers basic file manager and file synchronization functionality. • Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a command line interpreter. • T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language (SQL).
  • 8. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world’s leading companies build stronger busi- nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac- tion, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com ­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. About the Author Rashmi Khanolkar is a Senior Architect within Cognizant’s Comms-Tech Business Unit. Proficient in appli- cation architecture, data architecture and technical design, Rashmi has 15-plus years of experience in the software industry. She has managed multiple data migration quality projects involving large volumes of data. Rashmi also has extensive experience on multiple development projects on .Net and Moss 2007, and has broad knowledge within the CRM, insurance and banking domains. She can be reached at Rashmi.Khanolkar@cognizant.com. Footnotes 1 For more on Code Halos and innovation, read “Code Rules: A Playbook for Managing at the Crossroads,” Cognizant Technology Solutions, June 2013, http://www.cognizant.com/Futureofwork/Documents/ code-rules.pdf, and the book, Code Halos: How the Digital Lives of People, Things, and Organizations Are Changing the Rules of Business, by Malcolm Frank, Paul Roehrig and Ben Pring, published by John Wiley & Sons, April 2014, http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118862074.html. 2 2014 IDG Enterprise Big Data Report, http://www.idgenterprise.com/report/big-data-2. Codex 1439