Data Exchange Design with SDMX Format for Interoperability Statistical Data
Project Documentation
1. 1
DATA LINEAGE
A Major Project report submitted in partial fulfillment of the requirements for
award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
By
K. BHARGAVI Roll No:12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No:12011A0551
Roll No:11011A0557
Under the esteemed guidance of
Dr. J. UJWALA REKHA
Asst. Professor Of C.S.E
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
2. 2
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
DECLARATION BY THE CANDIDATE
We, K.Bhargavi (12011A0507), Ch.Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), hereby declare that the mini project
titled “Data Lineage”, carried out under the guidance of Dr. J. Ujwala Rekha, Asst.
Professor, is submitted in partial fulfillment of the requirements for the award of
Bachelor of Technology in Computer Science and Engineering. This is a record of
bonafide work carried out by us and the results produced by us have not been
reproduced/copied from any source.
The results embodied in this project report have not been submitted to any other
University or Institute for the award of any other degree or diploma.
K.BHARGAVI Roll No: 12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No: 12011A0551
Roll No: 11011A0557
3. 3
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
CERTIFICATE BY THE SUPERVISOR
This is to certify that the major project report titled “Data Lineage”, being submitted by
K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), in the Department of Computer
Science and Engineering of JNTUH COLLEGE OF ENGINEERING HYDERABAD
is a record of bonafide work carried out by them under my guidance and supervision. The
results embodied in this project report have not been submitted to any other University or
Institute for the award of any other degree or diploma. The results have been verified and
found to be satisfactory.
Dr. J. Ujwala Rekha
Asst. Professor
4. 4
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
CERTIFICATE BY THE HEAD OF THE DEPARTMENT
This is to certify that the major project report titled “Data Lineage”, being submitted by
K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), in the partial fulfillment of
requirements for award of Bachelor in Technology in Computer Science and
Engineering.
Dr. V. Kamakshi Prasad,
5. 5
ACKNOWLEDGEMENT
We take this opportunity to thank all who have rendered their full
support to our work. The pleasure, the achievement, the glory, the
satisfaction, the reward, the appreciation and the construction of our
project cannot be thought without a few, who apart from their regular
schedule spared their valuable time for us. This acknowledgement is not
just a position of words but also an account of the indictment. We thank
our guide, Ms.J. Ujwala Rekha, Asst.Professor for giving us the
opportunity to do this project work and for her constant help and
guidance. We take the opportunity to express our gratitude to Dr. V.
Kamakshi Prasad, Professor and Head of the Department, Department
of CSE, JNTUH for giving us the opportunity to do this major project.
Finally, we thank JP Morgan Chase and Co. for giving us the
opportunity to work on this project and our mentors for guiding and
keeping us motivated throughout this process. Also, we are thankful to
the faculty of Department of CSE, JNTUH, our friends, and all our family
members who with their valuable suggestions and support, directly or
indirectly helped us in this project work.
K.BHARGAVI Roll No: 12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No: 12011A0551
Roll No: 11011A0557
6. 6
ABSTRACT
Ninety percent (90%) of the world’s data has been created in the last two
years alone. This explosion of data is the result of the ever-growing
number of systems and automation at all levels in all sizes
of organizations. While this data has made it easier to access information
in the working world, it has also lead to a new set of problems.
Users need “clean” and conformed data to make informed decisions. Lack
of trust in data makes users move away from using information systems.
The solution to data integrity, uniformity and correctness is matured
data governance. And the first step to achieving it is to get a visual on the
existing data flow and data lineage.
The requirements of this project are:
Code Parsing engine to extract the data elements (i.e. Columns).
Identifying the original source of the data elements.
Identifying the transformation logic to create/populate the data
elements.
Graphical interactive interface for visualization.
The proposed solution is to introduce a system that handles the SQL
queries and maps the attributes to a database schema. With the help of
an interactive UI and a SQL parser, we can extract and collect
meaningful expressions from the parsed text, using declared
7. 7
combinations of grammar rules and parsed text tokens. This application
will help us visualize the complex network of data flows and data
dependencies, which in turn will help us define strategies to improve
data quality.
8. 8
TABLE OF CONTENTS
Abstract 6
Chapter 1 Introduction 11
1.1 Overview
Chapter 2 Literature Survey 11
2.1 Introduction 13
2.2 Data Lineage 13
2.3 Scope 14
2.4 Important definitions in Data Lineage 16
Chapter 3 Design 22
3.1 Project Background 22
3.2 Proposed Solution 22
3.3 Requirements List 24
3.3.1 Requirements Description 24
3.4 Assumptions and Dependencies 26
3.5 Risks 26
3.6 UML Diagrams 27
3.6.1 Class Diagram 27
3.6.2 Use Case Diagram 28
3.6.3 Activity Diagram 28
9. 9
3.6.4 Sequence Diagram 30
Chapter 4 Implementation 31
4.1 User Module 31
4.2 Parser 31
4.3 Graphical User Interface 32
Chapter 5 Result 33
Chapter 6 Conclusion and Future Scope 39
Chapter 7 References 41
10. 10
LIST OF FIGURES
Number Title Pg No.
Figure 3.1 Proposed/To-be
Work Flow
23
Figure 3.2 Class Diagram 27
Figure 3.3 Use Case Diagram 28
Figure 3.4 Activity Diagram 29
Figure 3.5 Sequence Diagram 30
Figure 6.1 Login Form 33
Figure 6.2 Home Page 34
Figure 6.3 SQL Input Form 35
Figure 6.4 Graphical
Representation of
Data
36
Figure 6.5 Expanded graph 37
11. 11
CHAPTER 1 : INTRODUCTION
1.1 Overview
With the increase in the amount of data being created every day, it is
becoming difficult to govern and maintain such a huge amount to data.
Developers and managers are facing problems in business intelligence
(BI) and Data Warehouse (DW) environments where the chains of data
transformations are long and the complexity of structural changes is
high. The management of data integration processes becomes
unpredictable and the costs of changes can be very high due to the lack
of information about data flows and internal relations of system
components. The amount of different data flows and system component
dependencies in a traditional data warehouse environment is large.
Important contextual relations are coded into data transformation
queries and programs (e.g. SQL queries, data loading scripts, open or
closed DI system components etc.). Data lineage dependencies are
spread between different systems and frequently exist only in program
code or SQL queries. This leads to unmanageable complexity, lack of
knowledge and a large amount of technical work with uncomfortable
consequences like unpredictable results, wrong estimations, rigid
administrative and development processes, high cost, lack of flexibility
and lack of trust. We need clean and correct data to make informed
decisions. Lack of trust in data makes users move away from using
information systems. The solution to data integrity, uniformity and
12. 12
correctness is matured data governance. And the first step to achieving
it is to get a visual on the existing data flow and data lineage.
A visual representation of data lineage helps to track data from its origin
to its destination. It explains the different processes involved in the data
flow and their dependencies. Metadata management is the key input to
capturing enterprise data flow and presenting data lineage. It consists of
metadata collection, integration, usage and repository maintenance. It
captures enterprise data flow and presents the data lineage in a
graphical manner, showing us the flow from source to the destination.
13. 13
CHAPTER 2 : LITERATURE SURVEY
2.1 Introduction
A literature survey or literature review means that we read and report on
what the literature in the field has to say about our topic or subject.
There may be a lot of literature on the topic or there may be a little.
Either way, the goal is to show that we have read and understood the
positions of other academics who have studied the problem/issue that
we are studying and include that in our project. We have done this by
comparing and contrasting, simple summarization.
2.2 Data Lineage
Representation of Data Lineage broadly depends on scope of the
metadata management and reference point of interest. Data Lineage
provides sources of the data and intermediate data flow hops from the
reference point with backward data lineage, leads to the final
destination's data points and its intermediate data flows with forward
data lineage. These views can be combined with end to end lineage for a
reference point that provides complete audit trail of that data point of
interest from source to its final destination. As the data points or hops
increases, the complexity of such representation becomes
incomprehensible. Thus, the best feature of the data lineage view would
be to able to simplify the view by temporarily masking unwanted
peripheral data points. Tools that have the masking feature enables
14. 14
scalability of the view and enhances analysis with best user experience
for both Technical and business users alike.[4]
2.3 Scope
Scope of data lineage determines the volume of metadata required to
represent its data lineage. Usually, Data Governance, and Data
Management determines the scope of the data lineage based on their
regulations, enterprise data management strategy, data impact, reporting
attributes, and critical data elements of the organization.
Data Lineage provides the audit trail of the data points at the lowest
granular level, but presentation of the lineage may be done at various
zoom levels to simplify the vast information, similar to the analytic web
maps. It can be visualized at various levels based on the granularity of
the view. At a very high level data lineage provides what systems the data
interacts before it reaches destination. As the granularity increases it
goes up to the data point level where it can provide the details of the data
point and its historical behavior, attribute properties, and trends and
data quality of the data passed through that specific data point in the
data lineage.
Data Governance plays a key role in metadata management for
guidelines, strategies, policies, implementation. Data quality, and data
management helps in enriching the data lineage with more business
value. Even though the final representation of data lineage is provided in
15. 15
one interface but the way the metadata is harvested and exposed to the
data lineage User Interface (UI) could be entirely different.
Thus, Data lineage can be broadly divided into three categories based on
the way metadata is harvested: Data lineage involving software packages
for structured data, Programming Languages, and Big Data.
Data lineage expects to view at least the technical metadata involving the
data points and its various transformations. Along with technical data,
data lineage may enrich the metadata with their corresponding data
quality results, reference data values, data models, business vocabulary,
people, programs, and systems linked to the data points and
transformations. Masking feature in the data lineage visualization allows
the tools to incorporate all the enrichments that matter for the specific
use case. Metadata normalization may be done in data lineage to
represent disparate systems into one common view.
Data provenance documents the inputs, entities, systems, and
processes that influence data of interest, in effect providing a historical
record of the data and its origins. The generated evidence supports
essential forensic activities such as data-dependency analysis,
error/compromise detection and recovery, and auditing and compliance
analysis.
16. 16
2.4 Important definitions in Data lineage
1. Metadata Metadata describes other data. It provides information
about a certain item's content. For example, an image may
include metadata that describes how large the picture is, the color
depth, the image resolution, when the image was created, and
other data. A text document's metadata may contain information
about how long the document is, who the author is, when the
document was written, and a short summary of the document.
Metadata is essential for understanding information stored in data
warehouses and has become increasingly important in XML-based
Web applications. The main purpose of metadata is to facilitate in
the discovery of relevant information, more often classified as
resource discovery. Metadata assists in resource discovery by
allowing resources to be found by relevant criteria, identifying
resources, bringing similar resources together, distinguishing
dissimilar resources, and giving location information. It is used to
summarize basic information about data which can make tracking
and working with specific data easier.[1] Some examples include:
Means of creation of the data
Purpose of the data
Time and date of creation
Creator or author of the data
17. 17
Location on a computer network where the data was created
Standards used
File size
2. Data Warehouse A data warehouse is a federated repository for
all the data that an enterprise's various business systems collect. The
repository may be physical or logical. It stores current and historical
data and are used for creating analytical reports for knowledge workers
throughout the enterprise.
Types of systems:
A data mart is a simple form of a data warehouse that is focused
on a single subject (or functional area) hence, they draw data from
a limited number of sources such as sales, finance or marketing.
Data marts are often built and controlled by a single department
within an organization. The sources could be internal operational
systems, a central data warehouse, or external data. Given that
data marts generally cover only a subset of the data contained in a
data warehouse, they are often easier and faster to implement.
Online analytical processing (OLAP) is characterized by a
relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems, response
time is an effectiveness measure. OLAP applications are widely
used by Data Mining techniques. OLAP databases store
18. 18
aggregated, historical data in multidimensional schemas (usually
star schemas). OLAP systems typically have data latency of a few
hours, as opposed to data marts, where latency is expected to be
closer to one day.The OLAP approach is used to analyze
multidimensional data from multiple sources and perspectives.
The three basic operations in OLAP are : Roll-up, Drill-down and
Slicing & Dicing.
Online transaction processing (OLTP) is characterized by a
large number of short online transactions (INSERT, UPDATE,
DELETE). OLTP systems emphasize very fast query processing
and maintaining data integrity in multi-access environments. For
OLTP systems, effectiveness is measured by the number of
transactions per second. OLTP databases contain detailed and
current data.
3. Data Management
Data management is the development and execution of architectures,
policies, practices and procedures in order to manage the information
lifecycle needs of an enterprise in an effective manner. Data lifecycle
management (DLM) is a policy-based approach to managing the flow of
an information system's data throughout its lifecycle: from creation and
initial storage to the time when it becomes obsolete and is deleted.
Several vendors offer DLM products but effective data management
19. 19
involves well-thought-out procedures and adherence to best practices as
well as applications.
There are various approaches to data management. Master data
management (MDM), for example, is a comprehensive method of enabling
an enterprise to link all of its critical data to one file, called a master file,
that provides a common point of reference. The effective management of
corporate data has grown in importance as businesses are subject to an
increasing number of compliance regulations. Furthermore, the sheer
volume of data that must be managed by organizations has increased so
markedly that it is sometimes referred to as big data.[5]
4.Data Cleaning
Data cleaning, also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order to
improve the quality of data. Data quality problems are present in single
data collections, such as files and databases, e.g., due to misspellings
during data entry, missing information or other invalid data. When
multiple data sources need to be integrated, e.g., in data warehouses,
federated database systems or global web-based information systems,
the need for data cleaning increases significantly. This is because the
sources often contain redundant data in different representations. In
order to provide access to accurate and consistent data, consolidation of
20. 20
different data representations and elimination of duplicate information
become necessary.
5.Data Quality
High-quality data needs to pass a set of quality criteria. Those include:
Validity: The degree to which the measures conform to defined
business rules or constraints.
De-cleansing: It is detecting errors and syntactically removing
them for better programming.
Accuracy: The degree of conformity of a measure to a standard or
a true value.
Completeness: The degree to which all required measures are
known.
Consistency: The degree to which a set of measures are equivalent
in across systems.
Uniformity: The degree to which a set data measures are specified
using the same units of measure in all systems .
6.User Interface (UI):
The user interface is one of the most important parts of any program
because it determines how easily you can make the program do what you
want. It is the space where interactions between humans and machines
occur. The goal of this interaction is to allow effective operation and
21. 21
control of the machine from the human end, whilst the machine
simultaneously feeds back information that aids the operator's' decision-
making process. Generally, the goal of user interface design is to produce
a user interface which makes it easy (self-explanatory), efficient, and
enjoyable (user-friendly) to operate a machine in the way which produces
the desired result. This generally means that the operator needs to
provide minimal input to achieve the desired output, and also that the
machine minimizes undesired outputs to the human. The user interface
can arguably include the total "user experience," which may include the
aesthetic appearance of the device, response time, and the content that
is presented to the user within the context of the user interface.
In our project we are using a web user interface (WUI) that accept input
and provide output by generating web pages which are transmitted via
the Internet and viewed by the user using a web browser program. This
kind of implementation utilizes Java, JavaScript, Bootstrap, AJAX, or
similar technologies to provide real-time control in a separate program,
eliminating the need to refresh a traditional HTML based web browser.
22. 22
CHAPTER 3 : DESIGN
3.1 Project Background
The objective of this project is to keep a track of the data from its origin
to its destination with the help of metadata. Metadata contains
information about the origins of a particular data set and can be
granular enough to define information at the attribute level. It maintains
auditable information about users, location of data, applications, and
processes that create, delete, or change data, the exact timestamp of the
change, and the authorization that was used to perform these actions.
The amount of different data flows and system component dependencies
in a traditional data warehouse environment is large. Important
contextual relations are coded into data transformation queries and
programs (e.g. SQL queries.). Data lineage dependencies are spread
between different systems and frequently exist only in program code or
SQL queries. This leads to unmanageable complexity, lack of knowledge
and unpredictable results.
3.2 Proposed Solution The proposed solution is to introduce a system
that handles the SQL queries and maps the attributes to a database
schema. With the help of an interactive UI and a SQL parser, we can
extract and collect meaningful expressions from the parsed text, using
declared combinations of grammar rules and parsed text tokens. This
application will help us visualize the complex network of data flows and
23. 23
data dependencies, which in turn will help us define strategies to
improve data quality.
Proposed/To-be Work Flow
Figure 3.1 Proposed/To-be Work Flow
24. 24
3.3 Requirements List
ID Requirement Name
1001 Code Parsing engine to extract the data elements (i.e. columns).
1002 Identifying the original source of the data elements.
1003 Graphical interactive interface for visualization.
Table 3.1
3.3.1 Requirements Description
Parser is an essential component of our system that takes SQL queries
and analyzes them.
Requirement ID:
1001
Version: 1.0 Created On: 23-FEB-2016
Requirement Name Code Parsing engine to extract the data elements (i.e.
columns).
Priority High
Description The application requires an appropriate Parsing engine
to extract the data elements from the database, handle
the sql queries and map the attributes to the database
25. 25
schema. Meaningful expressions can be collected from
the parsed data in the graphical user interface.
Table 3.2
SQL queries are taken as input data which are further parsed into
xml code
Requirement ID:
1002
Version: 1.0 Created On: 23-FEB-2016
Requirement Name The original source of the data elements (database).
Priority High
Description The database stores data elements that are sent to the
parsing engine, where the queries are decoded and
mapped to the respective attributes.
Table 3.3
The GUI is completely abstracted from the implementation to offer
intuitive interaction to users.
Requirement ID:
1003
Version: 1.0 Created On: 23-FEB-2016
Requirement Name Graphical interactive interface for visualization.
Priority High
26. 26
Description GUI is required so as to visualize complex network of
data flow and data dependencies, hence creating graphic
illustrations of information in an efficient manner.
Table 3.4
3.4 Assumptions and Dependencies
● We assume that the input data are of those type which our
constructed tool supports.
● The connection established to the data ware house doesn’t crash or
get disconnected.
● The General SQL Parser parses SQL queries to XML data only.
3.5 Risks
● The user interface through which we are visualizing data lineage
may lose connection with the database server.
● Web services that we are providing may be at a place different from
the repository location.
● Tracking lineage from mainframe applications and programs
doesn’t give exact workflow.
● Receiving files which are not compatible with our tool.
27. 27
3.6 Unified Modeling Language Diagrams
3.6.1 Class Diagram
The class diagram is a static diagram. It represents the static view of an
application. Class diagram is not only used for visualizing, describing
and documenting different aspects of a system but also for constructing
executable code of the software application.
Figure 3.1 Class Diagram
28. 28
3.6.2 Use Case Diagram
A use case diagram shows a set of use cases and actors (a special kind of
class) and their relationships. Use case diagrams address the static use
case view of a system. These diagrams are especially important in
organizing and modeling the behaviors of a system.
This UML diagram shows the relationship between various actors i.e.
user, parser and the UI.
Figure 3.2 Use Case Diagram
3.6.3 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise
activities and actions with support for choice, iteration and concurrency.
In the Unified Modeling Language, activity diagrams are intended to
model both computational and organizational processes (i.e.
29. 29
workflows). These diagrams show the overall flow of control. They deal
with all type of flow control by using different elements like fork, join etc.
Figure 3.3 Activity Diagram
3.6.4 Sequence Diagram
A Sequence diagram is an interaction diagram that shows how processes
operate with one another and what is their order. It is a construct of
a Message Sequence Chart. A sequence diagram shows object
interactions arranged in time sequence.[6]
31. 31
CHAPTER 4: IMPLEMENTATION
4.1 User Module
User Module consists of a Login form which takes input from the user.
The user provides login credentials like Username and Password in order
to connect to the webpage where the application runs.
The Home page gives a basic understanding of the data lineage concept
and provides options to explore the application.
4.2 Parser
An essential component of our system is the parser that is used to gather
information from mappings. In computer technology, a parser is a
program, usually part of a compiler, that receives input in the form of
sequential source program instructions, interactive online commands,
markup tags, or some other defined interface and breaks them up into
parts. In order to process mapping rules, we developed a parser that is
capable of extracting components from the SQL queries generically,
based on their semantic meaning and relation to each other. This parser
is developed using General SQL Parser.[2] General SQL Parser, Java
version is valuable tool because it provides an in-depth and detailed
analysis of SQL scripts for various databases, including SQL Server. The
parser can extract mapping components based on their semantic
meaning and recognizes the context in which components are used. By
32. 32
providing a generic interface for this parser, we built a basis that can be
utilized to query for any element in the mapping structure.
The user is directed to "Fill the details to know data lineage" form, where
he is provided with various database options. MySQL Database is used
here, by default. The user can either upload an sql query document or
type it in the input box. Upon "Send", the user is directed to a Graphical
User Interface.
4.3 Graphical User Interface:
The interface is completely abstracted from the implementation to offer
intuitive interaction to users. It uses the JSON files from the parser to
display graphs using D3.js-Data-Driven Documents.[3] D3.js is a
JavaScript library for producing dynamic, interactive data visualizations
in web browsers. It makes use of the widely implemented SVG, HTML5,
and CSS standards. The graphs consists of a sequence of mappings
defined as data flow i.e. the path that data items take in the system, from
their respective source to the final result. Consequently, it can be used
by any business user who has only minimal knowledge of mapping
structures. In combination with the complete abstraction from actual
data, this is a big step forward towards bridging the gap between
business users and IT systems.
39. 39
CHAPTER 6: CONCLUSION AND FUTURE SCOPE
Conclusion
As the volumes of data multiply, information about data becomes even
more critical. Data lineage methodology works like an x-ray for data flow
in an organization. It captures information from source to destination
along with the various processes and rules involved and shows how the
data is used. This knowledge about what data is available, its quality,
correctness and completeness leads to a mature data governance
process.
The metadata contains information about the origins of a particular data
set and can be granular enough to define information at the attribute
level. It maintains auditable information about users, location of data,
applications, and processes that create, delete, or change data, the exact
timestamp of the change, and the authorization that was used to perform
these actions.
Through this project we were able to keep track of our data starting from
the source to its destination. The parser takes the SQL queries and
generates a XML code. This XML data is then converted to JSON objects
and represented in a graphical manner using D3 graphs. This graph
helps us visualize the complex network of data flows and data
dependencies, which in turn will help us design strategies to improve
data quality.
40. 40
Future Scope
Our team is scheduled to meet the leaders and mentors from JPMorgan
Chase & Co. this month to discuss the further extension of our project.
Considering the growing need of data lineage in the IT and banking
industry, this line of work is very valuable to the firm. Our project work
has been appreciated and it might be used as a basic model by the
company and extend the scope and implementation of the Data Lineage
project.
41. 41
CHAPTER 7 : REFERENCES
[1] http://www.techtarget.com
[2] General SQL Parser
http://www.sqlparser.com/
[3] Graphical representation of data flow
https://d3js.org/
[4] Impact Analysis and Data Lineage
http://www.dlineage.com/impact-analysis-data-lineage.html
[5] Raghu Ramakrishnan and Johannes Gehrke, " Database Management
Systems", Edition 3
[6] Roger S Pressman: Software Engineering, A practitioner’s Approach
(6th Ed), McGrawHill Int. Ed.