SlideShare uma empresa Scribd logo
1 de 41
1
DATA LINEAGE
A Major Project report submitted in partial fulfillment of the requirements for
award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
By
K. BHARGAVI Roll No:12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No:12011A0551
Roll No:11011A0557
Under the esteemed guidance of
Dr. J. UJWALA REKHA
Asst. Professor Of C.S.E
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
2
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
DECLARATION BY THE CANDIDATE
We, K.Bhargavi (12011A0507), Ch.Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), hereby declare that the mini project
titled “Data Lineage”, carried out under the guidance of Dr. J. Ujwala Rekha, Asst.
Professor, is submitted in partial fulfillment of the requirements for the award of
Bachelor of Technology in Computer Science and Engineering. This is a record of
bonafide work carried out by us and the results produced by us have not been
reproduced/copied from any source.
The results embodied in this project report have not been submitted to any other
University or Institute for the award of any other degree or diploma.
K.BHARGAVI Roll No: 12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No: 12011A0551
Roll No: 11011A0557
3
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
CERTIFICATE BY THE SUPERVISOR
This is to certify that the major project report titled “Data Lineage”, being submitted by
K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), in the Department of Computer
Science and Engineering of JNTUH COLLEGE OF ENGINEERING HYDERABAD
is a record of bonafide work carried out by them under my guidance and supervision. The
results embodied in this project report have not been submitted to any other University or
Institute for the award of any other degree or diploma. The results have been verified and
found to be satisfactory.
Dr. J. Ujwala Rekha
Asst. Professor
4
Department of Computer Science and Engineering
JNTUH College of Engineering Hyderabad (Autonomous)
Kukatpally, Hyderabad - 500 085, Telangana, India
CERTIFICATE BY THE HEAD OF THE DEPARTMENT
This is to certify that the major project report titled “Data Lineage”, being submitted by
K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina
(12011A0551), and Rohan Reddy (11011A0557), in the partial fulfillment of
requirements for award of Bachelor in Technology in Computer Science and
Engineering.
Dr. V. Kamakshi Prasad,
5
ACKNOWLEDGEMENT
We take this opportunity to thank all who have rendered their full
support to our work. The pleasure, the achievement, the glory, the
satisfaction, the reward, the appreciation and the construction of our
project cannot be thought without a few, who apart from their regular
schedule spared their valuable time for us. This acknowledgement is not
just a position of words but also an account of the indictment. We thank
our guide, Ms.J. Ujwala Rekha, Asst.Professor for giving us the
opportunity to do this project work and for her constant help and
guidance. We take the opportunity to express our gratitude to Dr. V.
Kamakshi Prasad, Professor and Head of the Department, Department
of CSE, JNTUH for giving us the opportunity to do this major project.
Finally, we thank JP Morgan Chase and Co. for giving us the
opportunity to work on this project and our mentors for guiding and
keeping us motivated throughout this process. Also, we are thankful to
the faculty of Department of CSE, JNTUH, our friends, and all our family
members who with their valuable suggestions and support, directly or
indirectly helped us in this project work.
K.BHARGAVI Roll No: 12011A0507
CH.PRAKYA SRI Roll No: 12011A0529
SHALINI RAINA
ROHAN REDDY
Roll No: 12011A0551
Roll No: 11011A0557
6
ABSTRACT
Ninety percent (90%) of the world’s data has been created in the last two
years alone. This explosion of data is the result of the ever-growing
number of systems and automation at all levels in all sizes
of organizations. While this data has made it easier to access information
in the working world, it has also lead to a new set of problems.
Users need “clean” and conformed data to make informed decisions. Lack
of trust in data makes users move away from using information systems.
The solution to data integrity, uniformity and correctness is matured
data governance. And the first step to achieving it is to get a visual on the
existing data flow and data lineage.
The requirements of this project are:
 Code Parsing engine to extract the data elements (i.e. Columns).
 Identifying the original source of the data elements.
 Identifying the transformation logic to create/populate the data
elements.
 Graphical interactive interface for visualization.
The proposed solution is to introduce a system that handles the SQL
queries and maps the attributes to a database schema. With the help of
an interactive UI and a SQL parser, we can extract and collect
meaningful expressions from the parsed text, using declared
7
combinations of grammar rules and parsed text tokens. This application
will help us visualize the complex network of data flows and data
dependencies, which in turn will help us define strategies to improve
data quality.
8
TABLE OF CONTENTS
Abstract 6
Chapter 1 Introduction 11
1.1 Overview
Chapter 2 Literature Survey 11
2.1 Introduction 13
2.2 Data Lineage 13
2.3 Scope 14
2.4 Important definitions in Data Lineage 16
Chapter 3 Design 22
3.1 Project Background 22
3.2 Proposed Solution 22
3.3 Requirements List 24
3.3.1 Requirements Description 24
3.4 Assumptions and Dependencies 26
3.5 Risks 26
3.6 UML Diagrams 27
3.6.1 Class Diagram 27
3.6.2 Use Case Diagram 28
3.6.3 Activity Diagram 28
9
3.6.4 Sequence Diagram 30
Chapter 4 Implementation 31
4.1 User Module 31
4.2 Parser 31
4.3 Graphical User Interface 32
Chapter 5 Result 33
Chapter 6 Conclusion and Future Scope 39
Chapter 7 References 41
10
LIST OF FIGURES
Number Title Pg No.
Figure 3.1 Proposed/To-be
Work Flow
23
Figure 3.2 Class Diagram 27
Figure 3.3 Use Case Diagram 28
Figure 3.4 Activity Diagram 29
Figure 3.5 Sequence Diagram 30
Figure 6.1 Login Form 33
Figure 6.2 Home Page 34
Figure 6.3 SQL Input Form 35
Figure 6.4 Graphical
Representation of
Data
36
Figure 6.5 Expanded graph 37
11
CHAPTER 1 : INTRODUCTION
1.1 Overview
With the increase in the amount of data being created every day, it is
becoming difficult to govern and maintain such a huge amount to data.
Developers and managers are facing problems in business intelligence
(BI) and Data Warehouse (DW) environments where the chains of data
transformations are long and the complexity of structural changes is
high. The management of data integration processes becomes
unpredictable and the costs of changes can be very high due to the lack
of information about data flows and internal relations of system
components. The amount of different data flows and system component
dependencies in a traditional data warehouse environment is large.
Important contextual relations are coded into data transformation
queries and programs (e.g. SQL queries, data loading scripts, open or
closed DI system components etc.). Data lineage dependencies are
spread between different systems and frequently exist only in program
code or SQL queries. This leads to unmanageable complexity, lack of
knowledge and a large amount of technical work with uncomfortable
consequences like unpredictable results, wrong estimations, rigid
administrative and development processes, high cost, lack of flexibility
and lack of trust. We need clean and correct data to make informed
decisions. Lack of trust in data makes users move away from using
information systems. The solution to data integrity, uniformity and
12
correctness is matured data governance. And the first step to achieving
it is to get a visual on the existing data flow and data lineage.
A visual representation of data lineage helps to track data from its origin
to its destination. It explains the different processes involved in the data
flow and their dependencies. Metadata management is the key input to
capturing enterprise data flow and presenting data lineage. It consists of
metadata collection, integration, usage and repository maintenance. It
captures enterprise data flow and presents the data lineage in a
graphical manner, showing us the flow from source to the destination.
13
CHAPTER 2 : LITERATURE SURVEY
2.1 Introduction
A literature survey or literature review means that we read and report on
what the literature in the field has to say about our topic or subject.
There may be a lot of literature on the topic or there may be a little.
Either way, the goal is to show that we have read and understood the
positions of other academics who have studied the problem/issue that
we are studying and include that in our project. We have done this by
comparing and contrasting, simple summarization.
2.2 Data Lineage
Representation of Data Lineage broadly depends on scope of the
metadata management and reference point of interest. Data Lineage
provides sources of the data and intermediate data flow hops from the
reference point with backward data lineage, leads to the final
destination's data points and its intermediate data flows with forward
data lineage. These views can be combined with end to end lineage for a
reference point that provides complete audit trail of that data point of
interest from source to its final destination. As the data points or hops
increases, the complexity of such representation becomes
incomprehensible. Thus, the best feature of the data lineage view would
be to able to simplify the view by temporarily masking unwanted
peripheral data points. Tools that have the masking feature enables
14
scalability of the view and enhances analysis with best user experience
for both Technical and business users alike.[4]
2.3 Scope
Scope of data lineage determines the volume of metadata required to
represent its data lineage. Usually, Data Governance, and Data
Management determines the scope of the data lineage based on their
regulations, enterprise data management strategy, data impact, reporting
attributes, and critical data elements of the organization.
Data Lineage provides the audit trail of the data points at the lowest
granular level, but presentation of the lineage may be done at various
zoom levels to simplify the vast information, similar to the analytic web
maps. It can be visualized at various levels based on the granularity of
the view. At a very high level data lineage provides what systems the data
interacts before it reaches destination. As the granularity increases it
goes up to the data point level where it can provide the details of the data
point and its historical behavior, attribute properties, and trends and
data quality of the data passed through that specific data point in the
data lineage.
Data Governance plays a key role in metadata management for
guidelines, strategies, policies, implementation. Data quality, and data
management helps in enriching the data lineage with more business
value. Even though the final representation of data lineage is provided in
15
one interface but the way the metadata is harvested and exposed to the
data lineage User Interface (UI) could be entirely different.
Thus, Data lineage can be broadly divided into three categories based on
the way metadata is harvested: Data lineage involving software packages
for structured data, Programming Languages, and Big Data.
Data lineage expects to view at least the technical metadata involving the
data points and its various transformations. Along with technical data,
data lineage may enrich the metadata with their corresponding data
quality results, reference data values, data models, business vocabulary,
people, programs, and systems linked to the data points and
transformations. Masking feature in the data lineage visualization allows
the tools to incorporate all the enrichments that matter for the specific
use case. Metadata normalization may be done in data lineage to
represent disparate systems into one common view.
Data provenance documents the inputs, entities, systems, and
processes that influence data of interest, in effect providing a historical
record of the data and its origins. The generated evidence supports
essential forensic activities such as data-dependency analysis,
error/compromise detection and recovery, and auditing and compliance
analysis.
16
2.4 Important definitions in Data lineage
1. Metadata Metadata describes other data. It provides information
about a certain item's content. For example, an image may
include metadata that describes how large the picture is, the color
depth, the image resolution, when the image was created, and
other data. A text document's metadata may contain information
about how long the document is, who the author is, when the
document was written, and a short summary of the document.
Metadata is essential for understanding information stored in data
warehouses and has become increasingly important in XML-based
Web applications. The main purpose of metadata is to facilitate in
the discovery of relevant information, more often classified as
resource discovery. Metadata assists in resource discovery by
allowing resources to be found by relevant criteria, identifying
resources, bringing similar resources together, distinguishing
dissimilar resources, and giving location information. It is used to
summarize basic information about data which can make tracking
and working with specific data easier.[1] Some examples include:
 Means of creation of the data
 Purpose of the data
 Time and date of creation
 Creator or author of the data
17
 Location on a computer network where the data was created
 Standards used
 File size
2. Data Warehouse A data warehouse is a federated repository for
all the data that an enterprise's various business systems collect. The
repository may be physical or logical. It stores current and historical
data and are used for creating analytical reports for knowledge workers
throughout the enterprise.
Types of systems:
 A data mart is a simple form of a data warehouse that is focused
on a single subject (or functional area) hence, they draw data from
a limited number of sources such as sales, finance or marketing.
Data marts are often built and controlled by a single department
within an organization. The sources could be internal operational
systems, a central data warehouse, or external data. Given that
data marts generally cover only a subset of the data contained in a
data warehouse, they are often easier and faster to implement.
 Online analytical processing (OLAP) is characterized by a
relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems, response
time is an effectiveness measure. OLAP applications are widely
used by Data Mining techniques. OLAP databases store
18
aggregated, historical data in multidimensional schemas (usually
star schemas). OLAP systems typically have data latency of a few
hours, as opposed to data marts, where latency is expected to be
closer to one day.The OLAP approach is used to analyze
multidimensional data from multiple sources and perspectives.
The three basic operations in OLAP are : Roll-up, Drill-down and
Slicing & Dicing.
 Online transaction processing (OLTP) is characterized by a
large number of short online transactions (INSERT, UPDATE,
DELETE). OLTP systems emphasize very fast query processing
and maintaining data integrity in multi-access environments. For
OLTP systems, effectiveness is measured by the number of
transactions per second. OLTP databases contain detailed and
current data.
3. Data Management
Data management is the development and execution of architectures,
policies, practices and procedures in order to manage the information
lifecycle needs of an enterprise in an effective manner. Data lifecycle
management (DLM) is a policy-based approach to managing the flow of
an information system's data throughout its lifecycle: from creation and
initial storage to the time when it becomes obsolete and is deleted.
Several vendors offer DLM products but effective data management
19
involves well-thought-out procedures and adherence to best practices as
well as applications.
There are various approaches to data management. Master data
management (MDM), for example, is a comprehensive method of enabling
an enterprise to link all of its critical data to one file, called a master file,
that provides a common point of reference. The effective management of
corporate data has grown in importance as businesses are subject to an
increasing number of compliance regulations. Furthermore, the sheer
volume of data that must be managed by organizations has increased so
markedly that it is sometimes referred to as big data.[5]
4.Data Cleaning
Data cleaning, also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order to
improve the quality of data. Data quality problems are present in single
data collections, such as files and databases, e.g., due to misspellings
during data entry, missing information or other invalid data. When
multiple data sources need to be integrated, e.g., in data warehouses,
federated database systems or global web-based information systems,
the need for data cleaning increases significantly. This is because the
sources often contain redundant data in different representations. In
order to provide access to accurate and consistent data, consolidation of
20
different data representations and elimination of duplicate information
become necessary.
5.Data Quality
High-quality data needs to pass a set of quality criteria. Those include:
 Validity: The degree to which the measures conform to defined
business rules or constraints.
 De-cleansing: It is detecting errors and syntactically removing
them for better programming.
 Accuracy: The degree of conformity of a measure to a standard or
a true value.
 Completeness: The degree to which all required measures are
known.
 Consistency: The degree to which a set of measures are equivalent
in across systems.
 Uniformity: The degree to which a set data measures are specified
using the same units of measure in all systems .
6.User Interface (UI):
The user interface is one of the most important parts of any program
because it determines how easily you can make the program do what you
want. It is the space where interactions between humans and machines
occur. The goal of this interaction is to allow effective operation and
21
control of the machine from the human end, whilst the machine
simultaneously feeds back information that aids the operator's' decision-
making process. Generally, the goal of user interface design is to produce
a user interface which makes it easy (self-explanatory), efficient, and
enjoyable (user-friendly) to operate a machine in the way which produces
the desired result. This generally means that the operator needs to
provide minimal input to achieve the desired output, and also that the
machine minimizes undesired outputs to the human. The user interface
can arguably include the total "user experience," which may include the
aesthetic appearance of the device, response time, and the content that
is presented to the user within the context of the user interface.
In our project we are using a web user interface (WUI) that accept input
and provide output by generating web pages which are transmitted via
the Internet and viewed by the user using a web browser program. This
kind of implementation utilizes Java, JavaScript, Bootstrap, AJAX, or
similar technologies to provide real-time control in a separate program,
eliminating the need to refresh a traditional HTML based web browser.
22
CHAPTER 3 : DESIGN
3.1 Project Background
The objective of this project is to keep a track of the data from its origin
to its destination with the help of metadata. Metadata contains
information about the origins of a particular data set and can be
granular enough to define information at the attribute level. It maintains
auditable information about users, location of data, applications, and
processes that create, delete, or change data, the exact timestamp of the
change, and the authorization that was used to perform these actions.
The amount of different data flows and system component dependencies
in a traditional data warehouse environment is large. Important
contextual relations are coded into data transformation queries and
programs (e.g. SQL queries.). Data lineage dependencies are spread
between different systems and frequently exist only in program code or
SQL queries. This leads to unmanageable complexity, lack of knowledge
and unpredictable results.
3.2 Proposed Solution The proposed solution is to introduce a system
that handles the SQL queries and maps the attributes to a database
schema. With the help of an interactive UI and a SQL parser, we can
extract and collect meaningful expressions from the parsed text, using
declared combinations of grammar rules and parsed text tokens. This
application will help us visualize the complex network of data flows and
23
data dependencies, which in turn will help us define strategies to
improve data quality.
Proposed/To-be Work Flow
Figure 3.1 Proposed/To-be Work Flow
24
3.3 Requirements List
ID Requirement Name
1001 Code Parsing engine to extract the data elements (i.e. columns).
1002 Identifying the original source of the data elements.
1003 Graphical interactive interface for visualization.
Table 3.1
3.3.1 Requirements Description
Parser is an essential component of our system that takes SQL queries
and analyzes them.
Requirement ID:
1001
Version: 1.0 Created On: 23-FEB-2016
Requirement Name Code Parsing engine to extract the data elements (i.e.
columns).
Priority High
Description The application requires an appropriate Parsing engine
to extract the data elements from the database, handle
the sql queries and map the attributes to the database
25
schema. Meaningful expressions can be collected from
the parsed data in the graphical user interface.
Table 3.2
SQL queries are taken as input data which are further parsed into
xml code
Requirement ID:
1002
Version: 1.0 Created On: 23-FEB-2016
Requirement Name The original source of the data elements (database).
Priority High
Description The database stores data elements that are sent to the
parsing engine, where the queries are decoded and
mapped to the respective attributes.
Table 3.3
The GUI is completely abstracted from the implementation to offer
intuitive interaction to users.
Requirement ID:
1003
Version: 1.0 Created On: 23-FEB-2016
Requirement Name Graphical interactive interface for visualization.
Priority High
26
Description GUI is required so as to visualize complex network of
data flow and data dependencies, hence creating graphic
illustrations of information in an efficient manner.
Table 3.4
3.4 Assumptions and Dependencies
● We assume that the input data are of those type which our
constructed tool supports.
● The connection established to the data ware house doesn’t crash or
get disconnected.
● The General SQL Parser parses SQL queries to XML data only.
3.5 Risks
● The user interface through which we are visualizing data lineage
may lose connection with the database server.
● Web services that we are providing may be at a place different from
the repository location.
● Tracking lineage from mainframe applications and programs
doesn’t give exact workflow.
● Receiving files which are not compatible with our tool.
27
3.6 Unified Modeling Language Diagrams
3.6.1 Class Diagram
The class diagram is a static diagram. It represents the static view of an
application. Class diagram is not only used for visualizing, describing
and documenting different aspects of a system but also for constructing
executable code of the software application.
Figure 3.1 Class Diagram
28
3.6.2 Use Case Diagram
A use case diagram shows a set of use cases and actors (a special kind of
class) and their relationships. Use case diagrams address the static use
case view of a system. These diagrams are especially important in
organizing and modeling the behaviors of a system.
This UML diagram shows the relationship between various actors i.e.
user, parser and the UI.
Figure 3.2 Use Case Diagram
3.6.3 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise
activities and actions with support for choice, iteration and concurrency.
In the Unified Modeling Language, activity diagrams are intended to
model both computational and organizational processes (i.e.
29
workflows). These diagrams show the overall flow of control. They deal
with all type of flow control by using different elements like fork, join etc.
Figure 3.3 Activity Diagram
3.6.4 Sequence Diagram
A Sequence diagram is an interaction diagram that shows how processes
operate with one another and what is their order. It is a construct of
a Message Sequence Chart. A sequence diagram shows object
interactions arranged in time sequence.[6]
30
Figure 3.4 Sequence Diagram
31
CHAPTER 4: IMPLEMENTATION
4.1 User Module
User Module consists of a Login form which takes input from the user.
The user provides login credentials like Username and Password in order
to connect to the webpage where the application runs.
The Home page gives a basic understanding of the data lineage concept
and provides options to explore the application.
4.2 Parser
An essential component of our system is the parser that is used to gather
information from mappings. In computer technology, a parser is a
program, usually part of a compiler, that receives input in the form of
sequential source program instructions, interactive online commands,
markup tags, or some other defined interface and breaks them up into
parts. In order to process mapping rules, we developed a parser that is
capable of extracting components from the SQL queries generically,
based on their semantic meaning and relation to each other. This parser
is developed using General SQL Parser.[2] General SQL Parser, Java
version is valuable tool because it provides an in-depth and detailed
analysis of SQL scripts for various databases, including SQL Server. The
parser can extract mapping components based on their semantic
meaning and recognizes the context in which components are used. By
32
providing a generic interface for this parser, we built a basis that can be
utilized to query for any element in the mapping structure.
The user is directed to "Fill the details to know data lineage" form, where
he is provided with various database options. MySQL Database is used
here, by default. The user can either upload an sql query document or
type it in the input box. Upon "Send", the user is directed to a Graphical
User Interface.
4.3 Graphical User Interface:
The interface is completely abstracted from the implementation to offer
intuitive interaction to users. It uses the JSON files from the parser to
display graphs using D3.js-Data-Driven Documents.[3] D3.js is a
JavaScript library for producing dynamic, interactive data visualizations
in web browsers. It makes use of the widely implemented SVG, HTML5,
and CSS standards. The graphs consists of a sequence of mappings
defined as data flow i.e. the path that data items take in the system, from
their respective source to the final result. Consequently, it can be used
by any business user who has only minimal knowledge of mapping
structures. In combination with the complete abstraction from actual
data, this is a big step forward towards bridging the gap between
business users and IT systems.
33
CHAPTER 5: RESULT
Figure 5.1: Login Form for the User
34
Figure 5.2 Data Lineage Home Page
35
36
Figure 5.3 SQL Input Form
37
Figure 5.4 Graphical Representation of Data
38
Figure 5.5 Expanded graph
39
CHAPTER 6: CONCLUSION AND FUTURE SCOPE
Conclusion
As the volumes of data multiply, information about data becomes even
more critical. Data lineage methodology works like an x-ray for data flow
in an organization. It captures information from source to destination
along with the various processes and rules involved and shows how the
data is used. This knowledge about what data is available, its quality,
correctness and completeness leads to a mature data governance
process.
The metadata contains information about the origins of a particular data
set and can be granular enough to define information at the attribute
level. It maintains auditable information about users, location of data,
applications, and processes that create, delete, or change data, the exact
timestamp of the change, and the authorization that was used to perform
these actions.
Through this project we were able to keep track of our data starting from
the source to its destination. The parser takes the SQL queries and
generates a XML code. This XML data is then converted to JSON objects
and represented in a graphical manner using D3 graphs. This graph
helps us visualize the complex network of data flows and data
dependencies, which in turn will help us design strategies to improve
data quality.
40
Future Scope
Our team is scheduled to meet the leaders and mentors from JPMorgan
Chase & Co. this month to discuss the further extension of our project.
Considering the growing need of data lineage in the IT and banking
industry, this line of work is very valuable to the firm. Our project work
has been appreciated and it might be used as a basic model by the
company and extend the scope and implementation of the Data Lineage
project.
41
CHAPTER 7 : REFERENCES
[1] http://www.techtarget.com
[2] General SQL Parser
http://www.sqlparser.com/
[3] Graphical representation of data flow
https://d3js.org/
[4] Impact Analysis and Data Lineage
http://www.dlineage.com/impact-analysis-data-lineage.html
[5] Raghu Ramakrishnan and Johannes Gehrke, " Database Management
Systems", Edition 3
[6] Roger S Pressman: Software Engineering, A practitioner’s Approach
(6th Ed), McGrawHill Int. Ed.

Mais conteúdo relacionado

Mais procurados

Quiz app android ppt
Quiz app android pptQuiz app android ppt
Quiz app android pptAditya Nag
 
Blockchain Based voting system PPT.pptx
Blockchain Based voting system PPT.pptxBlockchain Based voting system PPT.pptx
Blockchain Based voting system PPT.pptxPrakash Zodge
 
Full report on blood bank management system
Full report on  blood bank management systemFull report on  blood bank management system
Full report on blood bank management systemJawhar Ali
 
Face detection and recognition
Face detection and recognitionFace detection and recognition
Face detection and recognitionPankaj Thakur
 
Attendance management system project report.
Attendance management system project report.Attendance management system project report.
Attendance management system project report.Manoj Kumar
 
Technical Seminar PPT
Technical Seminar PPTTechnical Seminar PPT
Technical Seminar PPTKshitiz_Vj
 
Campus news information system - Android
Campus news information system - AndroidCampus news information system - Android
Campus news information system - AndroidDhruvil Dhulia
 
IOT - Design Principles of Connected Devices
IOT - Design Principles of Connected DevicesIOT - Design Principles of Connected Devices
IOT - Design Principles of Connected DevicesDevyani Vasistha
 
Banking Management System Project
Banking Management System ProjectBanking Management System Project
Banking Management System ProjectChaudhry Sajid
 
CRYPTOCURRENCY TRACKER ppt.pptx
CRYPTOCURRENCY TRACKER ppt.pptxCRYPTOCURRENCY TRACKER ppt.pptx
CRYPTOCURRENCY TRACKER ppt.pptxSRUSHTIHINGE
 
Space internet and starlink
Space internet and starlinkSpace internet and starlink
Space internet and starlinkSahil Gupta
 
Virtual Surgery
Virtual SurgeryVirtual Surgery
Virtual Surgerybiomedicz
 
Smart attendance system
Smart attendance systemSmart attendance system
Smart attendance systempraful borad
 
Internship Presentation 1 Web Developer
Internship Presentation 1 Web DeveloperInternship Presentation 1 Web Developer
Internship Presentation 1 Web DeveloperHemant Sarthak
 
Android technical quiz app
Android technical quiz appAndroid technical quiz app
Android technical quiz appJagdeep Singh
 

Mais procurados (20)

Quiz app android ppt
Quiz app android pptQuiz app android ppt
Quiz app android ppt
 
Blockchain Based voting system PPT.pptx
Blockchain Based voting system PPT.pptxBlockchain Based voting system PPT.pptx
Blockchain Based voting system PPT.pptx
 
Full report on blood bank management system
Full report on  blood bank management systemFull report on  blood bank management system
Full report on blood bank management system
 
Face detection and recognition
Face detection and recognitionFace detection and recognition
Face detection and recognition
 
Attendance management system project report.
Attendance management system project report.Attendance management system project report.
Attendance management system project report.
 
Technical Seminar PPT
Technical Seminar PPTTechnical Seminar PPT
Technical Seminar PPT
 
Campus news information system - Android
Campus news information system - AndroidCampus news information system - Android
Campus news information system - Android
 
Nano computing
Nano computingNano computing
Nano computing
 
Bluejacking ppt
Bluejacking pptBluejacking ppt
Bluejacking ppt
 
IOT - Design Principles of Connected Devices
IOT - Design Principles of Connected DevicesIOT - Design Principles of Connected Devices
IOT - Design Principles of Connected Devices
 
Banking Management System Project
Banking Management System ProjectBanking Management System Project
Banking Management System Project
 
Quiz
QuizQuiz
Quiz
 
CRYPTOCURRENCY TRACKER ppt.pptx
CRYPTOCURRENCY TRACKER ppt.pptxCRYPTOCURRENCY TRACKER ppt.pptx
CRYPTOCURRENCY TRACKER ppt.pptx
 
Finger vein technology
Finger vein technologyFinger vein technology
Finger vein technology
 
Space internet and starlink
Space internet and starlinkSpace internet and starlink
Space internet and starlink
 
Virtual Surgery
Virtual SurgeryVirtual Surgery
Virtual Surgery
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
 
Smart attendance system
Smart attendance systemSmart attendance system
Smart attendance system
 
Internship Presentation 1 Web Developer
Internship Presentation 1 Web DeveloperInternship Presentation 1 Web Developer
Internship Presentation 1 Web Developer
 
Android technical quiz app
Android technical quiz appAndroid technical quiz app
Android technical quiz app
 

Semelhante a Project Documentation

Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loadereSAT Journals
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loadereSAT Publishing House
 
PROJECT FOR CSE BY TUSHAR DHOOT
PROJECT FOR CSE BY TUSHAR DHOOTPROJECT FOR CSE BY TUSHAR DHOOT
PROJECT FOR CSE BY TUSHAR DHOOTTushar Dhoot
 
Report on Dental treatment & management system
Report on Dental treatment &  management system Report on Dental treatment &  management system
Report on Dental treatment & management system Zakirul Islam
 
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...IRJET - Scrutinizing Attributes Influencing Role of Information Communication...
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...IRJET Journal
 
Hospital management system
Hospital management systemHospital management system
Hospital management systemMehul Ranavasiya
 
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docx
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docxRunning head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docx
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docxjeanettehully
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Report on Smart Blood Bank project
Report on Smart Blood Bank projectReport on Smart Blood Bank project
Report on Smart Blood Bank projectk Tarun
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...ertekg
 
Comparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataComparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataIRJET Journal
 
E filling system (report)
E filling system (report)E filling system (report)
E filling system (report)Badrul Alam
 
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERINGIMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERINGijcsit
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringAIRCC Publishing Corporation
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringAIRCC Publishing Corporation
 
Hostel management system (5)
Hostel management system (5)Hostel management system (5)
Hostel management system (5)PRIYANKMZN
 
Data Exchange Design with SDMX Format for Interoperability Statistical Data
Data Exchange Design with SDMX Format for Interoperability Statistical DataData Exchange Design with SDMX Format for Interoperability Statistical Data
Data Exchange Design with SDMX Format for Interoperability Statistical DataNooria Sukmaningtyas
 

Semelhante a Project Documentation (20)

Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
 
PROJECT FOR CSE BY TUSHAR DHOOT
PROJECT FOR CSE BY TUSHAR DHOOTPROJECT FOR CSE BY TUSHAR DHOOT
PROJECT FOR CSE BY TUSHAR DHOOT
 
Report on Dental treatment & management system
Report on Dental treatment &  management system Report on Dental treatment &  management system
Report on Dental treatment & management system
 
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...IRJET - Scrutinizing Attributes Influencing Role of Information Communication...
IRJET - Scrutinizing Attributes Influencing Role of Information Communication...
 
Hospital management system
Hospital management systemHospital management system
Hospital management system
 
Prasad_Resume
Prasad_ResumePrasad_Resume
Prasad_Resume
 
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docx
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docxRunning head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docx
Running head NETWORK DIAGRAM AND WORKFLOW1NETWORK DIAGRAM AN.docx
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
FinalReviewReport
FinalReviewReportFinalReviewReport
FinalReviewReport
 
Report on Smart Blood Bank project
Report on Smart Blood Bank projectReport on Smart Blood Bank project
Report on Smart Blood Bank project
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...
 
Comparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big dataComparing and analyzing various method of data integration in big data
Comparing and analyzing various method of data integration in big data
 
Project Report
Project ReportProject Report
Project Report
 
E filling system (report)
E filling system (report)E filling system (report)
E filling system (report)
 
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERINGIMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
IMPORTANCE OF PROCESS MINING FOR BIG DATA REQUIREMENTS ENGINEERING
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
 
Importance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements EngineeringImportance of Process Mining for Big Data Requirements Engineering
Importance of Process Mining for Big Data Requirements Engineering
 
Hostel management system (5)
Hostel management system (5)Hostel management system (5)
Hostel management system (5)
 
Data Exchange Design with SDMX Format for Interoperability Statistical Data
Data Exchange Design with SDMX Format for Interoperability Statistical DataData Exchange Design with SDMX Format for Interoperability Statistical Data
Data Exchange Design with SDMX Format for Interoperability Statistical Data
 

Project Documentation

  • 1. 1 DATA LINEAGE A Major Project report submitted in partial fulfillment of the requirements for award of the degree of Bachelor of Technology in Computer Science and Engineering By K. BHARGAVI Roll No:12011A0507 CH.PRAKYA SRI Roll No: 12011A0529 SHALINI RAINA ROHAN REDDY Roll No:12011A0551 Roll No:11011A0557 Under the esteemed guidance of Dr. J. UJWALA REKHA Asst. Professor Of C.S.E Department of Computer Science and Engineering JNTUH College of Engineering Hyderabad (Autonomous) Kukatpally, Hyderabad - 500 085, Telangana, India
  • 2. 2 Department of Computer Science and Engineering JNTUH College of Engineering Hyderabad (Autonomous) Kukatpally, Hyderabad - 500 085, Telangana, India DECLARATION BY THE CANDIDATE We, K.Bhargavi (12011A0507), Ch.Prakya Sri (12011A0529), Shalini Raina (12011A0551), and Rohan Reddy (11011A0557), hereby declare that the mini project titled “Data Lineage”, carried out under the guidance of Dr. J. Ujwala Rekha, Asst. Professor, is submitted in partial fulfillment of the requirements for the award of Bachelor of Technology in Computer Science and Engineering. This is a record of bonafide work carried out by us and the results produced by us have not been reproduced/copied from any source. The results embodied in this project report have not been submitted to any other University or Institute for the award of any other degree or diploma. K.BHARGAVI Roll No: 12011A0507 CH.PRAKYA SRI Roll No: 12011A0529 SHALINI RAINA ROHAN REDDY Roll No: 12011A0551 Roll No: 11011A0557
  • 3. 3 Department of Computer Science and Engineering JNTUH College of Engineering Hyderabad (Autonomous) Kukatpally, Hyderabad - 500 085, Telangana, India CERTIFICATE BY THE SUPERVISOR This is to certify that the major project report titled “Data Lineage”, being submitted by K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina (12011A0551), and Rohan Reddy (11011A0557), in the Department of Computer Science and Engineering of JNTUH COLLEGE OF ENGINEERING HYDERABAD is a record of bonafide work carried out by them under my guidance and supervision. The results embodied in this project report have not been submitted to any other University or Institute for the award of any other degree or diploma. The results have been verified and found to be satisfactory. Dr. J. Ujwala Rekha Asst. Professor
  • 4. 4 Department of Computer Science and Engineering JNTUH College of Engineering Hyderabad (Autonomous) Kukatpally, Hyderabad - 500 085, Telangana, India CERTIFICATE BY THE HEAD OF THE DEPARTMENT This is to certify that the major project report titled “Data Lineage”, being submitted by K.Bhargavi (12011A0507), Ch. Prakya Sri (12011A0529), Shalini Raina (12011A0551), and Rohan Reddy (11011A0557), in the partial fulfillment of requirements for award of Bachelor in Technology in Computer Science and Engineering. Dr. V. Kamakshi Prasad,
  • 5. 5 ACKNOWLEDGEMENT We take this opportunity to thank all who have rendered their full support to our work. The pleasure, the achievement, the glory, the satisfaction, the reward, the appreciation and the construction of our project cannot be thought without a few, who apart from their regular schedule spared their valuable time for us. This acknowledgement is not just a position of words but also an account of the indictment. We thank our guide, Ms.J. Ujwala Rekha, Asst.Professor for giving us the opportunity to do this project work and for her constant help and guidance. We take the opportunity to express our gratitude to Dr. V. Kamakshi Prasad, Professor and Head of the Department, Department of CSE, JNTUH for giving us the opportunity to do this major project. Finally, we thank JP Morgan Chase and Co. for giving us the opportunity to work on this project and our mentors for guiding and keeping us motivated throughout this process. Also, we are thankful to the faculty of Department of CSE, JNTUH, our friends, and all our family members who with their valuable suggestions and support, directly or indirectly helped us in this project work. K.BHARGAVI Roll No: 12011A0507 CH.PRAKYA SRI Roll No: 12011A0529 SHALINI RAINA ROHAN REDDY Roll No: 12011A0551 Roll No: 11011A0557
  • 6. 6 ABSTRACT Ninety percent (90%) of the world’s data has been created in the last two years alone. This explosion of data is the result of the ever-growing number of systems and automation at all levels in all sizes of organizations. While this data has made it easier to access information in the working world, it has also lead to a new set of problems. Users need “clean” and conformed data to make informed decisions. Lack of trust in data makes users move away from using information systems. The solution to data integrity, uniformity and correctness is matured data governance. And the first step to achieving it is to get a visual on the existing data flow and data lineage. The requirements of this project are:  Code Parsing engine to extract the data elements (i.e. Columns).  Identifying the original source of the data elements.  Identifying the transformation logic to create/populate the data elements.  Graphical interactive interface for visualization. The proposed solution is to introduce a system that handles the SQL queries and maps the attributes to a database schema. With the help of an interactive UI and a SQL parser, we can extract and collect meaningful expressions from the parsed text, using declared
  • 7. 7 combinations of grammar rules and parsed text tokens. This application will help us visualize the complex network of data flows and data dependencies, which in turn will help us define strategies to improve data quality.
  • 8. 8 TABLE OF CONTENTS Abstract 6 Chapter 1 Introduction 11 1.1 Overview Chapter 2 Literature Survey 11 2.1 Introduction 13 2.2 Data Lineage 13 2.3 Scope 14 2.4 Important definitions in Data Lineage 16 Chapter 3 Design 22 3.1 Project Background 22 3.2 Proposed Solution 22 3.3 Requirements List 24 3.3.1 Requirements Description 24 3.4 Assumptions and Dependencies 26 3.5 Risks 26 3.6 UML Diagrams 27 3.6.1 Class Diagram 27 3.6.2 Use Case Diagram 28 3.6.3 Activity Diagram 28
  • 9. 9 3.6.4 Sequence Diagram 30 Chapter 4 Implementation 31 4.1 User Module 31 4.2 Parser 31 4.3 Graphical User Interface 32 Chapter 5 Result 33 Chapter 6 Conclusion and Future Scope 39 Chapter 7 References 41
  • 10. 10 LIST OF FIGURES Number Title Pg No. Figure 3.1 Proposed/To-be Work Flow 23 Figure 3.2 Class Diagram 27 Figure 3.3 Use Case Diagram 28 Figure 3.4 Activity Diagram 29 Figure 3.5 Sequence Diagram 30 Figure 6.1 Login Form 33 Figure 6.2 Home Page 34 Figure 6.3 SQL Input Form 35 Figure 6.4 Graphical Representation of Data 36 Figure 6.5 Expanded graph 37
  • 11. 11 CHAPTER 1 : INTRODUCTION 1.1 Overview With the increase in the amount of data being created every day, it is becoming difficult to govern and maintain such a huge amount to data. Developers and managers are facing problems in business intelligence (BI) and Data Warehouse (DW) environments where the chains of data transformations are long and the complexity of structural changes is high. The management of data integration processes becomes unpredictable and the costs of changes can be very high due to the lack of information about data flows and internal relations of system components. The amount of different data flows and system component dependencies in a traditional data warehouse environment is large. Important contextual relations are coded into data transformation queries and programs (e.g. SQL queries, data loading scripts, open or closed DI system components etc.). Data lineage dependencies are spread between different systems and frequently exist only in program code or SQL queries. This leads to unmanageable complexity, lack of knowledge and a large amount of technical work with uncomfortable consequences like unpredictable results, wrong estimations, rigid administrative and development processes, high cost, lack of flexibility and lack of trust. We need clean and correct data to make informed decisions. Lack of trust in data makes users move away from using information systems. The solution to data integrity, uniformity and
  • 12. 12 correctness is matured data governance. And the first step to achieving it is to get a visual on the existing data flow and data lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow and their dependencies. Metadata management is the key input to capturing enterprise data flow and presenting data lineage. It consists of metadata collection, integration, usage and repository maintenance. It captures enterprise data flow and presents the data lineage in a graphical manner, showing us the flow from source to the destination.
  • 13. 13 CHAPTER 2 : LITERATURE SURVEY 2.1 Introduction A literature survey or literature review means that we read and report on what the literature in the field has to say about our topic or subject. There may be a lot of literature on the topic or there may be a little. Either way, the goal is to show that we have read and understood the positions of other academics who have studied the problem/issue that we are studying and include that in our project. We have done this by comparing and contrasting, simple summarization. 2.2 Data Lineage Representation of Data Lineage broadly depends on scope of the metadata management and reference point of interest. Data Lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leads to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end to end lineage for a reference point that provides complete audit trail of that data point of interest from source to its final destination. As the data points or hops increases, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view would be to able to simplify the view by temporarily masking unwanted peripheral data points. Tools that have the masking feature enables
  • 14. 14 scalability of the view and enhances analysis with best user experience for both Technical and business users alike.[4] 2.3 Scope Scope of data lineage determines the volume of metadata required to represent its data lineage. Usually, Data Governance, and Data Management determines the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes, and critical data elements of the organization. Data Lineage provides the audit trail of the data points at the lowest granular level, but presentation of the lineage may be done at various zoom levels to simplify the vast information, similar to the analytic web maps. It can be visualized at various levels based on the granularity of the view. At a very high level data lineage provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the data point level where it can provide the details of the data point and its historical behavior, attribute properties, and trends and data quality of the data passed through that specific data point in the data lineage. Data Governance plays a key role in metadata management for guidelines, strategies, policies, implementation. Data quality, and data management helps in enriching the data lineage with more business value. Even though the final representation of data lineage is provided in
  • 15. 15 one interface but the way the metadata is harvested and exposed to the data lineage User Interface (UI) could be entirely different. Thus, Data lineage can be broadly divided into three categories based on the way metadata is harvested: Data lineage involving software packages for structured data, Programming Languages, and Big Data. Data lineage expects to view at least the technical metadata involving the data points and its various transformations. Along with technical data, data lineage may enrich the metadata with their corresponding data quality results, reference data values, data models, business vocabulary, people, programs, and systems linked to the data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. Metadata normalization may be done in data lineage to represent disparate systems into one common view. Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis.
  • 16. 16 2.4 Important definitions in Data lineage 1. Metadata Metadata describes other data. It provides information about a certain item's content. For example, an image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document. Metadata is essential for understanding information stored in data warehouses and has become increasingly important in XML-based Web applications. The main purpose of metadata is to facilitate in the discovery of relevant information, more often classified as resource discovery. Metadata assists in resource discovery by allowing resources to be found by relevant criteria, identifying resources, bringing similar resources together, distinguishing dissimilar resources, and giving location information. It is used to summarize basic information about data which can make tracking and working with specific data easier.[1] Some examples include:  Means of creation of the data  Purpose of the data  Time and date of creation  Creator or author of the data
  • 17. 17  Location on a computer network where the data was created  Standards used  File size 2. Data Warehouse A data warehouse is a federated repository for all the data that an enterprise's various business systems collect. The repository may be physical or logical. It stores current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Types of systems:  A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area) hence, they draw data from a limited number of sources such as sales, finance or marketing. Data marts are often built and controlled by a single department within an organization. The sources could be internal operational systems, a central data warehouse, or external data. Given that data marts generally cover only a subset of the data contained in a data warehouse, they are often easier and faster to implement.  Online analytical processing (OLAP) is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store
  • 18. 18 aggregated, historical data in multidimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day.The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up, Drill-down and Slicing & Dicing.  Online transaction processing (OLTP) is characterized by a large number of short online transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. 3. Data Management Data management is the development and execution of architectures, policies, practices and procedures in order to manage the information lifecycle needs of an enterprise in an effective manner. Data lifecycle management (DLM) is a policy-based approach to managing the flow of an information system's data throughout its lifecycle: from creation and initial storage to the time when it becomes obsolete and is deleted. Several vendors offer DLM products but effective data management
  • 19. 19 involves well-thought-out procedures and adherence to best practices as well as applications. There are various approaches to data management. Master data management (MDM), for example, is a comprehensive method of enabling an enterprise to link all of its critical data to one file, called a master file, that provides a common point of reference. The effective management of corporate data has grown in importance as businesses are subject to an increasing number of compliance regulations. Furthermore, the sheer volume of data that must be managed by organizations has increased so markedly that it is sometimes referred to as big data.[5] 4.Data Cleaning Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly. This is because the sources often contain redundant data in different representations. In order to provide access to accurate and consistent data, consolidation of
  • 20. 20 different data representations and elimination of duplicate information become necessary. 5.Data Quality High-quality data needs to pass a set of quality criteria. Those include:  Validity: The degree to which the measures conform to defined business rules or constraints.  De-cleansing: It is detecting errors and syntactically removing them for better programming.  Accuracy: The degree of conformity of a measure to a standard or a true value.  Completeness: The degree to which all required measures are known.  Consistency: The degree to which a set of measures are equivalent in across systems.  Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems . 6.User Interface (UI): The user interface is one of the most important parts of any program because it determines how easily you can make the program do what you want. It is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and
  • 21. 21 control of the machine from the human end, whilst the machine simultaneously feeds back information that aids the operator's' decision- making process. Generally, the goal of user interface design is to produce a user interface which makes it easy (self-explanatory), efficient, and enjoyable (user-friendly) to operate a machine in the way which produces the desired result. This generally means that the operator needs to provide minimal input to achieve the desired output, and also that the machine minimizes undesired outputs to the human. The user interface can arguably include the total "user experience," which may include the aesthetic appearance of the device, response time, and the content that is presented to the user within the context of the user interface. In our project we are using a web user interface (WUI) that accept input and provide output by generating web pages which are transmitted via the Internet and viewed by the user using a web browser program. This kind of implementation utilizes Java, JavaScript, Bootstrap, AJAX, or similar technologies to provide real-time control in a separate program, eliminating the need to refresh a traditional HTML based web browser.
  • 22. 22 CHAPTER 3 : DESIGN 3.1 Project Background The objective of this project is to keep a track of the data from its origin to its destination with the help of metadata. Metadata contains information about the origins of a particular data set and can be granular enough to define information at the attribute level. It maintains auditable information about users, location of data, applications, and processes that create, delete, or change data, the exact timestamp of the change, and the authorization that was used to perform these actions. The amount of different data flows and system component dependencies in a traditional data warehouse environment is large. Important contextual relations are coded into data transformation queries and programs (e.g. SQL queries.). Data lineage dependencies are spread between different systems and frequently exist only in program code or SQL queries. This leads to unmanageable complexity, lack of knowledge and unpredictable results. 3.2 Proposed Solution The proposed solution is to introduce a system that handles the SQL queries and maps the attributes to a database schema. With the help of an interactive UI and a SQL parser, we can extract and collect meaningful expressions from the parsed text, using declared combinations of grammar rules and parsed text tokens. This application will help us visualize the complex network of data flows and
  • 23. 23 data dependencies, which in turn will help us define strategies to improve data quality. Proposed/To-be Work Flow Figure 3.1 Proposed/To-be Work Flow
  • 24. 24 3.3 Requirements List ID Requirement Name 1001 Code Parsing engine to extract the data elements (i.e. columns). 1002 Identifying the original source of the data elements. 1003 Graphical interactive interface for visualization. Table 3.1 3.3.1 Requirements Description Parser is an essential component of our system that takes SQL queries and analyzes them. Requirement ID: 1001 Version: 1.0 Created On: 23-FEB-2016 Requirement Name Code Parsing engine to extract the data elements (i.e. columns). Priority High Description The application requires an appropriate Parsing engine to extract the data elements from the database, handle the sql queries and map the attributes to the database
  • 25. 25 schema. Meaningful expressions can be collected from the parsed data in the graphical user interface. Table 3.2 SQL queries are taken as input data which are further parsed into xml code Requirement ID: 1002 Version: 1.0 Created On: 23-FEB-2016 Requirement Name The original source of the data elements (database). Priority High Description The database stores data elements that are sent to the parsing engine, where the queries are decoded and mapped to the respective attributes. Table 3.3 The GUI is completely abstracted from the implementation to offer intuitive interaction to users. Requirement ID: 1003 Version: 1.0 Created On: 23-FEB-2016 Requirement Name Graphical interactive interface for visualization. Priority High
  • 26. 26 Description GUI is required so as to visualize complex network of data flow and data dependencies, hence creating graphic illustrations of information in an efficient manner. Table 3.4 3.4 Assumptions and Dependencies ● We assume that the input data are of those type which our constructed tool supports. ● The connection established to the data ware house doesn’t crash or get disconnected. ● The General SQL Parser parses SQL queries to XML data only. 3.5 Risks ● The user interface through which we are visualizing data lineage may lose connection with the database server. ● Web services that we are providing may be at a place different from the repository location. ● Tracking lineage from mainframe applications and programs doesn’t give exact workflow. ● Receiving files which are not compatible with our tool.
  • 27. 27 3.6 Unified Modeling Language Diagrams 3.6.1 Class Diagram The class diagram is a static diagram. It represents the static view of an application. Class diagram is not only used for visualizing, describing and documenting different aspects of a system but also for constructing executable code of the software application. Figure 3.1 Class Diagram
  • 28. 28 3.6.2 Use Case Diagram A use case diagram shows a set of use cases and actors (a special kind of class) and their relationships. Use case diagrams address the static use case view of a system. These diagrams are especially important in organizing and modeling the behaviors of a system. This UML diagram shows the relationship between various actors i.e. user, parser and the UI. Figure 3.2 Use Case Diagram 3.6.3 Activity Diagram Activity diagrams are graphical representations of workflows of stepwise activities and actions with support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams are intended to model both computational and organizational processes (i.e.
  • 29. 29 workflows). These diagrams show the overall flow of control. They deal with all type of flow control by using different elements like fork, join etc. Figure 3.3 Activity Diagram 3.6.4 Sequence Diagram A Sequence diagram is an interaction diagram that shows how processes operate with one another and what is their order. It is a construct of a Message Sequence Chart. A sequence diagram shows object interactions arranged in time sequence.[6]
  • 31. 31 CHAPTER 4: IMPLEMENTATION 4.1 User Module User Module consists of a Login form which takes input from the user. The user provides login credentials like Username and Password in order to connect to the webpage where the application runs. The Home page gives a basic understanding of the data lineage concept and provides options to explore the application. 4.2 Parser An essential component of our system is the parser that is used to gather information from mappings. In computer technology, a parser is a program, usually part of a compiler, that receives input in the form of sequential source program instructions, interactive online commands, markup tags, or some other defined interface and breaks them up into parts. In order to process mapping rules, we developed a parser that is capable of extracting components from the SQL queries generically, based on their semantic meaning and relation to each other. This parser is developed using General SQL Parser.[2] General SQL Parser, Java version is valuable tool because it provides an in-depth and detailed analysis of SQL scripts for various databases, including SQL Server. The parser can extract mapping components based on their semantic meaning and recognizes the context in which components are used. By
  • 32. 32 providing a generic interface for this parser, we built a basis that can be utilized to query for any element in the mapping structure. The user is directed to "Fill the details to know data lineage" form, where he is provided with various database options. MySQL Database is used here, by default. The user can either upload an sql query document or type it in the input box. Upon "Send", the user is directed to a Graphical User Interface. 4.3 Graphical User Interface: The interface is completely abstracted from the implementation to offer intuitive interaction to users. It uses the JSON files from the parser to display graphs using D3.js-Data-Driven Documents.[3] D3.js is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of the widely implemented SVG, HTML5, and CSS standards. The graphs consists of a sequence of mappings defined as data flow i.e. the path that data items take in the system, from their respective source to the final result. Consequently, it can be used by any business user who has only minimal knowledge of mapping structures. In combination with the complete abstraction from actual data, this is a big step forward towards bridging the gap between business users and IT systems.
  • 33. 33 CHAPTER 5: RESULT Figure 5.1: Login Form for the User
  • 34. 34 Figure 5.2 Data Lineage Home Page
  • 35. 35
  • 36. 36 Figure 5.3 SQL Input Form
  • 37. 37 Figure 5.4 Graphical Representation of Data
  • 39. 39 CHAPTER 6: CONCLUSION AND FUTURE SCOPE Conclusion As the volumes of data multiply, information about data becomes even more critical. Data lineage methodology works like an x-ray for data flow in an organization. It captures information from source to destination along with the various processes and rules involved and shows how the data is used. This knowledge about what data is available, its quality, correctness and completeness leads to a mature data governance process. The metadata contains information about the origins of a particular data set and can be granular enough to define information at the attribute level. It maintains auditable information about users, location of data, applications, and processes that create, delete, or change data, the exact timestamp of the change, and the authorization that was used to perform these actions. Through this project we were able to keep track of our data starting from the source to its destination. The parser takes the SQL queries and generates a XML code. This XML data is then converted to JSON objects and represented in a graphical manner using D3 graphs. This graph helps us visualize the complex network of data flows and data dependencies, which in turn will help us design strategies to improve data quality.
  • 40. 40 Future Scope Our team is scheduled to meet the leaders and mentors from JPMorgan Chase & Co. this month to discuss the further extension of our project. Considering the growing need of data lineage in the IT and banking industry, this line of work is very valuable to the firm. Our project work has been appreciated and it might be used as a basic model by the company and extend the scope and implementation of the Data Lineage project.
  • 41. 41 CHAPTER 7 : REFERENCES [1] http://www.techtarget.com [2] General SQL Parser http://www.sqlparser.com/ [3] Graphical representation of data flow https://d3js.org/ [4] Impact Analysis and Data Lineage http://www.dlineage.com/impact-analysis-data-lineage.html [5] Raghu Ramakrishnan and Johannes Gehrke, " Database Management Systems", Edition 3 [6] Roger S Pressman: Software Engineering, A practitioner’s Approach (6th Ed), McGrawHill Int. Ed.