SlideShare a Scribd company logo
1 of 24
Download to read offline
COMBINING DATA LAKE AND DATA WRANGLING
FOR ENSURING DATA QUALITY IN CRIS
15th International Conference on Current Research Information Systems (CRIS2022)
Dubrovnik, Croatia, May 12-14, 2022
Otmane Azeroual1,
Joachim Schöpfel2,
Dragan Ivanovic3,
Anastasija Nikiforova4,5
(1) German Centre for Higher Education Research and Science Studies (DZHW), Germany
(2) GERiiCO-Labor, University of Lille, France
(3) University of NoviSad, Serbia
(4) University of Tartu, Institute of Computer Science, Estonia
(5) European Open Science Cloud Task Force «FAIR metrics and data quality», Belgium
BACKGROUND AND MOTIVATION
Today, billions of data sources continuously generate, collect, process, and exchange data. With the rapid increase in
the number of devices and information systems in use, the amount and variety of data are increasing. This is also the
case for the research / scientific domain.
Researchers as the end-users of RIS should be able to integrate ever-increasing volumes of data into their
institutional database such as Current Research Information Systems (CRIS), regardless of the source, format or
amount/ size of research information, where the data quality, flexibility and scalability in connecting and processing
different data sources are crucial.
an effective mechanism should be employed to ensure faster value creation from these data
an effective mechanism should be employed to ensure faster value creation from these data
DATA LAKE + DATA WRANGLING
This study sets out the concept of a data lake with data wrangling process to be used in CRIS to clean
up data from heterogeneous data sources as it is ingested and integrated.
BACKGROUND AND MOTIVATION
DATA LAKE
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
✔
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
✔
Image source: https://towardsdatascience.com/augment-your-data-lake-analytics-with-snowflake-b417f1186615
➢ Data Lake provides a scalable platform for storing and
processing large amounts of research data from various
sources in their original raw format, regardless of their
type, i.e., structured or unstructured data or text, numeric,
images, video etc.
➢ The raw data are not cleaned, validated, or transformed ➔
they are original data in their original format.
Image source: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.ecloudvalley.com%2Fwhat-is-datalake-and-datawarehouse%2F&psig=AOvVaw1vGuz42Qfu2-0J_bpmhpbJ&ust=1651908882490000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCNiaquD6yvcCFQAAAAAdAAAAABAO
➢ The concept of data lake allows to store a variety of data within the memory
BUT
➢ there is a need to clean up dirty data and enrich them in a pre-processing process, where data
wrangling is found to be suitable for these purposes.
➢ The goal is to convert complex data types and data formats into structured data without
programming efforts ➔ users should be able to prepare and transform their research information
without the need of using the ETL tools or familiarity and use of programming languages, where
these transformations should be automatically suggested after reading the data based on machine
learning algorithms that greatly speeds up this process.
➢ When storing data / research information, the completeness of data
and reduction of the cycle time between data generation and
availability are important.
➢ The lack of pre-processing does not slow down data supply and
does not lead to data loss.
DATA WRANGLING
https://pediaa.com/what-is-the-difference-between-data-wrangling-and-data-cleaning/
✔
With the amount of data and data sources rapidly growing and expanding, it is getting increasingly essential for large amounts
of available data to be organized for analysis.
Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis.
Source: https://monkeylearn.com/blog/data-wrangling/, https://www.altair.com/what-is-data-wrangling/
DATA LAKE + DATA WRANGLING
=
DATA QUALITY IN CRIS
In this study:
➢ an architectural model is first designed and specified, which analyzes the research information, adjusts it and transforms it into CRIS;
➢ a data lake makes both structured and unstructured data available in a reliable, trustworthy, secure and controlled way;
➢ the data wrangling process is used to verify and improve the quality of data, which also protects data from misuse ➔ data are properly
updated, retained, and eventually deleted according to the stage of its lifecycle.
The data wrangling process consists of several sequential steps. Depending on the IS and the desired or required target quality*, these individual
steps should be carried out several times ➔ data wrangling is a continuous process that repeats itself repeatedly at regular intervals.
SEVERAL ASPECTS AFFECTING A
DATA LAKE
Aspect Description
Metadata describes a dataset in more detail containing data about the origin, structure and content of the data
+ sorting, filtering or categorizing properties
+ are used for system management and administration
Data mapping describes the context of the data ➔ integration map - a detailed specification of which application data
from which data sources are linked / associated with which characteristics (mostly metadata)
Data lake
context
describes the higher-level use case on which the data lake is based ➔ the selection of the required data
sources is more targeted. !!! This avoids the misuse of the data lake as a data swamp!!!
Data context the individual datasets and their context so that they can be better classified for analysis purposes.
The context for records can be data origin, categorization, or other contextual feature in the metadata
Processing
logging
refers to the raw data processing that takes place in the data lake. The data record and its metadata are
manipulated in the process ➔ is of particular interest to data analysts to analyze data lake usage, data
set and use case
➢ Analog data: data sources automatically generate data in a specific predefined and therefore known data format. Due to the automatic generation, they
accumulate in a very large amount and are mostly repeated / duplicated. For this reason, they are usually stored in tabular form in so-called "log tapes".
➢ Application data: also have a known structure but are significantly different from analog data in their origin - analog data typically represents physical
measurement data, application data arises during the operation, transactions of an application (e.g. transmitted system data or analysis data). So-called
"records" are used as a common storage solution for these data characterized by their uniform / homogeneous structure.
A data record usually consists of a key attribute K, an index attribute I, and other predefined attributes A. Depending on the data origin and data type of the
application data, the predefined attributes may differ from each other. This application data structure is based on DBMS.
➢ Text-based data that are also closely related to the application, but are stored as separate files with metadata. A transformation is required to be carried
out for further processing of this data. The process of converting them into analytically processable data is called textual disambiguation.
DATA WRANGLING
➢ In the context of research information, data wrangling refers to the process of identifying, extracting, preparing and
integrating data into a database system such as CRIS.
➢ The use of data wrangling eliminates low-quality data, i.e. redundant, incomplete, inaccurate or incorrect data, etc., in order
to preserve only high-quality research information from which the reliable and value-adding knowledge can be obtained.
➢ This adjusted research information is then entered into the appropriate target CRIS system to be used in further phases of
the analysis (e.g. by analytical applications and protected from unauthorized access by access control).
➢ This should minimize the effort of analysis and enriching large volumes of data and metadata and achieve far-reaching
added value in the procurement of information staff, developers and end users of the CRIS.
Image source: https://towardsdatascience.com/data-wrangling-raw-to-clean-transformation-b30a27bf4b3b
The data wrangling process (steps of the process are indicated by numbers) to prepare research information and integrate it into CRIS
Step Description
Select data The required data records are identified in different data sources. When selecting data, a record is evaluated by its value → if there is added value, the availability and
terms of use of the data and subsequent data from this data source are checked
Structuring In most cases, there is little or no structure in the data ➔ change the structure of the data for easier accessibility.
Cleaning Almost every dataset contains some outliers that can skew the analysis results ➔ the data are extensively cleaned for better analysis (processing of null values, removing
duplicates and special characters, and standardization of the formatting to improve data consistency)
Enrichment The data needs to be enriched - an inventory of the data set and a strategy for improving it by adding additional data should be carried out. The data set is enriched with
various metadata:
✓ Schematic metadata provide basic information about the processing and ingestion of data ➔ the data wrangler analyzes / parses data records according to an
existing schema.
✓ Conversation metadata are exchanged between accessing instances with the idea to document information obtained during the processing or analysis of these
data for subsequent users.
The recognized peculiarities/ features of a data set can be saved.
*Data lake The physical transfer of data in the data lake. Although data are prepared using metadata, the record is not pre-processed.
The goal is to avoid a data swamp ➔ estimate the value of the data and decide on their lifespan depending on the data quality and its interconnectedness with the rest of
the DB.
Analyzes are not performed directly in the data lake, but only on the relevant data. To be able to use the data, the requester needs the appropriate access rights ➔ Data
Wrangler performs data extraction, however, general viewing and exploration of the data should be possible directly in the data lake.
*Data
governance
The contents of the data lake, technologies and hardware used are subject to change ➔ an audit is required to take care of the care and maintenance of the data lake. The
main principles / guidelines and measures that regulates data maintenance coordinating all processes in the data lake and responsibilities are defined
Validating the data are checked one more time before they are integrated into the target CRIS to identify problems with the data quality and consistency of the data, or to confirm that
the transformation has been successful.
Verify that the values of the attribute are correct and conform to the syntactic and distribution constraints, thus ensuring high data quality AND document every change so
that older versions can be restored, or history of changes can be viewed. If new data are generated during data analysis in CRIS, it can be re-included in Data Lake**
**New data go through the data wrangling process, starting with the step 2 of data validating and structuring the data.
At the end of this process, research information can be used by analytical applications and protected from unauthorized access by access control
USE-CASE
✓ Data formatting
✓ Correction of incorrect data (e.g. address data)
USE-CASE
✓Normalization and standardization (e.g. phone numbers, titles, etc.)
✓Structuring (e.g. separation of names into titles, first and last names, etc.)
✓Identification and cleaning of duplicates
USE-CASE: TRIFACTA FOR DATA WRANGLING
CONCLUSIONS
✓ As the volume of research information and data sources increases, the prerequisite for data to be complete, findable, comprehensively
accessible, interoperable, reusable (compliant with FAIR principles), but also securely stored, structured, and networked in order to be useful
remain critical but at the same time become more difficult to fulfill ➔ data wrangling can be seen a valuable asset in ensuring this.
✓ The goal is to counteract the growing number of data silos that isolate research information from different areas of the organization. Once
successfully implemented, data can be retrieved, managed and made available and accessible to everyone within the entity.
✓ A data lake and data wrangling can be implemented to improve and simplify IT infrastructure and architecture, governance and compliance.
They provide valuable support for predictive analytics and self-service analysis by making it easier and faster to access large amount of data
from multiple sources.
✓ The proper organization of the data lake makes it easier to find the research information the user needs. Managing the data that have already
been pre-processed results in an increased efficiency and cost saving, as preparing data for their further use is the most resource-
consuming part of data analysis. By providing pre-processed research information, users with limited or no experience in data preparation
(low level of data literacy) can be supported and analyzes can be carried out faster and more accurately.
EOSC Data Quality survey
THANK YOU FOR
ATTENTION!
QUESTIONS?
For more information, see ResearchGate,
anastasijanikiforova.com
For questions or any other queries,
contact us via email - Nikiforova.Anastasija@gmail.com,
azeroual@dzhw.eu

More Related Content

What's hot

Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in PythonRoman Merkulov
 
Systèmes informatiques d'aide à la décision médicale
Systèmes informatiques d'aide à la décision médicaleSystèmes informatiques d'aide à la décision médicale
Systèmes informatiques d'aide à la décision médicaleSara SI-MOUSSI
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data PlatformsAnkit Rathi
 
Cours data warehouse
Cours data warehouseCours data warehouse
Cours data warehousekhlifi z
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphsmukuljoshi
 
QGIS2.18 GNSS編
QGIS2.18 GNSS編QGIS2.18 GNSS編
QGIS2.18 GNSS編Jyun Tanaka
 
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...bitnineglobal
 
BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian Lilia Sfaxi
 
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo
 
New opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsNew opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsUlf Mattsson
 
선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화r-kor
 
디지털 트윈, 스마트 시티, 그리고 오픈소스
디지털 트윈, 스마트 시티, 그리고 오픈소스 디지털 트윈, 스마트 시티, 그리고 오픈소스
디지털 트윈, 스마트 시티, 그리고 오픈소스 SANGHEE SHIN
 
Chp1 - Introduction à l'AGL
Chp1 - Introduction à l'AGLChp1 - Introduction à l'AGL
Chp1 - Introduction à l'AGLLilia Sfaxi
 
MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1BiBoard.Org
 

What's hot (20)

Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in Python
 
Systèmes informatiques d'aide à la décision médicale
Systèmes informatiques d'aide à la décision médicaleSystèmes informatiques d'aide à la décision médicale
Systèmes informatiques d'aide à la décision médicale
 
Power BI Overview
Power BI Overview Power BI Overview
Power BI Overview
 
Architecting Modern Data Platforms
Architecting Modern Data PlatformsArchitecting Modern Data Platforms
Architecting Modern Data Platforms
 
Cours data warehouse
Cours data warehouseCours data warehouse
Cours data warehouse
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Resume de BI
Resume de BIResume de BI
Resume de BI
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
QGIS2.18 GNSS編
QGIS2.18 GNSS編QGIS2.18 GNSS編
QGIS2.18 GNSS編
 
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
Graph Database Meetup in Korea #8. Graph Database 5 Offerings_ DecisionTutor ...
 
BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian
 
Azure ML Training - Deep Dive
Azure ML Training - Deep DiveAzure ML Training - Deep Dive
Azure ML Training - Deep Dive
 
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
 
UML4
UML4UML4
UML4
 
New opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulationsNew opportunities and business risks with evolving privacy regulations
New opportunities and business risks with evolving privacy regulations
 
선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화
 
디지털 트윈, 스마트 시티, 그리고 오픈소스
디지털 트윈, 스마트 시티, 그리고 오픈소스 디지털 트윈, 스마트 시티, 그리고 오픈소스
디지털 트윈, 스마트 시티, 그리고 오픈소스
 
Chp1 - Introduction à l'AGL
Chp1 - Introduction à l'AGLChp1 - Introduction à l'AGL
Chp1 - Introduction à l'AGL
 
MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1
 
Windev
WindevWindev
Windev
 

Similar to Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data LakeFibonalabs
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaMukesh Jaiswal
 

Similar to Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS (20)

Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data Lake
 
Quality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data WarehouseQuality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data Warehouse
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Data base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somyaData base and data entry presentation by mj n somya
Data base and data entry presentation by mj n somya
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
BigData Testing by Shreya Pal
BigData Testing by Shreya PalBigData Testing by Shreya Pal
BigData Testing by Shreya Pal
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 

More from Anastasija Nikiforova

Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Anastasija Nikiforova
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Anastasija Nikiforova
 
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Anastasija Nikiforova
 
Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Anastasija Nikiforova
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Anastasija Nikiforova
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Anastasija Nikiforova
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Anastasija Nikiforova
 
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Anastasija Nikiforova
 
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Open data hackathon as a tool for increased engagement of Generation Z:  to h...Open data hackathon as a tool for increased engagement of Generation Z:  to h...
Open data hackathon as a tool for increased engagement of Generation Z: to h...Anastasija Nikiforova
 
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Anastasija Nikiforova
 
The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...Anastasija Nikiforova
 
Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Anastasija Nikiforova
 
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...Anastasija Nikiforova
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Anastasija Nikiforova
 
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...Anastasija Nikiforova
 
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSOPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSAnastasija Nikiforova
 
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Anastasija Nikiforova
 
Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Anastasija Nikiforova
 
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...Anastasija Nikiforova
 

More from Anastasija Nikiforova (20)

Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...
 
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...
 
Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...
 
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
 
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Open data hackathon as a tool for increased engagement of Generation Z:  to h...Open data hackathon as a tool for increased engagement of Generation Z:  to h...
Open data hackathon as a tool for increased engagement of Generation Z: to h...
 
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
 
The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...
 
Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...
 
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
 
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
 
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSOPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
 
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
 
Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...
 
Atvērto datu potenciāls
Atvērto datu potenciālsAtvērto datu potenciāls
Atvērto datu potenciāls
 
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
 

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

  • 1. COMBINING DATA LAKE AND DATA WRANGLING FOR ENSURING DATA QUALITY IN CRIS 15th International Conference on Current Research Information Systems (CRIS2022) Dubrovnik, Croatia, May 12-14, 2022 Otmane Azeroual1, Joachim Schöpfel2, Dragan Ivanovic3, Anastasija Nikiforova4,5 (1) German Centre for Higher Education Research and Science Studies (DZHW), Germany (2) GERiiCO-Labor, University of Lille, France (3) University of NoviSad, Serbia (4) University of Tartu, Institute of Computer Science, Estonia (5) European Open Science Cloud Task Force «FAIR metrics and data quality», Belgium
  • 2. BACKGROUND AND MOTIVATION Today, billions of data sources continuously generate, collect, process, and exchange data. With the rapid increase in the number of devices and information systems in use, the amount and variety of data are increasing. This is also the case for the research / scientific domain. Researchers as the end-users of RIS should be able to integrate ever-increasing volumes of data into their institutional database such as Current Research Information Systems (CRIS), regardless of the source, format or amount/ size of research information, where the data quality, flexibility and scalability in connecting and processing different data sources are crucial. an effective mechanism should be employed to ensure faster value creation from these data
  • 3. an effective mechanism should be employed to ensure faster value creation from these data DATA LAKE + DATA WRANGLING This study sets out the concept of a data lake with data wrangling process to be used in CRIS to clean up data from heterogeneous data sources as it is ingested and integrated. BACKGROUND AND MOTIVATION
  • 5. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/ ✔
  • 6. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/ ✔
  • 7. Image source: https://towardsdatascience.com/augment-your-data-lake-analytics-with-snowflake-b417f1186615 ➢ Data Lake provides a scalable platform for storing and processing large amounts of research data from various sources in their original raw format, regardless of their type, i.e., structured or unstructured data or text, numeric, images, video etc. ➢ The raw data are not cleaned, validated, or transformed ➔ they are original data in their original format.
  • 8. Image source: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.ecloudvalley.com%2Fwhat-is-datalake-and-datawarehouse%2F&psig=AOvVaw1vGuz42Qfu2-0J_bpmhpbJ&ust=1651908882490000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCNiaquD6yvcCFQAAAAAdAAAAABAO ➢ The concept of data lake allows to store a variety of data within the memory BUT ➢ there is a need to clean up dirty data and enrich them in a pre-processing process, where data wrangling is found to be suitable for these purposes. ➢ The goal is to convert complex data types and data formats into structured data without programming efforts ➔ users should be able to prepare and transform their research information without the need of using the ETL tools or familiarity and use of programming languages, where these transformations should be automatically suggested after reading the data based on machine learning algorithms that greatly speeds up this process. ➢ When storing data / research information, the completeness of data and reduction of the cycle time between data generation and availability are important. ➢ The lack of pre-processing does not slow down data supply and does not lead to data loss.
  • 11. With the amount of data and data sources rapidly growing and expanding, it is getting increasingly essential for large amounts of available data to be organized for analysis. Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. Source: https://monkeylearn.com/blog/data-wrangling/, https://www.altair.com/what-is-data-wrangling/
  • 12. DATA LAKE + DATA WRANGLING = DATA QUALITY IN CRIS
  • 13. In this study: ➢ an architectural model is first designed and specified, which analyzes the research information, adjusts it and transforms it into CRIS; ➢ a data lake makes both structured and unstructured data available in a reliable, trustworthy, secure and controlled way; ➢ the data wrangling process is used to verify and improve the quality of data, which also protects data from misuse ➔ data are properly updated, retained, and eventually deleted according to the stage of its lifecycle. The data wrangling process consists of several sequential steps. Depending on the IS and the desired or required target quality*, these individual steps should be carried out several times ➔ data wrangling is a continuous process that repeats itself repeatedly at regular intervals.
  • 14. SEVERAL ASPECTS AFFECTING A DATA LAKE Aspect Description Metadata describes a dataset in more detail containing data about the origin, structure and content of the data + sorting, filtering or categorizing properties + are used for system management and administration Data mapping describes the context of the data ➔ integration map - a detailed specification of which application data from which data sources are linked / associated with which characteristics (mostly metadata) Data lake context describes the higher-level use case on which the data lake is based ➔ the selection of the required data sources is more targeted. !!! This avoids the misuse of the data lake as a data swamp!!! Data context the individual datasets and their context so that they can be better classified for analysis purposes. The context for records can be data origin, categorization, or other contextual feature in the metadata Processing logging refers to the raw data processing that takes place in the data lake. The data record and its metadata are manipulated in the process ➔ is of particular interest to data analysts to analyze data lake usage, data set and use case
  • 15. ➢ Analog data: data sources automatically generate data in a specific predefined and therefore known data format. Due to the automatic generation, they accumulate in a very large amount and are mostly repeated / duplicated. For this reason, they are usually stored in tabular form in so-called "log tapes". ➢ Application data: also have a known structure but are significantly different from analog data in their origin - analog data typically represents physical measurement data, application data arises during the operation, transactions of an application (e.g. transmitted system data or analysis data). So-called "records" are used as a common storage solution for these data characterized by their uniform / homogeneous structure. A data record usually consists of a key attribute K, an index attribute I, and other predefined attributes A. Depending on the data origin and data type of the application data, the predefined attributes may differ from each other. This application data structure is based on DBMS. ➢ Text-based data that are also closely related to the application, but are stored as separate files with metadata. A transformation is required to be carried out for further processing of this data. The process of converting them into analytically processable data is called textual disambiguation.
  • 16. DATA WRANGLING ➢ In the context of research information, data wrangling refers to the process of identifying, extracting, preparing and integrating data into a database system such as CRIS. ➢ The use of data wrangling eliminates low-quality data, i.e. redundant, incomplete, inaccurate or incorrect data, etc., in order to preserve only high-quality research information from which the reliable and value-adding knowledge can be obtained. ➢ This adjusted research information is then entered into the appropriate target CRIS system to be used in further phases of the analysis (e.g. by analytical applications and protected from unauthorized access by access control). ➢ This should minimize the effort of analysis and enriching large volumes of data and metadata and achieve far-reaching added value in the procurement of information staff, developers and end users of the CRIS. Image source: https://towardsdatascience.com/data-wrangling-raw-to-clean-transformation-b30a27bf4b3b
  • 17. The data wrangling process (steps of the process are indicated by numbers) to prepare research information and integrate it into CRIS
  • 18. Step Description Select data The required data records are identified in different data sources. When selecting data, a record is evaluated by its value → if there is added value, the availability and terms of use of the data and subsequent data from this data source are checked Structuring In most cases, there is little or no structure in the data ➔ change the structure of the data for easier accessibility. Cleaning Almost every dataset contains some outliers that can skew the analysis results ➔ the data are extensively cleaned for better analysis (processing of null values, removing duplicates and special characters, and standardization of the formatting to improve data consistency) Enrichment The data needs to be enriched - an inventory of the data set and a strategy for improving it by adding additional data should be carried out. The data set is enriched with various metadata: ✓ Schematic metadata provide basic information about the processing and ingestion of data ➔ the data wrangler analyzes / parses data records according to an existing schema. ✓ Conversation metadata are exchanged between accessing instances with the idea to document information obtained during the processing or analysis of these data for subsequent users. The recognized peculiarities/ features of a data set can be saved. *Data lake The physical transfer of data in the data lake. Although data are prepared using metadata, the record is not pre-processed. The goal is to avoid a data swamp ➔ estimate the value of the data and decide on their lifespan depending on the data quality and its interconnectedness with the rest of the DB. Analyzes are not performed directly in the data lake, but only on the relevant data. To be able to use the data, the requester needs the appropriate access rights ➔ Data Wrangler performs data extraction, however, general viewing and exploration of the data should be possible directly in the data lake. *Data governance The contents of the data lake, technologies and hardware used are subject to change ➔ an audit is required to take care of the care and maintenance of the data lake. The main principles / guidelines and measures that regulates data maintenance coordinating all processes in the data lake and responsibilities are defined Validating the data are checked one more time before they are integrated into the target CRIS to identify problems with the data quality and consistency of the data, or to confirm that the transformation has been successful. Verify that the values of the attribute are correct and conform to the syntactic and distribution constraints, thus ensuring high data quality AND document every change so that older versions can be restored, or history of changes can be viewed. If new data are generated during data analysis in CRIS, it can be re-included in Data Lake** **New data go through the data wrangling process, starting with the step 2 of data validating and structuring the data. At the end of this process, research information can be used by analytical applications and protected from unauthorized access by access control
  • 19. USE-CASE ✓ Data formatting ✓ Correction of incorrect data (e.g. address data)
  • 20. USE-CASE ✓Normalization and standardization (e.g. phone numbers, titles, etc.) ✓Structuring (e.g. separation of names into titles, first and last names, etc.) ✓Identification and cleaning of duplicates
  • 21. USE-CASE: TRIFACTA FOR DATA WRANGLING
  • 22. CONCLUSIONS ✓ As the volume of research information and data sources increases, the prerequisite for data to be complete, findable, comprehensively accessible, interoperable, reusable (compliant with FAIR principles), but also securely stored, structured, and networked in order to be useful remain critical but at the same time become more difficult to fulfill ➔ data wrangling can be seen a valuable asset in ensuring this. ✓ The goal is to counteract the growing number of data silos that isolate research information from different areas of the organization. Once successfully implemented, data can be retrieved, managed and made available and accessible to everyone within the entity. ✓ A data lake and data wrangling can be implemented to improve and simplify IT infrastructure and architecture, governance and compliance. They provide valuable support for predictive analytics and self-service analysis by making it easier and faster to access large amount of data from multiple sources. ✓ The proper organization of the data lake makes it easier to find the research information the user needs. Managing the data that have already been pre-processed results in an increased efficiency and cost saving, as preparing data for their further use is the most resource- consuming part of data analysis. By providing pre-processed research information, users with limited or no experience in data preparation (low level of data literacy) can be supported and analyzes can be carried out faster and more accurately.
  • 24. THANK YOU FOR ATTENTION! QUESTIONS? For more information, see ResearchGate, anastasijanikiforova.com For questions or any other queries, contact us via email - Nikiforova.Anastasija@gmail.com, azeroual@dzhw.eu