SlideShare uma empresa Scribd logo
1 de 33
Data Science Seminar: When, why and How? The important of Business Intelligence
9.11.2022, Tartu, Estonia
DATA LAKE OR DATA WAREHOUSE?
DATA CLEANING OR DATA WRANGLING?
HOW TO ENSURE THE QUALITY OF
YOUR DATA?
Anastasija Nikiforova
Assistant Professor of Information Systems, Faculty of Science and Technology,
Institute of Computer Science, Chair of Software Engineering, University of Tartu
European Open Science CLoud (EOSC) Task Force “FAIR metrics and data quality”
BACKGROUND
Today, billions of data sources continuously generate, collect, process, and exchange data ⇒ with the rapid
increase in the number of devices and IS, the amount and variety of data are increasing.
There is a need to integrate ever-increasing volumes of data, regardless of the source, format or amount, where
the data quality, flexibility and scalability in connecting and processing different data sources are crucial.
an effective mechanism should be employed to ensure
faster value creation from these data
DATA QUALITY
Why data quality? Again? and still?
DATA QUALITY - WHAT, WHY, HOW, 10 BEST PRACTICES & MORE - Enterprise Master Data Management • Profisee
DATA QUALITY - WHAT, WHY, HOW, 10 BEST PRACTICES & MORE - Enterprise Master Data Management • Profisee
Among other “nuances”, data quality is use-case dependent and dynamic (as
well as relative) in nature!
*** “absolute data quality” == a level of data quality at which the data would satisfy all possible use
cases - is not possible to achieve, but this is the objective to be pursued
DATA REPOSITORY
DATA WAREHOUSE
DATA LAKE?
Maybe even something more?
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
schema on read
schema on write
“single source of truth”
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
DW were considered to be «a
silver bullet» for Business
Intelligence …
Implementing a Data Lake or Data Warehouse Architecture for Business Intelligence? | by Lan Chu | Towards Data Science
Image source: https://towardsdatascience.com/augment-your-data-lake-analytics-with-snowflake-b417f1186615
So how to get its benefits?
Image source: https://www.google.com/url?sa=i&url=https%3A%2F%2Ftwitter.com%2Frokar9%2Fstatus%2F1452339921629302784&psig=AOvVaw2IUSKtgUWxeaplk56f7CoK&ust=1668004535620000&source=images&cd=vfe&ved=0CA4QjhxqFwoTCJDHwbjnnvsCFQAAAAAdAAAAABAM
DATA LAKE & DATA WAREHOUSE
DATA LAKEHOUSE
Data lakehouse is seen as a combination of data warehousing workloads & data lake economics
Running Analytics on the Data Lake - The Databricks Blog
Running Analytics on the Data Lake - The Databricks Blog, Build a Lake House Architecture on AWS | AWS Big Data Blog (amazon.com), The Data Lakehouse, the Data Warehouse and a Modern Data platform architecture - Microsoft Community Hub
DATA LAKE FOR BUSINESS
INTELLIGENCE
BUSINESS DATA LAKE
https://www.capgemini.com/wp-content/uploads/2017/07/pivotal_data_lake_vs_traditional_bi_20140805.pdf
https://www.capgemini.com/wp-content/uploads/2017/07/pivotal_data_lake_vs_traditional_bi_20140805.pdf
Image source: The abstracted future of data engineering | by Justin Gage | Datalogue | Medium
Or how to avoid GIGO*?
*“garbage in, garbage out”
DATA CLEANING or DATA WRANGLING?
https://pediaa.com/what-is-the-difference-between-data-wrangling-and-data-cleaning/
✔
a process of iterative data exploration
and transformation that enables their
further analysis by making them (1)
usable, (2) credible and (3) useful
Image source: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.ecloudvalley.com%2Fwhat-is-datalake-and-datawarehouse%2F&psig=AOvVaw1vGuz42Qfu2-0J_bpmhpbJ&ust=1651908882490000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCNiaquD6yvcCFQAAAAAdAAAAABAO
⮚ The nature of data lake allows to store a variety of data within the memory
BUT
⮚ there is a need to clean up dirty data and enrich them in a pre-processing process, where data wrangling is found to be suitable
for these purposes.
⮚ The goal is to convert complex data types and data formats into structured data without programming efforts  users should
be able to prepare and transform their data without the need of using the ETL tools or familiarity and use of programming
languages, where these transformations should be automatically suggested after reading the data based on machine learning
algorithms that greatly speeds up this process.
Source: https://monkeylearn.com/blog/data-wrangling/, https://www.altair.com/what-is-data-wrangling/
DATA LAKE + DATA WRANGLING
=
DATA QUALITY IN IS
[an asset, not a silver bullet]
The data wrangling process to prepare research information and integrate it into CRIS
Depending on the IS and the desired or required target quality*, individual steps should be carried out several times  data wrangling is a continuous process
that repeats itself repeatedly at regular intervals.
Step Description
Select data The required data records are identified in different data sources. When selecting data, a record is evaluated by its value 🡪 if there is added value, the availability and terms of use
of the data and subsequent data from this data source are checked
Structure In most cases, there is little or no structure in the data 🡪 change the structure of the data for easier accessibility.
Clean Almost every dataset contains some outliers that can skew the analysis results 🡪 the data are extensively cleaned for better analysis (processing of null values, removing duplicates and
special characters, and standardization of the formatting to improve data consistency)
Enrich The data needs to be enriched - an inventory of the data set and a strategy for improving it by adding additional data should be carried out. The data set is enriched with various
metadata:
✔ Schematic metadata provide basic information about the processing and ingestion of data 🡪 the data wrangler analyzes / parses data records according to an existing
schema.
✔ Conversation metadata are exchanged between accessing instances with the idea to document information obtained during the processing or analysis of these data for
subsequent users.
The recognized peculiarities/ features of a data set can be saved.
*Data lake The physical transfer of data in the data lake. Although data are prepared using metadata, the record is not pre-processed.
The goal is to avoid a data swamp 🡪 estimate the value of the data and decide on their lifespan depending on the data quality and its interconnectedness with the rest of the DB.
Analyzes are not performed directly in the data lake, but only on the relevant data. To be able to use the data, the requester needs the appropriate access rights 🡪 Data Wrangler
performs data extraction, however, general viewing and exploration of the data should be possible directly in the data lake.
*Data
governance
The contents of the data lake, technologies and hardware used are subject to change 🡪 an audit is required to take care of the care and maintenance of the data lake. The main
principles / guidelines and measures that regulates data maintenance coordinating all processes in the data lake and responsibilities are defined
Validate the data are checked one more time before they are integrated into the target CRIS to identify problems with the data quality and consistency of the data, or to confirm that the
transformation has been successful.
Verify that the values of the attribute are correct and conform to the syntactic and distribution constraints, thus ensuring high data quality AND document every change so that
older versions can be restored, or history of changes can be viewed. If new data are generated during data analysis in CRIS, it can be re-included in Data Lake**
**New data go through the data wrangling process, starting with the step 2 of data validating and structuring the data.
At the end of this process, research information can be used by analytical applications and protected from unauthorized access by access control
USE-CASE
✔ Data formatting
✔ Correction of incorrect data (e.g. address data)
USE-CASE
✔Normalization and standardization (e.g. phone numbers, titles, etc.)
✔Structuring (e.g. separation of names into titles, first and last names, etc.)
✔Identification and cleaning of duplicates
USE-CASE: TRIFACTA FOR DATA WRANGLING
CONCLUSIONS
✔ As the volume of research information and data sources increases, the prerequisite for data to be complete, findable, comprehensively accessible,
interoperable, reusable (compliant with FAIR principles), but also securely stored, structured, and networked in order to be useful remain critical but
at the same time become more difficult to fulfill  data wrangling can be seen a valuable asset in ensuring this.
✔ The goal is to counteract the growing number of data silos that isolate data from different areas of the organization. Once successfully
implemented, data can be retrieved, managed and made available and accessible to everyone within the entity.
✔ A data lake and data wrangling can be implemented to improve and simplify IT infrastructure and architecture, governance and compliance.
They provide valuable support for predictive analytics and self-service analysis by making it easier and faster to access large amount of data from
multiple sources.
✔ The proper organization of the data lake makes it easier to find the data the user needs. Managing the data that have already been pre-processed
results in an increased efficiency and cost saving, as preparing data for their further use is the most resource-consuming part of data analysis.
✔ By providing pre-processed data, users with limited or no experience in data preparation (low level of data literacy) can be supported and analyzes
can be carried out faster and more accurately.
THANK YOU FOR
ATTENTION!
QUESTIONS?
For more information, see ResearchGate,
anastasijanikiforova.com
For questions or any queries, contact me via
Nikiforova.Anastasija@gmail.com,

Mais conteúdo relacionado

Mais procurados

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOpsSteven Ensslen
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 

Mais procurados (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOps
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 

Semelhante a Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure the Quality of Your Data?

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISAnastasija Nikiforova
 
Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data LakeFibonalabs
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdfVishnuGone
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdfDatacademy.ai
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster LEARN Project
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSTIBCO Spotfire
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationSrinivasan Sankar
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 

Semelhante a Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure the Quality of Your Data? (20)

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISCombining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
 
Best Practices To Build a Data Lake
Best Practices To Build a Data LakeBest Practices To Build a Data Lake
Best Practices To Build a Data Lake
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
Quality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data WarehouseQuality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data Warehouse
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Snowflake Time Travel.pdf
Snowflake Time Travel.pdfSnowflake Time Travel.pdf
Snowflake Time Travel.pdf
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Data Engineering.pdf
Data Engineering.pdfData Engineering.pdf
Data Engineering.pdf
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 

Mais de Anastasija Nikiforova

Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Anastasija Nikiforova
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Anastasija Nikiforova
 
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Anastasija Nikiforova
 
Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Anastasija Nikiforova
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Anastasija Nikiforova
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Anastasija Nikiforova
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Anastasija Nikiforova
 
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Anastasija Nikiforova
 
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Open data hackathon as a tool for increased engagement of Generation Z:  to h...Open data hackathon as a tool for increased engagement of Generation Z:  to h...
Open data hackathon as a tool for increased engagement of Generation Z: to h...Anastasija Nikiforova
 
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Anastasija Nikiforova
 
The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...Anastasija Nikiforova
 
Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Anastasija Nikiforova
 
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...Anastasija Nikiforova
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Anastasija Nikiforova
 
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...Anastasija Nikiforova
 
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSOPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSAnastasija Nikiforova
 
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Anastasija Nikiforova
 
Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Anastasija Nikiforova
 
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...Anastasija Nikiforova
 

Mais de Anastasija Nikiforova (20)

Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...
 
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...Public data ecosystems in and for smart cities: how to make open / Big / smar...
Public data ecosystems in and for smart cities: how to make open / Big / smar...
 
Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?Artificial Intelligence for open data or open data for artificial intelligence?
Artificial Intelligence for open data or open data for artificial intelligence?
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...
 
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
 
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Open data hackathon as a tool for increased engagement of Generation Z:  to h...Open data hackathon as a tool for increased engagement of Generation Z:  to h...
Open data hackathon as a tool for increased engagement of Generation Z: to h...
 
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
 
The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...The role of open data in the development of sustainable smart cities and smar...
The role of open data in the development of sustainable smart cities and smar...
 
Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...Data security as a top priority in the digital world: preserve data value by ...
Data security as a top priority in the digital world: preserve data value by ...
 
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
 
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
 
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSOPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
 
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
 
Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...Towards enrichment of the open government data: a stakeholder-centered determ...
Towards enrichment of the open government data: a stakeholder-centered determ...
 
Atvērto datu potenciāls
Atvērto datu potenciālsAtvērto datu potenciāls
Atvērto datu potenciāls
 
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure the Quality of Your Data?

  • 1. Data Science Seminar: When, why and How? The important of Business Intelligence 9.11.2022, Tartu, Estonia DATA LAKE OR DATA WAREHOUSE? DATA CLEANING OR DATA WRANGLING? HOW TO ENSURE THE QUALITY OF YOUR DATA? Anastasija Nikiforova Assistant Professor of Information Systems, Faculty of Science and Technology, Institute of Computer Science, Chair of Software Engineering, University of Tartu European Open Science CLoud (EOSC) Task Force “FAIR metrics and data quality”
  • 2. BACKGROUND Today, billions of data sources continuously generate, collect, process, and exchange data ⇒ with the rapid increase in the number of devices and IS, the amount and variety of data are increasing. There is a need to integrate ever-increasing volumes of data, regardless of the source, format or amount, where the data quality, flexibility and scalability in connecting and processing different data sources are crucial. an effective mechanism should be employed to ensure faster value creation from these data
  • 3. DATA QUALITY Why data quality? Again? and still?
  • 4. DATA QUALITY - WHAT, WHY, HOW, 10 BEST PRACTICES & MORE - Enterprise Master Data Management • Profisee
  • 5. DATA QUALITY - WHAT, WHY, HOW, 10 BEST PRACTICES & MORE - Enterprise Master Data Management • Profisee Among other “nuances”, data quality is use-case dependent and dynamic (as well as relative) in nature! *** “absolute data quality” == a level of data quality at which the data would satisfy all possible use cases - is not possible to achieve, but this is the objective to be pursued
  • 7. DATA WAREHOUSE DATA LAKE? Maybe even something more?
  • 8. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
  • 9. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/ schema on read schema on write “single source of truth”
  • 10. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
  • 11. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/
  • 12. Image source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/, https://www.qubole.com/data-lakes-vs-data-warehouses-the-co-existence-argument/ DW were considered to be «a silver bullet» for Business Intelligence …
  • 13. Implementing a Data Lake or Data Warehouse Architecture for Business Intelligence? | by Lan Chu | Towards Data Science
  • 14.
  • 17. DATA LAKE & DATA WAREHOUSE DATA LAKEHOUSE
  • 18. Data lakehouse is seen as a combination of data warehousing workloads & data lake economics Running Analytics on the Data Lake - The Databricks Blog
  • 19. Running Analytics on the Data Lake - The Databricks Blog, Build a Lake House Architecture on AWS | AWS Big Data Blog (amazon.com), The Data Lakehouse, the Data Warehouse and a Modern Data platform architecture - Microsoft Community Hub
  • 20. DATA LAKE FOR BUSINESS INTELLIGENCE BUSINESS DATA LAKE https://www.capgemini.com/wp-content/uploads/2017/07/pivotal_data_lake_vs_traditional_bi_20140805.pdf
  • 22. Image source: The abstracted future of data engineering | by Justin Gage | Datalogue | Medium Or how to avoid GIGO*? *“garbage in, garbage out”
  • 23. DATA CLEANING or DATA WRANGLING?
  • 24. https://pediaa.com/what-is-the-difference-between-data-wrangling-and-data-cleaning/ ✔ a process of iterative data exploration and transformation that enables their further analysis by making them (1) usable, (2) credible and (3) useful
  • 25. Image source: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.ecloudvalley.com%2Fwhat-is-datalake-and-datawarehouse%2F&psig=AOvVaw1vGuz42Qfu2-0J_bpmhpbJ&ust=1651908882490000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCNiaquD6yvcCFQAAAAAdAAAAABAO ⮚ The nature of data lake allows to store a variety of data within the memory BUT ⮚ there is a need to clean up dirty data and enrich them in a pre-processing process, where data wrangling is found to be suitable for these purposes. ⮚ The goal is to convert complex data types and data formats into structured data without programming efforts  users should be able to prepare and transform their data without the need of using the ETL tools or familiarity and use of programming languages, where these transformations should be automatically suggested after reading the data based on machine learning algorithms that greatly speeds up this process. Source: https://monkeylearn.com/blog/data-wrangling/, https://www.altair.com/what-is-data-wrangling/
  • 26. DATA LAKE + DATA WRANGLING = DATA QUALITY IN IS [an asset, not a silver bullet]
  • 27. The data wrangling process to prepare research information and integrate it into CRIS Depending on the IS and the desired or required target quality*, individual steps should be carried out several times  data wrangling is a continuous process that repeats itself repeatedly at regular intervals.
  • 28. Step Description Select data The required data records are identified in different data sources. When selecting data, a record is evaluated by its value 🡪 if there is added value, the availability and terms of use of the data and subsequent data from this data source are checked Structure In most cases, there is little or no structure in the data 🡪 change the structure of the data for easier accessibility. Clean Almost every dataset contains some outliers that can skew the analysis results 🡪 the data are extensively cleaned for better analysis (processing of null values, removing duplicates and special characters, and standardization of the formatting to improve data consistency) Enrich The data needs to be enriched - an inventory of the data set and a strategy for improving it by adding additional data should be carried out. The data set is enriched with various metadata: ✔ Schematic metadata provide basic information about the processing and ingestion of data 🡪 the data wrangler analyzes / parses data records according to an existing schema. ✔ Conversation metadata are exchanged between accessing instances with the idea to document information obtained during the processing or analysis of these data for subsequent users. The recognized peculiarities/ features of a data set can be saved. *Data lake The physical transfer of data in the data lake. Although data are prepared using metadata, the record is not pre-processed. The goal is to avoid a data swamp 🡪 estimate the value of the data and decide on their lifespan depending on the data quality and its interconnectedness with the rest of the DB. Analyzes are not performed directly in the data lake, but only on the relevant data. To be able to use the data, the requester needs the appropriate access rights 🡪 Data Wrangler performs data extraction, however, general viewing and exploration of the data should be possible directly in the data lake. *Data governance The contents of the data lake, technologies and hardware used are subject to change 🡪 an audit is required to take care of the care and maintenance of the data lake. The main principles / guidelines and measures that regulates data maintenance coordinating all processes in the data lake and responsibilities are defined Validate the data are checked one more time before they are integrated into the target CRIS to identify problems with the data quality and consistency of the data, or to confirm that the transformation has been successful. Verify that the values of the attribute are correct and conform to the syntactic and distribution constraints, thus ensuring high data quality AND document every change so that older versions can be restored, or history of changes can be viewed. If new data are generated during data analysis in CRIS, it can be re-included in Data Lake** **New data go through the data wrangling process, starting with the step 2 of data validating and structuring the data. At the end of this process, research information can be used by analytical applications and protected from unauthorized access by access control
  • 29. USE-CASE ✔ Data formatting ✔ Correction of incorrect data (e.g. address data)
  • 30. USE-CASE ✔Normalization and standardization (e.g. phone numbers, titles, etc.) ✔Structuring (e.g. separation of names into titles, first and last names, etc.) ✔Identification and cleaning of duplicates
  • 31. USE-CASE: TRIFACTA FOR DATA WRANGLING
  • 32. CONCLUSIONS ✔ As the volume of research information and data sources increases, the prerequisite for data to be complete, findable, comprehensively accessible, interoperable, reusable (compliant with FAIR principles), but also securely stored, structured, and networked in order to be useful remain critical but at the same time become more difficult to fulfill  data wrangling can be seen a valuable asset in ensuring this. ✔ The goal is to counteract the growing number of data silos that isolate data from different areas of the organization. Once successfully implemented, data can be retrieved, managed and made available and accessible to everyone within the entity. ✔ A data lake and data wrangling can be implemented to improve and simplify IT infrastructure and architecture, governance and compliance. They provide valuable support for predictive analytics and self-service analysis by making it easier and faster to access large amount of data from multiple sources. ✔ The proper organization of the data lake makes it easier to find the data the user needs. Managing the data that have already been pre-processed results in an increased efficiency and cost saving, as preparing data for their further use is the most resource-consuming part of data analysis. ✔ By providing pre-processed data, users with limited or no experience in data preparation (low level of data literacy) can be supported and analyzes can be carried out faster and more accurately.
  • 33. THANK YOU FOR ATTENTION! QUESTIONS? For more information, see ResearchGate, anastasijanikiforova.com For questions or any queries, contact me via Nikiforova.Anastasija@gmail.com,