SlideShare uma empresa Scribd logo
1 de 62
Data Quality
Standards and Application to Open Data
February 21, 2018 – Brunel University, UK
Marco Torchiano
marco.torchiano@polito.it
Version 1.1.0
© Marco Torchiano, 2018
About me
 Marco Torchiano
 Associate Professor, Politecnico di Torino
 Senior Member IEEE
 Faculty Fellow – Nexa Center for Internet
and Society
 Member UNI CT504–Software Engineering
 Contacts:
– mailto:marco.torchiano@polito.it
– http://softeng.polito.it/torchiano/
– Twitter: @mtorchiano
3
Current Research Interests
 Mobile UI Automated Testing
 PhD student working on fragility
 (Open)Data Quality
 PhD student working on KB quality
 Software Energy Consumption
 Several collaborations
 Also: MDD, Survey methodology, code
obfuscation, SE education, …
4
Acknowledgments
 Antonio Vetrò
 The counterpart for
this line of research
 Many other people
 L.Canova, R.Iemma, F.Iuliano, F.Morando,
C.Orozco Minotas, G.Procaccianti,
R.Rashid
5
OPEN DATA QUALITY
7
Open Coesione
 portal about the fulfilment of
investments using the 2007-2013
European Cohesion funds
 Interactive Interface
 Downloadable .csv datasets
 ~100 billion Euros are being tracked,
~100K projects
 http://www.opencoesione.gov.it/
9
Errors in data
10
43 !
* extraction, transformation, and loading
11
Accuracy
12
» Refer always to raw data
» If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
13
Missing data
14
15
»Outliers can point to interesting facts
Outliers
16
»… or to something which deserves a second look
Outliers
17
Valu
e
pcvc= percentage of cells with correct value
18
ISO DATA QUALITY
STANDARDS
19
ISO - SQuaRE
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Family of standards
20
ISO SQuaRE
 Internal Quality
 Values, formats, relation
 External Quality
 Technological environment
 Quality in Use
 Context of use of the data user
21
ISO 25012
Data Quality Model
22
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Roles
 Data Quality evaluator
 Data Producer
 Data Acquirer
 Data User
23
Data evaluator
 Defines/adapts a quality model
 Evaluate and act
 Data correction
 Technological adjustments
 Organizational measures
24
Model structure
 Characteristic
 Main aspects, e.g., usability
 Sub-Characteristic (optional)
 A detailed aspect of a characteristic, e.g.
Understandability
 Metric
 A set of rules to assign and interpret a
(numerical) evaluation to a specific (sub)-
characteristic
25
Characteristics
 Accuracy
 Completeness
 Consistency
 Credibility
 Currentness
 Accessibility
 Compliance
 Confidentiality
 Efficiency
 Precision
 Traceability
 Understandability
 Availability
 Portability
 Recoverability
26
Characteristics
 Accuracy
 Correspondence between data and reality
(syntactic and semantic)
 Completeness
 Computer: presence of all necessary
values
 User: how much the data is able to satisfy
the needs
 Consistency
 Absence of contradictions in the data
27
Characteristics
 Credibility
 The extent to which data are regarded as
true and credible by users
 Currentness
 the extent to which data is up-to-date
 Accessibility
 The capability of data to be accessed,
particularly by people who need
supporting technology or special
configuration because of some disability
28
Characteristics
 Regulatory compliance
 The capability of data to adhere to standards,
conventions or regulations in force and similar
rules relating to data quality
 Confidentiality
 The capability of the data to be accessed and
interpreted only by authorized users
 Efficiency
 The capability of data to be processed (accessed,
acquired, updated, etc) and to provide
appropriate levels of performance using the
appropriate amounts and types of resources
under stated conditions
29
Characteristics
 Precision
 Capability of the value assigned to an
attribute to provide the degree of
information needed in a stated context of
use
 Traceability
 Presence of attributes providing an audit trail
of access and changes made to data
 Understandability
 The extent to which data can be read and
interpreted by users
30
Characteristics
 Availability
 The capability of data to be always
retrievable.
 Recoverability
 The capability to preserve a specified level of
operations and its physical and logical
integrity, even in the event of failure
 Portability
 The capability of data to be moved to
another platform preserving quality
31
Inherent System
Dependent
Facts
(Data)
Artefacts
(D+Hw+Sw+Sys)
Accuracy
Completeness
Consistency
Credibility
Currentness
Accessibility
UnderstandabilityHCI
Support
Compliance
Confidentiality
Efficiency
Precision
Traceability
Perspectives
32
Availability
Portability
Recoverability
ISO 25024
Measurement of Data Quality
33
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Relationships among standards
ISO/IEC 25010
System and Software
Product Quality
ISO/IEC 25012
Data Quality
composed of
Quality characteristics
Quality sub-characteristics
composed of
Quality Measure
ISO/IEC 25022, 25023, 25024
Measuremen
t function
defines
composed of
Quality Measure Elements
QME
Measuremen
t method
ISO/IEC 25021
Property to quantifyTarget Entity
Source: ISO/IEC 25024 34
Data Life Cycle: examples
Data
design
Data
collection
Data
integration
External
data
acquisition
Source: ISO/IEC 25024
Data
processin
g
Presentation
Other use
Data store
Delete
35
Data design: target entities
 Architecture
 Contextual schema
 Data models (conceptual, logical,
physical)
 Data dictionary
 Document
36
Data design: properties
 Attribute
 Element
 Information
 Metadata
 Vocabulary
37
Other stages: target entities
 Data file
 DBMS
 RDBMS
 Form
 Presentation device
38
Properties
 Data format
 Data item
 Data value
 Information item
 Information item content
 Data record
39
Metrics definition
A) ID: abbreviated code of the quality characteristics +
(I/D)+serial number
b) Name: QM name related to data;
c) Description
d) Measurement function: formula showing how the QMEs
are combined to produce the QM;
e) DLC, Target entities, Properties: DLC includes stages of
the DLC where the data QMEs are applicable, target
entities and properties of target entities;
f) Note: in the note, additional information such as an
acceptable range of values, reference to other standards,
explanations or interpretation or criteria, measurement
method used to obtain the
40
ACCURACY (Acc-I-1)
Copyright: ISO/IEC 25024
42
CASE STUDIES
Open Government Data
50
Open Government Data
OD: open data, data that can be
 Used
 Reused
 Redistributed
 By anyone and with any goal
G: Government produced or commissioned
by a government or an institutional
entity controlled by the government
http://opengovernmentdata.org
51
Why OGD ?
 Transparency
 Social and commercial value
 Participation
52
Case 1: Open Coesione
 Published data
 Structured
 Open data format
OpenCoesione
Statistical data from municipalities
 Residents
 Weddings
 Commercial activities
60
Datasets analyzed
61
Orchestrated disclosure Decentralized disclosure
● Open Coesione
● portal about the
fulfilment of
investments using the
2007-2013 European
Cohesion funds
● 85 billion Euros are
being tracked, 850K
projects
Dataset
Torino
Roma
Milano
Firenze
Bologna
Residents X X X X X
Weddings X X X
Business
Activities
X X X
Open Coesione
Measures
Characteristic Description ISO name
Completeness
Percentage of complete cells Com-I-1 (cell)
Percentage of complete rows Com-I-1 (row)
Accuracy
Percentage of syntactically accurate
cells
Acc-I-1
Traceability
Track of creation Tra-D-2 ( c )
Track of update Tra-D-2 (u)
Currentness
Percentage of current rows Cur-I-2
Delay in publication ~Cur-I-1
Compliance
eGSM compliance Cmp-D-1
five stars open data Cmp-D-1
Understandability
Percentage of columns with metadata Und-I-3
Percentage of columns in
comprehensible format
Und-I-4
63
e.GMS
1. Accessibility (mandatory if
appl)
2. Addressee (optional)
3. Aggregation (optional)
4. Audience (optional)
5. Contributor (optional)
6. Coverage (recommended)
7. Creator (mandatory)
8. Date (mandatory)
9. Description (optional)
10. Digital signature (optional)
11. Disposal (optional)
12. Format (optional)
13. Identifier (mandatory if appl)
14. Language (recommended)
15. Location (optional)
16. Mandate (optional)
17. Preservation (optional)
18. Publisher (mandatory if appl)
19. Relation (optional)
20. Rights (optional)
21. Source (optional)
22. Status (optional)
23. Subject (mandatory)
24. Title (mandatory)
25. Type (optional)
UK - e-Governmant Metadata Standard
https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
Results – Open Coesione
65
0.00 0.20 0.40 0.60 0.80 1.00
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Null/zero
values :
domain
uncertain
Track
updates
missing
Missing
metadata
data not
linked
0 0.2 0.4 0.6 0.8 1
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Results – Municipality data
66
Discrepancies
of values with
domain
No info on
updates
Missing
metadata
Findings
 Disclosure strategy implies different
data quality
 Centralized vs.
 Decentralized
 Traceability is generally lacking
 Proposals to use Sw Conf Mgmt tools
 Metadata is often missing or
incomplete
67
Case 2: Public Contracts
 Published data
 Structured
 Open format
Data on public contracts ex Art.37
Decree Transparency + prescriptions
ANAC
68
Public contracts
 Decree Transparency (14 march 2013
n.33)
 Public contracts (Art.37 & Art 9.)
 Open Data Publication
 XML Standard Format (ANAC)
 Selected administrations: Italian
universities
69
Data Structure
XML
METADATA
DATA
LOTS
PARTICIPANTS
WINNER
70
<lotto>
<cig>4421574E47</cig>
<strutturaProponente>
<codiceFiscaleProp>00518460019</codiceFiscaleProp>
<denominazione>Politecnico di Torino</denominazione>
</strutturaProponente>
<oggetto>
Procedura di cottimo fiduciario per affidamento servizio di manutenzione e
assistenza di primo livello stazioni self-service
</oggetto>
<sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente>
<partecipanti>
<partecipante>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</partecipante>
</partecipanti>
<aggiudicatari>
<aggiudicatario>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</aggiudicatario>
</aggiudicatari>
<importoAggiudicazione>7500.00</importoAggiudicazione>
<tempiCompletamento>
<dataInizio>2014-09-01</dataInizio>
<dataUltimazione>2014-11-30</dataUltimazione>
</tempiCompletamento>
<importoSommeLiquidate>7500.00</importoSommeLiquidate>
</lotto>
71
Quality Evaluation Framework
Intrinsic
Dimensions
Domain
Dependent
Dimension Measure
Accuracy Percentage of elements
with correct values.
Completeness
Percentage of complete
elements.
Percentage of complete
aggregate elements.
Dimension Measure
Consistency Percentage of lots that
meet the Intrarelational
and Interrelational
Integrity Constraints.
Duplication Number of duplicates.
72
Identification of datasets
 First 25 universities of the overall ranking for
the 2014 provided by the newspaper Il Sole 24
Ore.
 Only 12 universities provide summary tables in
XML format.
Total numer of assessed lots: 123702
Average number of published lots:10308,5
 The remaining 13 universities either do not
provide the summary tables or they provide
summary tables but not in XML format.
73
CIG
74
The University of Torino
publishes summary tables that
have 100% cig completeness,
that is, the 100% of lots have the
cig element but about 32% of
them are out of domain.
1
0.94
0.9999
0.999
0.67
1
0.99
1
0.998
0.997
0.9998
0.99
1
1
1
1
1
1
1
1
1
1
1
1
0.600.700.800.901.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
Unique Tender Identifier
A lot of “00000000000”.
The element is present for
each lot but it is always
empty.
Choice of
contracting part
75
0.9999
0.998
0.9999
0
1
1
1
0.9991
1
1
1
1
1
1
1
1
1
1
1
0.999
1
1
1
1
0.000.200.400.600.801.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
All the lots published by
University of Milano have a
winner but no information about
the participants.
Fiscal Code
76
1
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
0.974
1
1
0.996
1
0.951
0.900.920.940.960.981.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
In 14% of lots the amount paid is
greater than the awarded amount.
Amount paid
vs. Total paid
78
0.87
0.97
0.96
0.9999
0.998
0.99
0.999
0.93
0.995
0.9999
0.98
0.98
0.800.850.900.951.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
PayedlessorequaltoAwarded
Final considerations
 ISO standard provides several
predefined measures
 Must be adapted to the case at hand
 Can be aggregated in different ways
 Possibility to define new measures
 ISO standard is intended for
structured data
 What about semantic knowledge bases?
79
References
 ISO/IEC 25012:2008, Software engineering — Software
product Quality Requirements and Evaluation (SQuaRE) —
Data quality model
 ISO 25024:2015, Software engineering — Software product
Quality Requirements and Evaluation (SQuaRE) —
Measurement of data quality
 Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco
Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open
Data Quality Measurement Framework: Definition and
Application to Open Government Data”GOVERNMENT
INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740-
624X
 Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca
“Preserving the Benefits of Open Government Data by
Measuring and Improving Their Quality: An Empirical Study” in
IEEE 41st Annual Computer Software and Applications
Conference (COMPSAC 2017)
80

Mais conteúdo relacionado

Mais procurados

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
dmurph4
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
Harshendu Desai
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
Alex Meadows
 

Mais procurados (20)

CDMP preparation workshop EDW2016
CDMP preparation workshop EDW2016CDMP preparation workshop EDW2016
CDMP preparation workshop EDW2016
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Hybrid Cloud & Data Fabric for Dummies
Hybrid Cloud & Data Fabric for DummiesHybrid Cloud & Data Fabric for Dummies
Hybrid Cloud & Data Fabric for Dummies
 
Infra Migration Proposal Draft from Oracle to Snowflake
Infra Migration Proposal Draft from Oracle to SnowflakeInfra Migration Proposal Draft from Oracle to Snowflake
Infra Migration Proposal Draft from Oracle to Snowflake
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
Data Quality Technical Architecture
Data Quality Technical ArchitectureData Quality Technical Architecture
Data Quality Technical Architecture
 
‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development
‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development
‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
RWDG Slides: Governing Your Data Catalog, Business Glossary, and Data Dictionary
RWDG Slides: Governing Your Data Catalog, Business Glossary, and Data DictionaryRWDG Slides: Governing Your Data Catalog, Business Glossary, and Data Dictionary
RWDG Slides: Governing Your Data Catalog, Business Glossary, and Data Dictionary
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 

Semelhante a Data Quality - Standards and Application to Open Data

A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
Jānis Grabis
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
Vaticle
 

Semelhante a Data Quality - Standards and Application to Open Data (20)

Thesis Defense MBI
Thesis Defense MBIThesis Defense MBI
Thesis Defense MBI
 
ENVRIPLUS Data for Science Theme
ENVRIPLUS Data for Science ThemeENVRIPLUS Data for Science Theme
ENVRIPLUS Data for Science Theme
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
 
Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
 
Lecture 1 Introduction to Computer Networks
Lecture 1 Introduction to Computer NetworksLecture 1 Introduction to Computer Networks
Lecture 1 Introduction to Computer Networks
 
Rinascimento Digitale - A Digital Renaissance
Rinascimento Digitale - A Digital RenaissanceRinascimento Digitale - A Digital Renaissance
Rinascimento Digitale - A Digital Renaissance
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - short
 
Service oriented space-infrastructures_brown_university_2014_lisi
Service oriented space-infrastructures_brown_university_2014_lisiService oriented space-infrastructures_brown_university_2014_lisi
Service oriented space-infrastructures_brown_university_2014_lisi
 
Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...
 
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
 
Hobbit project overview presented at EBDVF 2017
Hobbit project overview presented at EBDVF 2017Hobbit project overview presented at EBDVF 2017
Hobbit project overview presented at EBDVF 2017
 
The Internet of Things: What's next?
The Internet of Things: What's next? The Internet of Things: What's next?
The Internet of Things: What's next?
 
Dynamic Semantics for the Internet of Things
Dynamic Semantics for the Internet of Things Dynamic Semantics for the Internet of Things
Dynamic Semantics for the Internet of Things
 
PERICLES workshop (London 15 October 2015) - Digital Ecosystem Model
PERICLES workshop (London 15 October 2015) - Digital Ecosystem ModelPERICLES workshop (London 15 October 2015) - Digital Ecosystem Model
PERICLES workshop (London 15 October 2015) - Digital Ecosystem Model
 
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
 
Trm Trusted Repositories
Trm Trusted RepositoriesTrm Trusted Repositories
Trm Trusted Repositories
 
TESTING
TESTINGTESTING
TESTING
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 

Mais de Marco Torchiano

Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing toolsEspresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Marco Torchiano
 
Data Quality - Standards e Applicazioni
Data Quality - Standards e ApplicazioniData Quality - Standards e Applicazioni
Data Quality - Standards e Applicazioni
Marco Torchiano
 

Mais de Marco Torchiano (14)

Testing the UI of Mobile Applications
Testing the UI of Mobile ApplicationsTesting the UI of Mobile Applications
Testing the UI of Mobile Applications
 
Software Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di TorinoSoftware Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di Torino
 
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing toolsEspresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
 
Research Activities: past, present, and future.
Research Activities: past, present, and future.Research Activities: past, present, and future.
Research Activities: past, present, and future.
 
Data Quality - Standards e Applicazioni
Data Quality - Standards e ApplicazioniData Quality - Standards e Applicazioni
Data Quality - Standards e Applicazioni
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Riflessioni su Riforma Costituzionale "Renzi-Boschi"Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
 
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
 
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Energy Consumption Analysis
 of Image Encoding and Decoding AlgorithmsEnergy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
 
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
 
On the computation of Truck Factor
On the computation of Truck FactorOn the computation of Truck Factor
On the computation of Truck Factor
 
Language Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory StudyLanguage Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory Study
 
The impact of process maturity on defect density
The impact of process maturity on defect densityThe impact of process maturity on defect density
The impact of process maturity on defect density
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 

Data Quality - Standards and Application to Open Data

  • 1. Data Quality Standards and Application to Open Data February 21, 2018 – Brunel University, UK Marco Torchiano marco.torchiano@polito.it Version 1.1.0 © Marco Torchiano, 2018
  • 2. About me  Marco Torchiano  Associate Professor, Politecnico di Torino  Senior Member IEEE  Faculty Fellow – Nexa Center for Internet and Society  Member UNI CT504–Software Engineering  Contacts: – mailto:marco.torchiano@polito.it – http://softeng.polito.it/torchiano/ – Twitter: @mtorchiano 3
  • 3. Current Research Interests  Mobile UI Automated Testing  PhD student working on fragility  (Open)Data Quality  PhD student working on KB quality  Software Energy Consumption  Several collaborations  Also: MDD, Survey methodology, code obfuscation, SE education, … 4
  • 4. Acknowledgments  Antonio Vetrò  The counterpart for this line of research  Many other people  L.Canova, R.Iemma, F.Iuliano, F.Morando, C.Orozco Minotas, G.Procaccianti, R.Rashid 5
  • 6. Open Coesione  portal about the fulfilment of investments using the 2007-2013 European Cohesion funds  Interactive Interface  Downloadable .csv datasets  ~100 billion Euros are being tracked, ~100K projects  http://www.opencoesione.gov.it/
  • 7. 9
  • 9. 43 ! * extraction, transformation, and loading 11
  • 11. » Refer always to raw data » If not possible, estimate accuracy on analysis (e.g., about 5% in the example above) 43 ! 13
  • 13. 15
  • 14. »Outliers can point to interesting facts Outliers 16
  • 15. »… or to something which deserves a second look Outliers 17
  • 16. Valu e pcvc= percentage of cells with correct value 18
  • 18. ISO - SQuaRE 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement Family of standards 20
  • 19. ISO SQuaRE  Internal Quality  Values, formats, relation  External Quality  Technological environment  Quality in Use  Context of use of the data user 21
  • 20. ISO 25012 Data Quality Model 22 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 21. Roles  Data Quality evaluator  Data Producer  Data Acquirer  Data User 23
  • 22. Data evaluator  Defines/adapts a quality model  Evaluate and act  Data correction  Technological adjustments  Organizational measures 24
  • 23. Model structure  Characteristic  Main aspects, e.g., usability  Sub-Characteristic (optional)  A detailed aspect of a characteristic, e.g. Understandability  Metric  A set of rules to assign and interpret a (numerical) evaluation to a specific (sub)- characteristic 25
  • 24. Characteristics  Accuracy  Completeness  Consistency  Credibility  Currentness  Accessibility  Compliance  Confidentiality  Efficiency  Precision  Traceability  Understandability  Availability  Portability  Recoverability 26
  • 25. Characteristics  Accuracy  Correspondence between data and reality (syntactic and semantic)  Completeness  Computer: presence of all necessary values  User: how much the data is able to satisfy the needs  Consistency  Absence of contradictions in the data 27
  • 26. Characteristics  Credibility  The extent to which data are regarded as true and credible by users  Currentness  the extent to which data is up-to-date  Accessibility  The capability of data to be accessed, particularly by people who need supporting technology or special configuration because of some disability 28
  • 27. Characteristics  Regulatory compliance  The capability of data to adhere to standards, conventions or regulations in force and similar rules relating to data quality  Confidentiality  The capability of the data to be accessed and interpreted only by authorized users  Efficiency  The capability of data to be processed (accessed, acquired, updated, etc) and to provide appropriate levels of performance using the appropriate amounts and types of resources under stated conditions 29
  • 28. Characteristics  Precision  Capability of the value assigned to an attribute to provide the degree of information needed in a stated context of use  Traceability  Presence of attributes providing an audit trail of access and changes made to data  Understandability  The extent to which data can be read and interpreted by users 30
  • 29. Characteristics  Availability  The capability of data to be always retrievable.  Recoverability  The capability to preserve a specified level of operations and its physical and logical integrity, even in the event of failure  Portability  The capability of data to be moved to another platform preserving quality 31
  • 31. ISO 25024 Measurement of Data Quality 33 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 32. Relationships among standards ISO/IEC 25010 System and Software Product Quality ISO/IEC 25012 Data Quality composed of Quality characteristics Quality sub-characteristics composed of Quality Measure ISO/IEC 25022, 25023, 25024 Measuremen t function defines composed of Quality Measure Elements QME Measuremen t method ISO/IEC 25021 Property to quantifyTarget Entity Source: ISO/IEC 25024 34
  • 33. Data Life Cycle: examples Data design Data collection Data integration External data acquisition Source: ISO/IEC 25024 Data processin g Presentation Other use Data store Delete 35
  • 34. Data design: target entities  Architecture  Contextual schema  Data models (conceptual, logical, physical)  Data dictionary  Document 36
  • 35. Data design: properties  Attribute  Element  Information  Metadata  Vocabulary 37
  • 36. Other stages: target entities  Data file  DBMS  RDBMS  Form  Presentation device 38
  • 37. Properties  Data format  Data item  Data value  Information item  Information item content  Data record 39
  • 38. Metrics definition A) ID: abbreviated code of the quality characteristics + (I/D)+serial number b) Name: QM name related to data; c) Description d) Measurement function: formula showing how the QMEs are combined to produce the QM; e) DLC, Target entities, Properties: DLC includes stages of the DLC where the data QMEs are applicable, target entities and properties of target entities; f) Note: in the note, additional information such as an acceptable range of values, reference to other standards, explanations or interpretation or criteria, measurement method used to obtain the 40
  • 41. Open Government Data OD: open data, data that can be  Used  Reused  Redistributed  By anyone and with any goal G: Government produced or commissioned by a government or an institutional entity controlled by the government http://opengovernmentdata.org 51
  • 42. Why OGD ?  Transparency  Social and commercial value  Participation 52
  • 43. Case 1: Open Coesione  Published data  Structured  Open data format OpenCoesione Statistical data from municipalities  Residents  Weddings  Commercial activities 60
  • 44. Datasets analyzed 61 Orchestrated disclosure Decentralized disclosure ● Open Coesione ● portal about the fulfilment of investments using the 2007-2013 European Cohesion funds ● 85 billion Euros are being tracked, 850K projects Dataset Torino Roma Milano Firenze Bologna Residents X X X X X Weddings X X X Business Activities X X X
  • 46. Measures Characteristic Description ISO name Completeness Percentage of complete cells Com-I-1 (cell) Percentage of complete rows Com-I-1 (row) Accuracy Percentage of syntactically accurate cells Acc-I-1 Traceability Track of creation Tra-D-2 ( c ) Track of update Tra-D-2 (u) Currentness Percentage of current rows Cur-I-2 Delay in publication ~Cur-I-1 Compliance eGSM compliance Cmp-D-1 five stars open data Cmp-D-1 Understandability Percentage of columns with metadata Und-I-3 Percentage of columns in comprehensible format Und-I-4 63
  • 47. e.GMS 1. Accessibility (mandatory if appl) 2. Addressee (optional) 3. Aggregation (optional) 4. Audience (optional) 5. Contributor (optional) 6. Coverage (recommended) 7. Creator (mandatory) 8. Date (mandatory) 9. Description (optional) 10. Digital signature (optional) 11. Disposal (optional) 12. Format (optional) 13. Identifier (mandatory if appl) 14. Language (recommended) 15. Location (optional) 16. Mandate (optional) 17. Preservation (optional) 18. Publisher (mandatory if appl) 19. Relation (optional) 20. Rights (optional) 21. Source (optional) 22. Status (optional) 23. Subject (mandatory) 24. Title (mandatory) 25. Type (optional) UK - e-Governmant Metadata Standard https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
  • 48. Results – Open Coesione 65 0.00 0.20 0.40 0.60 0.80 1.00 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Null/zero values : domain uncertain Track updates missing Missing metadata data not linked
  • 49. 0 0.2 0.4 0.6 0.8 1 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Results – Municipality data 66 Discrepancies of values with domain No info on updates Missing metadata
  • 50. Findings  Disclosure strategy implies different data quality  Centralized vs.  Decentralized  Traceability is generally lacking  Proposals to use Sw Conf Mgmt tools  Metadata is often missing or incomplete 67
  • 51. Case 2: Public Contracts  Published data  Structured  Open format Data on public contracts ex Art.37 Decree Transparency + prescriptions ANAC 68
  • 52. Public contracts  Decree Transparency (14 march 2013 n.33)  Public contracts (Art.37 & Art 9.)  Open Data Publication  XML Standard Format (ANAC)  Selected administrations: Italian universities 69
  • 54. <lotto> <cig>4421574E47</cig> <strutturaProponente> <codiceFiscaleProp>00518460019</codiceFiscaleProp> <denominazione>Politecnico di Torino</denominazione> </strutturaProponente> <oggetto> Procedura di cottimo fiduciario per affidamento servizio di manutenzione e assistenza di primo livello stazioni self-service </oggetto> <sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente> <partecipanti> <partecipante> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </partecipante> </partecipanti> <aggiudicatari> <aggiudicatario> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </aggiudicatario> </aggiudicatari> <importoAggiudicazione>7500.00</importoAggiudicazione> <tempiCompletamento> <dataInizio>2014-09-01</dataInizio> <dataUltimazione>2014-11-30</dataUltimazione> </tempiCompletamento> <importoSommeLiquidate>7500.00</importoSommeLiquidate> </lotto> 71
  • 55. Quality Evaluation Framework Intrinsic Dimensions Domain Dependent Dimension Measure Accuracy Percentage of elements with correct values. Completeness Percentage of complete elements. Percentage of complete aggregate elements. Dimension Measure Consistency Percentage of lots that meet the Intrarelational and Interrelational Integrity Constraints. Duplication Number of duplicates. 72
  • 56. Identification of datasets  First 25 universities of the overall ranking for the 2014 provided by the newspaper Il Sole 24 Ore.  Only 12 universities provide summary tables in XML format. Total numer of assessed lots: 123702 Average number of published lots:10308,5  The remaining 13 universities either do not provide the summary tables or they provide summary tables but not in XML format. 73
  • 57. CIG 74 The University of Torino publishes summary tables that have 100% cig completeness, that is, the 100% of lots have the cig element but about 32% of them are out of domain. 1 0.94 0.9999 0.999 0.67 1 0.99 1 0.998 0.997 0.9998 0.99 1 1 1 1 1 1 1 1 1 1 1 1 0.600.700.800.901.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm Unique Tender Identifier A lot of “00000000000”.
  • 58. The element is present for each lot but it is always empty. Choice of contracting part 75 0.9999 0.998 0.9999 0 1 1 1 0.9991 1 1 1 1 1 1 1 1 1 1 1 0.999 1 1 1 1 0.000.200.400.600.801.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 59. All the lots published by University of Milano have a winner but no information about the participants. Fiscal Code 76 1 0.97 0.99 1 1 1 1 1 1 1 1 1 1 1 1 1 0.974 1 1 0.996 1 0.951 0.900.920.940.960.981.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 60. In 14% of lots the amount paid is greater than the awarded amount. Amount paid vs. Total paid 78 0.87 0.97 0.96 0.9999 0.998 0.99 0.999 0.93 0.995 0.9999 0.98 0.98 0.800.850.900.951.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm PayedlessorequaltoAwarded
  • 61. Final considerations  ISO standard provides several predefined measures  Must be adapted to the case at hand  Can be aggregated in different ways  Possibility to define new measures  ISO standard is intended for structured data  What about semantic knowledge bases? 79
  • 62. References  ISO/IEC 25012:2008, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Data quality model  ISO 25024:2015, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Measurement of data quality  Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open Data Quality Measurement Framework: Definition and Application to Open Government Data”GOVERNMENT INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740- 624X  Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca “Preserving the Benefits of Open Government Data by Measuring and Improving Their Quality: An Empirical Study” in IEEE 41st Annual Computer Software and Applications Conference (COMPSAC 2017) 80

Notas do Editor

  1. Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.
  2. To assess the quality we consider different dimensions:intrinsic dimesions which do not depend on the type of the data and domain dependent dimensions. As intrinsic dimension we evaluate the Accuracy computed as the percentage of elements with correct values and Completeness computed as percentage of complete elements and the percentage of complete aggregate elements, where an element is considered not correct or incomplete if it does not meet the specification of its domain or the number occurrences specified in the XML schema. For the domain dependent dimensions we evaluate the consistency by defining a set of integrity constraints that strictly depend on public constracts domain as for axample that the amountPaid must be less than or equal to the award amount or that if a public contract does not have a successful tenderer the amount paid must be equal to zero.
  3. To conduct the evaluation we selected the first 25 universities of the general ranking for the 2014 provided by the newspaper Il Sole 24 Ore. Only 12 of them provide summary tables in the xml format for a total of 123702 assessed lot. The remaining 13 Universities either do not provide the summary tables or they provide summary table but not in XML format.
  4. The accuracy and completeness were computed for all elements but we will show the most interesting and moreover we wiil see only some of the integrity constraints defined to asses the consistency. The cig is the unique identifier of a lot. The university of torino has a completeness on the cig of 100% this means that the cig element is present in all analysed lots but in the 32% of cases it is out of domain.
  5. The scelta contraente is one of the most important element because it specifies the procedure for the selection of the contractor and it can be used by the authorities to detect illegal award of contracts. High accuracy and completeness will improve the transparency of contracts. The completeness sceltaContraente element for the university of Milano is 100% but percentage of correct elements is equal to 0 this because the scelta contraente is always present in all the lots provided by the university of Milano but its value is always empty.
  6. The codiceFiscale is the unique identifier for the participants, an interesting aspect is that the University of Milano is not classified because in all the summary tables provided by the University there isn’t information about the participants.
  7. This results is highlighted by the lots has participant and the successful tenderer is participant interrelational constraints. The first one computes the percentage of lots which have a successful tenderer and have at least one participant while the second constraints computes the percentage of cells in which the sucessful tenderer of a lot is a participant for the same lot. In both cases the percentage for the university of Milano is equal to zero because there isn’t information on participants in the analysed files. For the university of Milano-Bococca the percentage of lot has participant constraint is slightly higher than the successfulTenderer is Participant and this means that althought in some lots there are participants, the successful tenderer is not one of those participant.
  8. The first IntrarelationalConstraint computes the percentage of lots in which the amount paid is less than or equal to the award amount and we can see that the 14% of lot of University of Bologna have an amount paid greater then the award amount this shows that more public money than requested is spent. The successfulTenderer_amountPaint computes the percentage of cells in which there isn’t information about the successful tenderer but the amount paid is different by zero. For the 40 % of lots of the University of Bologna there is not information about the successfull tenderer but an amount of money is distributed and it is not possible to track the money, that is, it is not known who receives the money.