2. About me
Marco Torchiano
Associate Professor, Politecnico di Torino
Senior Member IEEE
Faculty Fellow – Nexa Center for Internet
and Society
Member UNI CT504–Software Engineering
Contacts:
– mailto:marco.torchiano@polito.it
– http://softeng.polito.it/torchiano/
– Twitter: @mtorchiano
3
3. Current Research Interests
Mobile UI Automated Testing
PhD student working on fragility
(Open)Data Quality
PhD student working on KB quality
Software Energy Consumption
Several collaborations
Also: MDD, Survey methodology, code
obfuscation, SE education, …
4
4. Acknowledgments
Antonio Vetrò
The counterpart for
this line of research
Many other people
L.Canova, R.Iemma, F.Iuliano, F.Morando,
C.Orozco Minotas, G.Procaccianti,
R.Rashid
5
6. Open Coesione
portal about the fulfilment of
investments using the 2007-2013
European Cohesion funds
Interactive Interface
Downloadable .csv datasets
~100 billion Euros are being tracked,
~100K projects
http://www.opencoesione.gov.it/
19. ISO SQuaRE
Internal Quality
Values, formats, relation
External Quality
Technological environment
Quality in Use
Context of use of the data user
21
20. ISO 25012
Data Quality Model
22
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
22. Data evaluator
Defines/adapts a quality model
Evaluate and act
Data correction
Technological adjustments
Organizational measures
24
23. Model structure
Characteristic
Main aspects, e.g., usability
Sub-Characteristic (optional)
A detailed aspect of a characteristic, e.g.
Understandability
Metric
A set of rules to assign and interpret a
(numerical) evaluation to a specific (sub)-
characteristic
25
25. Characteristics
Accuracy
Correspondence between data and reality
(syntactic and semantic)
Completeness
Computer: presence of all necessary
values
User: how much the data is able to satisfy
the needs
Consistency
Absence of contradictions in the data
27
26. Characteristics
Credibility
The extent to which data are regarded as
true and credible by users
Currentness
the extent to which data is up-to-date
Accessibility
The capability of data to be accessed,
particularly by people who need
supporting technology or special
configuration because of some disability
28
27. Characteristics
Regulatory compliance
The capability of data to adhere to standards,
conventions or regulations in force and similar
rules relating to data quality
Confidentiality
The capability of the data to be accessed and
interpreted only by authorized users
Efficiency
The capability of data to be processed (accessed,
acquired, updated, etc) and to provide
appropriate levels of performance using the
appropriate amounts and types of resources
under stated conditions
29
28. Characteristics
Precision
Capability of the value assigned to an
attribute to provide the degree of
information needed in a stated context of
use
Traceability
Presence of attributes providing an audit trail
of access and changes made to data
Understandability
The extent to which data can be read and
interpreted by users
30
29. Characteristics
Availability
The capability of data to be always
retrievable.
Recoverability
The capability to preserve a specified level of
operations and its physical and logical
integrity, even in the event of failure
Portability
The capability of data to be moved to
another platform preserving quality
31
31. ISO 25024
Measurement of Data Quality
33
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
32. Relationships among standards
ISO/IEC 25010
System and Software
Product Quality
ISO/IEC 25012
Data Quality
composed of
Quality characteristics
Quality sub-characteristics
composed of
Quality Measure
ISO/IEC 25022, 25023, 25024
Measuremen
t function
defines
composed of
Quality Measure Elements
QME
Measuremen
t method
ISO/IEC 25021
Property to quantifyTarget Entity
Source: ISO/IEC 25024 34
33. Data Life Cycle: examples
Data
design
Data
collection
Data
integration
External
data
acquisition
Source: ISO/IEC 25024
Data
processin
g
Presentation
Other use
Data store
Delete
35
34. Data design: target entities
Architecture
Contextual schema
Data models (conceptual, logical,
physical)
Data dictionary
Document
36
36. Other stages: target entities
Data file
DBMS
RDBMS
Form
Presentation device
38
37. Properties
Data format
Data item
Data value
Information item
Information item content
Data record
39
38. Metrics definition
A) ID: abbreviated code of the quality characteristics +
(I/D)+serial number
b) Name: QM name related to data;
c) Description
d) Measurement function: formula showing how the QMEs
are combined to produce the QM;
e) DLC, Target entities, Properties: DLC includes stages of
the DLC where the data QMEs are applicable, target
entities and properties of target entities;
f) Note: in the note, additional information such as an
acceptable range of values, reference to other standards,
explanations or interpretation or criteria, measurement
method used to obtain the
40
41. Open Government Data
OD: open data, data that can be
Used
Reused
Redistributed
By anyone and with any goal
G: Government produced or commissioned
by a government or an institutional
entity controlled by the government
http://opengovernmentdata.org
51
42. Why OGD ?
Transparency
Social and commercial value
Participation
52
43. Case 1: Open Coesione
Published data
Structured
Open data format
OpenCoesione
Statistical data from municipalities
Residents
Weddings
Commercial activities
60
44. Datasets analyzed
61
Orchestrated disclosure Decentralized disclosure
● Open Coesione
● portal about the
fulfilment of
investments using the
2007-2013 European
Cohesion funds
● 85 billion Euros are
being tracked, 850K
projects
Dataset
Torino
Roma
Milano
Firenze
Bologna
Residents X X X X X
Weddings X X X
Business
Activities
X X X
46. Measures
Characteristic Description ISO name
Completeness
Percentage of complete cells Com-I-1 (cell)
Percentage of complete rows Com-I-1 (row)
Accuracy
Percentage of syntactically accurate
cells
Acc-I-1
Traceability
Track of creation Tra-D-2 ( c )
Track of update Tra-D-2 (u)
Currentness
Percentage of current rows Cur-I-2
Delay in publication ~Cur-I-1
Compliance
eGSM compliance Cmp-D-1
five stars open data Cmp-D-1
Understandability
Percentage of columns with metadata Und-I-3
Percentage of columns in
comprehensible format
Und-I-4
63
47. e.GMS
1. Accessibility (mandatory if
appl)
2. Addressee (optional)
3. Aggregation (optional)
4. Audience (optional)
5. Contributor (optional)
6. Coverage (recommended)
7. Creator (mandatory)
8. Date (mandatory)
9. Description (optional)
10. Digital signature (optional)
11. Disposal (optional)
12. Format (optional)
13. Identifier (mandatory if appl)
14. Language (recommended)
15. Location (optional)
16. Mandate (optional)
17. Preservation (optional)
18. Publisher (mandatory if appl)
19. Relation (optional)
20. Rights (optional)
21. Source (optional)
22. Status (optional)
23. Subject (mandatory)
24. Title (mandatory)
25. Type (optional)
UK - e-Governmant Metadata Standard
https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
49. 0 0.2 0.4 0.6 0.8 1
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Results – Municipality data
66
Discrepancies
of values with
domain
No info on
updates
Missing
metadata
50. Findings
Disclosure strategy implies different
data quality
Centralized vs.
Decentralized
Traceability is generally lacking
Proposals to use Sw Conf Mgmt tools
Metadata is often missing or
incomplete
67
51. Case 2: Public Contracts
Published data
Structured
Open format
Data on public contracts ex Art.37
Decree Transparency + prescriptions
ANAC
68
52. Public contracts
Decree Transparency (14 march 2013
n.33)
Public contracts (Art.37 & Art 9.)
Open Data Publication
XML Standard Format (ANAC)
Selected administrations: Italian
universities
69
54. <lotto>
<cig>4421574E47</cig>
<strutturaProponente>
<codiceFiscaleProp>00518460019</codiceFiscaleProp>
<denominazione>Politecnico di Torino</denominazione>
</strutturaProponente>
<oggetto>
Procedura di cottimo fiduciario per affidamento servizio di manutenzione e
assistenza di primo livello stazioni self-service
</oggetto>
<sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente>
<partecipanti>
<partecipante>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</partecipante>
</partecipanti>
<aggiudicatari>
<aggiudicatario>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</aggiudicatario>
</aggiudicatari>
<importoAggiudicazione>7500.00</importoAggiudicazione>
<tempiCompletamento>
<dataInizio>2014-09-01</dataInizio>
<dataUltimazione>2014-11-30</dataUltimazione>
</tempiCompletamento>
<importoSommeLiquidate>7500.00</importoSommeLiquidate>
</lotto>
71
55. Quality Evaluation Framework
Intrinsic
Dimensions
Domain
Dependent
Dimension Measure
Accuracy Percentage of elements
with correct values.
Completeness
Percentage of complete
elements.
Percentage of complete
aggregate elements.
Dimension Measure
Consistency Percentage of lots that
meet the Intrarelational
and Interrelational
Integrity Constraints.
Duplication Number of duplicates.
72
56. Identification of datasets
First 25 universities of the overall ranking for
the 2014 provided by the newspaper Il Sole 24
Ore.
Only 12 universities provide summary tables in
XML format.
Total numer of assessed lots: 123702
Average number of published lots:10308,5
The remaining 13 universities either do not
provide the summary tables or they provide
summary tables but not in XML format.
73
57. CIG
74
The University of Torino
publishes summary tables that
have 100% cig completeness,
that is, the 100% of lots have the
cig element but about 32% of
them are out of domain.
1
0.94
0.9999
0.999
0.67
1
0.99
1
0.998
0.997
0.9998
0.99
1
1
1
1
1
1
1
1
1
1
1
1
0.600.700.800.901.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
Unique Tender Identifier
A lot of “00000000000”.
58. The element is present for
each lot but it is always
empty.
Choice of
contracting part
75
0.9999
0.998
0.9999
0
1
1
1
0.9991
1
1
1
1
1
1
1
1
1
1
1
0.999
1
1
1
1
0.000.200.400.600.801.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
59. All the lots published by
University of Milano have a
winner but no information about
the participants.
Fiscal Code
76
1
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
0.974
1
1
0.996
1
0.951
0.900.920.940.960.981.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
60. In 14% of lots the amount paid is
greater than the awarded amount.
Amount paid
vs. Total paid
78
0.87
0.97
0.96
0.9999
0.998
0.99
0.999
0.93
0.995
0.9999
0.98
0.98
0.800.850.900.951.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
PayedlessorequaltoAwarded
61. Final considerations
ISO standard provides several
predefined measures
Must be adapted to the case at hand
Can be aggregated in different ways
Possibility to define new measures
ISO standard is intended for
structured data
What about semantic knowledge bases?
79
62. References
ISO/IEC 25012:2008, Software engineering — Software
product Quality Requirements and Evaluation (SQuaRE) —
Data quality model
ISO 25024:2015, Software engineering — Software product
Quality Requirements and Evaluation (SQuaRE) —
Measurement of data quality
Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco
Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open
Data Quality Measurement Framework: Definition and
Application to Open Government Data”GOVERNMENT
INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740-
624X
Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca
“Preserving the Benefits of Open Government Data by
Measuring and Improving Their Quality: An Empirical Study” in
IEEE 41st Annual Computer Software and Applications
Conference (COMPSAC 2017)
80
Notas do Editor
Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.
To assess the quality we consider different dimensions:intrinsic dimesions which do not depend on the type of the data and domain dependent dimensions. As intrinsic dimension we evaluate the Accuracy computed as the percentage of elements with correct values and Completeness computed as percentage of complete elements and the percentage of complete aggregate elements, where an element is considered not correct or incomplete if it does not meet the specification of its domain or the number occurrences specified in the XML schema. For the domain dependent dimensions we evaluate the consistency by defining a set of integrity constraints that strictly depend on public constracts domain as for axample that the amountPaid must be less than or equal to the award amount or that if a public contract does not have a successful tenderer the amount paid must be equal to zero.
To conduct the evaluation we selected the first 25 universities of the general ranking for the 2014 provided by the newspaper Il Sole 24 Ore. Only 12 of them provide summary tables in the xml format for a total of 123702 assessed lot. The remaining 13 Universities either do not provide the summary tables or they provide summary table but not in XML format.
The accuracy and completeness were computed for all elements but we will show the most interesting and moreover we wiil see only some of the integrity constraints defined to asses the consistency. The cig is the unique identifier of a lot. The university of torino has a completeness on the cig of 100% this means that the cig element is present in all analysed lots but in the 32% of cases it is out of domain.
The scelta contraente is one of the most important element because it specifies the procedure for the selection of the contractor and it can be used by the authorities to detect illegal award of contracts. High accuracy and completeness will improve the transparency of contracts. The completeness sceltaContraente element for the university of Milano is 100% but percentage of correct elements is equal to 0 this because the scelta contraente is always present in all the lots provided by the university of Milano but its value is always empty.
The codiceFiscale is the unique identifier for the participants, an interesting aspect is that the University of Milano is not classified because in all the summary tables provided by the University there isn’t information about the participants.
This results is highlighted by the lots has participant and the successful tenderer is participant interrelational constraints. The first one computes the percentage of lots which have a successful tenderer and have at least one participant while the second constraints computes the percentage of cells in which the sucessful tenderer of a lot is a participant for the same lot. In both cases the percentage for the university of Milano is equal to zero because there isn’t information on participants in the analysed files. For the university of Milano-Bococca the percentage of lot has participant constraint is slightly higher than the successfulTenderer is Participant and this means that althought in some lots there are participants, the successful tenderer is not one of those participant.
The first IntrarelationalConstraint computes the percentage of lots in which the amount paid is less than or equal to the award amount and we can see that the 14% of lot of University of Bologna have an amount paid greater then the award amount this shows that more public money than requested is spent. The successfulTenderer_amountPaint computes the percentage of cells in which there isn’t information about the successful tenderer but the amount paid is different by zero. For the 40 % of lots of the University of Bologna there is not information about the successfull tenderer but an amount of money is distributed and it is not possible to track the money, that is, it is not known who receives the money.