1. of1 30
TELL ME QUALITY
19 July 2017
MARCO BERLOT
SOFTWARE DOCUMENTATION
2. This work is under the terms of Creative Commons Public License. The whole text of the license in
version 4.0 can be found at the web address: http://creativecommons.org/licenses/by-sa/4.0/deed.it.
You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if
changes were made. You may do so in any reasonable manner, but not in any way that suggests the
licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your
contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that
legally restrict others from doing anything the license permits.
of2 30Image on the cover page: Airiesummer/iStock
3. TABLE OF CONTENTS
ABSTRACT 4
The Nexa Center for Internet & Society 4
Preliminary Design Choices: Technologies 5
Node.js 5
D3.js 5
GIT 6
JSON 6
MUSTACHE 6
REST 7
Shapes Constraint Language (SHACL) 8
ARCHITECTURE 9
FRONTEND 9
HOME PAGE 9
UPLOAD PAGE 10
PAGE FOR THE CONFIGURATION FILE 11
AVAILABLE MEASUREMENTS PAGE 13
DATA VISUALISATION PAGE 15
SHOW DETAIL WINDOW 16
VALUES FOR FIELDS PAGE 18
BACKEND 19
UPLOAD FUNCTION 20
CONFIGURATION OF THE SHAPE FILE 21
SELECT TYPE OF MEASUREMENT 22
EXPORT MEASUREMENTS 23
A REAL CASE 24
Uploading the file 24
Choosing the Measure 25
Reading the Results 26
Final Remarks 29
of3 30
4. ABSTRACT
The aim of this internship project is to develop a graphical user interface of the Tell me
quality tool. This tool aim’s is to perform qualitative measurement of set of data and display
the results to the user. The owner of the data can upload on the platform different types of
data. Furthermore he can also choose to upload a Shape file or to use a Wizard Menu on the1
platform to give more information about his files. The system will then perform some
preliminary tests to see wether it will be able to perform all the different measurements.
Once the user see the results of the test, he can choose to give some more metadata or to run
the measurements. The final result will show graphs that will explain the outcome of the
measurements. The design of the system was mainly divided in two parts. One related to the
server side that performs the measurements. The other instead, was focused on the front end
part and the study on how to implement efficient visualisations. This documentation deals
with the second part.
The Nexa Center for Internet & Society
“The Nexa Center for Internet & Society is born from the activities of an initially informal
interdisciplinary group – with expertise in technology, law and economics – that grew up in
Torino from 2003 and that has conceived, designed and implemented a number of
initiatives: Creative Commons Italia (2003-present), CyberLaw Torino (2004), Harvard
Internet Law Program Torino (2005), SeLiLi, free legal advice on open licenses for creators
and programmers (2006-2013), COMMUNIA, the European Commission-funded thematic
network of 50 partners aimed at studying the digital public domain (2007-2011), Neubot, a
software project on network neutrality (2008-present), and LAPSI, the European thematic
network on legal aspects of public sector information funded by the European Commission
(2010-2012).
The Nexa Center cooperates with international partners for the development of joint
interdisciplinary research projects and initiatives. From October 2014 to October 2016, the
Nexa Center took the role of coordinator of the Global Network of Internet & Society
Research Centers (NoC); launched by the Berkman Center for Internet & Society at
Harvard University in December 2012, the Network supports the cooperation among the
main Internet & society research centers worldwide. The Nexa Center has partnership
agreements with the Berkman Center for Internet & Society at Harvard University (2011)
and with the Internet & Society Lab at Keio University, Tokyo (2012). In 2013 the Nexa
Center has become part of the Global Network Initiative (GNI), a multi-stakeholder
international group devoted to protect and advance freedom of expression and privacy in
the ICT sector.”
Reference here: https://nexa.polito.it/
A Shape File is a type of file that contains some metadata that are able to give specific1
information in order to perform more accurate measurements.
of4 30
5. Preliminary Design Choices: Technologies
The first choice that we encountered in the design of the system was the one related to the
programming language that was best suited for the task. Javascript has been chosen for all
the parts that concerns the server side, in particular with Node as a run-time environment.
Concerning the Data Visualisation part we choose the library Js d3. This library is one of the
most known and used to represent data. It allows to dynamically creates SVG files that take
data as input.
Finally, in order to have an efficient team work we chose to use a repository on GIT.
Here, we briefly reference the main technologies employed in the development of the front
end part of the system are described. The description is directly taken from their website,
and at the end of each paragraph, a reference link is provided.
Node.js
“Node.js is an open-source, cross-platform JavaScript runtime environment for developing
a diverse variety of server tools and applications. Although Node.js is not a JavaScript
framework, many of its basic modules are written in JavaScript, and developers can write
new modules in JavaScript. The Node.js distributed development project, governed by the
Node.js Foundation, is facilitated by the Linux Foundation's Collaborative Projects
program.”
For the whole documentation: http://nodejs.org/api/
D3.js
“D3.js (or just D3 for Data-Driven Documents) is a JavaScript library for producing
dynamic, interactive data visualisations in web browsers. It makes use of the widely
implemented SVG, HTML5, and CSS standards. Embedded within an HTML webpage, the
JavaScript D3.js library uses pre-built JavaScript functions to select elements, create SVG
objects, style them, or add transitions, dynamic effects or tooltips to them. These objects can
also be widely styled using CSS. Large datasets can be easily bound to SVG objects using
simple D3.js functions to generate rich text/graphic charts and diagrams. The data can be
in various formats, most commonly JSON, comma-separated values (CSV) or geoJSON, but,
if required, JavaScript functions can be written to read other data formats. “
For the whole documentation: https://d3js.org/
of5 30
6. GIT
“Git is a version control system (VCS) for tracking changes in computer files and
coordinating work on those files among multiple people. It is primarily used for software
development, but it can be used to keep track of changes in any files. As a distributed
revision control system it is aimed at speed, data integrity, and support for distributed, non-
linear workflows.”
For the whole documentation: http://git-scm.com/doc
JSON
“JSON is an open-standard format that uses human-readable text to transmit data objects
consisting of attribute–value pairs. It is the most common data format used for
asynchronous browser/server communication, largely replacing XML, and is used by AJAX.
JSON is a language-independent data format. It was derived from JavaScript, but as of
2017 many programming languages include code to generate and parse JSON-format data.
The official Internet media type for JSON is application/json. JSON filenames use the
extension .json".
For the whole documentation: https://tools.ietf.org/html/rfc7159
MUSTACHE
Mustache is a “logic-less” template syntax. “Logic-less” means that it doesn’t rely on
procedural statements (if, else, for, etc.): Mustache templates are entirely defined with tags.
Mustache is implemented in different languages: Ruby, JavaScript, Python, PHP, Perl,
Objective-C, Java, .NET, Android, C++, Go, Lua, Scala, etc. Mustache.js is the JavaScript
implementation.
For the whole documentation: http://mustache.github.io/
of6 30
7. REST
“Representational state transfer (REST) or RESTful web services is a way of providing
interoperability between computer systems on the Internet. REST-compliant Web services
allow requesting systems to access and manipulate textual representations of Web
resources using a uniform and predefined set of stateless operations. Other forms of Web
service exist, which expose their own arbitrary sets of operations such as WSDL and
SOAP.
"Web resources" were first defined on the World Wide Web as documents or files identified
by their URLs, but today they have a much more generic and abstract definition
encompassing every thing or entity that can be identified, named, addressed or handled, in
any way whatsoever, on the Web. In a RESTful Web service, requests made to a
resource's URI will elicit a response that may be in XML, HTML, JSON or some other
defined format. The response may confirm that some alteration has been made to the
stored resource, and it may provide hypertext links to other related resources or collections
of resources. Using HTTP, as is most common, the kind of operations available include
those predefined by the HTTP verbs GET, POST, PUT, DELETE and so on.”
For the whole documentation: https://en.wikipedia.org/wiki/Representational_state_transfer
of7 30
8. Shapes Constraint Language (SHACL)
“ SHACL Shapes Constraint Language, a language for validating RDF graphs against a
set of conditions. These conditions are provided as shapes and other constructs
expressed in the form of an RDF graph. RDF graphs that are used in this manner are
called "shapes graphs" in SHACL and the RDF graphs that are validated against a shapes
graph are called "data graphs". As SHACL shape graphs are used to validate that data
graphs satisfy a set of conditions they can also be viewed as a description of the data
graphs that do satisfy these conditions. Such descriptions may be used for a variety of
purposes beside validation, including user interface building, code generation and data
integration.
The following example data graph contains three SHACL instances of the class ex:Person.
of8 30
The following conditions are shown in the example:
• A SHACL instance of ex:Person can have at most one value for the property ex:ssn,
and this value is a literal with the datatype xsd:string that matches a specified regular
expression.
• A SHACL instance of ex:Person can have unlimited values for the property
ex:worksFor, and these values are IRIs and SHACL instances of ex:Company.
• A SHACL instance of ex:Person cannot have values for any other property apart from
ex:ssn, ex:worksFor and rdf:type.
9. ARCHITECTURE
This section deals with the whole architecture of the system. For this reason it is divided
between a front end and a back end part. Every function is explained from a user’s
experience point of view. Many arguments as the purpose of every button, the specific
choices of graphs implementations and how the user should read the measures are explained
in this section.
In addiction, for every functionality in the front end, at the end of the paragraph, there’s a
section dedicated to main technical aspects concerning that specific page. This section has
the purpose to give some further information to developers.
The technical section is delineated by the following icon
FRONTEND
HOME PAGE
The main page of the front end (figure 1) is just a brief description of what is the purpose of
the system and it presents a button that allows the user to start the uploading of the file he
wants to measure.
of9 30
Figure 1. Homepage
10. UPLOAD PAGE
In this page the user can upload his own set of files. There are different types of files that the
system is able to process, these formats are JSON and CSV.
A function has been created to show the user that the upload was successful by showing the
name of the file under the UPLOAD button (figure 3).
Once the file has been uploaded the user will be presented with two alternatives: either
upload a Shape File or configure it. The manual configuration is managed by the “Page for
the configuration file” (analysed in the following section). The purpose of a Shape File is to
give more information concerning the meta data of a file. More informations provided to the
system will lead to a larger number of available measurements.
of10 30
Technical Aspects
The upper animation was realised starting by an open source code
implementing a Java Script function. This function is able to animate dots that draw a
line if they reach a minimum distance and deletes them when they exceed a maximum
distance. It is contained into the animation.js file
Further information: https://blog.alexwendland.com/2015/particle-network-js-animations/
Figure 2. Upload button
Figure 3. Remove button
11. PAGE FOR THE CONFIGURATION FILE
By choosing each metadata and filling its fields the user will create a Shape File that will
then be sent to the server. One of the challenge of this page was to make it entirely dynamic.
Fields are related to the uploaded file, for this reason it’s impossible to forecast their value.
Different Javascript functions that implement the Mustache framework are able to
dynamically create the page based on the data contained in the file.
Another problem to face was the fact that the user will first upload the file, then manually
configure the shape file and then return to the upload page. It was necessary to save the
uploaded file name, in order to show the user that, also after the configuration, the file was
still uploaded. In order to achieve that, the name of the uploaded file (and the same thing
happens if there is a configuration file uploaded) is passed trough the URL as an argument
(figure 5).
Once the file has been uploaded and the configuration file uploaded and modified, the page
will look like this:
of11 30
Figure 4. Configuration page
Figure 5. Uploaded file name passed as an URL argument
12. Looking at the page, it’s clear that the Dataset.txt file has been uploaded and the conf_file.txt
is the Shape File that will be used. From this view the user can either decide to manually
modify the configuration file he uploaded (“conf_file.txt”) or directly run the measurement.
If the upload file is removed, the frames “Shape Your File” and “Start the Measure” will
disappear, forcing the user to upload another file. By running the “RUN THE
MEASUREMENT” button the system will send the two files to the server (or only one, if
the configuration file was only configured manually), and it will get an answer with all the
available measurements. These measurements will be shown in the following page.
of12 30
Technical Aspects
The data inserted in the configuration file page modifies the JSON object
shapeResult. This Object is received trough a GET by the server. The package contains
all the possible fields, without metadata. Every time a metadata is inserted the
information is stored in the JSON object and when the SEND button is pressed a POST
that contains the shapeResult is sent to the server
Figure 6. Upload page
13. AVAILABLE MEASUREMENTS PAGE
After the processing of the configuration file the system will show the user all the available
measurements that the system can perform (figure 7). If the measurements available do not
satisfy some requirements, there’s the possibility to go back to the configuration file menu in
order to give more meta data to the platform. For every type of measurement (Accuracy,
Completeness , Consistency, Credibility and Compliance) there’s the possibility (by clicking
on the respective button) to open a new page (figure 8) with more detailed information. As
shown in the picture below:
In this page the user can choose to perform different types of a determinate measure on
some specific fields (in the example the fields are ID, NAME and EMAIL). This allows the
of13 30
Figure 7. Available measurements page
Figure 8. Page of specific type of measurements
14. system to do the measurement only on the attributes that interest the user, leading to a more
efficient computation. The presence of buttons denominated “SELECT ALL” will give the
possibility to select all the fields of the same column. Since the file can contain many field,
this procedure could save some time to the user.
The interesting point of the implementation of these two pages is that they are entirely
dynamic. This means that the front end only receives by the server a JSON object
containing all the possible measurements to perform. This was again realised trough the
Mustache framework. Starting from the JSON object received by the server, there’s a
specific function that counts how many measurements are possible to perform out of the
total number that the system is able to execute. It will display them in the “Available
measurements” page. After that, every time the user press a button concerning a specific
measurement, it dynamically creates a page containing all the possible measurements with
their related field. Using this way of coding there is only one HTML page containing each
measurement, and it is created automatically, when the user presses a specific button. In this
manner the code is more efficient and manageable.
of14 30
Technical Aspects
The system is able to generate the two pages with the two JSON objects received by
the server trough a POST. One JSON contains all the possible fields, and the other, the
possible measurements and fields the user can choose. The system dynamically counts
how many type of measurements are available in total and how many is able to perform
in that moment. Then it displays the two numbers in the AVAILABLE MEASUREMENT
PAGE. The name of all the measures and their descriptions are contained respectively
in the classes and definition object. In this way, every time there’s the need to show a
definition, the system receives the name of the measurement trough the server, matches
the string in the definition object and then displays it.
15. DATA VISUALISATION PAGE
This page has the purpose to show the user a set of general results concerning the
measurements. A deep study has been made in order to find the best solutions to show the
data in the most effective way, following some specific rules (Torchiano, 2017). The first
graph, which is called a Radar Chart (figure 9) gives the big picture of the whole set of
measurements. Since the area is directly proportional to the entire set of measurement it
should be quite clear to the user if its data had a good overall quality or not. Furthermore it
should be evident if there is a remarkable unbalance between different type of
measurements. Going into more detail, the graph reports every percentage concerning each
measurement. Those results are obtained as an average of all the percentages obtained from
the different standards of measurement. From this point the User has the possibility to see
the details of every type of measurement he performed, by clicking on the “Show Details”
button.
A collapsed menu for the detailed graphs has been chosen in order to give the possibility to
have all the information in one page, switching trough the desired ones without the need of
scrolling the entire page.
of15 30
Figure 9. Radar chart
16. SHOW DETAIL WINDOW
The aim of this frame is to show specific results for every measurement (in the above
picture, for example, the “Accuracy” measurement). The Accuracy, in this case, was
measured based on four semantics: “Syntactic”, “Semantic”, “Data Assurance” and “Risk of
Data Set Inaccuracy”.
From the horizontal graph (figure 10) it’s possible to get all the information related to every
specific measurement that was employed. The employment of a bar chart was based on the
fact that in this section, it is fundamental for the user to compare the specific results. Trough
the width of the bar this task becomes quite intuitive. Since sometimes the bar could be very
short ( a low result ) the percentage could not have enough space to be displayed. For this
reason every time the user passes the mouse on the graph, a label containing the type of
measurement and the percentage appears (figure 11).
The average result of all these fields is the one reported on the Radar Chart. On the right
side of the graph a paragraph is dedicated to the definition of the type of measurement the
user is looking at, in order to always understand the meaning of the result. Clicking on the
“Show Values for Fields” button (figure 12) , the user will be redirected to the “Values For
Fields” page.
of16 30
Figure 10. Horizontal graph
Figure 11. Detail of the horizontal graph
18. VALUES FOR FIELDS PAGE
From this last page it is possible to see the results related to the fields that were chosen by
the user. For every single field selected, there is a dedicated graph (figure 13) that include all
the measurements. In this way it is possible to compare how results of a single measurement
can change on the different fields of a file.
For the same reasons of efficiency in comparisons, the bar chart was considered the best
solution to show this type of data.
of18 30
Technical Aspects
All the Graphs were generated using the d3 library. In this way, starting from a JSON
object the visualisations can dynamically show the results. This Object is retrieved by
the server trough a GET.
Some modifications have been employed in order to show them correctly and with the
right dimensions. Since both pages display more than once the same type of graph, a
template has been build, in order to draw them trough a for loop.
Here again, all the names of measurement and descriptions are taken from the JSON
object stored in the local server. If a measurement is impossible to perform, it won’t be
present in the set of results sent by the server. For this reason, before displaying all the
graphs the system checks if some measures are missing. If that is the case, it will write
near the missing measure:”MEASUREMENT IMPOSSIBLE TO PERFORM”
Figure 13. Graph for single field
19. BACKEND
In this section the main functions of the backend server are analysed. Each function that is
present between the backend and the front end is represented by a Use Case.
2
Unified Modeling Language User Guide, The (2 ed.). Addison-Wesley. 2005.2
of19 30
HOW TO READ THE USE CASES
Every Use Case represented in this section was made according to
The Unified Modeling Language (UML) standards.
“UML is a general-purpose, developmental, modeling language in the field
of software engineering, that is intended to provide a standard way to
visualize the design of a system.” 2
There are two main sections represented by the two areas: the Client side
and the Server side. Every action is represented by an oval shape, and
some of them lead to an interface section, which is the visualisation of a new
page or window. Finally, the figure of an human represent the user, and all
the links to the actions are the possible tasks it can perform.
20. UPLOAD FUNCTION
This function is the same one for both the upload of the Data file object of the quality
measurements and the Shape File. The User will click the Upload Button, leading to the
opening of a new window. This window will represent the file system of the Hard Disk
where the software is running. The user will be able to choose the files he desires to upload,
and once he confirms his action, the files will be sent to the server and stored there.
of20 30
Figure 14. Upload function
21. CONFIGURATION OF THE SHAPE FILE
This Use Case represents the action of a manual configuration of the shape file. In this
situation the User will click on the “Configure The Shape File Button” in order to get into
another window. This new page is the one described in the backend part as the “Page for
Configuration File”. Every time the user starts writing the possible metadata, there is an
interaction with the server, since it will suggest some possible alternatives. At the end the
user will be able to export his configuration.
of21 30
Figure 15. Configuration of the shape file function
22. SELECT TYPE OF MEASUREMENT
When the user will choose the type of measurements to perform and wants to select some
particular fields for this action it will undergo a process that is represented by the above Use
Case. By clicking on the type of measurement it will be redirected to a new page, the one
represented by the interface entity in the Use Case. From this interface the user will be able
to choose the fields for every type of measurement. At the end of this selection, every data is
sent to the server as the shapeFile object.
Another possibility for the user is to already have a shapeFile, in this case, he could directly
Upload it.
of22 30
Figure 16. Selection of type of measurement
23. EXPORT MEASUREMENTS
Every time a User will see the results of some measurements he will be able to export the
visualisations of his results. This action will be done by an interaction between the front end
and the server, represented by the above Use Case. The front end will ask the server for the
files, and will get them, in order to let the user download them.
of23 30
Figure 16. Export of measurements
24. A REAL CASE
Uploading the file
The aim of this section is to show the main functions of the tool using a real set of data. In
order to do that, information coming from more than 300,000 XML files published by
15,000 Italian public bodies are going to be used.
The following picture represent the partial Shape File that is going to be used for the
measurement. It’s clear from the file that two of the fields that are taken into consideration
for the measurements are the Identifier and the Payment.
of24 30
Once that the file is uploaded the system will redirect the user to the page with all the
possible measurements he can perform.
Figure 17. Shape File
25. Choosing the Measure
Now the system is showing to the user all the possible measurements that can be performed
on the uploaded dataset over the whole possible set of measurements implemented by the
system.
of25 30
A further step is to choose a specific type of measurement for every field. In particular,
according to the uploaded Shape File, the fields are the following:
• Identifier
• Start date
• End date
• Agreed price
• Payment
• Procedure type
• Business Entity ID
The resulting page used for choosing the specific fields for the Accuracy measurement is shown in
the following picture.
Figure 18. List of measurements
26. Reading the Results
The radar chart gives an overall idea of the whole performance of the measurements.
Measurements with 0% corresponds to measurements that have not been performed.
of26 30
Figure 19. Fields per measurement
Figure 20. Radar chart
27. The Accuracy graph would look like this:
Finally, looking at the results for every single field:
of27 30
Figure 21 Accuracy results
29. Final Remarks
Even with all the information the graphs can give to the user, it’s important to keep in mind
the following concepts:
• High percentages are relative to the amount of data sets. In this case, for example, even if
90% could seem a high result, 10% of wrong data over 5,783,968 data sets it’s a
considerable number.
• A good performance in the compliance of the format results in data quality
• Currectness and Completeness are linked since some data are given only at the end (e.g
payment and end date)
• The presence of different accuracy errors suggests some issues in the manual insertion of
data. This fact could lead to other wrong data that are not considered in the measurements
performed by the system.
• It’s interesting to denote, for this case, that in the Accuracy measurement, semantic errors
are more present compared to syntactic errors.
of29 30
30. Bibliography
Torchiano, M. 2017, Visualizzazione dell’informazione quantitativa url
(last visited on: 15/05/2017)
Murray, S. 2013, Interactive Data Visualisation for the web, O’Reilly Media
Unified Modeling Language User Guide, The (2 ed.). Addison-Wesley. 2005.
of30 30