Presentation of PhD thesis on Location Data Fusion
1. Information fusion for location
data analysis
Candidate: Alket Cecaj Supervisor: Prof. Marco Mamei
Doctorate School in Industrial Innovation Engineering
2. Thesis outline
• Introduction to Data Fusion Methods
• Location Data and Application Scenarios
• Data Fusion for Event Detection and Event Description
• Re-identification of Anonymized CDR Records Using Information Fusion
• Privacy issues
• Conclusions
3. Location data and application scenarios
Data
• Location data such as CDR (Call
Description Records)
• Geo-tagged social network data or
data from LBS
• Open data with a location
dimension such as census data
Applications
• Social – economic development
(D4D) .
• Smart mobility applications, land use
and city management
• Ground truth information for
validation analysis
5. Introduction to data fusion methods
• Stage based methods.
• Feature level-based.
• Semantic meaning-based data fusion methods
6. Location data fusion : side effect
• Data fusion enables a huge number of applications
• Privacy risks for individual data
7. Data fusion for event detection / description by
using aggregated CDR data and geo-tagged social
network data
Detecting and describing events happening in urban
areas by analysing spatio – temporal data
Detecting and describing events happening in urban areas
by analysing spatio – temporal data
Riferimento all’articolo
19. By combining the results from
the two datasets
• Improvement of precision – recall
performance of the method
• The improvement is limited in the
long run by the main dataset.
• The same improvement can be
observed also by joining the
results of the other datasets.
Improving event detection results by data fusion
20. By using the CDR the events
can be detected but not
described:
• By joining the results the data
can complement and enrich
each other.
• In this case the social dataset
can be used to describe
semantically the events
Data fusion for Event description
21. Confronting the results with other works on event
detection
• Two other similar works
• Using much more sophisticated algorithms
• Comparable results
22. Re-identification of CDR data by using social
network geo-tagged data
• Fine grained social and CDR user data
• Mobility paths
• Uniqueness of mobility prints
• Matching of user’s mobility path
• Re-identification probability evaluation
• The groundtruth problem.
23. Location data : CDR and social
CDR data
1. Massive dataset about millions of
users
2. Released in an anonymized format
3. Regularly sampled
4. Tower granularity (400 – sev. kml)
Geo-tagged social data
1. Sparse data following exp. distrib. (too
many users too little events per user)
2. Not anonymized
3. Irregular samplinig
4. Precise (GPS or triang. Loc.)
24. Re-identification of CDR data by using social
network geo-tagged data
• Anonymization.. and re-identification
• Movie ratings from NetFlix Prize dataset
• Medical records of Massachusetts Hospital using a voters list
• Re-identification of anonymous volunteers in a DNA study for Personal Genome
Project
• In line with our domain
• Unique in the Crowd: the privacy bounds of Human Mobility
• Markov chain models for de-anonymization of geo-located data
35. • Matching by chance : Bonferroni principle
• False social user’s events created :
a) in a random way
b) by clonning events (+1km, +30min)
• As a result we have 60 % less in the number of matchings in the first
case and 40% in the second case
Data fusion : considerations
36. As real identity of CDR users is missing, a validation of these results is
difficoult.
Flickr user is Twitter user (mobility traces overlapping and similar
usernames) and (the only) CDR user.
MCC field of the CDR record matching with the language used for
describing pictures and tweets content.
Data fusion : groundtruth validation
38. Reidentifying CDR users : probabilistic approach
Given that CDR user Ci has Ni events (points) in common with FTi, how likely is that the two
users are the same?
39. • Question which is both novel (no other works addressing it in this
domain) and fundamental
• Conditional probability
Re-identification : probabilistic approach
Given that CDR user Ci has Ni events (points) in common with FTi, how likely is that the two
users are the same?
41. Privacy risks for pesonal data
The revelatory potential power of location data
• Location of a person’s home. What kind of city area does he lives in?
• Locations of the stores a person frequent and from this information
shopping patterns can be inferred preferences and in some cases religious belief.
• There are also other types of very sensitive data such as health records. These can be
deduced by locations of doctors and hospitals the person visits
• By linking two or more locations on time and space, mobility
paths may be inferred.
42. Privacy risks : privacy preserving techniques
• Data Anonymization
a) K-anonymity in different improved versions
b) Possible reidentification of location data as already showed
• Data Suppression
a) Suppression and aggregation
b) Utility of the dataset after suppression dramatically reduced
43. Challenges
• One of the main challenges is the lack of common engineering standards for data
fusion systems. It has been one of the main impediments to integration and data
fusion.
• As different methods of data fusion behave differently in different applications, it
is not trivial to choose the best method for a specific task.
• Challenges during the data fusion design phase. At which level of abstraction,
reduction and simplification the data should be fused ?
• The lack of a unified framework that could orient the process of data fusion
towards a “structured data fusion” vision.
44. Conclusions and future work
• Information fusion as a an enabling process for novel applications
- Future work oriented towards the “structured data fusion” idea
• Privacy
- Assesment of variations of existing privacy preserving techniques (D.P.)
45. Publications
• Nicola Bicocchi, Alket Cecaj, Damiano Fontana, Marco Mamei, Andrea Sassi, Franco Zambonelli: “ Collective Awareness
for Human ICT Collaboration in Smart Cities”. IEEE WETICE International conference on state-of-the art research in
enabling technologies for collaboration 17-20 2013.
• Alket Cecaj, Marco Mamei, Nicola Bicocchi : “ Re-identification of Anonymized CDR datasets Using Social Network Data
”. IEEE Percom International conference on Pervasive Computing and Communications. Budapest, Hungary 24-28, 2014.
• Cecaj Alket, Marco Mamei (2016) : “Data Fusion for City Life Event Detection” In: Journal of Ambient Intelligence and
Humanized Computing, pp 1– 15.
• Nicola Bicocchi, Alket Cecaj, Damiano Fontana, Marco Mamei, Andrea Sassi, Franco Zambonelli.(2014) “ Social
Collective Awareness in Socio-Technical Urban Superorganisms ”. Social Collective Intelligence Combining the Powers
Of Humans and Machines to Build a Smarter Society,Part III, Applications and Case studies, page 227.
• Cecaj, Alket, Marco Mamei, and Franco Zambonelli (2015). “Re-identification and Information Fusion Between
Anonymized CDR and Social Network Data”. In: Journal of Ambient Intelligence and Humanized Computing, pp. 1–14.
Notas do Editor
Introduzione ai metodi di data /information fusion. In particolare si parla di data o di information fusion a seconda che si tratti di una integrazione
di basso o alto livello.
I vari tipi di dati geo-referenziati e le diverse applicazioni che questi dati possono avere.
Uno studio di rilevamento automatico di grandi eventi in aree urbane usando dati aggregati di telefonia mobile e dati social geo-referenziati.
Dai dati aggregati si passa ai dati anonimizzati CDR che mostrano tracce di mobilità individuali. In questo lavoro si studiano diverse caratteristiche come l’unicità di queste tracce e di come questo può impattare la privacy.
Alla fine, insieme alle conclusioni si presentano diversi punti aperti (sfide ancora aperte) da risolvere sia per quanto riguarda il campo di data fusion che quello sulla privacy preserving.
La grande mole di dati generati durante la routine quotidiana come ad esempio
I dati geo-referenziati come ad esempio i CDR (Call Description Records), i dati geo-referenziati che è possibile ottenere dai social network o (LBS come Foursquare) oppure gli open data come quelli del census.
Dall’altra parte le applicazioni che derivano sono tante. Dal punto di vista dello sviluppo sociale si possono menzionare lavori che studiano i dati geo-referenziati a capire il meccanismo di diffusione delle malattie oppure i livelli di povertà nelle varie aree urbane, tutti studi che contribuiscono a orientare possibili interventi in questo senso.
In un ambito smart city tali dati permettono di capire le varie dinamiche nelle grandi città come i commute patterns e land use tutte informazioni utili a capire e gestire al meglio una città.
Anche se questi dati presi singolarmente sono utilissimi per le applicazioni menzionate prima, possono risultare molto più potenti se combinati o integrati in un’unica rappresentazione. Ad esempio anche se i CDR forniscono un indicazione su un grande raggruppamento di persone in una certa zona una volta combinati con i dati social possono rivelare anche il perché di un tale evento.
Questo processo di combinazione e integrazione degli dati o data fusion punta ad analizzare i dati cosi che ciascun data set possa interagire, informare e completare gli altri data set.
Record matching vs knowledge fusion.
This is a category that uses different data sets that are in different stages of the process of data mining. Following this category, the data sets are loosely coupled without any requirements on their consistency.
This method treats features extracted from different data sets and creates an array by concatenating them. This array can then be used in clustering and classification methods.
3. These methods take in consideration the relations between features in different data sets.
This implies that the data miner knows what each data set represents, and why they can be fused
or why they re-inforce each other in terms of enrichment of information.
Data such as anonimyzed CDR or social network datasets
By following the diagram in the first chapter we present the steps for applying the data – fusion methods.
Milano Grid and time series of the activity levels of one of the cells during the two months period
Big data challenge 2014 : aggregated CDR data and geo-tagged social network data tables .
Faster computation as there are less entries
The data used in the previous study were aggregated . It means that there were no personal data provided– they just provide the level of mobile phone activity in a certain geographic area identified by a square cell inside a grid . However there are many cases in which CDR data are released in a fine grained temporal and location scale, where personal anonymized data are provided.
That means that individual mobility traces can be spotted and analyzed.
In the same way geo-tagged social data form location based services such as Foursquare, or social networking services such as Twitter or Flickr can reveal location traces of their users.
The data used in the previous study were aggregated . It means that there were no personal data provided– they just provide the level of mobile phone activity in a certain geographic area identified by a square cell inside a grid . However there are many cases in which CDR data are released in a fine grained temporal and location scale, where personal anonymized data are provided.
That means that individual mobility traces can be spotted and analyzed.
In the same way geo-tagged social data form location based services such as Foursquare, or social networking services such as Twitter or Flickr can reveal location traces of their users.
Conclusions for this part : the uniqueness test shows the number of points needed for singleing out the mobility traces of 80-95 % of the overall users. A number of maximum 7 points is needed to do this. This number is not affected by the time intervall of the matchin process. The same can be sad for the percentage of the users paths singled out.
Having discovered the number of points sufficient to single out a CDR user we proceed in matching the CRD users with the social ones . In the graphics a simple matching process between CDR and social data.
While the C4 and C3 can be excluded due to their producing data in different locations in the same moment, nothing can be sad for C2 and C1. That’s why we use a probabilistic approach that could tell us (within a reasonably limit if a CDR user is the same social user with wich the events are
matching )
Conclusions for this part : the matching test shows the number of cdr users with which the social users match for a given number of points.
That means that every social user has at least one point in common with (on average) 1000 CDR users. It has two points in common (on average) with 100 CDR users, it has 13 points in common with (on average) 50 CDR users.
Analogously the percentage of CDR users with which the social users have 1, 2, 3…15 points in common decreases.