Analysis and presentation of the Telecom Italia Big data Challenge and discuss how the Big Data open challenges help to get the new insights of the collected data. Also, these challenges help and motivate other people to work on Data Analysis.
2. Summary of the presentation:
Short Introduction of Telecom Italia Big Data Challenge – Donagh
Summary of Paper 1 and Paper 2 – Rajesh
Other interesting insights we can draw from this dataset – Malika
3. a contest designed to stimulate
the creation and development of
innovative technological ideas in
the Big Data field
4. history
•
Early 2014 Telecom Italia released first edition which was closed
•
Success meant that the next iteration was open
•
Freely available for anyone to use.
•
https://dandelion.eu/datamine/open-big-data/
5. data sets
•
Geo-referenced (Milan and the Autonomous Province of Trento)
•
Anonymised
•
Millions of records
•
November -> December 2013
•
extracted from telecom records, energy, weather, public and private
transport, social networks
9. Milano datasets
Domain
Telecommunications SMS, Call Internet; MI to Provinces; MI to MI;
Weather Weather Station Data ; Precipitation
Environment Air Quality
News Milano Today
Social Tweets
11. Paper 1
(Anatomy and efficiency of urban multimodal mobility)
Main Goal: To find the optimal time-respecting path between two Geo locations in multi-modal layer
Where, l(a,b) is the quickest length (time respecting and minimal) trips on the network
d(a,b) is the euclidean distance from the origin 'a' to the destination 'b'
12. Rail becomes then dominant at 40 kms and air travel is dominant
for trips of distance of order 700 kms. Other transportation modes
play a secondary role, with peaks at 22 kms for the Metro, 40 kms
for Ferries and 70 kms for Coaches
13. The bus system is covering most of the
short trips, whereas the advantage of
using the Metro and Rail systems emerges
progressively for longer distances
14. The total number of stop events
Omega grows proportionally with the
urban area populations P.
Where, C(alpha) is the
number of stop events in the
layer 'alpha' and Delta-t is the
duration of the time interval
15. Paper 2
(High resolution population estimates from telecommunications data)
Data Source: Telecommunications(provided by Telecom Italia)
Census data
Satellite images(provided by Landsat)
Main Goal: Create high-resolution(235m x 235m) population estimates in time and space
Difficulties: Population counts can change rapidly that means is hard to acquire local census estimates
in a timely and accurate manner. The correlation coefficient between call volume and the
underlying population distribution vary with time.
16. Building map:
41% of area on the map are directly
generated.
To classify the remaining 59% , they train a
Random forest classifier using OpenStreetMap
data as labeled training examples.
17. Population is distributed exponentially in the beginning:
29% of grid-squares have zero population
5% of grid-squares have a population of 1
3% of grid-squares have population of 2 and so on.
39% of grid-squares have a population over 100
Then follow a normal distribution with a mean of 400 persons
Population Distribution:
18. 10-minute intervals for each of the 235m × 235m grid cells.
Communication activity is approximately log normal
There are 5 types of communications activity: SMSIN,
SMSOUT, CALLIN, CALLOUT, and INTERNET.
Telecommunications activity:
19. Elementary Model:
Previous research have suggest that the relation between location(i), population and telecommunication:
(w stands for call volume, p stands for population)
Not Perfect:
The relationship between call volume and population
in this region is much weaker below a threshold of
351 persons.
Main reason is that the dense population area tend to
have more cell tower for we to observe the relationship.
Model(1):
20. Model(2):
Try to find the best hours of call volume data:
Each type correlates most strongly during the hour
from 10 am to 11 am, and as with the total call
volumes, CALLOUT has the greatest correlation,
Approximately 0.68. Thus we use CALLOUT from
10 am to 1 am for the wi in
model(2).
22. Analyzing cities using the space-time structure of mobile phone
network
•
Attempts to connect telecom usage data from Telecom Italia mobile to geography
of human activity
•
Usage of telecom data to enhance the understanding of cities as space of flows
23. Using Telecom Dataset for social network analysis
investigating social structures through the use
of network and graph theories.
Anthropology, Biology, Communication Studies, …etc
social network analysis
24. Traffic monitoring in urban area.
•
Use of Telecom data to track the dense regions.
•
Rerouting strategies
•
Increase the public transport in dense area.
•
Provide more taxies in dense area.
At the beginning of 2014, Telecom Italia, in collaboration with several international partners, launched the Telecom Italia Big Data Challenge.
The contest made available to developers, designers and scientists a large dataset of 30+ kinds of data (mobile, weather, energy, etc.)
Datasets were released only to be used by the participants
after the end of the contest, the demand for those datasets has raised
They want people to reuse data
The data provided in the dataset of the Big Data Challenge is geo-referenced (areas: Milan and the Autonomous Province of Trento – Italy) and anonymized. The dataset contains millions of records of data covering the period from November to December 2013 extracted from telecommunications records, energy, weather, public and private transport, social networks and events.
Some of the datasets referring to the Milano urban area are spatially aggregated using a grid. We refer to this grid as the Milano Grid. The schema of the grid is cellId, geometry expressed as geoJSON
Grid has following spatial description
The square id numbering starts from the bottom left corner of the grid and grows till its right top corner.
Datasets are divided up into domains
Telecommunications has 3 datasets – SMS & Internet Calls
Call data from Milan to provinces
Call Data within Milan
Each row corresponds to a tweet. For privacy issues the user id has been obfuscated and the text has been replaced with a list of entites extracted by the Entity Extraction API tool. Entities are provided as links to DBpedia.
User, entities, language, municipality, date created, timestamp, geometry