Telecom Italia Big Data Challenge

Big Data Challenge
COMP 41700
Seminars in Data Science

Summary of the presentation:

Short Introduction of Telecom Italia Big Data Challenge – Donagh

Summary of Paper 1 and Paper 2 – Rajesh

Other interesting insights we can draw from this dataset – Malika

a contest designed to stimulate
the creation and development of
innovative technological ideas in
the Big Data field

history
•
Early 2014 Telecom Italia released first edition which was closed
•
Success meant that the next iteration was open
•
Freely available for anyone to use.
•
https://dandelion.eu/datamine/open-big-data/

data sets
•
Geo-referenced (Milan and the Autonomous Province of Trento)
•
Anonymised
•
Millions of records
•
November -> December 2013
•
extracted from telecom records, energy, weather, public and private
transport, social networks

Milano datasets
Domain
Telecommunications SMS, Call Internet; MI to Provinces; MI to MI;
Weather Weather Station Data ; Precipitation
Environment Air Quality
News Milano Today
Social Tweets

tweets
•
username - anonymised
•
entities
•
language
•
municipality
•
Tweet time
•
geometry

Paper 1
(Anatomy and efficiency of urban multimodal mobility)
Main Goal: To find the optimal time-respecting path between two Geo locations in multi-modal layer
Where, l(a,b) is the quickest length (time respecting and minimal) trips on the network
d(a,b) is the euclidean distance from the origin 'a' to the destination 'b'

Rail becomes then dominant at 40 kms and air travel is dominant
for trips of distance of order 700 kms. Other transportation modes
play a secondary role, with peaks at 22 kms for the Metro, 40 kms
for Ferries and 70 kms for Coaches

The bus system is covering most of the
short trips, whereas the advantage of
using the Metro and Rail systems emerges
progressively for longer distances

The total number of stop events
Omega grows proportionally with the
urban area populations P.
Where, C(alpha) is the
number of stop events in the
layer 'alpha' and Delta-t is the
duration of the time interval

Paper 2
(High resolution population estimates from telecommunications data)
Data Source: Telecommunications(provided by Telecom Italia)
Census data
Satellite images(provided by Landsat)
Main Goal: Create high-resolution(235m x 235m) population estimates in time and space
Difficulties: Population counts can change rapidly that means is hard to acquire local census estimates
in a timely and accurate manner. The correlation coefficient between call volume and the
underlying population distribution vary with time.

Building map:
41% of area on the map are directly
generated.
To classify the remaining 59% , they train a
Random forest classifier using OpenStreetMap
data as labeled training examples.

Population is distributed exponentially in the beginning:
29% of grid-squares have zero population
5% of grid-squares have a population of 1
3% of grid-squares have population of 2 and so on.
39% of grid-squares have a population over 100
Then follow a normal distribution with a mean of 400 persons
Population Distribution:

10-minute intervals for each of the 235m × 235m grid cells.
Communication activity is approximately log normal
There are 5 types of communications activity: SMSIN,
SMSOUT, CALLIN, CALLOUT, and INTERNET.
Telecommunications activity:

Elementary Model:
Previous research have suggest that the relation between location(i), population and telecommunication:
(w stands for call volume, p stands for population)
Not Perfect:
The relationship between call volume and population
in this region is much weaker below a threshold of
351 persons.
Main reason is that the dense population area tend to
have more cell tower for we to observe the relationship.
Model(1):

Model(2):
Try to find the best hours of call volume data:
Each type correlates most strongly during the hour
from 10 am to 11 am, and as with the total call
volumes, CALLOUT has the greatest correlation,
Approximately 0.68. Thus we use CALLOUT from
10 am to 1 am for the wi in
model(2).

Where else can we use the Telecom Italia
Dataset?

Analyzing cities using the space-time structure of mobile phone
network
•
Attempts to connect telecom usage data from Telecom Italia mobile to geography
of human activity
•
Usage of telecom data to enhance the understanding of cities as space of flows

 Using Telecom Dataset for social network analysis
 investigating social structures through the use
of network and graph theories.
 Anthropology, Biology, Communication Studies, …etc
social network analysis

Traffic monitoring in urban area.
•
Use of Telecom data to track the dense regions.
•
Rerouting strategies
•
Increase the public transport in dense area.
•
Provide more taxies in dense area.

Other Usages
Users localization

Security

Health Care : Tracking users exercises

Thank you...
Special Thanks to my team members:
Hao Wu and He Ping

Telecom Italia Big Data Challenge

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Telecom Italia Big Data Challenge

Semelhante a Telecom Italia Big Data Challenge (20)

Último

Último (20)

Telecom Italia Big Data Challenge

Notas do Editor