Mobile data offloading can greatly decrease the load on and usage of cellular data networks by exploiting opportunistic and frequent access to Wi- Fi connectivity. Unfortunately, Wi-Fi access from mobile devices can be difficult during typical work commutes, e.g., via trains or cars on highways. In this paper, we propose a new approach: to preload the mobile device with content that a user might be interested in, and thereby avoid the need for cellular data access. We demonstrate the feasibility of this approach by developing a supervised machine learning model that learns from user preferences for different types of content, and propensity to be guided by the UI of the player, and predictively preload entire TV shows. Testing on a dataset of nearly 3.9 million sessions from all over the UK to BBC TV shows, we find that predictive preloading can save significant share of the mobile data for an average user.
2. Users - 32 M/month
IP address – 20 M/month
Sessions - 1.9 Billion
May 2013 – Jan 2014
≈ 50% of population
Large-scale study of BBC iPlayer
UK Population – 64M
2 x INFOCOM’2015, ToN’2015, JSAC’2016
3. Longitudinal View across ISPs
Fixed-line Internet market
(5 representative providers)
Mobile market is more dynamic than the fixed-line Internet market
Mobile Internet market
(5 representative providers)
4. Data caps decrease market share
All-you-can-eat data
(M1, M5)
Limited-cap data packages
(M2 – M4)
All-you-can-eat plans boost user consumption
5. Temporal Patterns in different ISPs
Fixed-line accesses (F1-F5) peaks
in the evening hours
Mobile users watch more
during commutes
FixedLined
ISPs
Mobile,limiteddata
caps
6. There is a problem…
Internet on trains in the UK is no good
A study shows that 23.2% 3G packets and 37.2% 4G packets on
the major train routes failed
7. A useful insight: users watch across networks
Users complete watching across different sessions and networks
Fixed-line ISPs Mobile ISPs
Per user completion ratio
14. 1 channel 2 channels 3 channels
20%0% 40% 60% 100%
1 category 2 category 3 categories
30%0% 75%55% 100%
1 genre 2 gen. 3 gen.
15%0% 40% 50%30%
4 gen.
100%
1 sh. 2 sh. 3
10%0% 25%20%
4 sh.
100%35%
User Focus on Different Content Types
share of users with all their sessions from:
out of 11 channels
out of 171 genres
out of thousands shows
out of 11 categ.
15. importance
content category 0.038
content genre 0.063
category affinity 0.042
genre affinity 0.103
show affinity 0.179
channel affinity 0.043
content age 0.087
User Preferences
Total importance: 0.555
importance
featured content 0.061
featured position 0.061
content popularity rank 0.071
popularity position 0.008
featured probability 0.091
UI Guidance
Total importance: 0.292
importance
previously watched 0.066
completion ratio 0.081
probability of re-watching 0.007
Repeatedly Watched Content
Total importance: 0.154
Engineering Features
16. Supervised Learning
Problem: For a given user U and an episode E
predict whether U will watch E
Binary Classification Problem f(U,E) -> {0,1}
Random Forest: fast,
good performance on high dimensional data
Negative Examples: randomly sample from
what users did not watch
Predictions: Predict probability, rank all
episodes by probability
17. Accuracy of Personalized Predictions
For 50% of users over 70% chance
of fitting in Top-10 predictions
18. When do we do predictions?
Front Pages are updated over night…
19. When do we do predictions?
… and remain largely unchanged for 24h
20. How much traffic can be saved?
Predictive pre-fetching can potentially
save near 71% of mobile usage
23. Content Delivery for Home Broadband
Install more
distributed caches
May requires
significant investments
Any alternatives?
Problem: how to handle peak load from 32M users
24. Alternative: Peer-assisted Content Delivery
Content Servers
user
user user
user
user
user
average of 5K users online every sec in the first day after release
5K duplicates
every second!!!
Ask users for assistance
25. Elegant Theoretical Model for very Complex Behavior
around 88% of savings can be achieved
Data Analysis
TheoreticalModel
G c 1 e
c
26. Why it works?
Top-5% of the content
corpus accounts for 80% of traffic
Most of accesses happen in the
first day after release
Yes, it’s all about very popular content
27. Dmytro Karamshuk
King’s College London
“True genius resides in the capacity for evaluation of
uncertain, hazardous, and conflicting information” -
Winston Churchill