SlideShare uma empresa Scribd logo
1 de 20
Randomly Sampling YouTube Users:
 An Introduction to Random Prefix
         Sampling Method




             Cheng-Jun Wang

               Web Ming Lab
        City University of Hong Kong
                  20121225
YouTube growth curve




http://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/



https://gdata.youtube.com/feeds/api/standardfeeds/most_recent
Contents
Plan A: Sampling Users

∗ Unfortunately, YouTube’s user identifiers do not follow a
  standard format, YouTube’s user identifiers are user-specified
  strings. We were therefore unable to create a random sample
  of YouTube users.




  Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
Plan B: Sampling Videos

∗ Using the YouTube search API, Zhou et al develop a random
  prefix sampling method, and find that roughly 500 millions
  YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.




  Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Get proportional users?

∗ Limitation: selection bias towards those who uploading more
  videos. Therefore, weight against the number of videos per
  user (by the max value) is necessary to get a random sample of
  YouTube users.
∗ Is it possible?



                                                     1




              1    Videos crawled   Users detected
UserID   Video   Active
                                                 Num     Days
User   Video Weight   Active            1        10      20
ID     Num Factor     Days              2        5       15
                                        2        5       15
1      10    1        20                3        1       1
                               Weight   3        1       1
                               Cases
                                        3        1       1
2      5     2        15                3        1       1
                                        3        1       1
                                        3        1       1
3      1     10       1                 3        1       1
                                        3        1       1
                                        3        1       1
                                        3        1       1
Strategy




∗   60^10*16 = 9.674588e+18
∗   YouTube video is randomly generated from the id space
∗   Sampling space is tooooooo large!
∗   Any good idea?
∗   http://www.youtube.com/watch?v=1yo0zBFCMxo
∗   http://www.youtube.com/watch?v=_OBlgSz8sSM
YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
  using a keyword string of the format “watch?v=xy...z” (including the quotes)
  where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
  which does not contain the literal “-” in the prefix, YouTube will return a list
  of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.


∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
  search results may contain such “noisy” video ids; also, the short prefix may
  match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
  by the search engine.
Practice

∗ However, in practice, a prefix of length L < 5 contains usually
  more than one hundred results, and YouTube API can only
  return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
  with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.
∗ They find that querying prefixes with a prefix length of four
  will returned ids having a “-” in the fifth place, which provides
  a big enough result set so that each prefix returns some results
  and small enough to never reach the result limit set by the API.
∗ Zhou et al. found that there are about 500 million YouTube
  videos by 2011!




        Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Python and gdata


             gdata                                    Code
∗ gdata is a module for         def SearchAndPrint(search_terms):
                                 yt_service = gdata.youtube.service.YouTubeService()
  connecting Google data         query = gdata.youtube.service.YouTubeVideoQuery()
  (including YouTube) via API    query.vq = search_terms
                                 query.orderby = 'viewCount'
                                 query.racy = 'include'
                                 feed = yt_service.YouTubeQuery(query)
                                 PrintVideoFeed(feed)
Test Validity

∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ The Secret State - The Biggest Mistake - Official Lyric Music
  Video
                                                    Cant’ find
                                                    the video!
∗ searchApi("watch?v=1yo0z")
Restricted query term

∗ searchApi('"watch?v=1yo0"')
Compare two random samples

∗   # summary(da$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗   # summary(db$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 8.00 25.00 17.57 25.00 50.00
There are 604 million videos in
        YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300
Numeric simulation of random
                 prefix sampling
∗   # using degreenet to simulate decrete pareto distribution
∗   library(degreenet)
∗   a<-simdp(n=100000, v=3.5, maxdeg=10000)

∗   b<-data.frame(cbind(c(1:length(a)),a))
∗   c<-b[rep(1:nrow(b),b$a),]
∗   c$vid<-c(1:length(c$a))
∗   names(c)<-c("uid", "count", "vid")

∗   id<-sample(c(1:length(c$vid)), 2000, replace = F) #
∗   ds<-subset(c, c$vid%in%id)
∗   dat<-subset(ds, !duplicated(ds$uid))

∗   hist(dat$count)

∗   da<-as.data.frame(table(a))
∗   ds<-as.data.frame(table(dat$count))

∗   plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )
∗   points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")
∗   legend("topright", c("population", "sample"),
∗               col = c( "black","red"),
∗               cex=0.9, pch= c(3, 2))
Reference

∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
  Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
  Networks. IMC
∗ YouTube deverlopers guide for python
  https://developers.google.com/youtube/1.0/developers_guide_python

∗ Introduction to the library of gdata.youtube
  http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry
20121225

Mais conteúdo relacionado

Semelhante a Random Prefix Sampling YouTube Users

Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clusteringSahil Biswas
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]MODUL Technology GmbH
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...TEST Huddle
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...FIAT/IFTA
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfAnnyce Davis
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCoevanphx
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...Jarek Wilkiewicz
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answersITeLearn
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataIRJET Journal
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88Mahmoud Samir Fayed
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...Rudy Jahchan
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?Joonyoung Yi
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesBryan Duggan
 
UA Mobile 2012 (English)
UA Mobile 2012 (English)UA Mobile 2012 (English)
UA Mobile 2012 (English)dmalykhanov
 

Semelhante a Random Prefix Sampling YouTube Users (20)

Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clustering
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCo
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answers
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big Data
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
肉体言語 Tython
肉体言語 Tython肉体言語 Tython
肉体言語 Tython
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
 
NMSL_2017summer
NMSL_2017summerNMSL_2017summer
NMSL_2017summer
 
YouTube for Developers
YouTube for DevelopersYouTube for Developers
YouTube for Developers
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game Engines
 
UA Mobile 2012 (English)
UA Mobile 2012 (English)UA Mobile 2012 (English)
UA Mobile 2012 (English)
 

Mais de Chengjun Wang

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论Chengjun Wang
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104Chengjun Wang
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication Chengjun Wang
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsChengjun Wang
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekChengjun Wang
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time SeriesChengjun Wang
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理Chengjun Wang
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived valueChengjun Wang
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteChengjun Wang
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...Chengjun Wang
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variablesChengjun Wang
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From TreimanChengjun Wang
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N GChengjun Wang
 

Mais de Chengjun Wang (15)

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and Relations
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with Pajek
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time Series
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived value
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing Website
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variables
 
Pajek chapter1
Pajek chapter1Pajek chapter1
Pajek chapter1
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From Treiman
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N G
 
Amos Learning
Amos LearningAmos Learning
Amos Learning
 

Último

南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证kbdhl05e
 
(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)oannq
 
E J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxE J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxJackieSparrow3
 
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan
 
Inspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxInspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxShubham Rawat
 
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...JeylaisaManabat1
 

Último (6)

南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证
 
(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)
 
E J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxE J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptx
 
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
 
Inspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxInspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptx
 
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
 

Random Prefix Sampling YouTube Users

  • 1. Randomly Sampling YouTube Users: An Introduction to Random Prefix Sampling Method Cheng-Jun Wang Web Ming Lab City University of Hong Kong 20121225
  • 4. Plan A: Sampling Users ∗ Unfortunately, YouTube’s user identifiers do not follow a standard format, YouTube’s user identifiers are user-specified strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
  • 5. Plan B: Sampling Videos ∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011. ∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 6. Get proportional users? ∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users. ∗ Is it possible? 1 1 Videos crawled Users detected
  • 7. UserID Video Active Num Days User Video Weight Active 1 10 20 ID Num Factor Days 2 5 15 2 5 15 1 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 1 2 5 2 15 3 1 1 3 1 1 3 1 1 3 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
  • 8. Strategy ∗ 60^10*16 = 9.674588e+18 ∗ YouTube video is randomly generated from the id space ∗ Sampling space is tooooooo large! ∗ Any good idea? ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
  • 9. YouTube Search API ∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist. ∗ YouTube limits the number of returned results for any query. ∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos ∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
  • 10. Practice ∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query. ∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids. ∗ Therefore, a prefix length of 5 is a good choice in practice.
  • 11. ∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
  • 12. ∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 13. Python and gdata gdata Code ∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = 'viewCount' query.racy = 'include' feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
  • 14. Test Validity ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video! ∗ searchApi("watch?v=1yo0z")
  • 15. Restricted query term ∗ searchApi('"watch?v=1yo0"')
  • 16. Compare two random samples ∗ # summary(da$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 7.00 25.00 17.15 25.00 75.00 ∗ ∗ # summary(db$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 8.00 25.00 17.57 25.00 50.00
  • 17. There are 604 million videos in YouTube by Dec, 2012! ∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26 ∗ 34361/x = 125/34361 ∗ X = (34361^2/125)*64 == 604507300
  • 18. Numeric simulation of random prefix sampling ∗ # using degreenet to simulate decrete pareto distribution ∗ library(degreenet) ∗ a<-simdp(n=100000, v=3.5, maxdeg=10000) ∗ b<-data.frame(cbind(c(1:length(a)),a)) ∗ c<-b[rep(1:nrow(b),b$a),] ∗ c$vid<-c(1:length(c$a)) ∗ names(c)<-c("uid", "count", "vid") ∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) # ∗ ds<-subset(c, c$vid%in%id) ∗ dat<-subset(ds, !duplicated(ds$uid)) ∗ hist(dat$count) ∗ da<-as.data.frame(table(a)) ∗ ds<-as.data.frame(table(dat$count)) ∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" ) ∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red") ∗ legend("topright", c("population", "sample"), ∗ col = c( "black","red"), ∗ cex=0.9, pch= c(3, 2))
  • 19. Reference ∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC ∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC ∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python ∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry