SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
last.fm crawler
RW vs RWRW
             Mário Almeida      malmeida@kth.se
                 Zafar Gilani     szuhgi@kth.se
             Arinto Murdopo        arinto@kth.se
Outline
●   Parameters
●   Methodology
●   Results
●   Challenges
●   Conclusion
Parameters
1.   Playcounts
2.   Playlists
3.   Ages
4.   IDs
5.   Number of friends (degrees)

Compare average using RW and RWRW!
Methodology
Used lastfm APIs to obtain
  ● user info
  ● number of friends (degree)
RW with UIS-WR
We applied the following RW formula:
Methodology
For RWRW, we apply:




The weight Wv is set to number of friends (degree)
Results
Crawled for ~10 hours
Number of samples: 48000
Number of age samples: 36363, not all users
show their age
Results - Ages


           RW
        estimates
          lower      After about 25k
       average age    samples, the     There is a big
         values.     age stabilizes.    correlation
                                       between age
                                         and the
                                          degree
Results - Playlists

       Most users do
         not have
         playlists.




                          RW estimates higher
                       numbers of playlists. Users
                       with higher degrees tend to
                           have more playlists.
Results - Playcounts


           We found some
           users having
           playcounts in the
           order of millions.
                                RW estimates higher
                                playcounts. Users with
                                higher degree tend to
                                have higher playcounts
Results - IDs

       Not yet stable.




                         RW estimates a lower
                         average ID compared to
                         RWRW. An user with lower
                         ID has generally a higher
                         degree
Results - Degrees




                 RWRW reduces the bias of nodes
                with higher probability to be visited
                  due to the high degree. This is
               indeed close to the expected degree
                               value.
Conclusion
● A simple random walk in a social network
  generally results into biased averages.
  ○ A node with higher degree has a higher probability of
    being discovered.
● RWRW normalizes the averages.
  ○ High variations do not abruptly impact the
    estimation.
  ○ RWRW reduces the biases of RW.
● Low variance means lower difference
  between RW and RWRW.
● Crawling lastfm produces many challenges
  ○ e.g.: 0 degree, banned user, huge playcounts
Questions
Check the code in:
● http://code.google.com/p/lastfm-rwrw/

Mais conteúdo relacionado

Mais de Mário Almeida

High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)Mário Almeida
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelizationMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

Mais de Mário Almeida (11)

Spark
SparkSpark
Spark
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelization
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Lastfm crawler

  • 1. last.fm crawler RW vs RWRW Mário Almeida malmeida@kth.se Zafar Gilani szuhgi@kth.se Arinto Murdopo arinto@kth.se
  • 2. Outline ● Parameters ● Methodology ● Results ● Challenges ● Conclusion
  • 3. Parameters 1. Playcounts 2. Playlists 3. Ages 4. IDs 5. Number of friends (degrees) Compare average using RW and RWRW!
  • 4. Methodology Used lastfm APIs to obtain ● user info ● number of friends (degree) RW with UIS-WR We applied the following RW formula:
  • 5. Methodology For RWRW, we apply: The weight Wv is set to number of friends (degree)
  • 6. Results Crawled for ~10 hours Number of samples: 48000 Number of age samples: 36363, not all users show their age
  • 7. Results - Ages RW estimates lower After about 25k average age samples, the There is a big values. age stabilizes. correlation between age and the degree
  • 8. Results - Playlists Most users do not have playlists. RW estimates higher numbers of playlists. Users with higher degrees tend to have more playlists.
  • 9. Results - Playcounts We found some users having playcounts in the order of millions. RW estimates higher playcounts. Users with higher degree tend to have higher playcounts
  • 10. Results - IDs Not yet stable. RW estimates a lower average ID compared to RWRW. An user with lower ID has generally a higher degree
  • 11. Results - Degrees RWRW reduces the bias of nodes with higher probability to be visited due to the high degree. This is indeed close to the expected degree value.
  • 12. Conclusion ● A simple random walk in a social network generally results into biased averages. ○ A node with higher degree has a higher probability of being discovered. ● RWRW normalizes the averages. ○ High variations do not abruptly impact the estimation. ○ RWRW reduces the biases of RW. ● Low variance means lower difference between RW and RWRW. ● Crawling lastfm produces many challenges ○ e.g.: 0 degree, banned user, huge playcounts
  • 13. Questions Check the code in: ● http://code.google.com/p/lastfm-rwrw/