At bit.ly, we study behaviour on the internet by capturing clicks on shortened URLs. This link traffic comes in many forms, yet when studying human behaviour, we are only interested in 'organic' traffic: the traffic patterns caused by actual humans clicking on links that have been shared on the social web. This session will look at a model to extract and analyze these patterns by employing Python/Numpy, Streaming Hadoop, and machine learning. This model lets us extract the traffic we’re interested in from the variety of patterns generated by inorganic entities following bit.ly links.
28. SUN CITY PALM DESERT - SUN CITY SHADOW HILLS CA Tours, MLS, Plans, Info, MORE http://bit.ly/nda6QH+ TOP SALE Viagra from USD 0.90 per pill, Cialis from USD 1.75 per pill
29.
30. SUN CITY PALM DESERT - SUN CITY SHADOW HILLS CA Tours, MLS, Plans, Info, MORE http://imageshack.us/clip/my-videos/8/tcib.mp4/ 2011 CMA Nominees | Playlist | VEVO TOP SALE Viagra from USD 0.90 per pill, Cialis from USD 1.75 per pill National Association for College Admission Counselling log-in
31.
32. Free Game For Kids Registration - Pittsburgh Penguins - Fan Zone (2756 clicks) Video: Carra's top five transfers - Liverpool FC (915 clicks) Clip of the Week: Toews Nails Tot | NBC Chicago (179 clicks) Allegro.pl nie działa (919 clicks) DallasCowboys.com - Official Site of the Dallas Cowboys (683 clicks) Mao’s Room (2339 clicks) Manchester United Official Web Site - Ashley Young was long term United target (526 clicks) Shocking! Lady Gaga Poses Sans Makeup for Harper's Bazaar Cover - UsMagazine.com (12288 clicks) Runway - Runway TV Collections Fashion Magazine - Nick Carter: From the Backstreet to Taking Off (3662 clicks) The GQ&A: Drive Director Nicolas Winding Refn (601 clicks)
Thank you for having me. Who I am (Ph.D in CS, Caverlee’s Infolab at Texas A&M, Scientist at bitly). This was joint work with Mike Dewar, a fellow scientist working at bitly. Unfortunatly do to some rather lame visa issues, he can’t be here.
Bitly is a URL shortening and analytics service. We have been around for a little over three years. We are located in the Meat Packing District here in New York City. We have a five member science team.
What is a URL shortener, why do you need such a thing.
... then you share. Bitly powers many custom URL shorteners, such as NYTime, WashingtonPost, ESPN and O’Reilly
Shortening allows analytics. Adding a ‘+’ to the end of any bitly link. Traffic patterns, referrer details, location data. Information is free to all. What can we build on top of this data?
Internally, we’ve built a variety of search services. So, this is a screenshot of the bitly search engine.
We also can use the data to track trends. We can determine when phrases are occurring an abnormal amount.
agent / country / timezone / global hash / ip address / cookie / user language / referrer / url / timestamp / hash creation / city / lat lon Single line delimited json
minimum over 500 clicks per second
bitly is primarily a python shop, we use streaming on Hadoop, and many of use heavily use the MrJob framework from Yelp. Which is awesome.
Types of questions we use Hadoop/ MapReduce to answer.
What is an organic traffic?
Time series binned on the minute.
typical click stream / cumulative representation / binning (makes the baby shannon cry) / horrible derivatives made everything noisy
Probability density function. What is the likely hood there will be a click at this time.
quick factoid before moving on to the model, 3 hours (Most), 7 hours on Youtube
using bitly for promotion & decision making / links go ‘viral’ / talk about today / gotta do this very very quickly
first thing I’ve done, talk about today, is to throw away the rise and just look at the decay AutoRegression Model (Polynomial Regression for TimeSeries) (Line Fitting, Curve Fitting) Polynomial Regression
using least squares - fit one model to 1000 different time series
the only real difficulty is model selection / rich on computation short on time / fit a bunch of models with different temporal orders and look at their model predicted output
The blue line is the referred rate based on the PDF, the green line is the model prediction.
r squared / blunt threshold / many false negatives / gets job done
we can dig a bit deeper into this data using the AR model, learn something about the model as well
not “spam” but also not interesting
most uncorrelated links by correlation with model predicted output
(sort of) overfitting: a 9th order model is too flexible
3rd order model does much better
Bottom link, we see abnormal traffic to such sites, which are spam
let’s just check that the highly correlated links look less suspect