9. How we addressed it
• Custom Javascript injected onto the page so we can start
measuring things
• Custom web server modules to handle cookies over mulQple
domains
• Custom log processing infrastructure to push data onto HDFS
every 15m
• Map‐Reduce jobs to provide reports & create MySQL databases
• built a co‐visitaQon algorithm to produce related pages
9
15. Major issues
• Hadoop
• Map‐reduce was slow to write and inflexible
• Hadoop kept on hanging, both the name server and our custom push jobs would stall
• OperaQons
• how to move from 0.18 to 0.19 ?
• Jobs failing meant we were gecng paged, and restart‐ability was never really designed
• Felt like we were building our house on quicksand.
• we were running off factory‐defaults
• network wasn’t opQmized at ALL
• People
• zero experience going in
• people were learning by doing.
• lots of new things made fault detecQon ‘interesQng’
• our group started becoming a bo[leneck
• Map reduce hard to learn
15
17. OperaQonal issues
• Got ‘real’ machines
• put onto same switches/racks
• built the filesystem to be[er match how we used hadoop
• upgraded to 0.19 at same Qme
• took 48 hours to migrate
• Spent some Qme listening to experts
• tuned our cluster a bit be[er
• removed developer access to the ‘hadoop’ user
• SQll not a 100% “producQon” system
• but close enough for my liking
17
32. The current deliverables
• Get more informaQon about our customers
• Increase recirculaQon
• Increase RPM of our pages
• Build metrics into our plaxorm
• What works on pages
• How are we performing
• Build intelligence on the page
• CollaboraQve filtering
• Product recommendaQons
• Top‐K type lists
• Make it closer to real Qme
• not the focus of this talk
32
33. What data are we processing?
• Beacon Web servers
• Tracking beacon injected into the HTML page via custom javascript
• Tracks
• Page views
• Page clicks
• Custom event that the content developer wants
• Tracks standard things like referrers, and user agents, and LocaQon
• Developer can add custom parameters to tell us about the page
• needed to write a custom module to generate anonymous user ids + 3rd party domain tracking
• custom module to map IP#’s to geographic WOEID‐based locaQons
• Ad impressions
• User viewed a campaign
• Integrate it with campaign manage to determine actual revenue
• URL context (through relegence)
• We can determine who & what a arQcle is about
• through relegence, similar to what OpenCalais does
33
34. The Data Layer Infrastructure today
Web Page Publishing
Platforms
Advertising MySQL
Webservers Relegence
Reporting
Beacon Analytics
Hadoop
Webservers Tools
Cassandra Cassandra Redis MySQL
34