SlideShare uma empresa Scribd logo
1 de 16
Winning the Big Data Spam Challenge


 Erich Nachbar   Stefan Groschupf   Florian Leibert
Spam Types - Email Spam

• What do spammers do? 
  • Many domains
  • Cycle through IPs (TOR, bulk blocks)
  • Bulk account creation (increase IP reputation)
     • Break captchas (Mechanical Turk)
     • Common names
       (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
Spam Types - Social Media

• Spam Carriers
  • Blog Postings
  • Comments
  • Friend Requests
  • ...
• Spam Generation through
  • Actual User Accounts (Hacked / User Virus)
  • Bot Accounts
Spam Types - Social Media

• Differences
   • Detection is the same
   • Account treatment is different (cancel vs. clean)
• 99% of all Spam contains URLs:
   • Ignore text-only messages.
   • Look at the URL not the text.
Spam Types - Web Spam

• Goal: influence search engine results
  • Link farms
  • Keyword, Meta tag stuffing
  • Hidden or invisible unrelated text
  • Scraper sites
  • Spam blogs
Why process Spam in Hadoop?

• Easy to parallelize  
  • Bucketization
     • User
     • Date
     • Source
     • etc.
  • Count models (probabilities) are very "hadoopy"
Why process Spam in Hadoop?

• Large data sets
   • More samples ~ better results
• Algorithms require preprocessing
• Existing code
   • e.g. url-parsers, bayes implementations
   •
Sample System Architecture
Heuristics for Spam Detection

• Easy to compute, Group-By & Count
• Captcha solving rates
• Source IP/Email
   • Historic vs. current volume
   • Reputation
• Link
   • Rrequency, position, ratio 
Heuristics for Spam Detection

• Content
   • Self similarity
   • Hash of media content
   • Keywords
Looking at arrival times


• Inter-arrival times
• Fast Fourier Transform 
   • Timestamps
    -> frequency space
Evaluating content
• Jaccard similarity
   • Bucketize by user / source email
      • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6)
   • Easy with Hadoop
      • map: emit user_id (K), text (V)
      • reduce: 

• Even simpler: 
  • # links / user
  • # complaints, spam tags / user
  • etc.
Solutions - Pay Level Domain

• Requires payment at Top Level Domain
• Simple heuristic
• Much simpler than Trust-Rank, Page-Rank, etc.




  Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov
  IRLbot: Scaling to 6 Billion Pages and Beyond
Demo
Take Aways

• Spam Reports are important
• Rolling real-time Ham & Spam Samples
• Have Knobs to turn (e.g. over JMX)
• Simple solutions can get you pretty far, the easy 80% 
• Spammers adapt very fast, stay agile
• Try to break your own system
Thank you!

erich@quantifind.com, @enachb

sg@datameer.com, @datameer

  flo@leibert.de, @floleibert

Mais conteúdo relacionado

Mais procurados

The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked DataJoshua Shinavier
 
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...Anand Hemmige
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php applicationMichele Orselli
 
Austin Day of Rest - Introduction
Austin Day of Rest - IntroductionAustin Day of Rest - Introduction
Austin Day of Rest - IntroductionHandsOnWP.com
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Demand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLDemand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLSean Timm
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
On Again; Off Again - Benjamin Young - ebookcraft 2017
On Again; Off Again - Benjamin Young - ebookcraft 2017On Again; Off Again - Benjamin Young - ebookcraft 2017
On Again; Off Again - Benjamin Young - ebookcraft 2017BookNet Canada
 
Mikkel Heisterberg - An introduction to developing for the Activity Stream
Mikkel Heisterberg - An introduction to developing for the Activity StreamMikkel Heisterberg - An introduction to developing for the Activity Stream
Mikkel Heisterberg - An introduction to developing for the Activity StreamLetsConnect
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?Abhishek Mitra
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 

Mais procurados (19)

The state of the art in Linked Data
The state of the art in Linked DataThe state of the art in Linked Data
The state of the art in Linked Data
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...
South JVM Users Group Talk - Building Social Media Tools using JVM Supported ...
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php application
 
Austin Day of Rest - Introduction
Austin Day of Rest - IntroductionAustin Day of Rest - Introduction
Austin Day of Rest - Introduction
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Demand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLDemand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOL
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Vít Listík - Email.cz workshop
Vít Listík - Email.cz workshopVít Listík - Email.cz workshop
Vít Listík - Email.cz workshop
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
On Again; Off Again - Benjamin Young - ebookcraft 2017
On Again; Off Again - Benjamin Young - ebookcraft 2017On Again; Off Again - Benjamin Young - ebookcraft 2017
On Again; Off Again - Benjamin Young - ebookcraft 2017
 
Mikkel Heisterberg - An introduction to developing for the Activity Stream
Mikkel Heisterberg - An introduction to developing for the Activity StreamMikkel Heisterberg - An introduction to developing for the Activity Stream
Mikkel Heisterberg - An introduction to developing for the Activity Stream
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 

Destaque

Final paper
Final paperFinal paper
Final paperJDonpfd3
 
Link Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalLink Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalCarlos Castillo (ChaTo)
 
Webspam (English Version)
Webspam (English Version)Webspam (English Version)
Webspam (English Version)Dirk Haun
 
ASSP: Extracting the Ham from Spam -- by David J. Young
ASSP: Extracting the Ham from Spam -- by David J. YoungASSP: Extracting the Ham from Spam -- by David J. Young
ASSP: Extracting the Ham from Spam -- by David J. Youngtotojepasttrain
 
Job seekers defense against spammers/spambots Sept 7, 2012
Job seekers defense against spammers/spambots Sept 7, 2012Job seekers defense against spammers/spambots Sept 7, 2012
Job seekers defense against spammers/spambots Sept 7, 2012chuckthomassql
 
Top Cyber Threats of 2009
Top Cyber Threats of 2009Top Cyber Threats of 2009
Top Cyber Threats of 2009Symantec
 
Le novità di Ubuntu 10.04
Le novità di Ubuntu 10.04Le novità di Ubuntu 10.04
Le novità di Ubuntu 10.04Paolo Sammicheli
 
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward Online
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward OnlineYou’re Not A Dog: How Lawyers Can Put Their Best Foot Forward Online
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward OnlineRocket Matter, LLC
 
How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1Lebowitzcomics
 
Online Advertising Year Of Living Dangerously
Online Advertising Year Of Living DangerouslyOnline Advertising Year Of Living Dangerously
Online Advertising Year Of Living DangerouslyInternet Law Center
 
Online Advt: Year of Living. Dangerously
Online Advt: Year of Living. DangerouslyOnline Advt: Year of Living. Dangerously
Online Advt: Year of Living. DangerouslyBennet Kelley
 
Google Summer of Code™ (in German; neutral version)
Google Summer of Code™ (in German; neutral version)Google Summer of Code™ (in German; neutral version)
Google Summer of Code™ (in German; neutral version)Dirk Haun
 
Mobile WebKit Optimizations & Tools
Mobile WebKit Optimizations & ToolsMobile WebKit Optimizations & Tools
Mobile WebKit Optimizations & ToolsAndrew Hedges
 

Destaque (20)

Final paper
Final paperFinal paper
Final paper
 
Link Analysis for Web Information Retrieval
Link Analysis for Web Information RetrievalLink Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
 
Webspam (English Version)
Webspam (English Version)Webspam (English Version)
Webspam (English Version)
 
ASSP: Extracting the Ham from Spam -- by David J. Young
ASSP: Extracting the Ham from Spam -- by David J. YoungASSP: Extracting the Ham from Spam -- by David J. Young
ASSP: Extracting the Ham from Spam -- by David J. Young
 
Job seekers defense against spammers/spambots Sept 7, 2012
Job seekers defense against spammers/spambots Sept 7, 2012Job seekers defense against spammers/spambots Sept 7, 2012
Job seekers defense against spammers/spambots Sept 7, 2012
 
Top Cyber Threats of 2009
Top Cyber Threats of 2009Top Cyber Threats of 2009
Top Cyber Threats of 2009
 
E spam
E spamE spam
E spam
 
Immigration Related Audits (DHN Presentation) Japan Tokyo - Revised
Immigration Related Audits (DHN Presentation) Japan Tokyo - RevisedImmigration Related Audits (DHN Presentation) Japan Tokyo - Revised
Immigration Related Audits (DHN Presentation) Japan Tokyo - Revised
 
Le novità di Ubuntu 10.04
Le novità di Ubuntu 10.04Le novità di Ubuntu 10.04
Le novità di Ubuntu 10.04
 
RIC
RICRIC
RIC
 
You’re not a dog
You’re not a dogYou’re not a dog
You’re not a dog
 
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward Online
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward OnlineYou’re Not A Dog: How Lawyers Can Put Their Best Foot Forward Online
You’re Not A Dog: How Lawyers Can Put Their Best Foot Forward Online
 
Cyber Harassment
Cyber HarassmentCyber Harassment
Cyber Harassment
 
How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1
 
Net Neutrality Overview
Net Neutrality OverviewNet Neutrality Overview
Net Neutrality Overview
 
Online Advertising Year Of Living Dangerously
Online Advertising Year Of Living DangerouslyOnline Advertising Year Of Living Dangerously
Online Advertising Year Of Living Dangerously
 
Online Advt: Year of Living. Dangerously
Online Advt: Year of Living. DangerouslyOnline Advt: Year of Living. Dangerously
Online Advt: Year of Living. Dangerously
 
Can Spam At 5
Can Spam At 5Can Spam At 5
Can Spam At 5
 
Google Summer of Code™ (in German; neutral version)
Google Summer of Code™ (in German; neutral version)Google Summer of Code™ (in German; neutral version)
Google Summer of Code™ (in German; neutral version)
 
Mobile WebKit Optimizations & Tools
Mobile WebKit Optimizations & ToolsMobile WebKit Optimizations & Tools
Mobile WebKit Optimizations & Tools
 

Semelhante a HadoopSummit_2010_big dataspamchallange_hadoopsummit2010

Email Address Harvesting
Email Address HarvestingEmail Address Harvesting
Email Address HarvestingMichael Lamont
 
OSINT for Attack and Defense
OSINT for Attack and DefenseOSINT for Attack and Defense
OSINT for Attack and DefenseAndrew McNicol
 
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Joshua Kamdjou
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacksFrank Victory
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringChris Gates
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Internet and Social Media for Beginners
Internet and Social Media for BeginnersInternet and Social Media for Beginners
Internet and Social Media for Beginnersbecarreno
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015Nathan Halko
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBMihnea Giurgea
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopPeter Skomoroch
 
DataSploit - Tool Demo at Null Bangalore - March Meet.
DataSploit - Tool Demo at Null Bangalore - March Meet. DataSploit - Tool Demo at Null Bangalore - March Meet.
DataSploit - Tool Demo at Null Bangalore - March Meet. Shubham Mittal
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 

Semelhante a HadoopSummit_2010_big dataspamchallange_hadoopsummit2010 (20)

Email Address Harvesting
Email Address HarvestingEmail Address Harvesting
Email Address Harvesting
 
Geek basics
Geek basicsGeek basics
Geek basics
 
OSINT for Attack and Defense
OSINT for Attack and DefenseOSINT for Attack and Defense
OSINT for Attack and Defense
 
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
Voight-Kampff for Email Addresses: Quantifying Email Address Reputation to Id...
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacks
 
Gates Toorcon X New School Information Gathering
Gates Toorcon X New School Information GatheringGates Toorcon X New School Information Gathering
Gates Toorcon X New School Information Gathering
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Internet and Social Media for Beginners
Internet and Social Media for BeginnersInternet and Social Media for Beginners
Internet and Social Media for Beginners
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Web hacking
Web hackingWeb hacking
Web hacking
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
DataSploit - Tool Demo at Null Bangalore - March Meet.
DataSploit - Tool Demo at Null Bangalore - March Meet. DataSploit - Tool Demo at Null Bangalore - March Meet.
DataSploit - Tool Demo at Null Bangalore - March Meet.
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

HadoopSummit_2010_big dataspamchallange_hadoopsummit2010

  • 1. Winning the Big Data Spam Challenge Erich Nachbar Stefan Groschupf Florian Leibert
  • 2. Spam Types - Email Spam • What do spammers do?  • Many domains • Cycle through IPs (TOR, bulk blocks) • Bulk account creation (increase IP reputation) • Break captchas (Mechanical Turk) • Common names (e.g. http://www.census.gov/genealogy/names/dist.male.first) 
  • 3. Spam Types - Social Media • Spam Carriers • Blog Postings • Comments • Friend Requests • ... • Spam Generation through • Actual User Accounts (Hacked / User Virus) • Bot Accounts
  • 4. Spam Types - Social Media • Differences • Detection is the same • Account treatment is different (cancel vs. clean) • 99% of all Spam contains URLs: • Ignore text-only messages. • Look at the URL not the text.
  • 5. Spam Types - Web Spam • Goal: influence search engine results • Link farms • Keyword, Meta tag stuffing • Hidden or invisible unrelated text • Scraper sites • Spam blogs
  • 6. Why process Spam in Hadoop? • Easy to parallelize   • Bucketization • User • Date • Source • etc. • Count models (probabilities) are very "hadoopy"
  • 7. Why process Spam in Hadoop? • Large data sets • More samples ~ better results • Algorithms require preprocessing • Existing code • e.g. url-parsers, bayes implementations •
  • 9. Heuristics for Spam Detection • Easy to compute, Group-By & Count • Captcha solving rates • Source IP/Email • Historic vs. current volume • Reputation • Link • Rrequency, position, ratio 
  • 10. Heuristics for Spam Detection • Content • Self similarity • Hash of media content • Keywords
  • 11. Looking at arrival times • Inter-arrival times • Fast Fourier Transform  • Timestamps -> frequency space
  • 12. Evaluating content • Jaccard similarity • Bucketize by user / source email • s1 = S(x1,x3,x5), s2 = S(x2,x4,x6) • Easy with Hadoop • map: emit user_id (K), text (V) • reduce:  • Even simpler:  • # links / user • # complaints, spam tags / user • etc.
  • 13. Solutions - Pay Level Domain • Requires payment at Top Level Domain • Simple heuristic • Much simpler than Trust-Rank, Page-Rank, etc. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov IRLbot: Scaling to 6 Billion Pages and Beyond
  • 14. Demo
  • 15. Take Aways • Spam Reports are important • Rolling real-time Ham & Spam Samples • Have Knobs to turn (e.g. over JMX) • Simple solutions can get you pretty far, the easy 80%  • Spammers adapt very fast, stay agile • Try to break your own system
  • 16. Thank you! erich@quantifind.com, @enachb sg@datameer.com, @datameer flo@leibert.de, @floleibert