SlideShare uma empresa Scribd logo
1 de 38
Mining Large Scale Temporal Dynamics
                with Hadoop
                           Nish Parikh
                     Senior Research Scientist
eBay Research Labs
                     eBay Research Labs


       Joint Work With: Evan Chiu, Hui Hong, Neel Sundaresan
About eBay Research Labs

•  Who we are
   –  The group's charter is to conduct research and deliver innovative
      solutions.
•  What we do
   –  Computer Vision
   –  Economics and Optimization
   –  Human Language Technologies
   –  Data Science
   –  Machine Learning
   –  Infrastructure & Systems
   –  Innovation Programs & Incubations


•  Basically we “Dig” into data!
Why study temporal dynamics?

•  Stock Markets
•  Bio-Medical Signals
•  Traffic, Weather and Network Systems
•  Web Search & Ranking
•  Recommender Systems
•  eCommerce…
Challenges

•  Large Scale data
   –  100M+ users
   –  Petabytes of click-stream logs
   –  Billions of user sessions
   –  Billions of unique queries

•  Noisy Data
   –  Robots
   –  API Calls
   –  Crawlers, Spiders
   –  Tools, Scripts
   –  Data Biases

•  Data spread across long time frames
   –  Differences in collection methodologies

•  Complexity of certain algorithms
Hadoop Cluster at eBay

•  Nodes
   –  Cent OS 4 64 Bit
   –  Intel Dual Hex Core Xeon 2.4
      GHz
   –  72 GB RAM
   –  2 * 12 (24TB) HDD
   –  SSD for OS

•  Network
   –  TOR 1Gbps
   –  Core Switches uplink 40 Gbps

•  Cluster
   –  532n – 1008n
   –  4000+ cores – 24000 vCPUs
   –  5 – 18 PB
Mobius – Computation Platform

Application
              Click Stream Visualizer          Metrics Dashboard          Research Projects
  Layer

                                      Mobius Studio (Eclipse plugin)

 Mobius           Generic Java Dataset API                         Query Language
 Layer

                                       Low level Dataset access API

   eBay
   Infra-                                                Hadoop Cluster
structure &
   Data
  Source
   Layer                                  eBay Data (Logs, Tables)

                    E. Chiu, W. Mann, J. Shen, G. Singh, Z. Yang
Mobius – Generic JAVA Dataset API


•  Java-based, high-level data processing framework built
   on top of Apache Hadoop.
•  Tuple oriented.
•  Supports job chaining.
•  Supports high level operators such as join (inner or
   outer) or grouping.
•  Supports filtering.
•  Used internally at eBay for various data science
   applications.
•  http://labs.ebay.com/openmobius/
Mobius – Inner Join
Mobius – Filtering
Mobius – Grouping
Mobius – Chaining
Mining Temporal Data

•  When its in your mind, its in the Query Logs!
   –  Queries as a proxy for demand
Mining Temporal Data

•  Data Preparation
                                           Christmas trend – raw
   –  Robot Filtering                      data
   –  Session Log Analysis

•  Data Cleaning
   –  Normalization
   –  De-duplication




                       Christmas trend –
                       prepared data
Mining Temporal Data – What’s Buzzing?

•  Automatic Buzz Detection
Mining Temporal Data – Does History Repeat
Itself?

•  Seasonality and Trend Prediction




     Why are searches related to monopoly pieces popular every
     October?
Air conditioner searches become popular as summer approaches
Mining Temporal Data – Temporal Similarity




Similar patterns for queries related to Hanukkah
Preparing Data – Getting Queries from User
Sessions

                              Typical eBay flow


        Search                       View                      Purchase



•  Search: specify a query, with optional constraints
•  View: click on an item shown on search results page
•  Purchase: buy a fixed-price item or place winning bid on an auction item


Consider only queries typed in by humans. Ignore
   page views from robots or views from paid
advertisements, campaigns or natural search links.
Cleaning Data

•  Apply default robot detection and removal algorithm
   –  Based on IP, number of actions per day, agent information.

•  Find the right flows from the sessions.
   –  Filter out noisy search events.
   –  Remove anomalies due to outlier users.
   –  Limit the impact a single user can have on aggregated data (de-
      duplication).
Finding the right flow in the session

  Session 1

    Search               Exit
May not consider flows without any interesting activity like clicks


  Session 2

    Ads/paid
                                View                    Purchase
     search
   May not consider searches coming from advertisements


   Session 3

     Search                         View                        Purchase

These kind of sessions are considered and information is aggregated.
Data Preparation - Map Reduce Flow


                                         Save the result so it
                                          can be reused by
        Preprocessing stage                  other apps.                  Collecting stage



      M                        R                                       M                             R

                                                                 •  Find the right flow.       Calculate sum per
Read raw events   •    Group events into sessions.
                                                                 •  Emit query as key.         key
                  •    Group sessions by GUID
                                                                 •  Emit de-duplicated query
                  •    Apply bot filtering algorithm
                                                                 volume as value




                                                                   Query Volume
                                                                   output daily as
                                                                   dailyQueryData
Time Series Generation

                           Input: dailyQueryData for multi-year time-frames
          •  Data Cleaning.
 Mapper   •  Query normalization.



                                                             Key: query
                                                             Value: date: query volume
            Data not to scale and only shown as an example




          •  Time Series formation for all unique queries
Reducer   •  Time Series indicating total daily activity volume



              Output: Vectors of Query à Volume Time Series
Buzz Detection – 2 state automaton model


•  Arrival of queries as a stream.
•  “low rate” state (q0) and a “high rate” state (q1).
•      f 0 ( x) = α 0e   −α 0 x
                                  f1 (x ) = α1e−α1x     where α1 > α0.
•  The automaton changes state with probability p ε (0,
   1) between query arrivals.
•  Let Q = (qi1, qi2… qin) be a state sequence. Each
   state sequence Q induces a density function fQ over
   sequences of gaps, which has the form
                       n
    fQ(x1, x2 …xn) = ∏t =1 fi (xt )       t




     N. Parikh, N. Sundaresan. KDD 2008.
     Scalable and Near Real-time Burst Detection from eCommerce Queries.
Buzz Detection – Modeling Queries as a
Stream




 Frequency of Query


                        Gaps between arrival
                        times for queries
Buzz Detection – 2 state automaton model

•  If number of state transitions in sequence Q are denoted as b
•  Prior probability of Q is given asb
     ⎛        ⎞⎛            ⎞               n −b     ⎛ p ⎞
     ⎜ ∏ p ⎟⎜ ∏1 −
     ⎜ i ≠ i ⎟⎜ i = i
                             p ⎟
                               ⎟   = pb (1 − p )      = ⎜       ⎟ (1 − p )n
                                                         ⎜ 1 − p ⎟
     ⎝ t t +1 ⎠⎝ t t +1     ⎠                        ⎝       ⎠
•  Using Bayes theorem, the cost equation is
                         1− p        n
      C (Q | X ) = b. ln(     ) + (∑ − ln fit ( xt ))
                          p        t =1

•  Sequence that minimizes the cost would depend on
   –  Ease of jumps between 2 states.
   –  How well the sequence conforms to the rate of query arrivals.
•  Configurable Parameters for model are α0, α1 and cost p.
    –  α0, α1 are calculated from data in the MR job.
    –  Heuristically determined value of p = 0.38 is used.
Query Volume Time Series – 2 State
Representation
Time Series Normalization and Buzz Detection

                        Input: 4-7 Years Query Time Series Vectors
            •  Normalize Time Series
            •  Transform Time Series to two state model
                  •  Calculate parameters α0, α1 for every query and
 Mapper           apply dynamic programming for 2 state calculation
            •  Calculate probability of being a periodic event query e.g.
            superbowl

                                           Key: query
                                           Value: normalized time series,
                                           two state model, probability of
                                           being a seasonal event query

                                           Key: time-frame
                                           Value: query that buzzes
                                           during that time frame
            Group queries buzzing at similar time intervals
Reducer

  Output: time-frame à Queries Buzzing during that time-period
Catman – http://labs.ebay.com/Catman/
Trends Application for eBay sellers & buyers
Binary data structure generation from MR job


•  Created new FileOutputFormat
•  Write time series data to two files
   – Binary File with fixed sized records indicating time
     series volume
   – Text file mapping each unique query string to binary
     file and offset
•  Index created by reducers directly loaded by custom
   servers written in C++.
•  Used for an internal Query Trends Application
Query Trends
Query Trends – Mapping to External Events
Trends – Comparing Queries
Temporal Similarity

•  1+ Billion Queries
•  Naïve Algorithm – Quadratic Complexity
•  Pearson’s Correlation




•  Candidate Set Reduction
   –  Correlations useful only for event-based or seasonal queries
   –  Correlations useful in applications only for head and torso queries
   –  These filters reduce candidate space from B+ to a few M.
Exact Correlations amongst candidates – All
pairs similarity on Reduced Set
Applications of Temporal Correlations – Query
Suggestions
Remarks


•  Log Mining and Time Series mining algorithms are
   parallelizable.
•  Easy to scale such algorithms using Hadoop.
•  Hadoop empowers us to look at data-sets spanning
   years and years.
•  Hadoop enables us to iterate faster and hence run more
   user-facing experiments.
References

•  Parikh et al. Scalable and near real-time burst detection from
   eCommerce queries. KDD 2008.
•  Sundaresan et al. Scalable Stream Processing and Map Reduce.
   Hadoop World 2009.
•  Anil Madan. Hadoop at eBay.
   http://www.slideshare.net/madananil/hadoop-at-ebay.
•  N Sundaresan. Popup Commerce, Towards Building Transient and
   Thematic Stores. X.Innovate 2011.
•  Pantel et al. Web-Scale Distributional Similarity and Entity Set
   Expansion. EMNLP 2009.
Acknowledgments : Project Members &
Alumni

•  Long Hoang
•  Rifat Joyee
•  Wallace Mann
•  Karin Mauge
•  Hill Nguyen
•  Jack Shen
•  Gyanit Singh
•  Zhou Yang
Questions?

•  nparikh@ebay.com

Mais conteúdo relacionado

Semelhante a Mining Large-Scale Temporal Dynamics with Hadoop

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
Nish Parikh
 
Monitoring and mining network traffic in clouds
Monitoring and mining network traffic in cloudsMonitoring and mining network traffic in clouds
Monitoring and mining network traffic in clouds
Edward Yoon
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
Kognitio
 
Associations1
Associations1Associations1
Associations1
mancnilu
 

Semelhante a Mining Large-Scale Temporal Dynamics with Hadoop (20)

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
 
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
Monitoring and mining network traffic in clouds
Monitoring and mining network traffic in cloudsMonitoring and mining network traffic in clouds
Monitoring and mining network traffic in clouds
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Associations1
Associations1Associations1
Associations1
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
MachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptxMachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptx
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Mining Large-Scale Temporal Dynamics with Hadoop

  • 1. Mining Large Scale Temporal Dynamics with Hadoop Nish Parikh Senior Research Scientist eBay Research Labs eBay Research Labs Joint Work With: Evan Chiu, Hui Hong, Neel Sundaresan
  • 2. About eBay Research Labs •  Who we are –  The group's charter is to conduct research and deliver innovative solutions. •  What we do –  Computer Vision –  Economics and Optimization –  Human Language Technologies –  Data Science –  Machine Learning –  Infrastructure & Systems –  Innovation Programs & Incubations •  Basically we “Dig” into data!
  • 3. Why study temporal dynamics? •  Stock Markets •  Bio-Medical Signals •  Traffic, Weather and Network Systems •  Web Search & Ranking •  Recommender Systems •  eCommerce…
  • 4. Challenges •  Large Scale data –  100M+ users –  Petabytes of click-stream logs –  Billions of user sessions –  Billions of unique queries •  Noisy Data –  Robots –  API Calls –  Crawlers, Spiders –  Tools, Scripts –  Data Biases •  Data spread across long time frames –  Differences in collection methodologies •  Complexity of certain algorithms
  • 5. Hadoop Cluster at eBay •  Nodes –  Cent OS 4 64 Bit –  Intel Dual Hex Core Xeon 2.4 GHz –  72 GB RAM –  2 * 12 (24TB) HDD –  SSD for OS •  Network –  TOR 1Gbps –  Core Switches uplink 40 Gbps •  Cluster –  532n – 1008n –  4000+ cores – 24000 vCPUs –  5 – 18 PB
  • 6. Mobius – Computation Platform Application Click Stream Visualizer Metrics Dashboard Research Projects Layer Mobius Studio (Eclipse plugin) Mobius Generic Java Dataset API Query Language Layer Low level Dataset access API eBay Infra- Hadoop Cluster structure & Data Source Layer eBay Data (Logs, Tables) E. Chiu, W. Mann, J. Shen, G. Singh, Z. Yang
  • 7. Mobius – Generic JAVA Dataset API •  Java-based, high-level data processing framework built on top of Apache Hadoop. •  Tuple oriented. •  Supports job chaining. •  Supports high level operators such as join (inner or outer) or grouping. •  Supports filtering. •  Used internally at eBay for various data science applications. •  http://labs.ebay.com/openmobius/
  • 12. Mining Temporal Data •  When its in your mind, its in the Query Logs! –  Queries as a proxy for demand
  • 13. Mining Temporal Data •  Data Preparation Christmas trend – raw –  Robot Filtering data –  Session Log Analysis •  Data Cleaning –  Normalization –  De-duplication Christmas trend – prepared data
  • 14. Mining Temporal Data – What’s Buzzing? •  Automatic Buzz Detection
  • 15. Mining Temporal Data – Does History Repeat Itself? •  Seasonality and Trend Prediction Why are searches related to monopoly pieces popular every October? Air conditioner searches become popular as summer approaches
  • 16. Mining Temporal Data – Temporal Similarity Similar patterns for queries related to Hanukkah
  • 17. Preparing Data – Getting Queries from User Sessions Typical eBay flow Search View Purchase •  Search: specify a query, with optional constraints •  View: click on an item shown on search results page •  Purchase: buy a fixed-price item or place winning bid on an auction item Consider only queries typed in by humans. Ignore page views from robots or views from paid advertisements, campaigns or natural search links.
  • 18. Cleaning Data •  Apply default robot detection and removal algorithm –  Based on IP, number of actions per day, agent information. •  Find the right flows from the sessions. –  Filter out noisy search events. –  Remove anomalies due to outlier users. –  Limit the impact a single user can have on aggregated data (de- duplication).
  • 19. Finding the right flow in the session Session 1 Search Exit May not consider flows without any interesting activity like clicks Session 2 Ads/paid View Purchase search May not consider searches coming from advertisements Session 3 Search View Purchase These kind of sessions are considered and information is aggregated.
  • 20. Data Preparation - Map Reduce Flow Save the result so it can be reused by Preprocessing stage other apps. Collecting stage M R M R •  Find the right flow. Calculate sum per Read raw events •  Group events into sessions. •  Emit query as key. key •  Group sessions by GUID •  Emit de-duplicated query •  Apply bot filtering algorithm volume as value Query Volume output daily as dailyQueryData
  • 21. Time Series Generation Input: dailyQueryData for multi-year time-frames •  Data Cleaning. Mapper •  Query normalization. Key: query Value: date: query volume Data not to scale and only shown as an example •  Time Series formation for all unique queries Reducer •  Time Series indicating total daily activity volume Output: Vectors of Query à Volume Time Series
  • 22. Buzz Detection – 2 state automaton model •  Arrival of queries as a stream. •  “low rate” state (q0) and a “high rate” state (q1). •  f 0 ( x) = α 0e −α 0 x f1 (x ) = α1e−α1x where α1 > α0. •  The automaton changes state with probability p ε (0, 1) between query arrivals. •  Let Q = (qi1, qi2… qin) be a state sequence. Each state sequence Q induces a density function fQ over sequences of gaps, which has the form n fQ(x1, x2 …xn) = ∏t =1 fi (xt ) t N. Parikh, N. Sundaresan. KDD 2008. Scalable and Near Real-time Burst Detection from eCommerce Queries.
  • 23. Buzz Detection – Modeling Queries as a Stream Frequency of Query Gaps between arrival times for queries
  • 24. Buzz Detection – 2 state automaton model •  If number of state transitions in sequence Q are denoted as b •  Prior probability of Q is given asb ⎛ ⎞⎛ ⎞ n −b ⎛ p ⎞ ⎜ ∏ p ⎟⎜ ∏1 − ⎜ i ≠ i ⎟⎜ i = i p ⎟ ⎟ = pb (1 − p ) = ⎜ ⎟ (1 − p )n ⎜ 1 − p ⎟ ⎝ t t +1 ⎠⎝ t t +1 ⎠ ⎝ ⎠ •  Using Bayes theorem, the cost equation is 1− p n C (Q | X ) = b. ln( ) + (∑ − ln fit ( xt )) p t =1 •  Sequence that minimizes the cost would depend on –  Ease of jumps between 2 states. –  How well the sequence conforms to the rate of query arrivals. •  Configurable Parameters for model are α0, α1 and cost p. –  α0, α1 are calculated from data in the MR job. –  Heuristically determined value of p = 0.38 is used.
  • 25. Query Volume Time Series – 2 State Representation
  • 26. Time Series Normalization and Buzz Detection Input: 4-7 Years Query Time Series Vectors •  Normalize Time Series •  Transform Time Series to two state model •  Calculate parameters α0, α1 for every query and Mapper apply dynamic programming for 2 state calculation •  Calculate probability of being a periodic event query e.g. superbowl Key: query Value: normalized time series, two state model, probability of being a seasonal event query Key: time-frame Value: query that buzzes during that time frame Group queries buzzing at similar time intervals Reducer Output: time-frame à Queries Buzzing during that time-period
  • 27. Catman – http://labs.ebay.com/Catman/ Trends Application for eBay sellers & buyers
  • 28. Binary data structure generation from MR job •  Created new FileOutputFormat •  Write time series data to two files – Binary File with fixed sized records indicating time series volume – Text file mapping each unique query string to binary file and offset •  Index created by reducers directly loaded by custom servers written in C++. •  Used for an internal Query Trends Application
  • 30. Query Trends – Mapping to External Events
  • 32. Temporal Similarity •  1+ Billion Queries •  Naïve Algorithm – Quadratic Complexity •  Pearson’s Correlation •  Candidate Set Reduction –  Correlations useful only for event-based or seasonal queries –  Correlations useful in applications only for head and torso queries –  These filters reduce candidate space from B+ to a few M.
  • 33. Exact Correlations amongst candidates – All pairs similarity on Reduced Set
  • 34. Applications of Temporal Correlations – Query Suggestions
  • 35. Remarks •  Log Mining and Time Series mining algorithms are parallelizable. •  Easy to scale such algorithms using Hadoop. •  Hadoop empowers us to look at data-sets spanning years and years. •  Hadoop enables us to iterate faster and hence run more user-facing experiments.
  • 36. References •  Parikh et al. Scalable and near real-time burst detection from eCommerce queries. KDD 2008. •  Sundaresan et al. Scalable Stream Processing and Map Reduce. Hadoop World 2009. •  Anil Madan. Hadoop at eBay. http://www.slideshare.net/madananil/hadoop-at-ebay. •  N Sundaresan. Popup Commerce, Towards Building Transient and Thematic Stores. X.Innovate 2011. •  Pantel et al. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP 2009.
  • 37. Acknowledgments : Project Members & Alumni •  Long Hoang •  Rifat Joyee •  Wallace Mann •  Karin Mauge •  Hill Nguyen •  Jack Shen •  Gyanit Singh •  Zhou Yang