SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
+




    Engineering Challenges
    in Vertical Search Engines
    Aleksandar Bradic, Senior Director,
    Engineering and R&D, Vast.com
+
    Introduction

        Vertical Search
             Search focused on vertical data
             Vertical Data – data inherently described by it’s structure:
                Items/Properties for sale (Automotive, Real Estate..)

                  Geographical Data (Neighborhoods, Locations..)
                  Services (Hotels, Transportation..)
                  Businesses (Restaurants, Nightlife..)
                  Events (Concerts, Plays..)
                  Auction items (Collectibles, Art..)
                  Metadata (News, Social Data, Reviews..)
                  …
+
    Introduction

        Vertical Search != Full Text Search
             Full Text Search queries:
                “Cheap tickets for Broadway shows this week”
                “Trendy Restaurants in San Francisco near SoMa”
                “3-day trips from NYC to anywhere under $1000”
             Vertical Search queries:
                “price-sorted results bellow two standard deviations from tickets
                 category with Broadway as location and date range of 2010-04-11 to
                 2010-04-18”
                “distance-sorted results relative to center of SF/SoMa matching the
                 appropriate threshold of composite score of user review scores and
                 historical change in query/review volume”
                “total cost-sorted results for all 3-day intervals within next 6 months
                 combining hotel and airfare price bellow max value of $1000 for all
                 valid locations”
+
    Introduction

        Vertical Search = search on structured data

        Vertical Search at Web-Scale:
             Web-Scale datasets
             Web-Scale query volumes
             Interactive operation
             Low latency requirements
             Utility maximization across all involved parties

        => loads of fun ! : )
+
    @Vast.com

        Vast.com : Vertical Search & Analytics Platform

        Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
         Airlines, etc..
+
    @Vast.com

        Daily processing up to 1Tb of unstructured and semi-
         structured Web data

        Managing ~150M records operational dataset across multiple
         verticals

        Handling > 1000 query/sec peak search query loads



        We’re hiring ! : )
+
    Challenges in Vertical Search
    Engines
        Web Data Retrieval

        Unstructured Data

        Data Processing Infrastructures

        Vertical Search

        Data Analytics

        Computational Advertising
+
    Web Data Retrieval

        Crawler Architecture
             Queue Management
             Crawl Ordering Policies
             Duplicate URL Detection
             Content Hash Management
             Politeness Management
             Coverage Measurement
             Freshness Optimization
             Incremental Crawling
+
    Web Data Retrieval

        ”Deep Web” crawling
             Locating Deep Web Content Sources
             Selecting Relevant Sources
             Estimating Database Size
             Understanding Content / Form Detection
             Automatic Dispatch of HTML Forms
             Predicting content in free text forms
             Crawling non-HTML Content
             Estimating Query Result Sparsity
             URL Generation problem
             Query Covering Problem
+
    Web Data Retrieval

        Focused (Topical) Crawling
             Content Classification
             Link Content Prediction
             Topic Relevance Estimation

        Modeling Temporal Characteristics
             Site-Level Evolution
             Page-Level Evolution

        Adversarial Crawling
             Web Spam Detection
             Cloaked Content Detection
+
    Unstructured Data

        Unstructured Data – information that does not have a pre-
         defined data model

        Handling Unstructured Data:
             Data Cleaning
             Tagging with Metadata
             Vertical Classification
             Schema Matching
             Information Extraction


    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make            model   year    trim          price                  ???
+
    Unstructured Data

        Information extraction from unstructured, ungrammatical
         data
             Reference Sets - relational data sets that consist of collection of
              known entities with associated common attributes
             Reference Set Selection
             Reference Set Generation
             Record Linkage : Finding “best matching” member of reference
              set corresponding post
             Challenge : Automatic Generation of Reference Sets
+
    Data Processing Infrastructures

        Infrastructures for continuous processing of unbounded streams
         of unstructured data
        Information Extraction as part of processing (non-trivial
         computation per each processed entry)

        Inherently distributed infrastructures - in order to support
         performance and scalability

        Time-to-site constraints. Ability to process out-of band data.

        Support for complex operations on aggregated data (de-
         duplication, static ranking, data enrichment, data cleaning/
         filtering …)

        Support for data archival and off-line analysis
+
    Data Processing Infrastructures
+
    Data Processing Infrastructures

        Distributed Computing Platforms:

             Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

             Stream-oriented (Flume, S4, Stream SQL…)

             Distributed Data Stores (Dynamo/Cassandra/Riak…)

        The curse of CAP Theorem:
             It is impossible for a distributed system to simultaneously provide
              all three of the following guarantees:
                Consistency
                Availability
                Partition tolerance
+
    Vertical Search

        Large-Scale structured data search

        Providing both analytic and canonical set of Information
         Retrieval functionalities

        Entries are represented in Vector Space Model

        Each result is represented as data point – tuple consisting of
         appropriate number of fields :

         (make, model, year, trim …)
+
    Vertical Search

        Search in Vector Space Model
             Resulting subset generation
             Sorting as linearization using selected metric
             Dynamic subset criteria calculation
             Search Result Clustering
             “Similar” result search
             …



… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
+
    Vertical Search

        Faceted Search
             fac-et (fas’it) :
                1. One of the flat polished surfaces cut on a gemstone or occurring
                 naturally on a crystal.
                2. One of numerous aspects, as of a subject.


             Vocabulary problem for faceted data
             Facet Design / selection
                "the keywords that are assigned by indexers are often at
                  odds with those tried by searchers.”
                Selection of information-distinguishing facet values
             User-specific faceted search
             Dynamic correlated facet generation
             Distributing facet computation
+
    Data Analytics

        Clickstream Data Analysis

        Learning from implicit user feedback

        Anonymous user clustering

        Learning to rank

        Inventory/Market Trends

        Rare Event detection

        Price Prediction

        Spam Content detection
+
    Data Analytics

        Challenges:
             “Good Deal” detection
             Recommendation Systems for Vertical Data with no explicit user
              feedback
             Accuracy of Automatic Valuation Models
             Data-driven feature design
             Click Prediction
             User Behavior Modeling
+
    Computational Advertising

        The central problem of computational advertising is to find
         the "best match" between a given user in a given context and a
         suitable advertisement.




    ads


                                                                          ads




                                         search results !
+
    Computational Advertising

        Vertical Search presents an additional challenge in the sense
         that any of the actual search results can be “sponsored”




                                                                   ad ?




                                                                   ad ?
+
    Computational Advertising

        Central challenge:
             Find the “best match” between a given user in a given context
              and a suitable advertisement
             “best match” – maximizing the value for :
                  Users
                  Advertisers
                  Publishers
             Each of the parties has different set of utilities:
                Users want relevance

                  Advertisers want ROI and volume
                  Publishers want revenue per impression/search
+
    Computational Advertising

        CTR (ClickThrough Rate Estimation):
             Reactive (statistically significant historical CTR)
             Predictive (CTR estimated from features of ads)
             Hybrid (historical + predictive)


             Personalization of CTR Computation ?
             Dynamic CTR Estimation (online algorithms)




                                  P(click) = ?
+
    Computational Advertising

        Analytical Aparatus:
             Regression Analysis (Linear, Logistic, probit model, High
              Dimensional methods)
             Game Theory (Nash Equilibria, dominant strategy)
             Auction Theory (Vickrey, GSP, VCG…)
             Graph Theory (random walks on graphs, graph matching, etc.)
             Information Retrieval Techniques (similarity metrics, etc.)
             …
+
    Conclusion

        Vertical Search & Analytics at Web Scale == fun !!!

        Source of large number of relevant research & engineering
         problems !

        Opportunity to tackle wide spectra of techniques across all
         areas of Computer Science and Engineering !




                                       Jump on the bandwagon ! : )

Mais conteúdo relacionado

Semelhante a Engineering challenges in vertical search engines

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYAmit Sheth
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsOlha Hrytsay
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web TechnologiesKANIMOZHIUMA
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...Amazon Web Services
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation EngineAmazon Web Services
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebAlessandro Bozzon
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSAmazon Web Services
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product managementBhaskar Krishnan
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfParvathyparu25
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analyticsdeep.bi
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsAmit Sheth
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Publicaspoerri
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphBill Slawski
 

Semelhante a Engineering challenges in vertical search engines (20)

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 

Mais de ITDogadjaji.com

Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in SerbiaITDogadjaji.com
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center SerbiaITDogadjaji.com
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...ITDogadjaji.com
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event PresentationITDogadjaji.com
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities ITDogadjaji.com
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotITDogadjaji.com
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing upITDogadjaji.com
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first baseITDogadjaji.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the worldITDogadjaji.com
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentationITDogadjaji.com
 

Mais de ITDogadjaji.com (20)

Game Design 101
Game Design 101Game Design 101
Game Design 101
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
 
Taksiko
TaksikoTaksiko
Taksiko
 
SiteCake
SiteCakeSiteCake
SiteCake
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Engineering challenges in vertical search engines

  • 1. + Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
  • 2. + Introduction   Vertical Search   Search focused on vertical data   Vertical Data – data inherently described by it’s structure:   Items/Properties for sale (Automotive, Real Estate..)   Geographical Data (Neighborhoods, Locations..)   Services (Hotels, Transportation..)   Businesses (Restaurants, Nightlife..)   Events (Concerts, Plays..)   Auction items (Collectibles, Art..)   Metadata (News, Social Data, Reviews..)   …
  • 3. + Introduction   Vertical Search != Full Text Search   Full Text Search queries:   “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”   Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18”   “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”   “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
  • 4. + Introduction   Vertical Search = search on structured data   Vertical Search at Web-Scale:   Web-Scale datasets   Web-Scale query volumes   Interactive operation   Low latency requirements   Utility maximization across all involved parties   => loads of fun ! : )
  • 5. + @Vast.com   Vast.com : Vertical Search & Analytics Platform   Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
  • 6. + @Vast.com   Daily processing up to 1Tb of unstructured and semi- structured Web data   Managing ~150M records operational dataset across multiple verticals   Handling > 1000 query/sec peak search query loads   We’re hiring ! : )
  • 7. + Challenges in Vertical Search Engines   Web Data Retrieval   Unstructured Data   Data Processing Infrastructures   Vertical Search   Data Analytics   Computational Advertising
  • 8. + Web Data Retrieval   Crawler Architecture   Queue Management   Crawl Ordering Policies   Duplicate URL Detection   Content Hash Management   Politeness Management   Coverage Measurement   Freshness Optimization   Incremental Crawling
  • 9. + Web Data Retrieval   ”Deep Web” crawling   Locating Deep Web Content Sources   Selecting Relevant Sources   Estimating Database Size   Understanding Content / Form Detection   Automatic Dispatch of HTML Forms   Predicting content in free text forms   Crawling non-HTML Content   Estimating Query Result Sparsity   URL Generation problem   Query Covering Problem
  • 10. + Web Data Retrieval   Focused (Topical) Crawling   Content Classification   Link Content Prediction   Topic Relevance Estimation   Modeling Temporal Characteristics   Site-Level Evolution   Page-Level Evolution   Adversarial Crawling   Web Spam Detection   Cloaked Content Detection
  • 11. + Unstructured Data   Unstructured Data – information that does not have a pre- defined data model   Handling Unstructured Data:   Data Cleaning   Tagging with Metadata   Vertical Classification   Schema Matching   Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! make model year trim price ???
  • 12. + Unstructured Data   Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of known entities with associated common attributes   Reference Set Selection   Reference Set Generation   Record Linkage : Finding “best matching” member of reference set corresponding post   Challenge : Automatic Generation of Reference Sets
  • 13. + Data Processing Infrastructures   Infrastructures for continuous processing of unbounded streams of unstructured data   Information Extraction as part of processing (non-trivial computation per each processed entry)   Inherently distributed infrastructures - in order to support performance and scalability   Time-to-site constraints. Ability to process out-of band data.   Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …)   Support for data archival and off-line analysis
  • 14. + Data Processing Infrastructures
  • 15. + Data Processing Infrastructures   Distributed Computing Platforms:   Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)   Stream-oriented (Flume, S4, Stream SQL…)   Distributed Data Stores (Dynamo/Cassandra/Riak…)   The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide all three of the following guarantees:   Consistency   Availability   Partition tolerance
  • 16. + Vertical Search   Large-Scale structured data search   Providing both analytic and canonical set of Information Retrieval functionalities   Entries are represented in Vector Space Model   Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
  • 17. + Vertical Search   Search in Vector Space Model   Resulting subset generation   Sorting as linearization using selected metric   Dynamic subset criteria calculation   Search Result Clustering   “Similar” result search   … … with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
  • 18. + Vertical Search   Faceted Search   fac-et (fas’it) :   1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.   2. One of numerous aspects, as of a subject.   Vocabulary problem for faceted data   Facet Design / selection   "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values   User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation
  • 19. + Data Analytics   Clickstream Data Analysis   Learning from implicit user feedback   Anonymous user clustering   Learning to rank   Inventory/Market Trends   Rare Event detection   Price Prediction   Spam Content detection
  • 20. + Data Analytics   Challenges:   “Good Deal” detection   Recommendation Systems for Vertical Data with no explicit user feedback   Accuracy of Automatic Valuation Models   Data-driven feature design   Click Prediction   User Behavior Modeling
  • 21. + Computational Advertising   The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
  • 22. + Computational Advertising   Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
  • 23. + Computational Advertising   Central challenge:   Find the “best match” between a given user in a given context and a suitable advertisement   “best match” – maximizing the value for :   Users   Advertisers   Publishers   Each of the parties has different set of utilities:   Users want relevance   Advertisers want ROI and volume   Publishers want revenue per impression/search
  • 24. + Computational Advertising   CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)   Predictive (CTR estimated from features of ads)   Hybrid (historical + predictive)   Personalization of CTR Computation ?   Dynamic CTR Estimation (online algorithms) P(click) = ?
  • 25. + Computational Advertising   Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High Dimensional methods)   Game Theory (Nash Equilibria, dominant strategy)   Auction Theory (Vickrey, GSP, VCG…)   Graph Theory (random walks on graphs, graph matching, etc.)   Information Retrieval Techniques (similarity metrics, etc.)   …
  • 26. + Conclusion   Vertical Search & Analytics at Web Scale == fun !!!   Source of large number of relevant research & engineering problems !   Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )