Re-ranking Web Documents Using Personal Preferences

•

1 gostou•584 visualizações

This Prseentation describes a Project , a part of a Course-Work in IIIT Hyderabad. We need to Re-rank Webpages based on Personal Preference of the User.

Internet Tecnologia Educação

Re-ranking Web
Documents Based
on Personal
Preferences
Satarupa Guha, Priyanshu Agrawal, Shubham Sangal
IIIT Hyderabad

Dataset… Fully Anonymized
• Dataset was provided by Yandex ( Almost 16 GB of train
data and 400 MB of test Data)
• Search Data for 30 days from a large city was collected.
• To allay privacy concerns the user data is fully anonymized.
• Only meaningless numeric IDs of users, queries, query terms,
sessions, URLs and their domains are released.
• Training was done on data of first 27 days and testing on last
3 days.

Data Format
The log represents a stream of user actions, with each
line representing a session metadata, a query action, or a click
action. Each line contains tab-separated values according to
the following format:
Session metadata (TypeOfRecord = M):
SessionID TypeOfRecord Day USERID
Query action (TypeOfRecord = Q or T):
SessionID TimePassed TypeOfRecord SERPID QueryID
ListOfTerms ListOfURLsAndDomains
Click action (TypeOfRecord = C):
SessionID TimePassed TypeOfRecord SERPID URLID

Approach
Feature Engineering
Train Data Test Data
Basic Features User-Specific Features Pair-wise Features
Feature Representation
Regression Model
Trained Model
Re-ranked URLs
Train Data
Test
Data

Feature Engineering
• Initially each URL is labelled with a relevance label of:
o 0 (irrelevant) : no clicks or click less than 50 time units
o 1 (relevant): clicks with 50 to 399 time units
o 2 (highly relevant): dwell time grater than 400 units
• Basic Features used include URL-query instance such as
original position of URL for query , URLId , DomainId of
URLs , TermIds of query and various joint features such as
URLId X position , URLId X QueryId , DomainId X TermId
etc.

• User Specific Features include for each user , dividing each
URL in 5 categories based on relevance labels and whether
URL was previously displayed or skipped to the User.
• Pairwise Features include for each query checking the
relative position of a pair of URL
f(URLi, URLj) = 1 if URLi occurs at better position than
URLj
f(URLi, URLj) = -1 if URLj occurs at better position than
URLi
Feature Engineering

Logistic Regression
• Simple Logistic Regression Model was
used.
• URL with positive relevance was
labelled 1. Rest labelled -1.
• Relevance Labels assigned to each
URL w.r.t to each User were used as
Weight.
• Out of Core learning Algorithms
are used as they can work with huge
Data in hours.

Vowpal Wabbit for
Logistic Regression
• Fast Machine Learning tool.
• Works on huge data in a
parallel manner.
• RAM efficient.
• Fast Convergence to a Good
Predictor.
• Handles TBs of Data in Matter
of Hours.
Machine Learning

Evaluation and Results
• We used NDCG ( Normalized Discounted Cumulative Gain
)for evaluation of our results.
• Calculated using the ranking of URLs for each query, and
then averaged over queries.
𝐷𝐶𝐺@10 =
𝑖=1
10
2 𝑟𝑒𝑙 𝑖−1
𝑙𝑜𝑔2(𝑖+1)
where 𝑟𝑒𝑙𝑖, 𝑖 = 1,2, … 10 is the relevance list that contain 10
URL’s relevance values. IDCG@10 is the maximum possible
(ideal) DCG for a given set of queries, documents, and
relevance value. Then, NDCG@10 is given by
𝑁𝐷𝐶𝐺@10 =
𝐷𝐶𝐺@10
𝐼𝐷𝐶𝐺@10

• Uploaded the final Result file online on Kaggle Portal for
Evaluation.
• Our Team (Satarupa Guha) Got NDCG Score 0.76857
• Team who secured first Position got 0.80725

Mais conteúdo relacionado

Destaque

First Grade Jeopardy Review chatchis21

Huong dan-su-dung-phan-mem-ban-hangNguyễn Tuân

Procrastination in organizationfarahm3d

Scope management term paperMuhammad Umar

Centrepoint presentationMuhammad Umar

Selection of Roof Casting Formwork Systems for the Bird Island Project: Case ...Muhammad Umar

E business case study danish furniturefarahm3d

Mass transit systemMuhammad Umar

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks

Destaque (9)

First Grade Jeopardy Review

Huong dan-su-dung-phan-mem-ban-hang

Procrastination in organization

Scope management term paper

Centrepoint presentation

Selection of Roof Casting Formwork Systems for the Bird Island Project: Case ...

E business case study danish furniture

Mass transit system

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...

Semelhante a Re-ranking Web Documents Using Personal Preferences

Click Log Mining CS598Shih-Wen Huang

10 personalized-web-search-techniquesdipanjalishipne

Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes

Personalized Re-Ranking of Documentskswapna9

SharePoint User Group Meeting- SharePoint 2013 SearchC/D/H Technology Consultants

How To Build your own Custom Search EngineRicha Budhraja

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks

LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia

Lucene Bootcamp - 2GokulD

Architecting AI Solutions in Azure for BusinessIvo Andreev

SharePoint 2013 Search Based SolutionsSPC Adriatics

Developing Search-driven application in SharePoint 2013 SPC Adriatics

How seo worksOlatz Beitia

Optimising Queries - Series 1 Query Optimiser ArchitectureDAGEOP LTD

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.

RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu

Artificial Intelligence for Data QualityVera Ekimenko

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion

Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Lippo Group Digital

SlideSeekerUVCErohitnbwlf

Semelhante a Re-ranking Web Documents Using Personal Preferences (20)

Click Log Mining CS598

10 personalized-web-search-techniques

Evolving the Optimal Relevancy Ranking Model at Dice.com

Personalized Re-Ranking of Documents

SharePoint User Group Meeting- SharePoint 2013 Search

How To Build your own Custom Search Engine

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...

LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University

Lucene Bootcamp - 2

Architecting AI Solutions in Azure for Business

SharePoint 2013 Search Based Solutions

Developing Search-driven application in SharePoint 2013

How seo works

Optimising Queries - Series 1 Query Optimiser Architecture

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...

RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...

Artificial Intelligence for Data Quality

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...

Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...

SlideSeekerUVCE

Último

在线制作约克大学毕业证（yu毕业证）在读证明认证可查ydyuyu

( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...nilamkumrai

Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi

Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...SUHANI PANDEY

WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls

Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...SUHANI PANDEY

20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair

Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls

Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...SUHANI PANDEY

Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceDelhi Call girls

2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou

(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls

VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY

VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...singhpriety023

APNIC Updates presented by Paul Wilson at ARIN 53APNIC

Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo

Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678

Re-ranking Web Documents Using Personal Preferences

1. Re-ranking Web Documents Based on Personal Preferences Satarupa Guha, Priyanshu Agrawal, Shubham Sangal IIIT Hyderabad

2. • Each Individual is unique. • Search Engines should re- rank its results based on his profile so that his specific requirements can be met Personalisation… Why ??

3. • Personalisation works using Individualisation and Contextualisation • Individualisation means creating each User’s Individual Profile using his Long-Term History • Contextualisation means Identifying relation between sessions using Short-Term History

4. Dataset… Fully Anonymized • Dataset was provided by Yandex ( Almost 16 GB of train data and 400 MB of test Data) • Search Data for 30 days from a large city was collected. • To allay privacy concerns the user data is fully anonymized. • Only meaningless numeric IDs of users, queries, query terms, sessions, URLs and their domains are released. • Training was done on data of first 27 days and testing on last 3 days.

5. Data Format The log represents a stream of user actions, with each line representing a session metadata, a query action, or a click action. Each line contains tab-separated values according to the following format: Session metadata (TypeOfRecord = M): SessionID TypeOfRecord Day USERID Query action (TypeOfRecord = Q or T): SessionID TimePassed TypeOfRecord SERPID QueryID ListOfTerms ListOfURLsAndDomains Click action (TypeOfRecord = C): SessionID TimePassed TypeOfRecord SERPID URLID

6. Approach Feature Engineering Train Data Test Data Basic Features User-Specific Features Pair-wise Features Feature Representation Regression Model Trained Model Re-ranked URLs Train Data Test Data

7. Feature Engineering • Initially each URL is labelled with a relevance label of: o 0 (irrelevant) : no clicks or click less than 50 time units o 1 (relevant): clicks with 50 to 399 time units o 2 (highly relevant): dwell time grater than 400 units • Basic Features used include URL-query instance such as original position of URL for query , URLId , DomainId of URLs , TermIds of query and various joint features such as URLId X position , URLId X QueryId , DomainId X TermId etc.

8. • User Specific Features include for each user , dividing each URL in 5 categories based on relevance labels and whether URL was previously displayed or skipped to the User. • Pairwise Features include for each query checking the relative position of a pair of URL f(URLi, URLj) = 1 if URLi occurs at better position than URLj f(URLi, URLj) = -1 if URLj occurs at better position than URLi Feature Engineering

9. Logistic Regression • Simple Logistic Regression Model was used. • URL with positive relevance was labelled 1. Rest labelled -1. • Relevance Labels assigned to each URL w.r.t to each User were used as Weight. • Out of Core learning Algorithms are used as they can work with huge Data in hours.

10. Vowpal Wabbit for Logistic Regression • Fast Machine Learning tool. • Works on huge data in a parallel manner. • RAM efficient. • Fast Convergence to a Good Predictor. • Handles TBs of Data in Matter of Hours. Machine Learning

11. Evaluation and Results • We used NDCG ( Normalized Discounted Cumulative Gain )for evaluation of our results. • Calculated using the ranking of URLs for each query, and then averaged over queries. 𝐷𝐶𝐺@10 = 𝑖=1 10 2 𝑟𝑒𝑙 𝑖−1 𝑙𝑜𝑔2(𝑖+1) where 𝑟𝑒𝑙𝑖, 𝑖 = 1,2, … 10 is the relevance list that contain 10 URL’s relevance values. IDCG@10 is the maximum possible (ideal) DCG for a given set of queries, documents, and relevance value. Then, NDCG@10 is given by 𝑁𝐷𝐶𝐺@10 = 𝐷𝐶𝐺@10 𝐼𝐷𝐶𝐺@10

12. • Uploaded the final Result file online on Kaggle Portal for Evaluation. • Our Team (Satarupa Guha) Got NDCG Score 0.76857 • Team who secured first Position got 0.80725

13. Thank You !!!

Re-ranking Web Documents Using Personal Preferences

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (9)

Semelhante a Re-ranking Web Documents Using Personal Preferences

Semelhante a Re-ranking Web Documents Using Personal Preferences (20)

Último

Último (20)

Re-ranking Web Documents Using Personal Preferences