Ranking data, i.e., ordered list of items, naturally appears in a wide variety of situation; understanding how to adapt a specific dataset and to design the best approach to solve a ranking problem in a real-world scenario is thus crucial.This talk aims to illustrate how to set up and build a Learning to Rank (LTR) project starting from the available data, in our case a Spotify Dataset (available on Kaggle) on the Worldwide Daily Song Ranking, and ending with the implementation of a ranking model. A step by step (phased) approach to cope with this task using open source libraries will be presented.We will examine in depth the most important part of the pipeline that is the data preprocessing and in particular how to model and manipulate the features in order to create the proper input dataset, tailored to the machine learning algorithm requirements.
A Learning to Rank Project on a Daily Song Ranking Problem
1. London Information Retrieval Meetup
A Learning to Rank Project on a
Daily Song Ranking Problem
Ilaria Petreti, Information Retrieval/ML
Engineer
3rd November 2020
2. London Information Retrieval Meetup
Ilaria Petreti
! Information Retrieval/Machine Learning
Engineer
! Master in Data Science
! Data Mining and Machine Learning
technologies passionate
! Sports and Healthy Lifestyle lover
Who I Am
3. London Information Retrieval Meetup
● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
www.sease.io
Search Services
6. London Information Retrieval Meetup
How to create a Learning to Rank Pipeline using the
Spotify’s Kaggle Dataset?!
Problem Statement
https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking
7. London Information Retrieval Meetup
LTR is the application of machine learning, typically supervised, semi-
supervised or reinforcement learning, in the construction of ranking models for
information retrieval systems.
Training data consists of lists of items and each item is composed by:
• Query ID
• Relevance Rating
• Feature Vector (composed by N features (<id>:<value>))
Learning to Rank
8. London Information Retrieval Meetup
Spotify’s Worldwide
Daily Song Ranking:
• 200 most listened songs in 53
countries
• From 1st January 2017 to 9th
January 2018
• More than 3 million rows
• 6629 artists and 18598 songs
• A total count of one hundred five
billion streams counts
Dataset Description
9. London Information Retrieval Meetup
Learning to Rank: Our Approach
Trained Ranking Model
QUERY is the Region
DOCUMENT is the Song
Relevance Rating = estimated from Position on Chart
Feature Vector = all the other N features
Spotify Search Engine
11. London Information Retrieval Meetup
Feature Level
Document level Query level Query Dependent
This feature describes a
property of the DOCUMENT.
The value of the feature depends only on
the document instance.
e.g.
Document Type = Digital Music Service
Product
- Track Name
- Artist
- Streams
Each sample is a <query,document> pair, the feature vector describes numerically this
This feature describes a
property of the QUERY.
The value of the feature depends only on
the query instance.
e.g.
Query Type = Digital Music Service Search
- Month
- Day
- Weekday
This feature describes a
property of the QUERY in correlation
with the DOCUMENT.
The value of the feature depends on
the query and document instance.
e.g.
Query Type = Digital Music Service Search
Document Type = Digital Music Service
Product
- Matching query Region-Title Language
- Matching query Region-Artist Nationality
12. London Information Retrieval Meetup
Data Cleaning:
Data Preprocessing: Data Cleaning
Validity
Accuracy
Consistency
Completeness
Uniformity
Handle Missing Values:
a total of 657 NaN in Track Name and Artist features filled using a
DICTIONARY:
{0: 'Reggaetón Lento (Bailemos)', 1: 'Chantaje', 2: 'Otra Vez (feat. J Balvin)', 3:
"Vente Pa' Ca", 4: 'Safari', 5: 'La Bicicleta', 6: 'Ay Mi Dios', 7: 'Andas En Mi Cabeza',
8: 'Traicionera', 9: 'Shaky Shaky', 10: 'Vacaciones', 11: 'Dile Que Tu Me Quieres', 12:
'Let Me Love You', 13: 'DUELE EL CORAZON', 14: 'Chillax', 15: 'Borro Cassette', 16:
'One Dance', 17: 'Closer', …}
ID (URL) Track Name
0
Reggaetón
Lento
(Bailemos)
1 Chantaje
2
Otra Vez (feat.
J Balvin)'
0 NaN
3 Vente Pa' Ca
4 Safari
3 NaN
13. London Information Retrieval Meetup
Feature Engineering:
! Prepare the proper input dataset, compatible with the machine learning
algorithm requirements
! Improve the performance of machine learning models
Feature Engineering
Feature Selection
Feature Extraction
Feature Transformation
Feature Importance
Categorical
Encoding
14. London Information Retrieval Meetup
Position: song's position on chart
Feature Engineering: Grouping
Position
1
2
3
4-5
6-10
11-20
21-35
36-55
56-80
81-130
131-200
Ranking
10
9
8
7
6
5
4
3
2
1
0
Position Values have been grouped in two
different ways:
1. Relevance Labels (Ranking) from 0 to 10
2. Relevance Labels (Ranking) from 0 to 20
Target - Relevance Rating
Position
1
2
3
4
5
6
7
8
9
…
200
15. London Information Retrieval Meetup
Feature hashing maps each category
in a categorical feature to an integer
within a pre-determined range
Track Name: song title
Feature Engineering: Categorical Encoding
Track Name
Reggaetón Lento
(Bailemos)
Chantaje
Otra Vez (feat. J
Balvin)
…
Let Her Go
It is a method to create a
numeric representation of a
document/sentences, regardless
of its length
2 different approaches:
Hash Encoding
doc2vec
Document Level Feature
16. London Information Retrieval Meetup
Categorical Encoding: Hash Encoding
Feature Hashing or “The Hashing Trick” is a fast and space-efficient way of vectorising features
! Use of category_encoders library (as ce)
! Main Arguments:
title_encoder = ce.HashingEncoder(cols=[‘Track Name'], n_components=8)
newds = title_encoder.fit_transform(ds2)
• cols: a list of columns to encode
• n_components: how many bits to use to represent the feature
(default is 8 bits)
• hash_method: which hashing method to use (default is “md5”
algorithm)
https://contrib.scikit-learn.org/category_encoders/hashing.html
17. London Information Retrieval Meetup
Categorical Encoding: Doc2Vec
! Adaptation of Word2Vec, adding another feature vector named Paragraph ID
! Use of the gensim library
! Replace sentence as a list of words (token)
! Create new instance of TaggedDocument (token, tag)
! Build the Vocabulary
! Train the Doc2Vec model, the main parameters are:
• Documents: iterable list of TaggedDocument elements;
• dm{1,0}: defines the training algorithm; by default dm = 1 that is
Distributed Memory version of Paragraph Vector (PV-DM);
• min_count: ignores all words with total frequency lower than this;
• vector_size: dimensionality of the feature vectors (100 by default).
TaggedDocument
Trained Document Vectors
https://radimrehurek.com/gensim/models/doc2vec.html
18. London Information Retrieval Meetup
Language Detection from the Song Titles
Feature Engineering
! langdetect
! guess_language-spirit
! TextBlob
! Googletrans
• Low accuracy (built for
large text)
• No limitation
• High accuracy
• Limited access (API)
https://pypi.org/
https://textblob.readthedocs.io/en/dev/api_reference.html
19. London Information Retrieval Meetup
Artist: name of musician/singer or group
Artist
CNCO
Shakira
Zion &
Lennox
…
Passengers
Artists
78.12742
68.62432
61.62190
…
167.15266
Feature Engineering: Categorical Encoding
Leave One Out Encoding 0.39
0.24
2.21
0.76
0.27
4.01
2.28
0.19
2.03
1,96
5.15
0.36
1.06
A
C
B
B
C
A
mean = 1.06
TARGET FEATURE
Document Level Feature
! Use of category_encoders
library
! It excludes the current row’s
target when calculating the
mean target for a level
https://contrib.scikit-learn.org/category_encoders/leaveoneout.html
20. London Information Retrieval Meetup
Date: chart date
Year Month Day Weekday
2017 1 1 6
2017 1 2 0
2017 1 3 1
… … … …
2018 1 9 1
Date
2017/01/01
2017/01/02
2017/01/03
…
2018/01/09
Feature Engineering: Extracting Date
Query Level Feature
21. London Information Retrieval Meetup
Region: country code
Feature Engineering
Query
Region
ec
fi
cr
…
hn
query_ID
0
1
2
…
53
pandas.factorize()
to obtain a numeric representation of an array
when all that matters is identifying distinct values
24. London Information Retrieval Meetup
Model Training: XGBoost
XGBoost is an optimised distributed gradient boosting library
designed to be highly efficient, flexible and portable.
https://github.com/dmlc/xgboost
! It implements machine learning algorithms under the Gradient
Boosting framework.
! It is Open Source
! It supports both pairwise and list-wise models
25. London Information Retrieval Meetup
Model Training: XGBoost
1. Split the entire dataset in:
2. Separate the Relevance Label, query_ID and training
vectors as different components to create the xgboost
matrices
Training Set, used to build and train the model (80%)
Test Set, used to evaluate the model performance on unseen data (20%)
DMatrix is an internal data structure that used by
XGBoost which is optimized for both memory efficiency
and training speed
26. London Information Retrieval Meetup
training_xgb_matrix = xgboost.DMatrix(training_data_set,
label=training_label_column)
training_xgb_matrix.set_group(training_query_groups)
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
training_query_id_column = training_set_data_frame['query_ID']
training_query_groups = training_query_id_column.value_counts(sort=False)
training_label_column = training_set_data_frame['Ranking']
Training and Test Set Creation
test_xgb_matrix = xgboost.DMatrix(test_data_set, label=test_label_column)
test_xgb_matrix.set_group(test_query_groups)
test_data_set = test_set_data_frame[
test_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
test_query_id_column = test_set_data_frame['query_ID']
test_query_groups = test_query_id_column.value_counts(sort=False)
test_label_column = test_set_data_frame['Ranking']
27. London Information Retrieval Meetup
Train and test the model with LambdaMART method:
Model Training: XGBoost
! LambdaMART model uses gradient boosted decision tree using a cost
function derived from LambdaRank for solving a Ranking Task.
! The model performs list-wise ranking where Normalised Discounted
Cumulative Gain (NDCG) is maximised.
! List-wise approaches directly look at the entire list of documents and
try to come up with the optimal ordering for it
! The Evaluation Measure is an average across the queries.
28. London Information Retrieval Meetup
Train and test the model with LambdaMART:
params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@10', 'verbosity': 2,
'early_stopping_rounds': 10}
watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, ‘train')]
print('- - - - Training The Model')
xgb_model = xgboost.train(params, training_xgb_matrix, num_boost_round=999,
evals=watch_list)
print('- - - - Saving XGBoost model’)
xgboost_model_json = output_dir + "/xgboost-" + name + ".json"
xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True,
dump_format='json')
Model Training: LambdaMART
29. London Information Retrieval Meetup
• DCG@K = Discounted Cumulative Gain@K
It measures the usefulness, or gain, of a document based
on its position in the result list.
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
• It will be in the range [0,1]
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
14,01 15,76 17,64 22,60
0,62 0,70 0,78 1,0
Evaluation Metric: List-wise and NDCG
relevance weight
result position
DCG
NDCG
30. London Information Retrieval Meetup
Let’s see the common mistakes to avoid during the
model creation:
! One sample per query group
! One Relevance Label for all the samples in a query group:
Under Sampled Query Ids can potentially sky rock your
NDCG avg
Common Mistakes
32. London Information Retrieval Meetup
Results
train-ndcg@10 eval-ndcg@10
Relevance Labels
(0-10) 0.7179 0.7351
Relevance Labels
(0-20) 0.8018 0.7740
Relevance Labels
(0-10) 0.8235 0.7633
Relevance Labels
(0-20) 0.8215 0.8244
doc2vec
Encoding
Hash
Encoding
NDCG@10, where ‘@10’ denotes that the metric is evaluated only on top 10 documents/songs
33. London Information Retrieval Meetup
! Importance of Data Preprocessing and Feature Engineering
! Language Detection as additional feature
! doc2vec and Relevance Rating [0, 20] as best approaches
! Online testing in LTR evaluation
! Use of the library Tree SHAP for the feature importance
https://github.com/slundberg/shap
Conclusions