O slideshow foi denunciado.

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

0

Compartilhar

1 de 51
1 de 51

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

0

Compartilhar

Baixar para ler offline

For more details:
https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html

Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017 and Elasticsearch has an Open Source plugin released in 2018), organizations struggle with the problem of how to evaluate the quality of the models they train.

This talk explores all the major points in both Offline and Online evaluation.
Setting up correct infrastructures and processes for a fair and effective evaluation of the trained models is vital for measuring the improvements/regressions of a LTR system.
The talk is intended for:
– Product Owners, Search Managers, Business Owners
– Software Engineers, Data Scientists, and Machine Learning Enthusiast
Expect to learn :

the importance of Offline testing from a business perspective
how Offline testing can be done with Open Source libraries
how to build a realistic test set from the original data set in input avoiding common mistakes in the process
the importance of Online testing from a business perspective
A/B testing and Interleaving approaches: details and Pros/ Cons
common mistakes and how they can false the obtained results
Join us as we explore real-world scenarios and dos and don’ts from the e-commerce industry!

For more details:
https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html

Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017 and Elasticsearch has an Open Source plugin released in 2018), organizations struggle with the problem of how to evaluate the quality of the models they train.

This talk explores all the major points in both Offline and Online evaluation.
Setting up correct infrastructures and processes for a fair and effective evaluation of the trained models is vital for measuring the improvements/regressions of a LTR system.
The talk is intended for:
– Product Owners, Search Managers, Business Owners
– Software Engineers, Data Scientists, and Machine Learning Enthusiast
Expect to learn :

the importance of Offline testing from a business perspective
how Offline testing can be done with Open Source libraries
how to build a realistic test set from the original data set in input avoiding common mistakes in the process
the importance of Online testing from a business perspective
A/B testing and Interleaving approaches: details and Pros/ Cons
common mistakes and how they can false the obtained results
Join us as we explore real-world scenarios and dos and don’ts from the e-commerce industry!

Mais Conteúdo rRelacionado

Audiolivros relacionados

Gratuito durante 14 dias do Scribd

Ver tudo

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Evaluation

  1. 1. London Information Retrieval Meetup Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/ Online Evaluation Alessandro Benedetti, Director Anna Ruggero, R&D Software Engineer 23rd June 2020
  2. 2. London Information Retrieval MeetupWho We Are Alessandro Benedetti ! Born in Tarquinia(ancient Etruscan city) ! R&D Software Engineer ! Search Consultant ! Director ! Master in Computer Science ! Apache Lucene/Solr Committer ! Semantic, NLP, Machine Leaning technologies passionate ! Beach Volleyball player and Snowboarder
  3. 3. London Information Retrieval Meetup ! R&D Search Software Engineer ! Master Degree in Computer Science Engineering ! Big Data, Information Retrieval ! Organist, Music lover Who We Are Anna Ruggero
  4. 4. London Information Retrieval Meetup ● Headquarter in London/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Es experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning www.sease.io Search Services
  5. 5. London Information Retrieval MeetupClients
  6. 6. London Information Retrieval MeetupOverview Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  7. 7. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  8. 8. London Information Retrieval Meetup ! Find anomalies in data, like: weird distribution of the features, strange collected values, … ! Check how the model performs before using it in production: implement improvements, fix bugs, tune parameters, … ! Save time and money. Put in production a bad model can worse the user experience on the website. Advantages: [Offline] A Business Perspective
  9. 9. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  10. 10. London Information Retrieval Meetup[Offline] XGBoost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. It is Open Source. https://github.com/dmlc/xgboost
  11. 11. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create a training set with XGBoost: training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] Feature1 Feature2 3.0 2.0 0.0 1.0 3.0 2.5 9.0 4.0 8.0 4.0 3.0 1.0 training_data_set
  12. 12. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create the query Id groups: training_query_id_column = training_set_data_frame[features.QUERY_ID] training_query_groups = training_query_id_column.value_counts(sort=False) training_query_id_column QueryId 1 1 2 2 3 3 QueryId Count 1 2 2 2 3 2 training_query_groups
  13. 13. London Information Retrieval Meetup[Offline] Build a Test Set Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.5 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 Create the relevance label column: training_label_column = training_set_data_frame[features.RELEVANCE_LABEL] Relevance Label 3 2 4 1 0 2 training_label_column
  14. 14. London Information Retrieval Meetup Create a training set with XGBoost: training_xgb_matrix = xgb.DMatrix(training_data_set, label=training_label_column) training_xgb_matrix.set_group(training_query_groups) training_data_set = training_set_data_frame[ training_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] training_query_id_column = training_set_data_frame[features.QUERY_ID] training_query_groups = training_query_id_column.value_counts(sort=False) training_label_column = training_set_data_frame[features.RELEVANCE_LABEL] [Offline] Build a Test Set
  15. 15. London Information Retrieval Meetup Create a test set with XGBoost: test_xgb_matrix = xgb.DMatrix(test_data_set, label=test_label_column) test_xgb_matrix.set_group(test_query_groups) test_data_set = test_set_data_frame[ test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID, features.QUERY_ID])] test_query_id_column = test_set_data_frame[features.QUERY_ID] test_query_groups = test_query_id_column.value_counts(sort=False) test_label_column = test_set_data_frame[features.RELEVANCE_LABEL] [Offline] Build a Test Set
  16. 16. London Information Retrieval Meetup Train and test the model with XGBoost: params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@4','verbosity': 2, 'early_stopping_rounds' : 10} watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, 'train')] print('- - - - Training The Model') xgb_model = xgb.train(params, training_xgb_matrix, num_boost_round=999, evals=watch_list) print('- - - - Saving XGBoost model') xgboost_model_json = output_dir + "/xgboost-" + name + ".json" xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True, dump_format='json') [Offline] Train/Test
  17. 17. London Information Retrieval Meetup Save an XGBoost model: logging.info('- - - - Saving XGBoost model') xgboost_model_name = output_dir + "/xgboost-" + name xgb_model.save_model(xgboost_model_name) logging.info('- - - - Loading xgboost model') xgb_model = xgb.Booster() xgb_model.load_model(model_path) [Offline] Save/Load Models Load an XGBoost model:
  18. 18. London Information Retrieval Meetup • precision = Ratio of relevant results among the search results returned • precision@K = Ratio of relevant results among the top-k search results returned • recall = Ratio of relevant results found among all the relevant results • recall@k ? = Ratio of all the relevant results, you found in the topK What happens if :
 
 [Offline] Metrics means fewer <relevant results> in the top K means more <relevant results> in the top K means fewer <relevant results> found among all relevantrecall@k means more <relevant results> found among all relevantrecall@k precision@k precision@k
  19. 19. London Information Retrieval Meetup • DCG@K = Discounted Cumulative Gain@K Normalised Discounted Cumulative Gain • NDCG@K = DCG@K/ Ideal DCG@K Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0.64 0.73 0.79 1.0 [Offline] NDCG means less <relevant results> in worse positions
 with worse relevance * NDCG@k relevance weight result position means more <relevant results> in better positions
 with better relevance * NDCG@k
  20. 20. London Information Retrieval Meetup[Offline] Test a Trained Model Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.0 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID]) for query_id, data_frame in test_set_data_frame[[features.RELEVANCE_LABEL, features.QUERY_ID]].groupby(features.QUERY_ID)] QueryId Relevance Label 1 [3,2] 2 [4,1] 3 [0,2] Relevance Labels [3,2] [4,1] [0,2] test_relevance_labels _per_queryIddata_frame
  21. 21. London Information Retrieval Meetup[Offline] Test a Trained Model Relevance Label QueryId DocumentId Feature1 Feature2 3 1 1 3.0 2.0 2 1 2 0.0 1.0 4 2 2 3.0 2.0 1 2 1 9.0 4.0 0 3 2 8.0 4.0 2 3 1 3.0 1.0 test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL,features.DOCUMENT_ID])] QueryId Feature1 Feature2 1 3.0 2.0 1 0.0 1.0 2 3.0 2.0 2 9.0 4.0 3 8.0 4.0 3 3.0 1.0
  22. 22. London Information Retrieval Meetup[Offline] Test a Trained Model test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)] QueryId Feature1 Feature2 1 3.0 2.0 1 0.0 1.0 2 3.0 2.0 2 9.0 4.0 3 8.0 4.0 3 3.0 1.0 QueryId Feature1 Feature2 1 [3,0] [2,1] 2 [3,9] [2,4] 3 [8,3] [4,1] test_data_per_queryId Feature1 Feature2 [3,0] [2,1] [3,9] [2,4] [8,3] [4,1] data_frame
  23. 23. London Information Retrieval Meetup Test an already trained XGBoost model. Prepare the test set: test_relevance_labels_per_queryId = [np.array(data_frame.loc[:, data_frame.columns != features.QUERY_ID]) for query_id, data_frame in test_set_data_frame[[features.RELEVANCE_LABEL, features.QUERY_ID]].groupby(features.QUERY_ID)] test_relevance_labels_per_queryId = [test_relevance_labels.reshape(len(test_relevance_labels),) for test_relevance_labels in test_relevance_labels_per_queryId] test_set_data_frame = test_set_data_frame[test_set_data_frame.columns.difference( [features.RELEVANCE_LABEL, features.DOCUMENT_ID])] test_data_per_queryId = [data_frame.loc[:, data_frame.columns != features.QUERY_ID] for query_id, data_frame in test_set_data_frame.groupby(features.QUERY_ID)] test_xgb_matrix_list = [xgb.DMatrix(test_set) for test_set in test_data_per_queryId ] [Offline] Test a Trained Model
  24. 24. London Information Retrieval Meetup Test an already trained xgboost model: predictions_with_relevance = [] logging.info('- - - - Making predictions') predictions_list = [xgb_model.predict(test_xgb_matrix) for test_xgb_matrix in test_xgb_matrix_list] for predictions, labels in zip(predictions_list, test_label_list): to_data_frame = [list(row) for row in zip(predictions, labels)] predictions_with_relevance.append(pd.DataFrame(to_data_frame, columns=[‘predicted_scores’, 'relevance_labels'])) predictions_with_relevance = [predictions_per_query.sort_values(by='predicted_score', ascending=False) for predictions_per_query in predictions_with_relevance] logging.info('- - - - Ndcg computation') ndcg_scores_list = [] for predictions_per_query in predictions_with_relevance: ndcg = ndcg_at_k(predictions_per_query['relevance_label'], len(predictions_per_query)) ndcg_scores_list.append(ndcg) final_ndcg = statistics.mean(ndcg_scores_list) logging.info('- - - - The final ndcg is: ' + str(final_ndcg)) [Offline] Test a Trained Model
  25. 25. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group: ! If we have a small number of interactions it could happen during the split that we obtain some queries with just a single training sample. In this case the NDCG@K for the query group will be 1! (independently of the model) [Offline] Common Mistakes Model1 Model2 Model3 Ideal 1 1 1 1 1 1 1 1 Model1 Model2 Model3 Ideal 3 3 3 3 7 7 7 7 Query1 Query2 DCG
  26. 26. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group ! One relevance label for all the samples in a query group: ! During the split we could put all the samples with a single relevance label in the test set. [Offline] Common Mistakes
  27. 27. London Information Retrieval Meetup Let’s see the common mistakes to avoid during the test set creation: ! One sample per query group ! One relevance label for all the samples of a query group ! Samples considered for the data set creation: ! We have to be sure that we are using realistic set of samples for the test set creation. These <query,document> pairs represent the possible user behavior, so they must have a balance of unknown/known queries with results mixed in relevance. [Offline] Common Mistakes
  28. 28. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  29. 29. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► We may get an extremely high evaluation metric offline, but only because we improperly designed the test, the model is unfortunately not a good fit There are several problems that are hard to be detected with an offline evaluation: [Online] A Business Perspective
  30. 30. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► Finding a direct correlation between the offline evaluation metrics and the parameters used for the online model performance evaluation (e.g. revenues, click through rate…). There are several problems that are hard to be detected with an offline evaluation: [Online] A Business Perspective
  31. 31. London Information Retrieval Meetup ► An incorrect or imperfect test set brings us model evaluation results that aren’t reflecting the real model improvement/regressions. ► Finding a direct correlation between the offline evaluation metrics and the parameters used for the online model performance evaluation (e.g. revenues, click through rate…). ► Is based on generated relevance labels that not always reflect the real user need. [Online] A Business Perspective There are several problems that are hard to be detected with an offline evaluation:
  32. 32. London Information Retrieval Meetup ► The reliability of the results: we directly observe the user behaviour. ► The interpretability of the results: we directly observe the impact of the model in terms of online metrics the business cares. ► The possibility to observe the model behavior: we can see how the user interact with the model and figure out how to improve it. Using online testing can lead to many advantages: [Online] Business Advantages
  33. 33. London Information Retrieval Meetup ! Click Through Rates ( views, downloads, add to cart …) ! Sale/Revenue Rates ! Dwell time ( time spent on a search result after the click) ! Query reformulations/ Bounce rates ! …. Recommendation: test for direct correlation! When training the model, we probably chose one objective to optimise (there are also multi objective learning to rank models) [Online] Signals to measure
  34. 34. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  35. 35. London Information Retrieval Meetup 50% 50% A B 20% 40% Control Variation [Online] A/B testing
  36. 36. London Information Retrieval Meetup ► Be sure to consider only interactions from result pages ranked by the models you are comparing. i.e. not using every clicks, sales, downloads happening in the site [Online] A/B Testing Noise Extra care is needed when implementing A/B Testing.
  37. 37. London Information Retrieval Meetup ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Extra care is needed when implementing A/B Testing. ► Suppose we are analyzing model A. We obtain: 10 sales from the homepage and 5 sales from the search page. ► Suppose we are analyzing model B. We obtain: 4 sales from the homepage and 10 sales from the search page. Model A is better than Model B(?) [Online] A/B Testing Noise 1
  38. 38. London Information Retrieval Meetup ► Suppose we are analyzing model A. We obtain: 10 sales from the homepage and 5 sales from the search page. ► Suppose we are analyzing model B. We obtain: 4 sales from the homepage and 10 sales from the search page. [Online] A/B Testing Noise 1 Model A is better than Model B(?) Extra care is needed when implementing A/B Testing. ► Be sure to consider only interactions from result pages ranked by the models you are comparing.
  39. 39. London Information Retrieval Meetup ► Suppose we are analyzing model B. We obtain: 5 sales from the homepage and 10 sales from the search page. ► Suppose we are analyzing model A. We obtain: 12 sales from the homepage and 11 sales from the search page. [Online] A/B Testing Noise 2 Extra care is needed when implementing A/B Testing. ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Model A is better than Model B(?)
  40. 40. London Information Retrieval Meetup Model A is better than Model B(?) ► Suppose we are analyzing model B. We obtain: 5 sales from the homepage and 10 sales from the search page. ► Suppose we are analyzing model A. We obtain: 12 sales from the homepage and 11 sales from the search page. [Online] A/B Testing Noise 2 ► Be sure to consider only interactions from result pages ranked by the models you are comparing. Extra care is needed when implementing A/B Testing.
  41. 41. London Information Retrieval Meetup Offline Testing for Business Build a Test Set Online Testing for Business A/B Testing Interleaving
  42. 42. London Information Retrieval Meetup ► It reduces the problem with users’ variance due to their separation in groups (group A and group B). ► It is more sensitive in comparison between models. ► It requires less traffic. ► It requires less time to achieve reliable results. ► It doesn’t necessarily expose a bad model to a sub population of users. [Online] Interleaving Advantages
  43. 43. London Information Retrieval Meetup 100% Model A Model B 21 3 1 2 3 1 2 3 4 [Online] Interleaving
  44. 44. London Information Retrieval Meetup[Online] Balanced Interleaving There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority.
  45. 45. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. DRAWBACK ► When comparing two very similar models. ► Model A: lA = (a, b, c, d) ► Model B: lB = (b, c, d, a) ► The comparison phase will bring the Model B to win more often than Model A. This happens regardless of the model chosen as prior. ► This drawback arises due to: ► the way in which the evaluation of the results is done. ► the fact that model_B rank higher than model_A all documents with the exception of a. [Online] Balanced Interleaving
  46. 46. London Information Retrieval Meetup[Online] Team-Draft Interleaving There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. https://issues.apache.org/jira/browse/SOLR-14560
  47. 47. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. DRAWBACK ► When comparing two very similar models. ► Model A: lA = (a, b, c, d) ► Model B: lB = (b, c, d, a) ► Suppose c to be the only relevant document. ► With this approach we can obtain four different interleaved lists: ► lI1 = (aA, bB, cA, dB) ► lI2 = (bB, aA, cB, dA) ► lI3 = (bB, aA, cA, dB) ► lI4 = (aA, bB, cB, dA) ► All of them putting c at the same rank. Tie! But Model B should be chosen as the best model! [Online] Team-Draft Interleaving
  48. 48. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. ► Probabilistic Interleaving: rely on probability distributions. Every documents have a non-zero probability to be added in the interleaved result list. [Online] Probabilistic Interleaving
  49. 49. London Information Retrieval Meetup There are different types of interleaving: ► Balanced Interleaving: alternate insertion with one model having the priority. ► Team-Draft Interleaving: method of team captains in team-matches. ► Probabilistic Interleaving: rely on probability distributions. Every documents have a non-zero probability to be added in the interleaved result list. DRAWBACK The use of probability distribution could lead to a worse user experience. Less relevant document could be put higher. [Online] Probabilistic Interleaving
  50. 50. London Information Retrieval Meetup ► Both Offline/Online Learning To Rank evaluations are vital for a business ► Offline - doesn’t affect production - allows research and experimentation of wild ideas - reduces the number of Online Experiments to run ► Online - measures improvements/regressions with real users - isolates the benefits coming from the Learning To Rank models Conclusions
  51. 51. London Information Retrieval MeetupThanks!

×