O slideshow foi denunciado.

Interactive Questions and Answers - London Information Retrieval Meetup

0

Compartilhar

Carregando em…3
×
1 de 27
1 de 27

Interactive Questions and Answers - London Information Retrieval Meetup

0

Compartilhar

Baixar para ler offline

Answers to some questions about Natural Language Search, Language Modelling (Google Bert, OpenAI GPT-3), Neural Search and Learning to Rank made during our London Information Retrieval Meetup (December).

Answers to some questions about Natural Language Search, Language Modelling (Google Bert, OpenAI GPT-3), Neural Search and Learning to Rank made during our London Information Retrieval Meetup (December).

Mais Conteúdo rRelacionado

Audiolivros relacionados

Gratuito durante 14 dias do Scribd

Ver tudo

Interactive Questions and Answers - London Information Retrieval Meetup

  1. 1. Interactive Q&A 10th December 2020
  2. 2. Question 1
  3. 3. General Considerations Language Models https://en.wikipedia.org/wiki/BERT_(language_model) https://en.wikipedia.org/wiki/GPT-3 https://rajpurkar.github.io/SQuAD-explorer/ • Pre Trained on large corpora (expensive) • Ad hoc fine tuning to solve Natural Language Tasks (inexpensive) • Ability to encode terms and sentences as high dimensional vectors e.g. https://github.com/google-research/bert#pre-trained-models https://github.com/hanxiao/bert-as-service/ Bert vectors for sentences [‘access the bank', ‘walking by the street', ‘tigers are big cats'] : [[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174 -0.31721866] [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355 -0.11345179] [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536 0.07640187]]
  4. 4. General Considerations Language Models in Search • Indexing Time : encode sentences (or full field contents) and store the vectors • Searching Time: encode the query • Score the query-document vectors pair, calculating vector distance/similarity: Euclidean distance Cosine Similarity … Limitations • Rank entire corpus of documents ? Apply an (Approximate) Nearest Neighbour approach? • Performance for embedding extraction? • Un-intuitive results -> should be combined with Traditional Information Retrieval • Explainability
  5. 5. Apache Lucene Ideally you want to avoid scoring all documents of your corpus for your query. The algorithms for vector retrieval can be roughly classified into four categories, 1. Tree-base algorithms, such as KD-tree; 2. Hashing methods, such as LSH (Local Sensitive Hashing); 3. Product quantization based algorithms, such as IVFFlat; 4. Graph-base algorithms, such as HNSW, SSG, NSG; Specific File Format (Nov 2020) •https://issues.apache.org/jira/browse/LUCENE-9004 Hierarchical Navigable Small World Graphs - DONE •https://issues.apache.org/jira/browse/LUCENE-9322 DONE Unified Vector Format •https://issues.apache.org/jira/browse/LUCENE-9136 IVFFlat - In Progress
  6. 6. Apache Lucene Follow-ups - reducing heap usage during graph construction - adding a Query implementation - exposing index hyper-parameters - benchmarks - testing on public datasets - implementing a diversity heuristic for neighbour selection during graph construction - making the graph hierarchical - exploring more efficient search across multiple per-segment graphs… Keep an eye on Lucene JIRA! https://issues.apache.org/jira/browse/LUCENE-9004
  7. 7. Apache Solr Status of Deep Learning Vector Based Search • Lucene latest codecs and file format not used yet https://issues.apache.org/jira/browse/SOLR-14397 -> develop an official solution out of the box https://issues.apache.org/jira/browse/SOLR-12890 -> summary Ready to use Approaches • Vector Scoring using Streaming Expressions (Point Fields) • Available Solr Vector Search Plugin - https://github.com/saaay71/solr-vector-scoring (Payloads) https://medium.com/@dmitry.kan/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559 • Available Solr Vector Search Plugin with LSH Hashing (Payloads) Limitations • Generally slow solutions • Re-use data structures, not using ad hoc codecs/file format • Generally support only one vector per field
  8. 8. Apache Solr - Streaming Expressions Index Time <dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/> Sample Docs: curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food_collection/ update?commit=true --data-binary ' [ {"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]}, {"id": "2", "name_s":"apple juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]}, … ] https://www.elastic.co/blog/lucene-points-6.0 org.apache.solr.schema.PointField Multi Valued Field <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
  9. 9. Apache Solr - Streaming Expressions Streaming Expression: sort( select( search(food_collection, q="*:*", fl="id,vector_fs", sort="id asc", rows=3), cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as sim, id), by="sim desc")   Response: {   "result-set": {     "docs": [         { "sim": 0.99996111, "id": "1" },         { "sim": 0.98590279, "id": "10" },         { "sim": 0.55566643, "id": "2" },         { "EOF": true, "RESPONSE_TIME": 10 }     ]   } } https://lucene.apache.org/solr/guide/8_7/vector-math.html Drawbacks: 1) it doesn’t apply to normal search -> you need to use Streaming Expressions 2) Requires traversing all vectors and scoring them. 3) no support for multiple vectors per field - SOLR-11077
 Query Time
  10. 10. Apache Solr - Solr Vector Search Plugin <fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/> </analyzer> </fieldType> <field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true" termPositions="true" termVectors="true" multiValued="true"/> Index Time curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-name}/update? commit=true --data-binary ' [ {"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "}, {"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "}, … ]'
  11. 11. Apache Solr - Solr Vector Search Plugin Query Time http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0" cosine=false} N.B. Adding the parameter cosine=false calculates the dot product "response":{"numFound":6,"start":0,"maxScore":40.1675,"docs":[ { "name":["example 3"], "vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "], "score":40.1675}, { "name":["example 0"], "vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "], "score":30.180502}, … ]} Drawbacks: 1) Payloads used for storing vectors->
 slow 2) Requires traversing all vectors and scoring them. 3) support for multiple vectors per field must be investigated
 N.B. https://github.com/DmitryKey/solr-vector-scoring is a fork with a 8.6 Apache Solr port
  12. 12. Apache Solr - LSH Hashing Plugin <fieldType name="VectorField" class="solr.BinaryField" stored="true" indexed="false" multiValued="false"/> <field name="_vector_" type="VectorField" /> <field name="_lsh_hash_" type="string" indexed="true" stored="true" multiValued="true"/> <field name="vector" type="string" indexed="true" stored="true"/> Index Time <updateRequestProcessorChain name="LSH"> <processor class="com.github.saaay71.solr.updateprocessor.LSHUpdateProcessorFactory" > <int name="seed">5</int> <int name="buckets">50</int> <int name="stages">50</int> <int name="dimensions">6</int> <str name="field">vector</str> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection- name}/update?update.chain=LSH&commit=true --data-binary ' [{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"}, {"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}]'
  13. 13. Apache Solr - LSH Hashing Plugin Query Time http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"} &fl=name,score,vector,_vector_,_lsh_hash_ "response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[ { "id": "1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33", "_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==", "_lsh_hash_":["0_8", "1_35", "2_7", … "49_43"], "score":36.65736} ] Drawbacks: 1) Performance must be investigated, usage of binary fields with encoded vectors
 2) latest commit October 2018

  14. 14. Elasticsearch Status of Deep Learning Vector Based Search • Lucene latest codecs and file format not used yet https://github.com/elastic/elasticsearch/issues/42326 - Work in progress for covering Approximate Nearest Neighbour Techiques Ready to use Approaches • X-Pack enterprise features - https://www.elastic.co/guide/en/elasticsearch/reference/current/dens vector.html • https://github.com/alexklibisz/elastiknn • https://github.com/opendistro-for-elasticsearch/k-NN Limitations • Performance must be investigated ( https://elastiknn.com/performance/ ) • Re-use data structures, not using ad hoc codecs/file format • Supports only one vector per field
  15. 15. Elasticsearch - X-Pack Index Time PUT my-index-000001 { "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 3 }, “status" : { "type" : "keyword" } } } } PUT my-index-000001/_doc/1 { "my_dense_vector": [0.5, 10, 6], "status" : "published" } PUT my-index-000001/_doc/2 { "my_dense_vector": [-0.5, 10, 10], "status" : "published" } • N.B. Lucene latest codecs and file format not used yet, vectors are stored as binary doc values.
  16. 16. Elasticsearch - X-Pack Query Time N.B. various distance functions are supported Drawbacks: 1) Requires traversing all vectors returned by initial query and scoring them. 2) no support for multiple vectors per field
 
 GET my-index-000001/_search { "query": { "script_score": { "query" : { "bool" : { "filter" : { "term" : { "status" : "published" } } } }, "script": { "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", "params": { "query_vector": [4, 3.4, -0.2] } } } } }
  17. 17. Next Steps ● Keep an eye on our Blog: https://sease.io/blog, as more is coming!
  18. 18. Question 2
  19. 19. Learning to Rank Libraries RankLib https://github.com/codelibs/ranklib XGBoost (University of Washington) https://github.com/dmlc/xgboost TensorFlow (Google) https://github.com/tensorflow/ranking LigthGBM (Microsoft) https://github.com/Microsoft/LightGBM CatBoost (Yandex) https://github.com/catboost SVMRank http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html LightFM https://github.com/lyst/lightfm QuickRank (ISTI-CNR) https://github.com/hpclab/quickrank JForests https://github.com/yasserg/jforests
  20. 20. Ranklib Overview https://sourceforge.net/p/lemur/wiki/RankLib/ RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been implemented: • MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6] • RankNet [1] • RankBoost [2] • AdaRank [3] • Coordinate Ascent [4] • LambdaMART [5] • ListNet [7] • Random Forests [8]
  21. 21. Ranklib Our Experience https://sourceforge.net/p/lemur/wiki/RankLib/ • Multiple learning to rank libraries supported including LambdaMART • Relatively easy to use • Command Line Interface application -> not meant to be integrated with other apps • Java code, minimal Test Coverage • Svn (there’s a Github port, not official: https://github.com/codelibs/ranklib ) • Small Community
  22. 22. XGBoost Overview https://github.com/dmlc/xgboost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. • It implements machine learning algorithms under the Gradient Boosting framework. • XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. • The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problem beyond billions of examples.
  23. 23. XGBoost Our Experience • Multiple learning to rank libraries supported including LambdaMART • Relatively easy to use • Library easy to integrate • Python code, huge project, Tests seem fair • Github (https://github.com/dmlc/xgboost ) • Extremely popular • Huge Community
  24. 24. Learning to Rank Libraries Limitations: ‣ Developed for small data sets ‣ Limited support for Sparse Features ‣ Require extensive Feature Engineering ‣ Do not support the recent advances in Unbiased Learning-to-rank The TensorFlow Ranking library addresses these gaps
  25. 25. TensorFlow Ranking Overview ‣ Open source library for solving large-scale ranking problems in a deep learning framework ‣ Developed by Google’s AI department ‣ Fast and easy to use ‣ Flexible and highly configurable ‣ Support Pointwise, Pairwise, and Listwise losses ‣ Support popular ranking metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) GitHub: https://github.com/tensorflow/ranking
  26. 26. TensorFlow Ranking Additional components: ‣ Fully integrated with the rest of the TensorFlow ecosystem ‣ Can handle textual features using Text Embeddings ‣ Multi-item (also known as Groupwise) scoring functions ‣ LambdaLoss implementation ‣ Unbiased Learning-to-Rank TF-Ranking Article: https://arxiv.org/abs/1812.00073
  27. 27. XGBoost vs TensorFlow XGBoost TensorFlow Tree-based Ranker Neural Ranker Handle Missing Values Handle Missing Values Run Efficiently on CPU Run Efficiently on CPU Large Scale Training Large Scale Training Main Differences

×