SlideShare uma empresa Scribd logo
1 de 48
Large-scale Data Processing for
Information Retrieval
Edgar Meij
Informatics Institute
Joint work with Amit Bronner, Hendrike Peetz,
Wouter Weerkamp, Anne Schuth, Maarten de Rijke




                           Large-scale Data Processing for IR   2
Big                         Information
                                           data                          retrieval
lingual
mation
cess                Machine                                                                      Theory and
                  translation                                                                      models




                                                                              Evaluation
                                                                             methodology
       Text
      mining                             Intelligent
                                     Information retrieval
                                       information
                                        for information
                                            services
                                          access
                                                                                             Political
                                                                                           information
  Storytelling

                         Human-
                        computer                                        Knowledge
                       information                                    representation
                         retrieval                                     & reasoning

                                                                                                Information
    Exploratory                                                                                  integration
                                              Foundations
      search                                          Large-scale Data Processing for IR          3
                                                of XML
Semantic
                                        search

                        Real-time
                        analytics
                                                               Social
                                                               signal
                                                              analysis



                 Big                Information
                 data                 retrieval

achine                                                           Theory and
nslation                                                           models




                                         Evaluation
                                        methodology

               Intelligent
           Information retrieval
             information
              for information
                  services
                access
                                                  Large-scale Data Processing for IR   3
                                                         Political
                                                       information
s

                                                                            Real-time
                                                                            analytics
                Synchronize
                  content




                                                                Big                         Inform
                                                                data                          retr
Multi-lingual
information
   access                  Machine
                         translation




             Text
            mining                                        Intelligent
                                                      Information retrieval
                                                        information
                                                         for information
                                                             services
                                                               access


       Storytelling                    Large-scale Data Processing for IR               3

                                Human-
Intelligent
                Information retrieval
                  information
                   for information
                       services
                     access
                                                               Political
                                                             information



  Human-
 computer                                 Knowledge
information                             representation
  retrieval                              & reasoning

                                                                     Information
                                                                      integration
                         Foundations
                           of XML


  Multi-modal                                   Open
  summaries                                     data




                                                         Large-scale Data Processing for IR   3
Text
    mining                               Intelligent
                                     Information retrieval
                                       information
                                        for information
                                            services
                                               access


Storytelling

                  Human-
                 computer                                            Knowledg
                information                                        representat
                  retrieval                                         & reasonin



  Exploratory                                        Foundations
    search                                             of XML


                  Multi-modal                                              Op
                  summaries                                                da




                              Large-scale Data Processing for IR   3
Me

¢   Information retrieval (~ search engines)
¢   Semantic search/annotations
¢   Use knowledge bases (Wikipedia, Freebase, etc.) as
     £   primary information source for search or
     £   as complement to traditional retrieval




                                            Large-scale Data Processing for IR   4
Search engines




          Large-scale Data Processing for IR   5
Search engines – a bird’s eye view

¢   Main ingredient: Counting words
     £   Query ~ distribution over words
     £   Document ~ distribution over words
     £   Ranking ~ comparing distributions




                                          Large-scale Data Processing for IR   6
Search engines – a bird’s eye view

¢   Main ingredient: Counting words
     £   Query ~ distribution over words
     £   Document ~ distribution over words
     £   Ranking ~ comparing distributions




                                          Large-scale Data Processing for IR   6
Forecasters are watching
     fore
         cas                    tropical storms that could
              t
                                pose hurricane threats to
hurricane         fun          the southern United States.
          tropical               One is a downgraded …
                        wind
weather     home




                                Large-scale Data Processing for IR   7
Search engines – a bit of history

¢   Anno 1995
     £   Counting words (only)...
     £   Stopwords
     £   Linguistic normalization




                                     Large-scale Data Processing for IR   8
Search engines – a bit of history

¢   Anno 2000: 2nd generation
     £   Link structure
          ˜   Anchor text
          ˜   PageRank
     £   Document structure
          ˜   title, top/bottom, etc.
          ˜   boilerplate
     £   Click-through data




                                         Large-scale Data Processing for IR   9
Search engines – a bit of history

¢   Anno now
     £   Real-time indexing/search
     £   Increasingly personalized
     £   Increasingly social
     £   Apply “observations” of human behavior to
          improve, to evaluate
          ˜   Search behavior, click behavior, dwell behavior, reading time,
               …, other things that are happening in the world
     £   Rich signals




                                                       Large-scale Data Processing for IR   10
Signals

¢   Users/Personalisation
     £   group: country, region, language, device, browser, etc.
     £   individual: profile, history, sessions, etc.
                                           Why “learning to rank”?
¢   Linguistics (e.g., spell-checking)
¢   Semantics (e.g., entities)
¢   Popularity (e.g. PageRank)
¢   Social (e.g. G+)
     And more...
                                                                                                                  1
¢
                                            • More and more features are found to be useful for ranking
     £   readability, relevance assessments, clicks, etc.
                                            documents.
                                            • How should we combine these?
                                             1
                                                 http://www.flickr.com/photos/sameli/540933604/
                                                 Large-scale Data Processing for IR                          11
                                             KH&MdR (U. Amsterdam)          Advanced Information Retrieval            MS
Applying signals

¢   Typically at query time...
     £   Leaning heavily on machine learning
¢   Not the focus here...              Why “learning to rank”?




                                                                                                               1


                                          • More and more features are found to be useful for ranking
                                            documents.
                                          • How should we combine these?
                                          1
                                              http://www.flickr.com/photos/sameli/540933604/
                                              Large-scale Data Processing for IR                          12
                                          KH&MdR (U. Amsterdam)          Advanced Information Retrieval            MS
What generates (non-monetary) value?




                     Large-scale Data Processing for IR   13
What generates (non-monetary) value?

¢   What is value?
     £   Better/Richer UX
          ˜   Clever term/phrase suggestions
          ˜   Clever, rich snippets
     £   Finding what you need faster/better/...
          ˜   Homing in on what you want to find
          ˜   Task/Problem solving
     £   and more...




                                                    Large-scale Data Processing for IR   14
For instance...




  good camera under
      300 euro




                      Large-scale Data Processing for IR   15
Or...




        Large-scale Data Processing for IR   16
Or...




        Large-scale Data Processing for IR   17
Large-scale Data Processing for IR   18
Large-scale Data Processing for IR   19
Large-scale Data Processing for IR   20
Large-scale Data Processing for IR   21
So, where else do you get value from?

¢   Improving signals...
     £   Richer/Better/More focused signals
          ˜   Richer data/better extraction/...
          ˜   "Google acquires Freebase"

¢   ... or the application thereof
     £   Algorithmic innovations
     £   Training data
          ˜   Logs (queries, clicks, ...) – from toolbars, redirects, etc.
          ˜   Relevance assessments – manual, professionals, mechanical turk, etc.

¢   "More intelligent systems"

                                                    Large-scale Data Processing for IR   22
Intelligence?

¢   Need analysis of (large quantities of) data
     £   Typically, "transformations"
          ˜   graphs (PageRank, FriendRank)
          ˜   text => structure
          ˜   aggregations
          ˜   etc.

¢   Then, aggregate analyses to obtain "value"
     £   count/sum/min/max/avg/etc.
¢   Hadoop!


                                               Large-scale Data Processing for IR   23
Use-cases




        Large-scale Data Processing for IR   24
Use-case 1: Search and analysis on tweets

¢   Even getting them is not quite trivial
¢   Example: TREC Microblog track
     £   16M tweets
          ˜   Published as ID
          ˜   Default HTML download option without metadata (geo data, original
               tweet when retweeted, reply-to, etc.)
          ˜   JSON format has all the beautiful stuff
     £   HTML crawling vs getting the JSON objects
          ˜   JSON download limited to 150 tweets per hour per IP address
               ™ On a single machine: more than 12 years
               ™ 884 nodes running for close to a week




                                                            Large-scale Data Processing for IR   25
And once you have millions of tweets…

¢   Text analytics on twitter streams
     £   Information extraction, sentiment analysis, …
     £   Given an entity (company, product, …), what is being said
          about it?

                               Obama almost 15mins late...
                               wonder if he's watching college
                               hoops. Less than 2mins left in
                               Texas Oakland game #NCAA
                               #Marc ...




                                                            Large-scale Data Processing for IR   26
And once you have millions of tweets…

¢   Text analytics on twitter streams
     £   Information extraction, sentiment analysis, …
     £   Given an entity (company, product, …), what is being said
          about it?
          Which aspects?
          Which attitudes?
     £   Extract triples
          X–R–Y
     £   Dependency parsing



                                           Large-scale Data Processing for IR   27
Large-scale Data Processing for IR   28
Some numbers

¢   Data
     £   ~10% public English tweets in 2010
     £   ~250M tweets
¢   Performance
     £   Single machine (1 Dual core, 2.2GHz, 3GB ram)
          ˜   ~2 years
     £   Sara Hadoop cluster (20 nodes x Dual core, 2.6GHz, 16GB ram)
          ˜   ~30 days
     £   DAS4 Hadoop cluster (36 nodes x Dual quad-core, 2.4GHz,
          24GB ram)
          ˜   ~1 day


                                               Large-scale Data Processing for IR   29
Intermezzo: The-Web-as-a-corpus

¢   Web retrieval
     £   TREC Web track – ClueWeb09
          ˜   1,040,809,705 web pages, in 10 languages
          ˜   25TB uncompressed

¢   Parse TBs of web data
     £   SARA Hadoop
     £   cloud9/Ivory(/Elasticsearch/SOLR/Lucene)
     £   POS, DEP, entities
     £   easy peasy



                                                   Large-scale Data Processing for IR   30
Using Bursts for Query Modeling
Use-case 2: Temporal patterns for IR

¢   Temporal relevance?
¢   Relevant documents
     £   query: ‘grammys’
     £   time (in days) along the x-axis
     £   nr. of judged relevant
          documents along the y-axis
¢   Value: detect “temporal”
     queries
                                                  (a) Relevant documents

                                 Table 1: Temporal Processing for IR
                                           Large-scale Data distributions for the que
                                                                            31
                                 Figure 1a is the same as Figure 1?
4d), with many more new home products
being sold, has a knot point at 10 hours
versus Anchorage’s 29 (4c).
Unique visitors: Unlike inter-version
means, there is no statistical difference in
  Use-case 2: Temporal patterns for IR
where the knot point falls as a function of
unique visitors. This is consistent with the
fact that while popular pages change more
often, they change less whenplot do, and
  ¢ “Term lifespan” they
thus require the same amount of time to
“stabilize” as less popular pages. the x-axis
      £ time (in days) along

URL Depth: Thealong the page is in the
     £ terms deeper the y-axis
page £ every the further the knot an
     hierarchy dot represents point,
potentially indicating that content on pages
deep within a site “decay” atthat day
        occurrence on a slower rate.
Category: Perhaps unsurprisingly, their first
      £ terms are ordered by
                                    News
and Sports pages have an earlierwebpages
         occurrence in the knot point
as content in these pages is likely to be
replacedon allrecipes.com
          quickly. Industry/trade pages,
including corporate home pages, display a
much more gradual rate of content decay
before reaching the knot point.
4.3 Term-Level Change
The above analysis explores how page           Figure 5. Term lifespanfor IR for several pages
                                                     Large-scale Data Processing plots 32
content changes across an entire Web           replaced with the BestBuy homepage. Time (in
Or from Wikipedia access logs...

¢   1 year = ~ 555GB of
     raw Wikipedia logs
     £   filter
     £   aggregate
     £   link
     £   visualize
¢   Inherently parallelizable




                                 Large-scale Data Processing for IR   33
Or from Wikipedia access logs...
                                                                           31-2   30000
                                                                     01012
                                                             nts-2
¢   1 year = ~ 555GB of                          pagec
                                                        ou

                                                                  3
                                                            57482
     raw Wikipedia logs          [...  ]
                                         ristm
                                                      68 76
                                                as 11 rol 1 713 th%20Apax
                                                       a
                                 en Ch stmas%20C oling%20W
                                                                    i
                                                                                   1 602

                                         ri             ar           1
                                  en Ch stmas%20C slip 1 59
     £   filter                  en   Chri
                                               tmas%
                                                     20Cow d 1 630
                                                            n
                                       Chris as%20Isla ture 1 72 ant%20Wal
                                                                          0            l%20D
                                                                                             ecal
                                                                                                  1 611
                                   en
                                           ristm         itera         %20Gi
     £   aggregate                en Ch stmas%20L e%20Quote 593
                                           ri            re
                                    en Ch stmas%20T 20medium
                                                                       1
                                                                            1 596 98
                                    en   Chri      s%2 0by%      %20 wall        1 5
     £   link                              r istma 20fantasy all%20art i 1 605
                                     en Ch stmas%
                                          Chri
                                                              l%20w
                                                        0viny s_Solis_I
                                                                             nvict
                                     en           mas%2        i
                                            hrist s%23Natal
     £   visualize                   e
                                      e
                                        n C
                                        n Chr
                                               istma
                                       [...]
¢   Inherently parallelizable




                                                Large-scale Data Processing for IR            33
Use-case 3: Mining user edits on Wikipedia

¢   As a social signal …
¢   As a language resource …
     £   Target: User edits, textual differences between revisions of
          the same document
     £   Objective: Distinguish between factual edits (alter the
          meaning) and fluency edits (address style or readability)
     £   Dataset: Full revision history of the English Wikipedia




                                             Large-scale Data Processing for IR   34
The data

¢   Average of 3.5 to 4 million revisions per month
     £   English Wikipedia, August 2006 to August 2011
     £   Each revision may contain multiple edits (many are
          irrelevant)
     £   342GB compressed
          text (snapshot of
          15/01/2011)




                                           Large-scale Data Processing for IR   35
What to do?

¢   A lot of pre-processing
     £   Filtering out irrelevant revisions
     £   Parsing wiki markup
     £   Words tokenization
     £   Sentence splitting
     £   Computing textual diff between revisions
     £   Indexing user edits at sentence level and across sentence boundaries
     £   Computing classification features per user edit
¢   And then
     £   Execution:15 nodes, each processes a data stream
     £   Average of 2-3 days per node
¢   Outcome: 6.3 million textual diff segments, 4.3 million user edits

                                                  Large-scale Data Processing for IR   36
What to do?

¢   A lot of pre-processing
     £   Filtering out irrelevant revisions
     £   Parsing wiki markup
     £   Words tokenization
     £   Sentence splitting
     £   Computing textual diff between revisions
     £   Indexing user edits at sentence level and across sentence boundaries
     £   Computing classification features per user edit
¢   And then
     £   Execution:15 nodes, each processes a data stream
     £   Average of 2-3 days per node
¢   Outcome: 6.3 million textual diff segments, 4.3 million user edits

                                                  Large-scale Data Processing for IR   36
What’s next?




         Large-scale Data Processing for IR   37
Real-time semantic analysis

¢   Example: reputation management
¢   Follow twitter stream
     £   Am I being mentioned?
     £   What are they saying about me?
     £   Is this potentially damaging?
¢   Why a challenge
     £   Ambiguity
     £   Noise
     £   “I need to know now!”
¢   Big data
                                           Large-scale Data Processing for IR   38
Extreme personalisation

¢   “Zero click”, “zero query”
¢   Tell me what I should
     know
     £   Summarize a few million
          documents
     £   Show a semantically
          meaningful result on my
          screen
¢   Big data

                                    Large-scale Data Processing for IR   39
Social search

¢   Socially improved search
     £   General search, personalized search
     £   Thousands of users of social networks actively share
          content and attitudes and opinions and experiences
     £   Use this to “push content”
     £   Return results that you care about, with a broad “subjective
          context”
¢   Big data



                                            Large-scale Data Processing for IR   40
Thanks!




¢   Edgar Meij
     £   http://edgar.meij.pro
     £   edgar.meij@uva.nl
     £   @edgarmeij




                                  Large-scale Data Processing for IR   41

Mais conteúdo relacionado

Último

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destaque (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Large-scale Data Processing for Information Retrieval #nlhug

  • 1. Large-scale Data Processing for Information Retrieval Edgar Meij Informatics Institute
  • 2. Joint work with Amit Bronner, Hendrike Peetz, Wouter Weerkamp, Anne Schuth, Maarten de Rijke Large-scale Data Processing for IR 2
  • 3. Big Information data retrieval lingual mation cess Machine Theory and translation models Evaluation methodology Text mining Intelligent Information retrieval information for information services access Political information Storytelling Human- computer Knowledge information representation retrieval & reasoning Information Exploratory integration Foundations search Large-scale Data Processing for IR 3 of XML
  • 4. Semantic search Real-time analytics Social signal analysis Big Information data retrieval achine Theory and nslation models Evaluation methodology Intelligent Information retrieval information for information services access Large-scale Data Processing for IR 3 Political information
  • 5. s Real-time analytics Synchronize content Big Inform data retr Multi-lingual information access Machine translation Text mining Intelligent Information retrieval information for information services access Storytelling Large-scale Data Processing for IR 3 Human-
  • 6. Intelligent Information retrieval information for information services access Political information Human- computer Knowledge information representation retrieval & reasoning Information integration Foundations of XML Multi-modal Open summaries data Large-scale Data Processing for IR 3
  • 7. Text mining Intelligent Information retrieval information for information services access Storytelling Human- computer Knowledg information representat retrieval & reasonin Exploratory Foundations search of XML Multi-modal Op summaries da Large-scale Data Processing for IR 3
  • 8. Me ¢ Information retrieval (~ search engines) ¢ Semantic search/annotations ¢ Use knowledge bases (Wikipedia, Freebase, etc.) as £ primary information source for search or £ as complement to traditional retrieval Large-scale Data Processing for IR 4
  • 9. Search engines Large-scale Data Processing for IR 5
  • 10. Search engines – a bird’s eye view ¢ Main ingredient: Counting words £ Query ~ distribution over words £ Document ~ distribution over words £ Ranking ~ comparing distributions Large-scale Data Processing for IR 6
  • 11. Search engines – a bird’s eye view ¢ Main ingredient: Counting words £ Query ~ distribution over words £ Document ~ distribution over words £ Ranking ~ comparing distributions Large-scale Data Processing for IR 6
  • 12. Forecasters are watching fore cas tropical storms that could t pose hurricane threats to hurricane fun the southern United States. tropical One is a downgraded … wind weather home Large-scale Data Processing for IR 7
  • 13. Search engines – a bit of history ¢ Anno 1995 £ Counting words (only)... £ Stopwords £ Linguistic normalization Large-scale Data Processing for IR 8
  • 14. Search engines – a bit of history ¢ Anno 2000: 2nd generation £ Link structure ˜ Anchor text ˜ PageRank £ Document structure ˜ title, top/bottom, etc. ˜ boilerplate £ Click-through data Large-scale Data Processing for IR 9
  • 15. Search engines – a bit of history ¢ Anno now £ Real-time indexing/search £ Increasingly personalized £ Increasingly social £ Apply “observations” of human behavior to improve, to evaluate ˜ Search behavior, click behavior, dwell behavior, reading time, …, other things that are happening in the world £ Rich signals Large-scale Data Processing for IR 10
  • 16. Signals ¢ Users/Personalisation £ group: country, region, language, device, browser, etc. £ individual: profile, history, sessions, etc. Why “learning to rank”? ¢ Linguistics (e.g., spell-checking) ¢ Semantics (e.g., entities) ¢ Popularity (e.g. PageRank) ¢ Social (e.g. G+) And more... 1 ¢ • More and more features are found to be useful for ranking £ readability, relevance assessments, clicks, etc. documents. • How should we combine these? 1 http://www.flickr.com/photos/sameli/540933604/ Large-scale Data Processing for IR 11 KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
  • 17. Applying signals ¢ Typically at query time... £ Leaning heavily on machine learning ¢ Not the focus here... Why “learning to rank”? 1 • More and more features are found to be useful for ranking documents. • How should we combine these? 1 http://www.flickr.com/photos/sameli/540933604/ Large-scale Data Processing for IR 12 KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
  • 18. What generates (non-monetary) value? Large-scale Data Processing for IR 13
  • 19. What generates (non-monetary) value? ¢ What is value? £ Better/Richer UX ˜ Clever term/phrase suggestions ˜ Clever, rich snippets £ Finding what you need faster/better/... ˜ Homing in on what you want to find ˜ Task/Problem solving £ and more... Large-scale Data Processing for IR 14
  • 20. For instance... good camera under 300 euro Large-scale Data Processing for IR 15
  • 21. Or... Large-scale Data Processing for IR 16
  • 22. Or... Large-scale Data Processing for IR 17
  • 27. So, where else do you get value from? ¢ Improving signals... £ Richer/Better/More focused signals ˜ Richer data/better extraction/... ˜ "Google acquires Freebase" ¢ ... or the application thereof £ Algorithmic innovations £ Training data ˜ Logs (queries, clicks, ...) – from toolbars, redirects, etc. ˜ Relevance assessments – manual, professionals, mechanical turk, etc. ¢ "More intelligent systems" Large-scale Data Processing for IR 22
  • 28. Intelligence? ¢ Need analysis of (large quantities of) data £ Typically, "transformations" ˜ graphs (PageRank, FriendRank) ˜ text => structure ˜ aggregations ˜ etc. ¢ Then, aggregate analyses to obtain "value" £ count/sum/min/max/avg/etc. ¢ Hadoop! Large-scale Data Processing for IR 23
  • 29. Use-cases Large-scale Data Processing for IR 24
  • 30. Use-case 1: Search and analysis on tweets ¢ Even getting them is not quite trivial ¢ Example: TREC Microblog track £ 16M tweets ˜ Published as ID ˜ Default HTML download option without metadata (geo data, original tweet when retweeted, reply-to, etc.) ˜ JSON format has all the beautiful stuff £ HTML crawling vs getting the JSON objects ˜ JSON download limited to 150 tweets per hour per IP address ™ On a single machine: more than 12 years ™ 884 nodes running for close to a week Large-scale Data Processing for IR 25
  • 31. And once you have millions of tweets… ¢ Text analytics on twitter streams £ Information extraction, sentiment analysis, … £ Given an entity (company, product, …), what is being said about it? Obama almost 15mins late... wonder if he's watching college hoops. Less than 2mins left in Texas Oakland game #NCAA #Marc ... Large-scale Data Processing for IR 26
  • 32. And once you have millions of tweets… ¢ Text analytics on twitter streams £ Information extraction, sentiment analysis, … £ Given an entity (company, product, …), what is being said about it? Which aspects? Which attitudes? £ Extract triples X–R–Y £ Dependency parsing Large-scale Data Processing for IR 27
  • 34. Some numbers ¢ Data £ ~10% public English tweets in 2010 £ ~250M tweets ¢ Performance £ Single machine (1 Dual core, 2.2GHz, 3GB ram) ˜ ~2 years £ Sara Hadoop cluster (20 nodes x Dual core, 2.6GHz, 16GB ram) ˜ ~30 days £ DAS4 Hadoop cluster (36 nodes x Dual quad-core, 2.4GHz, 24GB ram) ˜ ~1 day Large-scale Data Processing for IR 29
  • 35. Intermezzo: The-Web-as-a-corpus ¢ Web retrieval £ TREC Web track – ClueWeb09 ˜ 1,040,809,705 web pages, in 10 languages ˜ 25TB uncompressed ¢ Parse TBs of web data £ SARA Hadoop £ cloud9/Ivory(/Elasticsearch/SOLR/Lucene) £ POS, DEP, entities £ easy peasy Large-scale Data Processing for IR 30
  • 36. Using Bursts for Query Modeling Use-case 2: Temporal patterns for IR ¢ Temporal relevance? ¢ Relevant documents £ query: ‘grammys’ £ time (in days) along the x-axis £ nr. of judged relevant documents along the y-axis ¢ Value: detect “temporal” queries (a) Relevant documents Table 1: Temporal Processing for IR Large-scale Data distributions for the que 31 Figure 1a is the same as Figure 1?
  • 37. 4d), with many more new home products being sold, has a knot point at 10 hours versus Anchorage’s 29 (4c). Unique visitors: Unlike inter-version means, there is no statistical difference in Use-case 2: Temporal patterns for IR where the knot point falls as a function of unique visitors. This is consistent with the fact that while popular pages change more often, they change less whenplot do, and ¢ “Term lifespan” they thus require the same amount of time to “stabilize” as less popular pages. the x-axis £ time (in days) along URL Depth: Thealong the page is in the £ terms deeper the y-axis page £ every the further the knot an hierarchy dot represents point, potentially indicating that content on pages deep within a site “decay” atthat day occurrence on a slower rate. Category: Perhaps unsurprisingly, their first £ terms are ordered by News and Sports pages have an earlierwebpages occurrence in the knot point as content in these pages is likely to be replacedon allrecipes.com quickly. Industry/trade pages, including corporate home pages, display a much more gradual rate of content decay before reaching the knot point. 4.3 Term-Level Change The above analysis explores how page Figure 5. Term lifespanfor IR for several pages Large-scale Data Processing plots 32 content changes across an entire Web replaced with the BestBuy homepage. Time (in
  • 38. Or from Wikipedia access logs... ¢ 1 year = ~ 555GB of raw Wikipedia logs £ filter £ aggregate £ link £ visualize ¢ Inherently parallelizable Large-scale Data Processing for IR 33
  • 39. Or from Wikipedia access logs... 31-2 30000 01012 nts-2 ¢ 1 year = ~ 555GB of pagec ou 3 57482 raw Wikipedia logs [... ] ristm 68 76 as 11 rol 1 713 th%20Apax a en Ch stmas%20C oling%20W i 1 602 ri ar 1 en Ch stmas%20C slip 1 59 £ filter en Chri tmas% 20Cow d 1 630 n Chris as%20Isla ture 1 72 ant%20Wal 0 l%20D ecal 1 611 en ristm itera %20Gi £ aggregate en Ch stmas%20L e%20Quote 593 ri re en Ch stmas%20T 20medium 1 1 596 98 en Chri s%2 0by% %20 wall 1 5 £ link r istma 20fantasy all%20art i 1 605 en Ch stmas% Chri l%20w 0viny s_Solis_I nvict en mas%2 i hrist s%23Natal £ visualize e e n C n Chr istma [...] ¢ Inherently parallelizable Large-scale Data Processing for IR 33
  • 40. Use-case 3: Mining user edits on Wikipedia ¢ As a social signal … ¢ As a language resource … £ Target: User edits, textual differences between revisions of the same document £ Objective: Distinguish between factual edits (alter the meaning) and fluency edits (address style or readability) £ Dataset: Full revision history of the English Wikipedia Large-scale Data Processing for IR 34
  • 41. The data ¢ Average of 3.5 to 4 million revisions per month £ English Wikipedia, August 2006 to August 2011 £ Each revision may contain multiple edits (many are irrelevant) £ 342GB compressed text (snapshot of 15/01/2011) Large-scale Data Processing for IR 35
  • 42. What to do? ¢ A lot of pre-processing £ Filtering out irrelevant revisions £ Parsing wiki markup £ Words tokenization £ Sentence splitting £ Computing textual diff between revisions £ Indexing user edits at sentence level and across sentence boundaries £ Computing classification features per user edit ¢ And then £ Execution:15 nodes, each processes a data stream £ Average of 2-3 days per node ¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits Large-scale Data Processing for IR 36
  • 43. What to do? ¢ A lot of pre-processing £ Filtering out irrelevant revisions £ Parsing wiki markup £ Words tokenization £ Sentence splitting £ Computing textual diff between revisions £ Indexing user edits at sentence level and across sentence boundaries £ Computing classification features per user edit ¢ And then £ Execution:15 nodes, each processes a data stream £ Average of 2-3 days per node ¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits Large-scale Data Processing for IR 36
  • 44. What’s next? Large-scale Data Processing for IR 37
  • 45. Real-time semantic analysis ¢ Example: reputation management ¢ Follow twitter stream £ Am I being mentioned? £ What are they saying about me? £ Is this potentially damaging? ¢ Why a challenge £ Ambiguity £ Noise £ “I need to know now!” ¢ Big data Large-scale Data Processing for IR 38
  • 46. Extreme personalisation ¢ “Zero click”, “zero query” ¢ Tell me what I should know £ Summarize a few million documents £ Show a semantically meaningful result on my screen ¢ Big data Large-scale Data Processing for IR 39
  • 47. Social search ¢ Socially improved search £ General search, personalized search £ Thousands of users of social networks actively share content and attitudes and opinions and experiences £ Use this to “push content” £ Return results that you care about, with a broad “subjective context” ¢ Big data Large-scale Data Processing for IR 40
  • 48. Thanks! ¢ Edgar Meij £ http://edgar.meij.pro £ edgar.meij@uva.nl £ @edgarmeij Large-scale Data Processing for IR 41