SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Building a Global Listening
                     Platform with Solr
Steve Kearns
Rosette Product Manager
Basis Technology
October 7, 2010




Monday, October 04, 2010



                                              2
Agenda
•   Agenda
•   Who Am I?
•   What is a “Listening Platform”?
•   Challenges (Technical & Global)
•   Details
•   Demonstration




                                      3
About me
• Product Manager at Basis Technology
  – Rosette linguistics platform
     • Language ID
     • Language Support for Search
     • Entity Extraction
     • Entity Translation/Search
     •…
• Related history
  – Media Monitoring at BBN Technologies
     • Video, Web content extraction: STT, MT, Search


                                                   4
What is a Listening Platform?
• Content aggregator for online media
• Targets:
  – Social/Brand monitoring
  – Government OSINT
• Functions:
  –   Content acquisition
  –   Content analysis
  –   Search indexing
  –   Search (UI)
  –   Visualization


                                        5
Content Acquisition
• What:
  – News Articles
  – Social Media
• How:
  – Web Crawler
     • Nutch!
  – RSS feed reader/aggregator
     • ROME/Curn
  – Pay a 3rd party aggregator
     • Good option, if you can afford it.


                                            6
Content Acquisition Problems
• Web Crawling
  – Content Extraction!
     • BoilerPipe, Readability
  – Crawl History
     • Duplicate detection
     • Article updates?
  – Crawl Control
     • Crawl Depth?
     • Crawl Restrictions?

• Per-site configuration doesn’t scale.
                                          7
Content Acquisition: How
        • CURN – Customizable Utilitarian RSS Notifier

                                   CURN History



            CURN

               RSS                Yes    New                    No
RSS            Feed                     story?                            End
Feed                                                                                Solr
 List
            Output
            Plug-In   Download             Extract                    Create Solr
                      Story URL            Content                     Message
                                        (BoilerPipe || Readability)




                                                                                       8
What is a Listening Platform?
• Content aggregator for online media
• Targets:
  – Social/Brand monitoring
  – Government OSINT
• Functions:
  –   Content acquisition
  –   Content analysis
  –   Search indexing
  –   Other visualization



                                        9
Content Analysis
•   Language Identification
•   Entity Extraction
•   Relationship Extraction
•   Classification
•   Near-Duplicate Detection
•   Story Tracking




                                10
Content Analysis: How?
• Preprocessor to Solr
  – Custom
  – OpenPipeline
• Solr UpdateRequestProcessors
  – Chain of URP’s defined in SolrConfig
  – Add, edit, remove fields




                                           11
Content Analysis: How
• Custom distributed processing pipeline
  – Complexity of components
  – Number of components
  – Some components require their own data storage


• Solr Indexing is the final processing step




                                                 12
Content Analysis Details
• Language Identification for:
   – Indexing
   – Faceting/Searching
   – Entity Extraction
• Language-specific indexing for:
   – Improved recall with high precision
• Entity Extraction for:
   – Faceting
   – Entity search
   – Input to relationship extraction


                                           13
Language Identification
• Detect dominant language
• Find language regions




                                14
Language-Specific Indexing
• Every language has unique challenges:
  – Tokenization
     • Morphological Analysis vs. N-Gram
  – Stemming vs. Lemmatization
     • All European and Middle Eastern languages
  – Compound words
     • Swedish, Danish, Norwegian, Dutch, German




                                                   15
Morphological Analysis vs. N-Gram
• Search Term: 東京 ルパン上映時間
• N-Gram:




• Morphological:




                                16
Stemming vs. Lemmatization
• Stemming:
  – Set of rules for removing characters from words
  – Increased recall at the expense of precision
  – Example EN rule: Remove trailing “ing” or “al”


• Lemmatization:
  – Complex set of approaches for producing the
    dictionary form of a word
  – Increased recall without hurting precision
  – Uses context to disambiguate candidates


                                                      17
Stemming vs. Lemmatization
• English: “I have spoken at several conferences”
• Stemming:




• Lemmatization:




                                             18
Stemming vs. Lemmatization
• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:




• Lemmatization (and decompounding!)




                                                          19
Stemming and Lemmatization Challenges
  • Can I index text from many languages into
    the same field?
    – Yes, but it’s not always a good idea!
       • Query language ID is not accurate.
    – You need a custom Query Analyzer that does
      stemming/lemmatization in many languages for
      the same query.


  • How do I query text in multiple fields?
    – Dismax parser allows you to specify multiple
      fields to search.

                                                     20
Entity Extraction
• Process of identifying people, places, organizations, dates,
  times, etc. in unstructured text.
• Methods:
   – List-based
   – Rules-based
   – Statistical-based

• Define your goals upfront!
   – Some extraction methods work better for certain entity types
       • Rules work well for dates, email addresses, and URL’s, but not people
       • Lists work well for titles, but not locations
       • Statistical extractors work well for ambiguous entities like people,
         locations, organizations


                                                                           21
Entity Extraction Example




                            22
User Interface
• Google-style search results aren’t enough!

• Design UI around workflow
           and/or
• Design for Exploration




                                               23
Dashboard/Summary




                    24
Faceting




           25
Link Analysis




                26
Details
• Data Acquisition: Curn RSS Aggregator
• Analysis:
  – Basis Technology:
     • Language ID                     • Relationship Extraction
     • Search Enablement               • Entity Search
     • Entity Extraction
• Indexing: Solr
• UI:
  – JSP                 – Javascript InfoViz Toolkit (theJit.org)
  – Yahoo UI (YUI)      – gRaphaël (g.raphaeljs.com)



                                                              27
Architecture

  CURN                    Rosette Analysis              Document
(RSS Harvester)
                           Components                Classification and
                         • Lang ID                      Clustering
                         • Entity Extraction
                         • Relationship Extraction




                                                         Indexing /         User Interface
                                                        Query Service

                  MySQL
                  (Long Term
                   Datastore)




                                                      Name           Solr
                                                     Indexer                             28
Demo
• Listening Platform built on Solr

• I built this version in 3 months using Solr
  and products from Basis Technology

• I would be happy to show you the Solr
  config and let you try it out




                                                29

Mais conteúdo relacionado

Destaque

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovationsLucidworks (Archived)
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Lucidworks (Archived)
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
Transforming the house hunting experience
Transforming the house hunting experienceTransforming the house hunting experience
Transforming the house hunting experienceLucidworks (Archived)
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012彰 村地
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]tanica
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagemgindri
 
All the lovers
All the loversAll the lovers
All the loverstanica
 
Updated: Preparing an investor presentation
Updated:  Preparing an investor presentationUpdated:  Preparing an investor presentation
Updated: Preparing an investor presentationMarty Kaszubowski
 
Updated: Getting Ready for Due-Diligence
Updated:  Getting Ready for Due-DiligenceUpdated:  Getting Ready for Due-Diligence
Updated: Getting Ready for Due-DiligenceMarty Kaszubowski
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2彰 村地
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 

Destaque (20)

E learning At The Library
E learning At The LibraryE learning At The Library
E learning At The Library
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Using LWE/Solr/Lucene for eCom
Using LWE/Solr/Lucene for eComUsing LWE/Solr/Lucene for eCom
Using LWE/Solr/Lucene for eCom
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...Building specialized industry applications using Solr, and migration from FAS...
Building specialized industry applications using Solr, and migration from FAS...
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
How To Get The Justin Bieber Smile
How To Get The Justin Bieber SmileHow To Get The Justin Bieber Smile
How To Get The Justin Bieber Smile
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
Transforming the house hunting experience
Transforming the house hunting experienceTransforming the house hunting experience
Transforming the house hunting experience
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
корея
кореякорея
корея
 
All the lovers
All the loversAll the lovers
All the lovers
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Updated: Preparing an investor presentation
Updated:  Preparing an investor presentationUpdated:  Preparing an investor presentation
Updated: Preparing an investor presentation
 
Updated: Getting Ready for Due-Diligence
Updated:  Getting Ready for Due-DiligenceUpdated:  Getting Ready for Due-Diligence
Updated: Getting Ready for Due-Diligence
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 

Semelhante a 2010 10-building-global-listening-platform-with-solr

SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Lucidworks
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...Agnes Molnar
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchAgnes Molnar
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
The Final Frontier
The Final FrontierThe Final Frontier
The Final FrontierjClarity
 

Semelhante a 2010 10-building-global-listening-platform-with-solr (20)

SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data Exploration
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Enterprise Search @EPAM
Enterprise Search @EPAMEnterprise Search @EPAM
Enterprise Search @EPAM
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
The Final Frontier
The Final FrontierThe Final Frontier
The Final Frontier
 

Mais de Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

2010 10-building-global-listening-platform-with-solr

  • 1. Building a Global Listening Platform with Solr Steve Kearns Rosette Product Manager Basis Technology October 7, 2010 Monday, October 04, 2010 2
  • 2. Agenda • Agenda • Who Am I? • What is a “Listening Platform”? • Challenges (Technical & Global) • Details • Demonstration 3
  • 3. About me • Product Manager at Basis Technology – Rosette linguistics platform • Language ID • Language Support for Search • Entity Extraction • Entity Translation/Search •… • Related history – Media Monitoring at BBN Technologies • Video, Web content extraction: STT, MT, Search 4
  • 4. What is a Listening Platform? • Content aggregator for online media • Targets: – Social/Brand monitoring – Government OSINT • Functions: – Content acquisition – Content analysis – Search indexing – Search (UI) – Visualization 5
  • 5. Content Acquisition • What: – News Articles – Social Media • How: – Web Crawler • Nutch! – RSS feed reader/aggregator • ROME/Curn – Pay a 3rd party aggregator • Good option, if you can afford it. 6
  • 6. Content Acquisition Problems • Web Crawling – Content Extraction! • BoilerPipe, Readability – Crawl History • Duplicate detection • Article updates? – Crawl Control • Crawl Depth? • Crawl Restrictions? • Per-site configuration doesn’t scale. 7
  • 7. Content Acquisition: How • CURN – Customizable Utilitarian RSS Notifier CURN History CURN RSS Yes New No RSS Feed story? End Feed Solr List Output Plug-In Download Extract Create Solr Story URL Content Message (BoilerPipe || Readability) 8
  • 8. What is a Listening Platform? • Content aggregator for online media • Targets: – Social/Brand monitoring – Government OSINT • Functions: – Content acquisition – Content analysis – Search indexing – Other visualization 9
  • 9. Content Analysis • Language Identification • Entity Extraction • Relationship Extraction • Classification • Near-Duplicate Detection • Story Tracking 10
  • 10. Content Analysis: How? • Preprocessor to Solr – Custom – OpenPipeline • Solr UpdateRequestProcessors – Chain of URP’s defined in SolrConfig – Add, edit, remove fields 11
  • 11. Content Analysis: How • Custom distributed processing pipeline – Complexity of components – Number of components – Some components require their own data storage • Solr Indexing is the final processing step 12
  • 12. Content Analysis Details • Language Identification for: – Indexing – Faceting/Searching – Entity Extraction • Language-specific indexing for: – Improved recall with high precision • Entity Extraction for: – Faceting – Entity search – Input to relationship extraction 13
  • 13. Language Identification • Detect dominant language • Find language regions 14
  • 14. Language-Specific Indexing • Every language has unique challenges: – Tokenization • Morphological Analysis vs. N-Gram – Stemming vs. Lemmatization • All European and Middle Eastern languages – Compound words • Swedish, Danish, Norwegian, Dutch, German 15
  • 15. Morphological Analysis vs. N-Gram • Search Term: 東京 ルパン上映時間 • N-Gram: • Morphological: 16
  • 16. Stemming vs. Lemmatization • Stemming: – Set of rules for removing characters from words – Increased recall at the expense of precision – Example EN rule: Remove trailing “ing” or “al” • Lemmatization: – Complex set of approaches for producing the dictionary form of a word – Increased recall without hurting precision – Uses context to disambiguate candidates 17
  • 17. Stemming vs. Lemmatization • English: “I have spoken at several conferences” • Stemming: • Lemmatization: 18
  • 18. Stemming vs. Lemmatization • German: “Am Samstagmorgen fliege ich zurueck nach Boston.” • Stemming: • Lemmatization (and decompounding!) 19
  • 19. Stemming and Lemmatization Challenges • Can I index text from many languages into the same field? – Yes, but it’s not always a good idea! • Query language ID is not accurate. – You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query. • How do I query text in multiple fields? – Dismax parser allows you to specify multiple fields to search. 20
  • 20. Entity Extraction • Process of identifying people, places, organizations, dates, times, etc. in unstructured text. • Methods: – List-based – Rules-based – Statistical-based • Define your goals upfront! – Some extraction methods work better for certain entity types • Rules work well for dates, email addresses, and URL’s, but not people • Lists work well for titles, but not locations • Statistical extractors work well for ambiguous entities like people, locations, organizations 21
  • 22. User Interface • Google-style search results aren’t enough! • Design UI around workflow and/or • Design for Exploration 23
  • 24. Faceting 25
  • 26. Details • Data Acquisition: Curn RSS Aggregator • Analysis: – Basis Technology: • Language ID • Relationship Extraction • Search Enablement • Entity Search • Entity Extraction • Indexing: Solr • UI: – JSP – Javascript InfoViz Toolkit (theJit.org) – Yahoo UI (YUI) – gRaphaël (g.raphaeljs.com) 27
  • 27. Architecture CURN Rosette Analysis Document (RSS Harvester) Components Classification and • Lang ID Clustering • Entity Extraction • Relationship Extraction Indexing / User Interface Query Service MySQL (Long Term Datastore) Name Solr Indexer 28
  • 28. Demo • Listening Platform built on Solr • I built this version in 3 months using Solr and products from Basis Technology • I would be happy to show you the Solr config and let you try it out 29