SlideShare uma empresa Scribd logo
1 de 21
lots of facets, fast

     Anne Veling, BeyondTrees
anne@beyondtrees.com, May 26th 2011
introduction
 Anne Veling
  • Freelance Search Architect
  • Lucene Trainer


 Proquest
 New York Times




                                 3
visualization
 data
  • 1851 up to 2006: almost 60k newspapers
 How to give semantic overview
  • Context, where am I
  • Detail
 Exploration and Discovery




                                             4
zoom
 Present all newspapers on one canvas
 Dynamic zooming and panning
 Search interface
   • for discovery


 Front-end by Q42
   • HTML5 app
   • iPad app


 Not yet live

                                         5
architecture




           Tile                   Web
images                   tiles
         Generator               Server

                                              client

 text                     solr    solr
          Indexer
                         index   server
                                      facet
                                     plugin




                                                       6
tiling
 Newspaper images, old ones scanned
  • TIFF form
  • Wrinkles, coffee stains
 Tile generator
  • Convert to jpg
  • One virtual canvas of 512Gpixel
  • Multilayers 3M tiles: ~100Gb in 11 levels




                                                7
search
 25,072,989 articles
 867M solr index
 DataImportHandler
  • Issue with memory: load all XML URLs in
    memory first
  • Solved by indexing in batches
 Special
  • Nothing stored, not even IDs
  • We need nothing returned from search…



                                              8
results   facets
             0

query




                           …




        maxDoc
                  4   2


                               9
faceting memory
 Store each facet as BitSet over 25M articles
  • 58k facets x 25M docs x 1 bit = 169Gb (memory!)
 So we use DocSet from Solr
  • Scarce bitarray -> now fits in 1Gb memory




                                                      10
faceting performance
query
                     Facet initialization
                       • Takes ~1.5minute
                       • Cached


                     Facet evaluation
                       • Runtime!
                       • #docs x #facets




                                             11
performance
 Facet initialization/creation
 Runtime faceting

 Solr LRU cache
 Creation of all facets ~72s
 Runtime evaluation ootb: 71 seconds…
  /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on
  &facet.date=thedate
  &facet.date.start=1850-01-01T00:00:00Z
  &facet.date.end=2007-01-01T00:00:00Z
  &facet.date.gap=%2B1DAY
  &facet=true


 Client-side bottleneck vs Server-side
                                                               12
<filterCache class="solr.FastLRUCache" size="70000"
initialSize="512" autowarmCount="0"/>
 Improved performance to ~300ms for
  “Amsterdam” [1825] query!
   • 2.3Mb output…
<requestHandler name="/zoomr"
class="com.proquest.zoom.ZoomrRequestHandler">
</requestHandler>
 Custom json output
   • Base 36 encoded heatmap
                          01111111111111111122111222777986878768885568855899beddbce
                          bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi
                          mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000




                                                                                      13
runtime facet optimization

                     16 decades

               160 years

      1,920 months

58,560 days

   60,656 facets
   Worst case facet #DocSet.exists(doc)
      • Originally: 25M x 60k = 1.5E12 checks, 60k per
        doc
      • Now: average 0.5x for each level = 34.5 per doc
                                                          14
optimization
 Custom facet runtime Collector
  • Break if facet matched
      single value per doc per facet
      each doc has only 1 day
  • Top-down facet selection
      decade – year – month – day
 Performance for 1850 docs and 60k docs
  improved from 300ms to 10ms
 Custom optimized heatmap json
 Bottleneck now in the client/canvas/js


                                           15
show us or it didn’t happen
 Web Application
 iPad App




                                 16
zooming




          17
facet heatmap

        “television”




                       “inflation”




                                     18
conclusions
 Great exploratory UI
 Use domain knowledge to optimize for
  performance
  • If you can
 Next
  •   Bring it live on the Web and in App Store
  •   Using it for 1.2M books/CDs/DVDs of Belgium
  •   More search options
  •   Multipage



                                                    19
enhancement suggestions
 Lucene Collector
  • def collect(doc: Int):Boolean
                           class ExistsCollector extends Collector {
                             var exists = false

                               def collect(doc: Int) = {
                                 exists = true
                                 false
                               }

                               def acceptsDocsOutOfOrder() = true
                               def setNextReader(reader: IndexReader, base: Int) {}
                               def setScorer(scorer: Scorer) {}
                           }



 Solr SingleValueFacet
      Break after first find
      Automatic order based on #counts?


                                                                                  20
lessons learned
 Java Graphics has limitations for large fonts
  (>26,000)
 Handling large data sets is tricky
  • Indexing
  • Copying
 There’s technology and there’s corporate
  agendas
 You can always make things 10x faster
  • Lucene is ridiculously fast
      If you configure it well
  • Using domain knowledge can get you far
                                                  21
thank you




      anne@beyondtrees.com
              @anneveling



                             22

Mais conteúdo relacionado

Mais procurados

Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru Architecture
Dmitry Buzdin
 

Mais procurados (20)

Staying friendly with the gc
Staying friendly with the gcStaying friendly with the gc
Staying friendly with the gc
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
 
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block Store
 
MongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationMongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combination
 
Accelerating NoSQL
Accelerating NoSQLAccelerating NoSQL
Accelerating NoSQL
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Rebooting design in RavenDB
Rebooting design in RavenDBRebooting design in RavenDB
Rebooting design in RavenDB
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
 
Running Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureRunning Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows Azure
 
ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
 
Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru Architecture
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
 
Spring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseSpring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using Couchbase
 

Semelhante a Lots of facets, fast

A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 

Semelhante a Lots of facets, fast (20)

Loom promises: be there!
Loom promises: be there!Loom promises: be there!
Loom promises: be there!
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Restful web services with nodejs
Restful web services with nodejsRestful web services with nodejs
Restful web services with nodejs
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Eureka Moment UKLUG
Eureka Moment UKLUGEureka Moment UKLUG
Eureka Moment UKLUG
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 

Último

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Lots of facets, fast

  • 1. lots of facets, fast Anne Veling, BeyondTrees anne@beyondtrees.com, May 26th 2011
  • 2. introduction  Anne Veling • Freelance Search Architect • Lucene Trainer  Proquest  New York Times 3
  • 3. visualization  data • 1851 up to 2006: almost 60k newspapers  How to give semantic overview • Context, where am I • Detail  Exploration and Discovery 4
  • 4. zoom  Present all newspapers on one canvas  Dynamic zooming and panning  Search interface • for discovery  Front-end by Q42 • HTML5 app • iPad app  Not yet live 5
  • 5. architecture Tile Web images tiles Generator Server client text solr solr Indexer index server facet plugin 6
  • 6. tiling  Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains  Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  • 7. search  25,072,989 articles  867M solr index  DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches  Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  • 8. results facets 0 query … maxDoc 4 2 9
  • 9. faceting memory  Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!)  So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  • 10. faceting performance query  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  • 11. performance  Facet initialization/creation  Runtime faceting  Solr LRU cache  Creation of all facets ~72s  Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true  Client-side bottleneck vs Server-side 12
  • 12. <filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>  Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output… <requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"> </requestHandler>  Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  • 13. runtime facet optimization 16 decades 160 years 1,920 months 58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  • 14. optimization  Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day  Performance for 1850 docs and 60k docs improved from 300ms to 10ms  Custom optimized heatmap json  Bottleneck now in the client/canvas/js 15
  • 15. show us or it didn’t happen  Web Application  iPad App 16
  • 16. zooming 17
  • 17. facet heatmap “television” “inflation” 18
  • 18. conclusions  Great exploratory UI  Use domain knowledge to optimize for performance • If you can  Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  • 19. enhancement suggestions  Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} }  Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  • 20. lessons learned  Java Graphics has limitations for large fonts (>26,000)  Handling large data sets is tricky • Indexing • Copying  There’s technology and there’s corporate agendas  You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  • 21. thank you anne@beyondtrees.com @anneveling 22