SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
TEXT TAGGING WITH FINITE STATE
TRANSDUCERS
David Smiley
Software Systems Engineer, Lead
Text Tagging with
Finite State Transducers
David Smiley
Lucene/Solr Revolution 2013
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley
 Working at MITRE, for 13 years
 web development, Java, search
 Published 1st book on Solr; then 2nd edition (2009, 2011)
 Apache Lucene / Solr committer/PMC member (2012)
 Presented at Lucene Revolution (2010) & Basis O.S. Search
Conference (2011, 2012)
 Taught Solr classes at MITRE (2010, 2011, 2012)
 Solr search consultant within MITRE and its sponsors, and
privately
3
What is “Text Tagging” and “FSTs”?
 First, I need to establish the context:
 JIEDDO’s OpenSextant project
 Though this presentation is not about OpenSextant or
geotagging
 Ultimately, I want to convey how cool Lucene’s FSTs are
 And you may have a need for a text tagger
 Or a geotagger (like OpenSextant)
OpenSextant
A DoD Funded Project: JIEDDO/COIC & NGA
Open Source approval recently obtained
OpenSextant Project
 A geotagging solution for unstructured text
 Finds place name references in natural language
 “… I live near Boston … ”
 Finds “Boston” with input character offset #s
 Often resolves to multiple gazetteer entries: “Boston” has 73
 What’s a Gazetteer?
 A dictionary of place names with metadata like latitude &
longitude
How does it work?
The “Naïve” Tagger
 AKA “Text Tagger”
 Simply consults a dictionary/gazetteer; no fancy NLP
 There’s nothing geospatial about it
 Subsequent NLP processing eliminates low-confidence tags
 Actually, not so simple
 Names vary in word length
 Must find overlapping names
 but not names within names
The Gazetteer
 13 million place name records
 8.1M distinct place names
 Why not 13M?
 Ambiguous names (e.g. San Diego)
 Text analysis normalization (e.g. diacritic removal, etc.)
 2.8M are single-word names (1/3rd)
 2.3 avg. words / name
 14 avg. chars / name
3 Naïve Tagger Implementations
 GATE’s Tagger
 In-memory Aho-Corasick string-matching algorithm
 Requires an estimated 80 GB RAM !! (for our data)
 FAST
 A JIEDDO developed MySQL based Tagger
 “Reasonable” RAM requirements ~4GB
 SLOW (~15x, 20x? not certain). ~1 doc/second
 A JIEDDO developed Solr/FST based Tagger …
Finite State Transducers
Applied to text tagging
Finite State Automata (FSA)
 SortedSet<char[]>:
 mop, moth, pop, slop, sloth, stop, top
Note: a “Trie” data structure is similar but only shares prefixes
Finite State Transducer (FST)
 Adds optional output to each arc
 SortedMap<char[],int>
 mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
Lucene’s FST Implementation
 FST encoded as a byte[]
 Memory efficient! And fast to load from disk.
 Write-once API (immutable)
 Build minimal, acyclic FST from pre-sorted inputs
 Fast (linear time with input size), low memory
 Optional two-pass packing can shrink by ~25%
 SortedMap<int[],T>: arcs are sorted by label
 getByOutput also possible if outputs are sorted
 http://s.apache.org/LuceneFSTs
Based on a
Mihov & Maurel
paper, 2001
FSTs and Text Tagging
 My approach involves two layers of FSTs:
 A word dictionary FST to hold each unique word
 Enables using integers as substitutes for char[]
 Via getByOutput(12345) -> “New”
 Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345
 A word phrase FST comprised of word id string keys
 Ex: “New York City” -> [12345, 5522111, 345]
 Value are arrays of gazetteer primary keys
Memory Use
 Word Dict FST:
 3.3M words with ordinal ids in 26MB of RAM
 Name Phrase FST:
 8.1M word id phrases in 90 MB of RAM
 Plus 82MB of arrays of gazetteer primary key ids
 Total: 198 MB (compare to 80GB GATE Aho-Corasick)
 Building it consumes ~1.5GB Java heap, for 2 minutes
Experimental measurements
 Single FST Experiment
 1 FST of analyzed character word phrase -> int id
 “new york city” -> 6344207
 Theory: more than 2x the memory
 Result: 69 MB! (compare to 26+90) 41% reduction
 Retrospective: What I would have done differently
 Index a field of concatenated terms (custom TokenFilter).
 More disk needed but reduces build time & memory
requirements. Unclear effect on tagging performance.
 Potential to use MemoryPostingsFormat, a Lucene Codec that
uses an FST internally + vInt doc ids, instead of custom FST code.
Tagging Algorithm
It’s complicated! Single-pass (streaming) algorithm
 For each input word, lookup its ordinal id, then:
1. Create an FST arc iterator for name phrase
2. Append the iterator onto a queue of active ones
3. Try to advance all iterators
 Remove those that don’t advance
Iterator linked-list queue:
Head: New, York, City ✔
Head+1: York, City
Head+2: City …
Speed Benchmarks
Docs/Sec RAM (GB)
OpenSextant: GATE Tagger ? 80
OpenSextant: MySQL based Tagger 1.1 4
OpenSextant: Solr/FST Tagger 15.9 2*
Measures single-threaded performance of geotagging 428
documents in the “ACE” collection. OpenSextant tests all had
the same gazetteer.
Integrated with Solr
 As a custom Solr Request Handler
 Builds the FSTs from the index (the gazetteer)
 Configurable
 Text analysis (e.g. phonetic)
 Exclude gazetteer docs by configured query
 Optional partial word phrase matching
 Optional sub-tags tagging
 Solr integration benefits
 Solr as a taxonomy manager! Web-service, searchable,
scalable, easy to update, …
~$ curl -XPOST 'http://localhost:8983/solr/tag
?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston"
{
"responseHeader":{
"status":0,
"QTime":1898},
"tagsCount":1,
"tags":[[
"startOffset",12,
"endOffset",18,
"ids",[1190927,
1099063,
2562742,
2667203,
2684629,
2695904,
2653982,
2657690,
2585165,
2597292,
…
… 11890986,
11891415]]],
"matchingDocs":{"numFound":73,"start":0,"docs":[
{
"id":12719030,
"place_id":"USGS1893700",
"name":"Boston",
"lat":65.01667,
"lon":-163.28333,
"feat_class":"L",
"feat_code":"AREA",
"FIPS_cc":"US",
"ISO_cc":["US"],
"cc":"US",
"ISO3_cc":"USA",
"adm1":"US02",
"adm2":"US02.0180",
"name_bias":0.0,
"id_bias":0.04,
"geo":"65.01667,-163.28333"},
…
Where can you get this?
 https://github.com/openSextant/SolrTextTagger
 An independent module of OpenSextant
 Might seek incubator status at http://www.osgeo.org
 Includes documentation, tests
Concluding Remarks
 Lucene FSTs are awesome!
 Great for storing large amounts of strings in-memory
 Or other string-like data: e.g. IP addresses, geohashes
 The API is hard to use, however
 The text tagger should be useful independent of
OpenSextant
 Tag people/org names or special keywords
 Might be ported to Lucene as an alternative to its synonym
token filter
 I’ve got an idea on applying these concepts to Lucene
“Shingling” as a codec to make it more scalable
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
David Smiley
dsmiley@mitre.org

Mais conteúdo relacionado

Mais procurados

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedis Labs
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Automata and Compiler 2020
Automata and Compiler 2020Automata and Compiler 2020
Automata and Compiler 2020Joud Khattab
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Amazon Web Services
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeErik Krogen
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsVipin Varghese
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Neo4j Training Cypher
Neo4j Training CypherNeo4j Training Cypher
Neo4j Training CypherMax De Marzi
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 

Mais procurados (20)

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Automata and Compiler 2020
Automata and Compiler 2020Automata and Compiler 2020
Automata and Compiler 2020
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT332) - AWS re:Inv...
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
 
Debug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpoints
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Neo4j Training Cypher
Neo4j Training CypherNeo4j Training Cypher
Neo4j Training Cypher
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 

Destaque

Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoriststjcarter
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_versiontjcarter
 
Adult development theory
Adult development theoryAdult development theory
Adult development theorycccscoetc
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライトShinichiro Abe
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VECharles Palus
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesCharles Palus
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 ScotlandCharles Palus
 

Destaque (11)

Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Adult Development
Adult DevelopmentAdult Development
Adult Development
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
 
Adult Development
Adult Development Adult Development
Adult Development
 
Adult development theory
Adult development theoryAdult development theory
Adult development theory
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライト
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 Scotland
 

Semelhante a Text tagging with finite state transducers

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUGAlex Ott
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF UsersAndy Schwartz
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 

Semelhante a Text tagging with finite state transducers (20)

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUG
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF Users
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 

Mais de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 

Último (20)

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

Text tagging with finite state transducers

  • 1. TEXT TAGGING WITH FINITE STATE TRANSDUCERS David Smiley Software Systems Engineer, Lead
  • 2. Text Tagging with Finite State Transducers David Smiley Lucene/Solr Revolution 2013 © 2012 The MITRE Corporation. All rights reserved.
  • 3. About David Smiley  Working at MITRE, for 13 years  web development, Java, search  Published 1st book on Solr; then 2nd edition (2009, 2011)  Apache Lucene / Solr committer/PMC member (2012)  Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011, 2012)  Taught Solr classes at MITRE (2010, 2011, 2012)  Solr search consultant within MITRE and its sponsors, and privately 3
  • 4. What is “Text Tagging” and “FSTs”?  First, I need to establish the context:  JIEDDO’s OpenSextant project  Though this presentation is not about OpenSextant or geotagging  Ultimately, I want to convey how cool Lucene’s FSTs are  And you may have a need for a text tagger  Or a geotagger (like OpenSextant)
  • 5. OpenSextant A DoD Funded Project: JIEDDO/COIC & NGA Open Source approval recently obtained
  • 6. OpenSextant Project  A geotagging solution for unstructured text  Finds place name references in natural language  “… I live near Boston … ”  Finds “Boston” with input character offset #s  Often resolves to multiple gazetteer entries: “Boston” has 73  What’s a Gazetteer?  A dictionary of place names with metadata like latitude & longitude
  • 7.
  • 8. How does it work?
  • 9. The “Naïve” Tagger  AKA “Text Tagger”  Simply consults a dictionary/gazetteer; no fancy NLP  There’s nothing geospatial about it  Subsequent NLP processing eliminates low-confidence tags  Actually, not so simple  Names vary in word length  Must find overlapping names  but not names within names
  • 10. The Gazetteer  13 million place name records  8.1M distinct place names  Why not 13M?  Ambiguous names (e.g. San Diego)  Text analysis normalization (e.g. diacritic removal, etc.)  2.8M are single-word names (1/3rd)  2.3 avg. words / name  14 avg. chars / name
  • 11. 3 Naïve Tagger Implementations  GATE’s Tagger  In-memory Aho-Corasick string-matching algorithm  Requires an estimated 80 GB RAM !! (for our data)  FAST  A JIEDDO developed MySQL based Tagger  “Reasonable” RAM requirements ~4GB  SLOW (~15x, 20x? not certain). ~1 doc/second  A JIEDDO developed Solr/FST based Tagger …
  • 13. Finite State Automata (FSA)  SortedSet<char[]>:  mop, moth, pop, slop, sloth, stop, top Note: a “Trie” data structure is similar but only shares prefixes
  • 14. Finite State Transducer (FST)  Adds optional output to each arc  SortedMap<char[],int>  mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • 15. Lucene’s FST Implementation  FST encoded as a byte[]  Memory efficient! And fast to load from disk.  Write-once API (immutable)  Build minimal, acyclic FST from pre-sorted inputs  Fast (linear time with input size), low memory  Optional two-pass packing can shrink by ~25%  SortedMap<int[],T>: arcs are sorted by label  getByOutput also possible if outputs are sorted  http://s.apache.org/LuceneFSTs Based on a Mihov & Maurel paper, 2001
  • 16. FSTs and Text Tagging  My approach involves two layers of FSTs:  A word dictionary FST to hold each unique word  Enables using integers as substitutes for char[]  Via getByOutput(12345) -> “New”  Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345  A word phrase FST comprised of word id string keys  Ex: “New York City” -> [12345, 5522111, 345]  Value are arrays of gazetteer primary keys
  • 17. Memory Use  Word Dict FST:  3.3M words with ordinal ids in 26MB of RAM  Name Phrase FST:  8.1M word id phrases in 90 MB of RAM  Plus 82MB of arrays of gazetteer primary key ids  Total: 198 MB (compare to 80GB GATE Aho-Corasick)  Building it consumes ~1.5GB Java heap, for 2 minutes
  • 18. Experimental measurements  Single FST Experiment  1 FST of analyzed character word phrase -> int id  “new york city” -> 6344207  Theory: more than 2x the memory  Result: 69 MB! (compare to 26+90) 41% reduction  Retrospective: What I would have done differently  Index a field of concatenated terms (custom TokenFilter).  More disk needed but reduces build time & memory requirements. Unclear effect on tagging performance.  Potential to use MemoryPostingsFormat, a Lucene Codec that uses an FST internally + vInt doc ids, instead of custom FST code.
  • 19. Tagging Algorithm It’s complicated! Single-pass (streaming) algorithm  For each input word, lookup its ordinal id, then: 1. Create an FST arc iterator for name phrase 2. Append the iterator onto a queue of active ones 3. Try to advance all iterators  Remove those that don’t advance Iterator linked-list queue: Head: New, York, City ✔ Head+1: York, City Head+2: City …
  • 20. Speed Benchmarks Docs/Sec RAM (GB) OpenSextant: GATE Tagger ? 80 OpenSextant: MySQL based Tagger 1.1 4 OpenSextant: Solr/FST Tagger 15.9 2* Measures single-threaded performance of geotagging 428 documents in the “ACE” collection. OpenSextant tests all had the same gazetteer.
  • 21. Integrated with Solr  As a custom Solr Request Handler  Builds the FSTs from the index (the gazetteer)  Configurable  Text analysis (e.g. phonetic)  Exclude gazetteer docs by configured query  Optional partial word phrase matching  Optional sub-tags tagging  Solr integration benefits  Solr as a taxonomy manager! Web-service, searchable, scalable, easy to update, …
  • 22. ~$ curl -XPOST 'http://localhost:8983/solr/tag ?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston" { "responseHeader":{ "status":0, "QTime":1898}, "tagsCount":1, "tags":[[ "startOffset",12, "endOffset",18, "ids",[1190927, 1099063, 2562742, 2667203, 2684629, 2695904, 2653982, 2657690, 2585165, 2597292, … … 11890986, 11891415]]], "matchingDocs":{"numFound":73,"start":0,"docs":[ { "id":12719030, "place_id":"USGS1893700", "name":"Boston", "lat":65.01667, "lon":-163.28333, "feat_class":"L", "feat_code":"AREA", "FIPS_cc":"US", "ISO_cc":["US"], "cc":"US", "ISO3_cc":"USA", "adm1":"US02", "adm2":"US02.0180", "name_bias":0.0, "id_bias":0.04, "geo":"65.01667,-163.28333"}, …
  • 23. Where can you get this?  https://github.com/openSextant/SolrTextTagger  An independent module of OpenSextant  Might seek incubator status at http://www.osgeo.org  Includes documentation, tests
  • 24. Concluding Remarks  Lucene FSTs are awesome!  Great for storing large amounts of strings in-memory  Or other string-like data: e.g. IP addresses, geohashes  The API is hard to use, however  The text tagger should be useful independent of OpenSextant  Tag people/org names or special keywords  Might be ported to Lucene as an alternative to its synonym token filter  I’ve got an idea on applying these concepts to Lucene “Shingling” as a codec to make it more scalable
  • 25. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT David Smiley dsmiley@mitre.org