SlideShare uma empresa Scribd logo
1 de 67
What's the
story with
open
source?
Searching and monitoring news media with open
source technology
Charlie Hull, Flax
BCS IRSG Search Solutions 2010
Photo source: http://www.flickr.com/photos/shironekoeuro/
www.flax.co.uk 2
What is Flax?
www.flax.co.uk 3
What is Flax?
Search engine specialists
Formed in 2001 from the ashes of Muscat Ltd
and Webtop as Lemur Consulting Ltd
Based in Cambridge UK
Contributors to and users of Xapian
Recently selected as UK Authorized Partner by
Lucid Imagination
Customers include Mydeco, NLA, Durrants
Ltd, Financial Times, MediaMiser, MySkreen
Apache Lucene and Solr are trademarks of The Apache Software Foundation
www.flax.co.uk 4
The challenges
www.flax.co.uk 5
The challenges
Content is created for publication, not for search
www.flax.co.uk 6
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
www.flax.co.uk 7
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
www.flax.co.uk 8
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
www.flax.co.uk 9
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
www.flax.co.uk 10
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
- Every project is different
www.flax.co.uk 11
So how do we build news search?
www.flax.co.uk 12
So how do we build news search?
Indexing
www.flax.co.uk 13
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
www.flax.co.uk 14
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
www.flax.co.uk 15
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
www.flax.co.uk 16
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
www.flax.co.uk 17
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
www.flax.co.uk 18
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
www.flax.co.uk 19
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
Solution
Lightweight, customisable index scripts
using powerful open source libraries
www.flax.co.uk 20
So how do we build news search?
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()
www.flax.co.uk 21
So how do we build news search?
Searching
www.flax.co.uk 22
So how do we build news search?
Searching
Free text with Boolean operators
www.flax.co.uk 23
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
www.flax.co.uk 24
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
www.flax.co.uk 25
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
www.flax.co.uk 26
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
www.flax.co.uk 27
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
www.flax.co.uk 28
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
www.flax.co.uk 29
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
www.flax.co.uk 30
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
Beware Javascript & older browsers!
www.flax.co.uk 31
So how do we build news search?
Administration
Indexing failures common
Logging is essential
www.flax.co.uk 32
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
www.flax.co.uk 33
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
www.flax.co.uk 34
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
Open source search libraries provide
distributed indexing, replication, remote
indexes
Not simple to get this right!
www.flax.co.uk 35
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), ...
www.flax.co.uk 36
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), …
We can use whatever works!
www.flax.co.uk 37
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
http://www.nla-clipshare.com
www.flax.co.uk 38
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
One of very few ways to search content
from all the papers within hours of
publication
http://www.nla-clipshare.com
www.flax.co.uk 39
www.flax.co.uk 40
www.flax.co.uk 41
www.flax.co.uk 42
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
http://presscuttings.ft.com
www.flax.co.uk 43
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
Built from scratch in a fortnight
Designed as a prototype, scaled to
production use without significant change
http://presscuttings.ft.com
www.flax.co.uk 44
www.flax.co.uk 45
A different task – news monitoring
Non-traditional use of search
www.flax.co.uk 46
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
www.flax.co.uk 47
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
www.flax.co.uk 48
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
www.flax.co.uk 49
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
False negatives should never occur!
www.flax.co.uk 50
A different task – news monitoring
An example
Durrants Ltd.
www.flax.co.uk 51
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
www.flax.co.uk 52
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
www.flax.co.uk 53
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
Accuracy improved in some cases from 95%
rejected to 95% accepted
Hardware budget 15% of previous system
www.flax.co.uk 54
Why open source?
Flexible, extendable
www.flax.co.uk 55
Why open source?
Flexible, extendable
Powerful & scalable
www.flax.co.uk 56
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
www.flax.co.uk 57
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
www.flax.co.uk 58
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
- Freedom to innovate
www.flax.co.uk 59
Looking to the future
www.flax.co.uk 60
Looking to the future
More and more content including social media
www.flax.co.uk 61
Looking to the future
More and more content including social media
Multiple delivery platforms
www.flax.co.uk 62
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
www.flax.co.uk 63
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
www.flax.co.uk 64
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
www.flax.co.uk 65
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
www.flax.co.uk 66
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
Open source no longer an
outsider, but the obvious choice
www.flax.co.uk 67
Thankyou!
Questions?
charlie@flax.co.uk
www.flax.co.uk/blog
Twitter: @FlaxSearch
Photo source: http://www.flickr.com/photos/katerha/4259440136/

Mais conteúdo relacionado

Mais procurados

From Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data IntelligenceFrom Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data IntelligenceTrieu Nguyen
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesMarin Dimitrov
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
 
GraphQL - The new "Lingua Franca" for API-Development
GraphQL - The new "Lingua Franca" for API-DevelopmentGraphQL - The new "Lingua Franca" for API-Development
GraphQL - The new "Lingua Franca" for API-Developmentjexp
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databasesjexp
 
Full Stack Graph in the Cloud
Full Stack Graph in the CloudFull Stack Graph in the Cloud
Full Stack Graph in the CloudNeo4j
 
Slide 3 Fast Data processing with kafka, rfx and redis
Slide 3 Fast Data processing with kafka, rfx and redisSlide 3 Fast Data processing with kafka, rfx and redis
Slide 3 Fast Data processing with kafka, rfx and redisTrieu Nguyen
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to CypherNeo4j
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Mastering On-Site Search / Custom Site Search
Mastering On-Site Search / Custom Site SearchMastering On-Site Search / Custom Site Search
Mastering On-Site Search / Custom Site SearchRalf Schwoebel
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!Richard Wallis
 
NSGIC 2011 Presentation on geo open source
NSGIC 2011 Presentation on geo open sourceNSGIC 2011 Presentation on geo open source
NSGIC 2011 Presentation on geo open sourceMichael Terner
 
Fire kit ios (r-baldwin)
Fire kit ios (r-baldwin)Fire kit ios (r-baldwin)
Fire kit ios (r-baldwin)DevDays
 
Micro-Servicing Linked Data
Micro-Servicing Linked DataMicro-Servicing Linked Data
Micro-Servicing Linked DataopenCypher
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuNeo4j
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
 
The Kasabi Information Marketplace
The Kasabi Information MarketplaceThe Kasabi Information Marketplace
The Kasabi Information MarketplaceKnud Möller
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesRichard Wallis
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsRandy Shoup
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 

Mais procurados (20)

From Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data IntelligenceFrom Data Analytics to Fast Data Intelligence
From Data Analytics to Fast Data Intelligence
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL Queries
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
GraphQL - The new "Lingua Franca" for API-Development
GraphQL - The new "Lingua Franca" for API-DevelopmentGraphQL - The new "Lingua Franca" for API-Development
GraphQL - The new "Lingua Franca" for API-Development
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
Full Stack Graph in the Cloud
Full Stack Graph in the CloudFull Stack Graph in the Cloud
Full Stack Graph in the Cloud
 
Slide 3 Fast Data processing with kafka, rfx and redis
Slide 3 Fast Data processing with kafka, rfx and redisSlide 3 Fast Data processing with kafka, rfx and redis
Slide 3 Fast Data processing with kafka, rfx and redis
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Mastering On-Site Search / Custom Site Search
Mastering On-Site Search / Custom Site SearchMastering On-Site Search / Custom Site Search
Mastering On-Site Search / Custom Site Search
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!
 
NSGIC 2011 Presentation on geo open source
NSGIC 2011 Presentation on geo open sourceNSGIC 2011 Presentation on geo open source
NSGIC 2011 Presentation on geo open source
 
Fire kit ios (r-baldwin)
Fire kit ios (r-baldwin)Fire kit ios (r-baldwin)
Fire kit ios (r-baldwin)
 
Micro-Servicing Linked Data
Micro-Servicing Linked DataMicro-Servicing Linked Data
Micro-Servicing Linked Data
 
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas SuravarapuGraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
GraphConnect Europe 2016 - Faster Lap Times with Neo4j - Srinivas Suravarapu
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
The Kasabi Information Marketplace
The Kasabi Information MarketplaceThe Kasabi Information Marketplace
The Kasabi Information Marketplace
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 

Semelhante a What's the story with Open Source?

Five Ways To Calais V01
Five Ways To Calais V01Five Ways To Calais V01
Five Ways To Calais V01Thomas Tague
 
Wuhan Wednesday Discussion Breakout Session Keiser
Wuhan Wednesday Discussion Breakout Session KeiserWuhan Wednesday Discussion Breakout Session Keiser
Wuhan Wednesday Discussion Breakout Session KeiserBEKINC
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentOntotext
 
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web Strategy
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web StrategyIWMW 2004: Trials, Trips and Tribulations of an Integrated Web Strategy
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web StrategyIWMW
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Cloudsinovex GmbH
 
Search01 /certified fixed orthodontic courses by Indian dental academy
Search01 /certified fixed orthodontic courses by Indian dental academy Search01 /certified fixed orthodontic courses by Indian dental academy
Search01 /certified fixed orthodontic courses by Indian dental academy Indian dental academy
 
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018BookNet Canada
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?therealgaston
 
Content Management
Content ManagementContent Management
Content Managementsanand0
 
Semantic Search on the Public Web with Creative Commons
Semantic Search on the Public Web with Creative CommonsSemantic Search on the Public Web with Creative Commons
Semantic Search on the Public Web with Creative CommonsMike Linksvayer
 
Best Practices in Managing e-resources
Best Practices in Managing e-resourcesBest Practices in Managing e-resources
Best Practices in Managing e-resourcesslimkm
 
02.Branding and identity
02.Branding and identity02.Branding and identity
02.Branding and identityJulian Matthews
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Oxford Seo.Com Presentation
Oxford Seo.Com PresentationOxford Seo.Com Presentation
Oxford Seo.Com PresentationIgorgold
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 

Semelhante a What's the story with Open Source? (20)

Lecture08
Lecture08Lecture08
Lecture08
 
TYPO3 at UNESCO.org
TYPO3 at UNESCO.orgTYPO3 at UNESCO.org
TYPO3 at UNESCO.org
 
Five Ways To Calais V01
Five Ways To Calais V01Five Ways To Calais V01
Five Ways To Calais V01
 
Wuhan Wednesday Discussion Breakout Session Keiser
Wuhan Wednesday Discussion Breakout Session KeiserWuhan Wednesday Discussion Breakout Session Keiser
Wuhan Wednesday Discussion Breakout Session Keiser
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news content
 
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web Strategy
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web StrategyIWMW 2004: Trials, Trips and Tribulations of an Integrated Web Strategy
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web Strategy
 
Information update May 2010 Alternative Search engine
Information update May 2010 Alternative Search engineInformation update May 2010 Alternative Search engine
Information update May 2010 Alternative Search engine
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
Search01 /certified fixed orthodontic courses by Indian dental academy
Search01 /certified fixed orthodontic courses by Indian dental academy Search01 /certified fixed orthodontic courses by Indian dental academy
Search01 /certified fixed orthodontic courses by Indian dental academy
 
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
 
Content Management
Content ManagementContent Management
Content Management
 
Semantic Search on the Public Web with Creative Commons
Semantic Search on the Public Web with Creative CommonsSemantic Search on the Public Web with Creative Commons
Semantic Search on the Public Web with Creative Commons
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
20minutes Quart
20minutes Quart20minutes Quart
20minutes Quart
 
Best Practices in Managing e-resources
Best Practices in Managing e-resourcesBest Practices in Managing e-resources
Best Practices in Managing e-resources
 
02.Branding and identity
02.Branding and identity02.Branding and identity
02.Branding and identity
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Oxford Seo.Com Presentation
Oxford Seo.Com PresentationOxford Seo.Com Presentation
Oxford Seo.Com Presentation
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 

Mais de Charlie Hull

Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesCharlie Hull
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big dataCharlie Hull
 
Search Solutions 2015: Towards a new model of search relevance testing
Search Solutions 2015:  Towards a new model of search relevance testingSearch Solutions 2015:  Towards a new model of search relevance testing
Search Solutions 2015: Towards a new model of search relevance testingCharlie Hull
 
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015Charlie Hull
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformaticsCharlie Hull
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 

Mais de Charlie Hull (6)

Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challenges
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
Search Solutions 2015: Towards a new model of search relevance testing
Search Solutions 2015:  Towards a new model of search relevance testingSearch Solutions 2015:  Towards a new model of search relevance testing
Search Solutions 2015: Towards a new model of search relevance testing
 
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformatics
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 

What's the story with Open Source?

  • 1. What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/
  • 3. www.flax.co.uk 3 What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by Lucid Imagination Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen Apache Lucene and Solr are trademarks of The Apache Software Foundation
  • 5. www.flax.co.uk 5 The challenges Content is created for publication, not for search
  • 6. www.flax.co.uk 6 The challenges Content is created for publication, not for search Content isn't published consistently or available to all
  • 7. www.flax.co.uk 7 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple
  • 8. www.flax.co.uk 8 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google”
  • 9. www.flax.co.uk 9 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size
  • 10. www.flax.co.uk 10 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size - Every project is different
  • 11. www.flax.co.uk 11 So how do we build news search?
  • 12. www.flax.co.uk 12 So how do we build news search? Indexing
  • 13. www.flax.co.uk 13 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions)
  • 14. www.flax.co.uk 14 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly
  • 15. www.flax.co.uk 15 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source
  • 16. www.flax.co.uk 16 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary
  • 17. www.flax.co.uk 17 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required
  • 18. www.flax.co.uk 18 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data
  • 19. www.flax.co.uk 19 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data Solution Lightweight, customisable index scripts using powerful open source libraries
  • 20. www.flax.co.uk 20 So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()
  • 21. www.flax.co.uk 21 So how do we build news search? Searching
  • 22. www.flax.co.uk 22 So how do we build news search? Searching Free text with Boolean operators
  • 23. www.flax.co.uk 23 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges
  • 24. www.flax.co.uk 24 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking
  • 25. www.flax.co.uk 25 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate
  • 26. www.flax.co.uk 26 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting
  • 27. www.flax.co.uk 27 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this'
  • 28. www.flax.co.uk 28 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters
  • 29. www.flax.co.uk 29 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries
  • 30. www.flax.co.uk 30 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries Beware Javascript & older browsers!
  • 31. www.flax.co.uk 31 So how do we build news search? Administration Indexing failures common Logging is essential
  • 32. www.flax.co.uk 32 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later
  • 33. www.flax.co.uk 33 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale
  • 34. www.flax.co.uk 34 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale Open source search libraries provide distributed indexing, replication, remote indexes Not simple to get this right!
  • 35. www.flax.co.uk 35 So how do we build news search? ●Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC, ... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...
  • 36. www.flax.co.uk 36 So how do we build news search? ●Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC, ... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), … We can use whatever works!
  • 37. www.flax.co.uk 37 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers http://www.nla-clipshare.com
  • 38. www.flax.co.uk 38 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers One of very few ways to search content from all the papers within hours of publication http://www.nla-clipshare.com
  • 42. www.flax.co.uk 42 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions http://presscuttings.ft.com
  • 43. www.flax.co.uk 43 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions Built from scratch in a fortnight Designed as a prototype, scaled to production use without significant change http://presscuttings.ft.com
  • 45. www.flax.co.uk 45 A different task – news monitoring Non-traditional use of search
  • 46. www.flax.co.uk 46 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content
  • 47. www.flax.co.uk 47 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs
  • 48. www.flax.co.uk 48 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking
  • 49. www.flax.co.uk 49 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking False negatives should never occur!
  • 50. www.flax.co.uk 50 A different task – news monitoring An example Durrants Ltd.
  • 51. www.flax.co.uk 51 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline
  • 52. www.flax.co.uk 52 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture
  • 53. www.flax.co.uk 53 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system
  • 54. www.flax.co.uk 54 Why open source? Flexible, extendable
  • 55. www.flax.co.uk 55 Why open source? Flexible, extendable Powerful & scalable
  • 56. www.flax.co.uk 56 Why open source? Flexible, extendable Powerful & scalable Lower cost
  • 57. www.flax.co.uk 57 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary
  • 58. www.flax.co.uk 58 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary - Freedom to innovate
  • 60. www.flax.co.uk 60 Looking to the future More and more content including social media
  • 61. www.flax.co.uk 61 Looking to the future More and more content including social media Multiple delivery platforms
  • 62. www.flax.co.uk 62 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications
  • 63. www.flax.co.uk 63 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL'
  • 64. www.flax.co.uk 64 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud
  • 65. www.flax.co.uk 65 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation
  • 66. www.flax.co.uk 66 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation Open source no longer an outsider, but the obvious choice