SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Web scale
Named Entity Mining
"There's simply too much information out there"
WI-IAT 2011
in memoriam of
Herbert A. Simon …
stuck
April 2011
Herbert Simon's Brookings Institute Lecture
"Designing Organizations for an Information-Rich World"
Johns Hopkins University, September 1, 1969
1.Tales & legends
Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943
Main constraints:
•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress
and exposure to detergent agents)
•compatible with existing tools - withdrawal must be close to LEXAN 943
•optical characteristic close to LEXAN 943
•weldable by ultrasonic welding
•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94
delay : one week
organization centric search
Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air
defense missile system ?
location centric search
Recent information (past month)
about call for proposal
"outils Web innovants en entreprise" ?
time centric search
Location
"pro" searches focus on
Orgs People
Time
named entities
2.Introducing
WebNEM
relevant
query ?
query
again ?
where ?
+ browsing/ranking
results
Attention-greedy & burdensome
product
specifications
get
manufacturer
or distributor
find
compliant
products
"SA-24 Grinch
9K338 Igla-S"
Goal : Attention-saver process
exploratory data analysis
of high dimensional data
"In exploratory data analysis of high dimensional data
one of the main tasks is the formation of a
simplified, usually visual, overview of data sets.
....
Clustering and projection
are among the examples of useful methods
to achieve this task."
Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their
application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)
Lourenço, Lobo, Bação – JOCLAD 2004
WebNEM
collection of
relevant data,
anywhere in the web
+ projection on
Named Entities space
topical web crawler
named entity recognition
visualization/exploratory analysis tools
"Web scale" collection : brute force
never-ending crawl
fast answer,
"any" topic
a priori
"whole" Web indexing
general index
"everywhere"
huge resources required
(data size based)
user
query
"Web scale" collection : our approach
"close to optimal" resources
(usage based)
user
query
on-demand topical crawl
delayed answer,
but less garbage
tailored index
anywhere
relevant
built on order
Web slices
Projection : when to extract entities ?
Named Entity Recognition is resource intensive
crawl time whole web 1010 asynchronous
query time collection 102 real-time
crawl time web slice 104 asynchronous
process step data size required response time
www.squido.fr
our SaaS Web mining system
large scale
Named Entity extraction (EN/FR)
beta released to customers
June 2011
WebNEM with Squido
index
focused
crawl
search
topic
shallow
entity extraction
page
cleaning
user
queries
user
collections
deep
entity extraction
visualization
visualization
Page cleaning
instead
of
this
work
on
this
fast heuristic
DOM processing
Shallow extraction
detect
language
tokenize
sentence
split
gazetteers grammar
Web
docs
format
parse
index
Deep extraction
POS
tagger
grammar
ortho
matcher index
morpho
analyzer
NP/VP
chunker
≅≅≅≅ shallow extraction + elaborate linguistics
3.Annoyances
Linguistic processing throughput
deep extraction
too expensive
when crawling
shallow
extraction
OK
penalty
on
quality
workaround :
asynch deep extraction
on smaller collections
query time sanitization
Page cleaning
need evaluation
goal : ↗accuracy ? cost : ↘ recall ?
performance impact ?
↘ +1 processing step
↗ less text in later steps
"Multiple dates" usage ?
<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE>
<DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE>
<DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE>
<DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE>
<DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE>
<DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE>
<DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE>
<DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE>
<DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE>
<DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE>
<DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE>
<DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE>
<DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE>
<DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE>
<DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE>
<DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE>
retrieve
by date
sort
by date
?
Publishing date ?
critical for
time centric
searches
published
05/2011tagged as
7 jul 2011
& many more…
wrong
spelling
Tapei→Taipei
location is also a first name
"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)
compound first names
"Jean-Claude Marin"→Claude Marin
wrong character case (very frequent on titles)
breaks all case-based rules
barrack obama→not extracted
How To Buy Electric Trucks→Buy Electric (organization)
In Virginia Life Is Sweet→Virginia Life (person)
polymorphism
"Nagy Bocsa", "Nagy-Bocsa", "Nagy"
sanitize parser output
for tokenization
transliteration, case, punctuation, …
4. Results
Reminder
Next results are obtained
automatically
from unstructured content
picked on the web
by an autonomous system,
without previous knowledge
of the topic or the visited Web sites
Let's try it with a use case
"hydrogen storage for fuel cells"
What's inside a collection
of 66 highly ranked documents ?
run a few cycles
(shallow extraction only)
entity
weight function
(tf-idf, …)
some
104 pages
PeopleOrgs Location Time
Special attention paid
to so-called outliers
Organizations > 900 : overload…
page cleaning + entity sanitization
=> better details & accuracy
↗attention ↘information : top 50
academic
team ?
H2 military
usage ?
new questions are instantly popping up
?
People
authors lead to
relevant content
(classic IR method,
even in libraries !)
?
Countries
political threats
on Lithium battery
supplies
argument in favor of
H2 technology
Cities
"Austin is in a unique position
to offer its electric grid as a
real world proving ground"
"Direct Methanol Fuel Cells"
⇒alternative to H2
!
!
!
changeover from nickel to lithium
will be complete by 2016 and 2018
Multiple-dates timeline
outlookhistory
domains
time
Honda President Takanobu Ito says
around 10 percent of Honda’s global sales
will be hybrids by 2015
In a few clicks...
DMFC alternative to H2
Austin,
TX
hydrogen storage
for fuel cells ?
changeover from
nickel to lithium
by 2016/2018
5. Perspectives
To clean or not to clean ?
performance impact"attention" impact
run pipeline with/without cleaningcorpus
label examples +/-
clean
set
full
set
time full
pipeline
Publishing date extraction
heuristic
DOM processing
prototype ready
need large scale
evaluation
build gold
standard from
RSS feeds
A zest of Linked Data ?
too slow & fat
for crawling...
use it "offline"
disambiguation, gazetteers, infoboxes, ...
Play with graphs
entity co-occurence, page similarity, ...
UI/user experience
search facets
word clouds
maps
dashboards
infoboxes
highlighting
graphs
Lexical Taxonomies Induction
22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),
Barcelona, Spain, July 19-22nd, 2011
another kind of projection
a. A real need of Attention-saving…
b. WebNEM results are encouraging
c. Work in progress, lots of paths to explore
6. Digest
"There's simply
too much
information out
there."
"Leaders feel
misled. Stupid.
Trapped."
Final word by Herbert Simon
"Filtering by intelligent programs
is the main part of the answer"
[to information overload]
www.ixxo.fr
www.slideshare.net/fpouilloux
www.linkedin.com/pub/st%C3%A9phanie-jacquemont/20/271/767
www.linkedin.com/in/fpouilloux
MANY THANKS!
joint work of
CREDITS
Photos
2. Home page, The 2011 IEEE/WIC/ACM International Conference on Web
Intelligence
4. Designing Organizations for an Information-Rich World, The Herbert A.
Simon Collection
5.Vlad the Impaler, Wikimedia commons
7. Missile 9M342 of the portable anti-aircraft missile system Igla-S,
©vitalykuzmin.net
10. Internet Map 2005, ©www.opte.org
33. The Inspector, ©DePatie-Freleng Enterprises
36. Nanomaterials for Solid State Hydrogen Storage, book cover,
©springer.com
40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory
40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole
41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy
44. Linking Open Data cloud diagram, ©Richard Cyganiak and Anja
Jentzsch, lod-cloud.net
44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter
48. Views of the solar corona by the Transition Region and Coronal
Explorer, Stanford-Lockheed Institute for Space Research, NASA Small
Explorer program
49. Hyperformance book cover, www.tjwaters.com
50. Dr Simon solving puzzles, The Herbert A. Simon Collection
Websites
wi-iat-2011.org
The Herbert A. Simon Collection, Carnegie Mellon University Libraries,
diva.library.cmu.edu/webapp/simon/index.html
www.google.com
online.barrons.com
www.me.utexas.edu/~dmfc-muri
www.alsace-industrie.fr
www.hybridcars.com
www.me.utexas.edu/blogs/meyersresearchgroup
Bibliography
Simon, H. A. (1971), "Designing Organizations for an Information-Rich
World", Carnegie Mellon University Libraries,
diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl
0002/doc0001
Waters, T. J. (2011), "Hyperformance",
www.tjwaters.com/hyperformance-excerpt.html
R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing
Lexical Taxonomies from Scratch. Proc. of the 22nd International Joint
Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July
19-22nd, 2011, pp. 1872-1877.

Mais conteúdo relacionado

Destaque

Web Intelligence et Information Stratégique sur le Web
Web Intelligence et Information Stratégique sur le WebWeb Intelligence et Information Stratégique sur le Web
Web Intelligence et Information Stratégique sur le WebFrancois Pouilloux
 
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...Francois Pouilloux
 
Identités des sciences humaines et formation en humanités digitales, Claire C...
Identités des sciences humaines et formation en humanités digitales, Claire C...Identités des sciences humaines et formation en humanités digitales, Claire C...
Identités des sciences humaines et formation en humanités digitales, Claire C...Claire Clivaz
 
Intelligence artificielle
Intelligence artificielleIntelligence artificielle
Intelligence artificielleMed Zaibi
 
Vers un monde digital plus intelligent
Vers un monde digital plus intelligentVers un monde digital plus intelligent
Vers un monde digital plus intelligentFrançois DUCROT
 
L’intelligence artificielle
L’intelligence artificielleL’intelligence artificielle
L’intelligence artificielleiapassmed
 
Intelligence Artificielle : Introduction à l'intelligence artificielle
Intelligence Artificielle : Introduction à l'intelligence artificielleIntelligence Artificielle : Introduction à l'intelligence artificielle
Intelligence Artificielle : Introduction à l'intelligence artificielleECAM Brussels Engineering School
 
Gambia 2015 rural development and education discovery visit
Gambia 2015 rural development and education discovery visitGambia 2015 rural development and education discovery visit
Gambia 2015 rural development and education discovery visitStephen Haggard
 
Gamification at large and in learning
Gamification at large and in learningGamification at large and in learning
Gamification at large and in learningPete Baikins
 
BCC 2005 - Justice Biometrics Cooperative
BCC 2005 - Justice Biometrics CooperativeBCC 2005 - Justice Biometrics Cooperative
BCC 2005 - Justice Biometrics CooperativeDuane Blackburn
 
cleveland_overview_slides
cleveland_overview_slidescleveland_overview_slides
cleveland_overview_slidesandy biggin
 
Horizon Report Higher Education Briefing
Horizon Report Higher Education Briefing Horizon Report Higher Education Briefing
Horizon Report Higher Education Briefing National University
 
Using The National Science and Technology Council (NSTC)
Using The National Science and Technology Council (NSTC)Using The National Science and Technology Council (NSTC)
Using The National Science and Technology Council (NSTC)Duane Blackburn
 
Jive World 12 - Apps 202
Jive World 12 - Apps 202Jive World 12 - Apps 202
Jive World 12 - Apps 202weitzelm
 

Destaque (20)

Web Intelligence et Information Stratégique sur le Web
Web Intelligence et Information Stratégique sur le WebWeb Intelligence et Information Stratégique sur le Web
Web Intelligence et Information Stratégique sur le Web
 
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...
The 2011 IEEE/WIC/ACM International Conference on Web Intelligence » industry...
 
Identités des sciences humaines et formation en humanités digitales, Claire C...
Identités des sciences humaines et formation en humanités digitales, Claire C...Identités des sciences humaines et formation en humanités digitales, Claire C...
Identités des sciences humaines et formation en humanités digitales, Claire C...
 
Intelligence artificielle
Intelligence artificielleIntelligence artificielle
Intelligence artificielle
 
Vers un monde digital plus intelligent
Vers un monde digital plus intelligentVers un monde digital plus intelligent
Vers un monde digital plus intelligent
 
L’intelligence artificielle
L’intelligence artificielleL’intelligence artificielle
L’intelligence artificielle
 
Intelligence Artificielle : Introduction à l'intelligence artificielle
Intelligence Artificielle : Introduction à l'intelligence artificielleIntelligence Artificielle : Introduction à l'intelligence artificielle
Intelligence Artificielle : Introduction à l'intelligence artificielle
 
Gambia 2015 rural development and education discovery visit
Gambia 2015 rural development and education discovery visitGambia 2015 rural development and education discovery visit
Gambia 2015 rural development and education discovery visit
 
Sistek Chandler Cue10
Sistek Chandler Cue10Sistek Chandler Cue10
Sistek Chandler Cue10
 
Gamification at large and in learning
Gamification at large and in learningGamification at large and in learning
Gamification at large and in learning
 
BCC 2005 - Justice Biometrics Cooperative
BCC 2005 - Justice Biometrics CooperativeBCC 2005 - Justice Biometrics Cooperative
BCC 2005 - Justice Biometrics Cooperative
 
cleveland_overview_slides
cleveland_overview_slidescleveland_overview_slides
cleveland_overview_slides
 
Guión Litúrgico
Guión Litúrgico Guión Litúrgico
Guión Litúrgico
 
Web 2.0 , social media safety in education with Lucian
Web 2.0 , social media  safety  in  education with Lucian  Web 2.0 , social media  safety  in  education with Lucian
Web 2.0 , social media safety in education with Lucian
 
The kc quiz
The kc quizThe kc quiz
The kc quiz
 
Horizon Report Higher Education Briefing
Horizon Report Higher Education Briefing Horizon Report Higher Education Briefing
Horizon Report Higher Education Briefing
 
Using The National Science and Technology Council (NSTC)
Using The National Science and Technology Council (NSTC)Using The National Science and Technology Council (NSTC)
Using The National Science and Technology Council (NSTC)
 
European Union
European UnionEuropean Union
European Union
 
Jive World 12 - Apps 202
Jive World 12 - Apps 202Jive World 12 - Apps 202
Jive World 12 - Apps 202
 
Anum presentation
Anum presentationAnum presentation
Anum presentation
 

Semelhante a Web Scale Named Entity Mining

Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?Michael Nelson
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futureslisld
 
The personal search engine
The personal search engineThe personal search engine
The personal search engineArjen de Vries
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internetMaria Theresa
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Science and Web2.0
Science and Web2.0Science and Web2.0
Science and Web2.0Ian Mulvany
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeologyguest756e05
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Amit Sheth
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the cataloglisld
 
Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Bianca Pereira
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10Scott Edmunds
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Bradley Allen
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 

Semelhante a Web Scale Named Entity Mining (20)

Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internet
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Science and Web2.0
Science and Web2.0Science and Web2.0
Science and Web2.0
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
 
Sem web tutorial general
Sem web tutorial generalSem web tutorial general
Sem web tutorial general
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the catalog
 
Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Web3uploaded
Web3uploadedWeb3uploaded
Web3uploaded
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Web Scale Named Entity Mining

  • 1. Web scale Named Entity Mining "There's simply too much information out there" WI-IAT 2011
  • 2. in memoriam of Herbert A. Simon …
  • 4. Herbert Simon's Brookings Institute Lecture "Designing Organizations for an Information-Rich World" Johns Hopkins University, September 1, 1969
  • 6. Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943 Main constraints: •more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress and exposure to detergent agents) •compatible with existing tools - withdrawal must be close to LEXAN 943 •optical characteristic close to LEXAN 943 •weldable by ultrasonic welding •compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94 delay : one week organization centric search
  • 7. Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air defense missile system ? location centric search
  • 8. Recent information (past month) about call for proposal "outils Web innovants en entreprise" ? time centric search
  • 9. Location "pro" searches focus on Orgs People Time named entities
  • 11. relevant query ? query again ? where ? + browsing/ranking results Attention-greedy & burdensome product specifications get manufacturer or distributor find compliant products
  • 12. "SA-24 Grinch 9K338 Igla-S" Goal : Attention-saver process
  • 13. exploratory data analysis of high dimensional data
  • 14. "In exploratory data analysis of high dimensional data one of the main tasks is the formation of a simplified, usually visual, overview of data sets. .... Clustering and projection are among the examples of useful methods to achieve this task." Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004) Lourenço, Lobo, Bação – JOCLAD 2004
  • 15. WebNEM collection of relevant data, anywhere in the web + projection on Named Entities space topical web crawler named entity recognition visualization/exploratory analysis tools
  • 16. "Web scale" collection : brute force never-ending crawl fast answer, "any" topic a priori "whole" Web indexing general index "everywhere" huge resources required (data size based) user query
  • 17. "Web scale" collection : our approach "close to optimal" resources (usage based) user query on-demand topical crawl delayed answer, but less garbage tailored index anywhere relevant built on order Web slices
  • 18. Projection : when to extract entities ? Named Entity Recognition is resource intensive crawl time whole web 1010 asynchronous query time collection 102 real-time crawl time web slice 104 asynchronous process step data size required response time
  • 19. www.squido.fr our SaaS Web mining system large scale Named Entity extraction (EN/FR) beta released to customers June 2011
  • 20. WebNEM with Squido index focused crawl search topic shallow entity extraction page cleaning user queries user collections deep entity extraction visualization visualization
  • 25. Linguistic processing throughput deep extraction too expensive when crawling shallow extraction OK penalty on quality workaround : asynch deep extraction on smaller collections query time sanitization
  • 26. Page cleaning need evaluation goal : ↗accuracy ? cost : ↘ recall ? performance impact ? ↘ +1 processing step ↗ less text in later steps
  • 27. "Multiple dates" usage ? <DATE TYPE="DateDay" D="11" M="2" Y="2008">February 10-13, 2008</DATE> <DATE TYPE="DateDay" D="11" M="2" Y="2008">February 9-13, 2008</DATE> <DATE TYPE="DateDay" D="12" M="11" Y="2007">November 11-13, 2007</DATE> <DATE TYPE="DateDay" D="14" M="10" Y="2008">October 12-17, 2008</DATE> <DATE TYPE="DateDay" D="16" M="2" Y="2009">February 15-18, 2009</DATE> <DATE TYPE="DateDay" D="17" M="9" Y="2007">September 16-19, 2007</DATE> <DATE TYPE="DateDay" D="2" M="5" Y="2008">May 2, 2008</DATE> <DATE TYPE="DateDay" D="26" M="5" Y="2009">May 24-29, 2009</DATE> <DATE TYPE="DateDay" D="27" M="10" Y="2009">October 25-29, 2009</DATE> <DATE TYPE="DateDay" D="7" M="10" Y="2008">October 5-9 2008</DATE> <DATE TYPE="DateDay" D="8" M="2" Y="2009">February 7-10, 2009</DATE> <DATE TYPE="DateDay" D="8" M="5" Y="2007">May 6-11, 2007</DATE> <DATE TYPE="DateDay" D="9" M="10" Y="2007">October 7-12, 2007</DATE> <DATE TYPE="DateMonth" M="11" Y="2009">November, 2009</DATE> <DATE TYPE="DateMonth" M="2" Y="2009">February, 2009</DATE> <DATE TYPE="DateMonth" M="8" Y="2008">August 2008</DATE> retrieve by date sort by date ?
  • 28. Publishing date ? critical for time centric searches published 05/2011tagged as 7 jul 2011
  • 29. & many more… wrong spelling Tapei→Taipei location is also a first name "University of Michigan, Ann Arbor, MI"→Ann Arbor (person) compound first names "Jean-Claude Marin"→Claude Marin wrong character case (very frequent on titles) breaks all case-based rules barrack obama→not extracted How To Buy Electric Trucks→Buy Electric (organization) In Virginia Life Is Sweet→Virginia Life (person) polymorphism "Nagy Bocsa", "Nagy-Bocsa", "Nagy" sanitize parser output for tokenization transliteration, case, punctuation, …
  • 31. Reminder Next results are obtained automatically from unstructured content picked on the web by an autonomous system, without previous knowledge of the topic or the visited Web sites
  • 32. Let's try it with a use case "hydrogen storage for fuel cells" What's inside a collection of 66 highly ranked documents ? run a few cycles (shallow extraction only) entity weight function (tf-idf, …) some 104 pages PeopleOrgs Location Time
  • 33. Special attention paid to so-called outliers
  • 34. Organizations > 900 : overload… page cleaning + entity sanitization => better details & accuracy
  • 35. ↗attention ↘information : top 50 academic team ? H2 military usage ? new questions are instantly popping up ?
  • 36. People authors lead to relevant content (classic IR method, even in libraries !) ?
  • 37. Countries political threats on Lithium battery supplies argument in favor of H2 technology
  • 38. Cities "Austin is in a unique position to offer its electric grid as a real world proving ground" "Direct Methanol Fuel Cells" ⇒alternative to H2 ! ! !
  • 39. changeover from nickel to lithium will be complete by 2016 and 2018 Multiple-dates timeline outlookhistory domains time Honda President Takanobu Ito says around 10 percent of Honda’s global sales will be hybrids by 2015
  • 40. In a few clicks... DMFC alternative to H2 Austin, TX hydrogen storage for fuel cells ? changeover from nickel to lithium by 2016/2018
  • 42. To clean or not to clean ? performance impact"attention" impact run pipeline with/without cleaningcorpus label examples +/- clean set full set time full pipeline
  • 43. Publishing date extraction heuristic DOM processing prototype ready need large scale evaluation build gold standard from RSS feeds
  • 44. A zest of Linked Data ? too slow & fat for crawling... use it "offline" disambiguation, gazetteers, infoboxes, ...
  • 45. Play with graphs entity co-occurence, page similarity, ...
  • 46. UI/user experience search facets word clouds maps dashboards infoboxes highlighting graphs
  • 47. Lexical Taxonomies Induction 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July 19-22nd, 2011 another kind of projection
  • 48. a. A real need of Attention-saving… b. WebNEM results are encouraging c. Work in progress, lots of paths to explore 6. Digest
  • 49. "There's simply too much information out there." "Leaders feel misled. Stupid. Trapped."
  • 50. Final word by Herbert Simon "Filtering by intelligent programs is the main part of the answer" [to information overload]
  • 52. CREDITS Photos 2. Home page, The 2011 IEEE/WIC/ACM International Conference on Web Intelligence 4. Designing Organizations for an Information-Rich World, The Herbert A. Simon Collection 5.Vlad the Impaler, Wikimedia commons 7. Missile 9M342 of the portable anti-aircraft missile system Igla-S, ©vitalykuzmin.net 10. Internet Map 2005, ©www.opte.org 33. The Inspector, ©DePatie-Freleng Enterprises 36. Nanomaterials for Solid State Hydrogen Storage, book cover, ©springer.com 40. EnerDel/Argonne lithium-ion battery, ©Argonne National Laboratory 40. Pennybacker Bridge - Austin, TX, ©Andy Heatwole 41. 20060206211301_132363.jpg, pulpo.org, ©Jumpedforjoy 44. Linking Open Data cloud diagram, ©Richard Cyganiak and Anja Jentzsch, lod-cloud.net 44. Taji crawl, ©The U.S. Army, www.flickr.com/soldiersmediacenter 48. Views of the solar corona by the Transition Region and Coronal Explorer, Stanford-Lockheed Institute for Space Research, NASA Small Explorer program 49. Hyperformance book cover, www.tjwaters.com 50. Dr Simon solving puzzles, The Herbert A. Simon Collection Websites wi-iat-2011.org The Herbert A. Simon Collection, Carnegie Mellon University Libraries, diva.library.cmu.edu/webapp/simon/index.html www.google.com online.barrons.com www.me.utexas.edu/~dmfc-muri www.alsace-industrie.fr www.hybridcars.com www.me.utexas.edu/blogs/meyersresearchgroup Bibliography Simon, H. A. (1971), "Designing Organizations for an Information-Rich World", Carnegie Mellon University Libraries, diva.library.cmu.edu/webapp/simon/item.jsp?q=/box00055/fld04178/bdl 0002/doc0001 Waters, T. J. (2011), "Hyperformance", www.tjwaters.com/hyperformance-excerpt.html R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch. Proc. of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July 19-22nd, 2011, pp. 1872-1877.