Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Information Extraction with UIMA - Usecases
1. Information Extraction
with UIMA - Use Cases
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teofili
tommaso [at] apache [dot] org
venerdì 16 aprile 2010
2. Use Cases - Agenda
UC1 : Real Estatate market analysis
UC2 : Tenders automatic information
extraction
venerdì 16 aprile 2010
3. UC1 : Source
An online announcement site for sellers and
buyers
Wide purpose (cars, RE, hi-fi, etc...)
Local scope (Rome and nearby)
venerdì 16 aprile 2010
4. UC1 - Goals
Are you looking for houses?
A specified subcategory of the site is dedicated to
real estate
I would like to monitor Rome real estate market to
Take smart decisions
Predict how things will go in the (near) future
venerdì 16 aprile 2010
6. UC1 - Goals
I want to build a separate web application to
monitor such estate listings
I have to use a crawler to automatically
download selected pages periodically from the
source
Estate listings text is unstructered
I want to make aggregate queries on structured
information
venerdì 16 aprile 2010
7. UC1 - Information
Extraction
I have to write an information extraction
engine to populate a relational schema DB
with structured information from free text
of estate listings
venerdì 16 aprile 2010
9. UC1 - Crawler
A specialized crawler extract data from the
source
Estate listings data are stored grouped by
zones in files on some directory on a
managed machine
venerdì 16 aprile 2010
10. UC1 - Crawler
Define navigation of the site using one XML
for each city zone
The crawler downloads page fragments two
times a week
The estate listings extracted free text is
saved on XML grouped by zone
venerdì 16 aprile 2010
13. UC1 - Crawler
Issues :
Enabled cookies
Some HTTP headers needed
Needed to put fixed sleeping intervals
between requests
venerdì 16 aprile 2010
14. UC1 - Domain
EstateListing (Announcement)
Zone
MagazineNumber (Uscita)
HouseStructure with properties
venerdì 16 aprile 2010
15. UC1 - Information
Extraction Engine
Goal : extract price, zone and telephone
number
The first version contained a specialized IE
engine which used huge regular expressions
Hard to maintain and unefficient
Extracting not so much information
venerdì 16 aprile 2010
16. UC1 - IE Engine
New requirement: extract also the structure
of the house
Number of rooms, box, garden(s), external
spaces, number of bathrooms, kitchen, etc...
Using again RegEx resulted to be hard to
maintain and unefficient
venerdì 16 aprile 2010
17. UC1 - IE Engine
Subsitute the RegEx based IE engine with a UIMA
based IE engine to:
exploit previous work (RegExs can live inside UIMA
too)
exploit existing components
be able to modify and enhanche IE rules easily
much more efficient
more information extracted
venerdì 16 aprile 2010
21. Sample text
“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”
venerdì 16 aprile 2010
22. UC1 - ContentAnnotator
From the XML produced by the crawler only
estate listings must be extracted
A simple parser to get each node containing
an estate listing (that in turn will be
unstructured)
Create a ContentAnnotation over the
document
venerdì 16 aprile 2010
30. UC1 - Consuming
extracted information
the previous version of the IE engine
produced (again) XMLs that needed to be
parsed to store structured data inside the
DB
with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB
venerdì 16 aprile 2010
31. UC1 - Analyzing real
estate market data
a simple webapp written in Java with Spring
framework modules (Spring core, DAO, JDBC,
MVC) querying aggregate data on MySQL DB
enriched UI with JQuery
venerdì 16 aprile 2010
34. UC2 - Monitor of
tenders/announcements
Monitor various sources which provide
announcement and tenders to which people
and companies are interested can subscribe
We want to automate the long monitoring
process of such sources and also
automatically extract useful common
information from announcements’ text
venerdì 16 aprile 2010
40. UC2 - Crawling
Similar to UC1 Crawler but using a Firefox
plugin we can define navigation patterns for
pages of each source
We can also define metadata we see during
navigation that deliver information
Again an XML will be generated so that it
can be saved on a storage and executed
periodically
venerdì 16 aprile 2010
42. UC2 - Domain
annotations
Language Funding type
Abstract Geographic region
Activity Sector
Beneficiary Subject
Budget Title
Expiration date Tags
venerdì 16 aprile 2010
43. UC2 - Domain entities
First and most important is an entity that
represents the entire tender or
announcement
Annotations inside the domain will finally fill
such entity properties
venerdì 16 aprile 2010
45. UC2 - Simple first
Each annotator first looks:
if some metadata was extracted during navigation
for the most common pattern for defining
information inside such announcements
i.e.: “Budget: 200000$” or “Financial information: ......”
Such patterns are language independent (although
this is often not true)
venerdì 16 aprile 2010
46. UC2 - AbstractAnnotator
The abstract is usually in the first part of the
document
We use Tokenizer and Tagger to get Tokens (with
PoS tags) and Sentences
We use Dictionary to provide a list of “good”
words
We look in the first sentences of the document
looking for objectives of the announcement
(mixing good words and regular expressions)
venerdì 16 aprile 2010
47. UC2 -
ExpirationDateAnnotator
A DateAnnotator is executed before
Iterate over DateAnnotations
Get sentences wrapping such DateAnnotations
Check if some terms like “deadline” appear in
the same sentence of a DateAnnotation
venerdì 16 aprile 2010
53. Conclusions on IE
UC1 : simple and stable sentence patterns
UC2 : multi language, much more complex
and different sentence structures and
patterns
Fine grain metadata are very important
Need to play with NLP
Need to establish good test cases
venerdì 16 aprile 2010