Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Using opinion mining techniques for early crisis detection
1. “Al. I. Cuza”, University of Iasi, Romania
Faculty of Computer Science
Adrian Iftene, Alexandru Lucian Gînscă
ICCCC 2012, 8-12 May, Băile Felix, Oradea, Romania
2. System overview
Data acquisition
Topic detection
Data processing
Identification of opinions
Results
Visualization
Conclusions
ICCCC 2012, 8-12 May, Băile Felix, Oradea
4. Scenario: Street protests in Romania (between 13
and 26 January, 2012)
Crawler component, RSS feeds
Scraping: removed links, photos, menus, special
characters
Data locally stored
ICCCC 2012, 8-12 May, Băile Felix, Oradea 4
5. The topic is very important in detecting articles
reffering to a crisis situation
Latent Dirichlet Allocation: state of the art topic model
Problems:
• The number of topics needs to be specified from start
• The results are lists of representative words for each topic resulting
for a need for human intervention in interpreting them
Solution: WordNet based similarity measures
• WuPalmer
• Lin
• Resnik (best results)
ICCCC 2012, 8-12 May, Băile Felix, Oradea 5
6. Computing the similarity between 2 sets of words
T1, T2 = two sets of words.
sim(t1, t2) = one of the Wu and Palmer, Resnik or Lin similarity measures.
ICCCC 2012, 8-12 May, Băile Felix, Oradea 6
7. LDA results for our street protests corpus when tracking 3
topics
ICCCC 2012, 8-12 May, Băile Felix, Oradea 7
8. Language specific resources that contain cities (Iasi,
Bucuresti, Ploiesti, etc.), regions (Bucovina, Moldova,
Transilvania, etc.) (Iftene et al., 2011)
Introducing a more localized approach: new resources
and rules for street (Iasi, Bulevardul Independentei,
Bucuresti, Calea Victoriei, etc.) and smaller inner city
regions identification (Pacurari district, center of Iasi,
Arch of Triumph Square)
Example of Rules: to identify streets (Street + entity,
Boulevard + entity, etc.), to identify small regions (the
area between street A and street B or the area of the
building A)
ICCCC 2012, 8-12 May, Băile Felix, Oradea 8
9. 538 files with 2,806 entities of "street" and “area”
types
The overall quality of NE identification component
is around 92% and the quality of NE classification
component is around 67%
Problems:
◦ incorrect spelling
◦ anaphora resolution
◦ ambigous situations when from the context we cannot
conclude that the NE is a person name or a street
name
ICCCC 2012, 8-12 May, Băile Felix, Oradea 9
10. Rule based opinion mining system (Gînscă et al., 2011)
Easily adaptible from a crisis scenario to another – in
opposition with a statistical approach
Use of manually built resources to identify opinion
keywords (good, bad etc.), amplifiers (most, more etc.),
diminishers (less, etc.), negation (not, never etc.)
Calculate the valences for groups of feelings and pairing
named entities with scores based on the distance,
punctuation and context
Use a dedicated vocabulary for a specific crisis situation
with 21 initial words (protest, conflict, fight, etc.) + similar
words from WordNet (synonyms, hypernyms, etc.)
ICCCC 2012, 8-12 May, Băile Felix, Oradea 10
11. Greedy approach – adding iteratively
intermediate green points to the current path
until solution cannot be improved
Advantages – we reduce the search space for
optimal routes and the Greedy solution is
obtained very fast
Disavantages – the Greedy solution is closed
to the optimal solution
ICCCC 2012, 8-12 May, Băile Felix, Oradea 11
13. Location type entities mentions by day
250
200
150
100
50
0
13 14 15 16 17 18 19 20 21 22 23 25
ICCCC 2012, 8-12 May, Băile Felix, Oradea 13
14. GoogleMaps API
Our algorithm is able to find another path (longer)
which passes near the red islands and prefers the
ways near the green islands
Thus, at every step is possible to insert penalties
when the partial solution crosses red islands (with
potential risks) and add bonuses when the partial
solution crosses green islands (without potential
risk)
ICCCC 2012, 8-12 May, Băile Felix, Oradea 14
17. When we haven’t green islands we must specify another
method to select intermediate points in order to
improve the quality of current solution
If in the cases of streets and boulevards the
GoogleMaps API is able to put these entities on the
map, for specific squares and areas it is not able to do
this. In such cases we built an additional resource
which specifies the GIS coordinates for them
ICCCC 2012, 8-12 May, Băile Felix, Oradea 17
18. We present a system that can be easily adapted from a
crisis situation to another (changing the dictionaries,
changing the interest topics)
Efficient topic identification using LDA
Suggestive visualization using GoogleAPI
ICCCC 2012, 8-12 May, Băile Felix, Oradea 18