Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Textometry and Information Discovery : A New
Approach to Mining Textual Data on the Web

Erin MacMurray*, Marguerite Leenhardt **
SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne
Nouvelle Paris 3
*erin.macmurray@gmail.com
** marguerite.leenhardt@gmail.com

ICAI’11 Workshop on Intelligent Linguistic Technologies

In a nutshell
• Introduction & background
• Textometry and Web Mining: why?
• Textometry and Web Mining: how?
• Textometry and Web Mining: application?
• Conclusion

22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2

Introduction & background
Structure ? Man versus machine ?

Seth Grimes sees « three categories of Neil Glassman « between those on one
data : (i) Quantities, whether measured, side who feel the accuracy of automated
observed, or computed (ii) Content, which [content analysis] is sufficient and those
I’ll characterize as non-quantitative on the other side who feel we can only rely
information (iii) Metadata describing on human analysis […] most in the field
quantities and content. concur with the idea that we need to
Structured/unstructured is a false define a methodology where the software
dichotomy. » and the analyst collaborate to get over the
noise and deliver accurate analysis. »
(July 2011 – IKS Semantic Workshop, France)
(May 2011 – Sentiment Analysis Symposium
review)

Textometry and Web Mining: why ?
• Improving Linguistic Models
– Semantic complexity of simple units such as NE
– Identifying paraphrases of NE

INTEL Paris
Le président de la République
Gone with the wind
Sarko
Harry Potter JIF Peanut Butter Nicolas Sarkozy
Sarkoland
lyzozym
The 4th of July 20GB
M. Sarkozy
Dulles International Airport Sarkozyste

Le Tour de France Mr Sarkozy
Sarkozysme
www.nytimes.com

NE : an heterogeneous category
Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE
théorique et méthodes de désambiguïsation.


Textometry and Web Mining: why ?
• Text is considered having its own internal structure
• Application of statistical and probabilistic calculations directly to the textual
units of comparable texts in a corpus


Textometry and Web Mining: how?

Form Specificness

b 23.43
July 4th 2011
b 12.68

b 5.57

b 5.66

Hypergeometric Distribution

Form Specificness

d 13.73
July 5th 2011
d 21.86

d 7.75

d 6.55


Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships

A
---A---C---B---D.
---B---C---H---E.
---B-- C --A---E. B C E
---E---B---D---F.
---C---A---D---H.
A B C
---F---C---B---D.
---E---B---D---A.
E

Textometry and Web Mining : how?
1/ POINT OF ENTRY 2/ CORPUS
184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies Article 160 articles
and people) selection
197,341 occurrences / 17,807 formes / 9,416 hapax
103 articles

Company NE = Xerox
People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS

4/ INTERPRETATION OF RESULTS
Hypergeometric
Disribution Quantitative information
to formulate qualitative interpretations.

Specificness
Cooccurrences


Textometry and Web Mining: results?
Observing forms and repeted segments of « Nicolas Sarkozy »
allows identifying polarities of opinion in paraphrases,
providing clues for determining how the NE is perceived.

contextually
dependant {

negative {


Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».


As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
( « buzz effect »)


Conclusion
• Two intelligence use-cases on Le Monde and The New York Times
• Two complementary approaches : specificness and co-occurrence analysis

• Three main contributions :
– Building corpus-driven linguistic ressources (time and cost-cutting)
– Identifying trends with specificness calculation
– Targeting zones of activity or events through co-occurrence networks

• In sum, this method :
– Help derive knowledge from corpora without predefined information
models
– Provides adequate functions enabling interaction between the
expertise of the user and processing tools

References
Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.
Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.
Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.
Proceedings, JADT’2010.
Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST
Conference, Toulouse, France, 2010.
Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.
Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.
Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational
Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.
Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent
Systems, 1999, volume LNAI 1609, p. 16–29.
Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.
Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.
MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.
Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.
Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en
Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.
Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.
2008
Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.
Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.
Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles),
2004, p 986-992.
Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.
Tufféry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007.
Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,
2006 vol 4 n°2 p.349-387.


Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Semelhante a Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web (20)

Último

Último (20)

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web