Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web
1. Textometry and Information Discovery : A New
Approach to Mining Textual Data on the Web
Erin MacMurray*, Marguerite Leenhardt **
SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne
Nouvelle Paris 3
*erin.macmurray@gmail.com
** marguerite.leenhardt@gmail.com
ICAI’11 Workshop on Intelligent Linguistic Technologies
2. In a nutshell
• Introduction & background
• Textometry and Web Mining: why?
• Textometry and Web Mining: how?
• Textometry and Web Mining: application?
• Conclusion
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2
3. Introduction & background
Structure ? Man versus machine ?
Seth Grimes sees « three categories of Neil Glassman « between those on one
data : (i) Quantities, whether measured, side who feel the accuracy of automated
observed, or computed (ii) Content, which [content analysis] is sufficient and those
I’ll characterize as non-quantitative on the other side who feel we can only rely
information (iii) Metadata describing on human analysis […] most in the field
quantities and content. concur with the idea that we need to
Structured/unstructured is a false define a methodology where the software
dichotomy. » and the analyst collaborate to get over the
noise and deliver accurate analysis. »
(July 2011 – IKS Semantic Workshop, France)
(May 2011 – Sentiment Analysis Symposium
review)
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 3
4. Textometry and Web Mining: why ?
• Improving Linguistic Models
– Semantic complexity of simple units such as NE
– Identifying paraphrases of NE
INTEL Paris
Le président de la République
Gone with the wind
Sarko
Harry Potter JIF Peanut Butter Nicolas Sarkozy
Sarkoland
lyzozym
The 4th of July 20GB
M. Sarkozy
Dulles International Airport Sarkozyste
Le Tour de France Mr Sarkozy
Sarkozysme
www.nytimes.com
NE : an heterogeneous category
Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE
théorique et méthodes de désambiguïsation.
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 4
5. Textometry and Web Mining: why ?
• Text is considered having its own internal structure
• Application of statistical and probabilistic calculations directly to the textual
units of comparable texts in a corpus
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 5
6. Textometry and Web Mining: how?
Form Specificness
b 23.43
July 4th 2011
b 12.68
b 5.57
b 5.66
Hypergeometric Distribution
Form Specificness
d 13.73
July 5th 2011
d 21.86
d 7.75
d 6.55
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 6
7. Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships
A
---A---C---B---D.
---B---C---H---E.
---B-- C --A---E. B C E
---E---B---D---F.
---C---A---D---H.
A B C
---F---C---B---D.
---E---B---D---A.
E
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 7
8. Textometry and Web Mining : how?
1/ POINT OF ENTRY 2/ CORPUS
184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies Article 160 articles
and people) selection
197,341 occurrences / 17,807 formes / 9,416 hapax
103 articles
Company NE = Xerox
People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS
4/ INTERPRETATION OF RESULTS
Hypergeometric
Disribution Quantitative information
to formulate qualitative interpretations.
Specificness
Cooccurrences
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 8
9. Textometry and Web Mining: results?
Observing forms and repeted segments of « Nicolas Sarkozy »
allows identifying polarities of opinion in paraphrases,
providing clues for determining how the NE is perceived.
contextually
dependant {
negative {
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 9
10. Textometry and Web Mining: results?
Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 10
11. Textometry and Web Mining: results?
As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
( « buzz effect »)
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 11
12. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 12
13. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 13
14. Conclusion
• Two intelligence use-cases on Le Monde and The New York Times
• Two complementary approaches : specificness and co-occurrence analysis
• Three main contributions :
– Building corpus-driven linguistic ressources (time and cost-cutting)
– Identifying trends with specificness calculation
– Targeting zones of activity or events through co-occurrence networks
• In sum, this method :
– Help derive knowledge from corpora without predefined information
models
– Provides adequate functions enabling interaction between the
expertise of the user and processing tools
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 14
15. References
Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.
Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.
Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.
Proceedings, JADT’2010.
Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST
Conference, Toulouse, France, 2010.
Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.
Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.
Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational
Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.
Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent
Systems, 1999, volume LNAI 1609, p. 16–29.
Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.
Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.
MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.
Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.
Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en
Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.
Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.
2008
Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.
Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.
Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles),
2004, p 986-992.
Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.
Tufféry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007.
Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,
2006 vol 4 n°2 p.349-387.
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 15