SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Textometry and Information Discovery : A New
 Approach to Mining Textual Data on the Web

         Erin MacMurray*, Marguerite Leenhardt **
    SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne
                      Nouvelle Paris 3
               *erin.macmurray@gmail.com
            ** marguerite.leenhardt@gmail.com




     ICAI’11 Workshop on Intelligent Linguistic Technologies
In a nutshell
             •   Introduction & background
             •   Textometry and Web Mining: why?
             •   Textometry and Web Mining: how?
             •   Textometry and Web Mining: application?
             •   Conclusion




22/07/2011        E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 2
Introduction & background
                 Structure ?                          Man versus machine ?




    Seth Grimes sees « three categories of         Neil Glassman « between those on one
    data : (i) Quantities, whether measured,       side who feel the accuracy of automated
    observed, or computed (ii) Content, which      [content analysis] is sufficient and those
    I’ll characterize as non-quantitative          on the other side who feel we can only rely
    information (iii) Metadata describing          on human analysis […] most in the field
    quantities and content.                        concur with the idea that we need to
    Structured/unstructured is a false             define a methodology where the software
    dichotomy. »                                   and the analyst collaborate to get over the
                                                   noise and deliver accurate analysis. »
    (July 2011 – IKS Semantic Workshop, France)
                                                   (May 2011 – Sentiment Analysis Symposium
                                                   review)
22/07/2011           E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 3
Textometry and Web Mining: why ?
• Improving Linguistic Models
   – Semantic complexity of simple units such as NE
   – Identifying paraphrases of NE


                             INTEL            Paris
                                                                                    Le président de la République
                           Gone with the wind
                                                                                                Sarko
              Harry Potter                   JIF Peanut Butter                                             Nicolas Sarkozy
                                                                                Sarkoland
                                    lyzozym
             The 4th of July                            20GB
                                                                                                         M. Sarkozy
                      Dulles International Airport                                    Sarkozyste

                                     Le Tour de France                                                   Mr Sarkozy
                                                                                         Sarkozysme
                           www.nytimes.com

                NE : an heterogeneous category
                Ehrmann M. (2008) les EN de la linguistique au TAL statut                   Paraphrases of a single NE
                théorique et méthodes de désambiguïsation.

22/07/2011                   E. MacMurray & M. Leenhardt                    ICAI’11 Workshop on Intelligent Linguistic Technologies 4
Textometry and Web Mining: why ?
         •   Text is considered having its own internal structure
         •   Application of statistical and probabilistic calculations directly to the textual
             units of comparable texts in a corpus




22/07/2011        E. MacMurray & M. Leenhardt             ICAI’11 Workshop on Intelligent Linguistic Technologies 5
Textometry and Web Mining: how?

                                                                                                Form         Specificness

                                                                                            b               23.43
July 4th 2011
                                                                                            b               12.68

                                                                                            b               5.57

                                                                                            b               5.66

                                              Hypergeometric Distribution

                                                                                                Form         Specificness

                                                                                            d               13.73
July 5th 2011
                                                                                            d               21.86

                                                                                            d               7.75

                                                                                            d               6.55




       22/07/2011   E. MacMurray & M. Leenhardt             ICAI’11 Workshop on Intelligent Linguistic Technologies 6
Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships




                                                                                       A
  ---A---C---B---D.
  ---B---C---H---E.
  ---B-- C --A---E.                                                 B                   C                 E
  ---E---B---D---F.
  ---C---A---D---H.
                                                      A                  B                          C
  ---F---C---B---D.
  ---E---B---D---A.
                                                                                            E
  22/07/2011        E. MacMurray & M. Leenhardt        ICAI’11 Workshop on Intelligent Linguistic Technologies 7
Textometry and Web Mining : how?
       1/ POINT OF ENTRY                                  2/ CORPUS
                                                                                  184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies            Article                                                                    160 articles
and people)              selection
                                                                                  197,341 occurrences / 17,807 formes / 9,416 hapax
                                                                                                    103 articles

     Company NE = Xerox
     People NE = Nicolas Sarkozy               3/ TEXTOMETRIC ANALYSIS




                                                                                               4/ INTERPRETATION OF RESULTS
                                                        Hypergeometric
                                                          Disribution                          Quantitative information
                                                                                               to formulate qualitative interpretations.



                                                        Specificness
                                                        Cooccurrences




    22/07/2011            E. MacMurray & M. Leenhardt                   ICAI’11 Workshop on Intelligent Linguistic Technologies 8
Textometry and Web Mining: results?
   Observing forms and repeted segments of « Nicolas Sarkozy »
   allows identifying polarities of opinion in paraphrases,
   providing clues for determining how the NE is perceived.




contextually
dependant      {

   negative    {
22/07/2011         E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 9
Textometry and Web Mining: results?




             Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».

22/07/2011      E. MacMurray & M. Leenhardt                              ICAI’11 Workshop on Intelligent Linguistic Technologies 10
Textometry and Web Mining: results?
As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
                                   ( « buzz effect »)




22/07/2011      E. MacMurray & M. Leenhardt     ICAI’11 Workshop on Intelligent Linguistic Technologies 11
Textometry and Web Mining: results?




22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 12
Textometry and Web Mining: results?




22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 13
Conclusion
• Two intelligence use-cases on Le Monde and The New York Times
• Two complementary approaches : specificness and co-occurrence analysis

• Three main contributions :
   – Building corpus-driven linguistic ressources (time and cost-cutting)
   – Identifying trends with specificness calculation
   – Targeting zones of activity or events through co-occurrence networks

•    In sum, this method :
      – Help derive knowledge from corpora without predefined information
         models
      – Provides adequate functions enabling interaction between the
         expertise of the user and processing tools
22/07/2011    E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 14
References
Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.
Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.
Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.
        Proceedings, JADT’2010.
Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST
        Conference, Toulouse, France, 2010.
Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.
 Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.
Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational
        Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.
Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent
        Systems, 1999, volume LNAI 1609, p. 16–29.
Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.
 Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.
MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.
Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.
Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en
        Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.
Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.
        2008
Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.
Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.
Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles),
        2004, p 986-992.
Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.
Tufféry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007.
Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,
        2006 vol 4 n°2 p.349-387.




22/07/2011                   E. MacMurray & M. Leenhardt                              ICAI’11 Workshop on Intelligent Linguistic Technologies 15

Mais conteúdo relacionado

Semelhante a Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Gilbert Paquette
 
Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Touradj Ebrahimi
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and WellnessAmit Sheth
 
Network of future 2011 befemto
Network of future 2011   befemtoNetwork of future 2011   befemto
Network of future 2011 befemtoThierry Lestable
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1STIinnsbruck
 
Uncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsUncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsYiwei Cao
 
Invited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationInvited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationNational Taiwan Normal University
 
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...IOSRJVSP
 
e-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levele-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levelVassilis Goulandris
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsMustafa Jarrar
 
Analyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseAnalyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseEURECOM
 
Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Nuno Carvalho
 
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Mathieu d'Aquin
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
 
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringGOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringChristophe Debruyne
 
The EM network October 2010 Newsletter
The EM network October 2010 NewsletterThe EM network October 2010 Newsletter
The EM network October 2010 Newsletterphovdenak
 
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...Ionel Gabriel Niculescu
 

Semelhante a Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web (20)

Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...
 
Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and Wellness
 
Network of future 2011 befemto
Network of future 2011   befemtoNetwork of future 2011   befemto
Network of future 2011 befemto
 
Reference Ontology Presentation
Reference Ontology PresentationReference Ontology Presentation
Reference Ontology Presentation
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1
 
Uncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsUncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information Systems
 
Invited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationInvited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigation
 
Geolinkeddata 07042011 1
Geolinkeddata 07042011 1Geolinkeddata 07042011 1
Geolinkeddata 07042011 1
 
GeoLinkedData
GeoLinkedDataGeoLinkedData
GeoLinkedData
 
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
 
e-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levele-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local level
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnets
 
Analyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseAnalyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their Reuse
 
Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012
 
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
 
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringGOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
 
The EM network October 2010 Newsletter
The EM network October 2010 NewsletterThe EM network October 2010 Newsletter
The EM network October 2010 Newsletter
 
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

  • 1. Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web Erin MacMurray*, Marguerite Leenhardt ** SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne Nouvelle Paris 3 *erin.macmurray@gmail.com ** marguerite.leenhardt@gmail.com ICAI’11 Workshop on Intelligent Linguistic Technologies
  • 2. In a nutshell • Introduction & background • Textometry and Web Mining: why? • Textometry and Web Mining: how? • Textometry and Web Mining: application? • Conclusion 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2
  • 3. Introduction & background Structure ? Man versus machine ? Seth Grimes sees « three categories of Neil Glassman « between those on one data : (i) Quantities, whether measured, side who feel the accuracy of automated observed, or computed (ii) Content, which [content analysis] is sufficient and those I’ll characterize as non-quantitative on the other side who feel we can only rely information (iii) Metadata describing on human analysis […] most in the field quantities and content. concur with the idea that we need to Structured/unstructured is a false define a methodology where the software dichotomy. » and the analyst collaborate to get over the noise and deliver accurate analysis. » (July 2011 – IKS Semantic Workshop, France) (May 2011 – Sentiment Analysis Symposium review) 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 3
  • 4. Textometry and Web Mining: why ? • Improving Linguistic Models – Semantic complexity of simple units such as NE – Identifying paraphrases of NE INTEL Paris Le président de la République Gone with the wind Sarko Harry Potter JIF Peanut Butter Nicolas Sarkozy Sarkoland lyzozym The 4th of July 20GB M. Sarkozy Dulles International Airport Sarkozyste Le Tour de France Mr Sarkozy Sarkozysme www.nytimes.com NE : an heterogeneous category Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE théorique et méthodes de désambiguïsation. 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 4
  • 5. Textometry and Web Mining: why ? • Text is considered having its own internal structure • Application of statistical and probabilistic calculations directly to the textual units of comparable texts in a corpus 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 5
  • 6. Textometry and Web Mining: how? Form Specificness b 23.43 July 4th 2011 b 12.68 b 5.57 b 5.66 Hypergeometric Distribution Form Specificness d 13.73 July 5th 2011 d 21.86 d 7.75 d 6.55 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 6
  • 7. Textometry and Web Mining: how? Two words or more that appear at the same time in a predetermined span of text- lexical relationships around a pivot-form (William Martinez, 2003) Result: network of associative relationships A ---A---C---B---D. ---B---C---H---E. ---B-- C --A---E. B C E ---E---B---D---F. ---C---A---D---H. A B C ---F---C---B---D. ---E---B---D---A. E 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 7
  • 8. Textometry and Web Mining : how? 1/ POINT OF ENTRY 2/ CORPUS 184,761 occurrences / 13,075 forms / 5,194 hapax NE (companies Article 160 articles and people) selection 197,341 occurrences / 17,807 formes / 9,416 hapax 103 articles Company NE = Xerox People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS 4/ INTERPRETATION OF RESULTS Hypergeometric Disribution Quantitative information to formulate qualitative interpretations. Specificness Cooccurrences 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 8
  • 9. Textometry and Web Mining: results? Observing forms and repeted segments of « Nicolas Sarkozy » allows identifying polarities of opinion in paraphrases, providing clues for determining how the NE is perceived. contextually dependant { negative { 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 9
  • 10. Textometry and Web Mining: results? Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ». 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 10
  • 11. Textometry and Web Mining: results? As a current event is discussed in the media, the lexical network produced by the co- occurrence calculation will be greater during an event than during periods of calm or low activity of the NE ( « buzz effect ») 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 11
  • 12. Textometry and Web Mining: results? 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 12
  • 13. Textometry and Web Mining: results? 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 13
  • 14. Conclusion • Two intelligence use-cases on Le Monde and The New York Times • Two complementary approaches : specificness and co-occurrence analysis • Three main contributions : – Building corpus-driven linguistic ressources (time and cost-cutting) – Identifying trends with specificness calculation – Targeting zones of activity or events through co-occurrence networks • In sum, this method : – Help derive knowledge from corpora without predefined information models – Provides adequate functions enabling interaction between the expertise of the user and processing tools 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 14
  • 15. References Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289. Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010. Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution. Proceedings, JADT’2010. Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST Conference, Toulouse, France, 2010. Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p. Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957. Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,. Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent Systems, 1999, volume LNAI 1609, p. 16–29. Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994. Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230. MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010. Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005. Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003. Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem. 2008 Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003. Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005. Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles), 2004, p 986-992. Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008. Tufféry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007. Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology, 2006 vol 4 n°2 p.349-387. 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 15