SlideShare uma empresa Scribd logo
1 de 37
AG Corpus-écrits, 21 novembre 
Consortium Corpus-écrits 
SIG 
TEI-CMC 
Open Resources and 
TOols for LANGuage 
http://comere.org 
http://hdl.handle.net/11403/comere 
Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, 
Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, 
Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
2 
http://www.tei-c.org/Activities/SIG/CMC/ 
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as resources for empirical research on 
CMC phenomena in the Humanities (linguistics, communication 
science, language technology, …) 
Cette resource doit donc être libre d'accès (open 
access research data) afin d'être réutilisable par les 
communautés de chercheurs 
Nous reviendrons plus tard sur ce point
Our subject and goals 
Computer-mediated communication (CMC): 
All genres of interpersonal communication mediated 
through computer networks (the internet) and used 
via personal computers and/or mobile devices: chats, 
online forums, instant messaging, tweets, comments 
on weblogs, discussions in wikis and on “social net-work” 
sites, interactions in multimodal communication 
environments such as Skype, MMORPGs or “virtual 
worlds” (e.g., SecondLife), SMS, WhatsApp, ....
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as resources for empirical research on 
CMC phenomena in the Humanities (linguistics, communication 
science, language technology, …) 
Our vision: These corpora shall be … 
 interoperable (i) with each other and (ii) with other types of 
linguistic corpora (text corpora, speech corpora) 
 represented conformant to established encoding standards in 
the field of Digital Humanities 
 linguistically annotated in order to allow for sophisticated 
queries and language-focused research
Our subject and goals 
The problem / challenge: 
 By now, there are no established standards for the 
representation of CMC genres 
 Established standards for the representation of text genres do 
not include models for the representation of the peculiarities of 
CMC 
 “Off the shelf” NLP tools for automatic linguistic analysis and 
annotation (tokenizers, part-of-speech taggers, lematizers, 
normalizers, parsers) do not perform well on CMC data 
(because they usually have been trained on edited text and 
therefore can’t handle “non-standard” phenomena and 
multimodal elements in CMC discourse)
Our subject and goals 
Our goals: 
 work on solutions for these desiderata 
 develop suggestions for standards for 
- packaging and sharing (mono- and multimodal) CMC 
corpora, 
- modeling these types of “texts” within a framework which is 
conformant with the encoding framework of the Text 
Encoding Initiative (TEI) and thus with a widely accepted de-facto 
standard in the field of Digital Humanities, 
- processing and annotating these corpora (part-of-speech, 
normalization, ...) with NLP tools.
Who belongs to our community (so far)? 
Our kernel projects 
and founding members 
http://http://glottoweb.org/web2corpus/ 
http://hdl.handle.net/11403/comere 
French CMC corpora 
Infrastructure for languages 
National consortium on corpora 
National infrastructure 
for Digital Humanities 
Scientific network 
„Empirical research of CMC“ 
http://www.empirikom.net 
Dortmund Chat Corpus 
http://www.chatkorpus.tu-dortmund.de 
German Reference Corpus of CMC 
http://www.tinyurl.com/derik-llc 
Wikipedia corpus in DeReKo 
(Mannheim) 
German CMC corpora 
Dutch CMC corpora 
SoNaR 
(Stevin Nederlandstalig Referentiecorpus) 
Italian CMC pilot corpus
Activities and initiatives (past and future) 
2013, 2014 
-European workshops on CMC corpora (Dortmund 
- special journal issue (JLCL) 
9 
Our 
pathway 
2013 
creation of the TEI-CMC SIG 
End of 2014 
Publication of CMC French 
corpora (CoMeRe) in open 
access, all TEI-CMC 
2015 
Application to CLARIN-DE 
Tranform existing German 
corpora into TEI-CMC 
2015 October 
International 
CMC conference 
Rennes (Ledegen) 
2015 
Submission 
of TEI-CMC 
model 
2015 
Launch larger 
CMC-corpora 
community 
2016 
Common system 
of basic CMC-annotations 
(POS tagging)
Project supported by the national 
consortium Corpus-écrits, sub-part of 
Huma-Num, and Ortolang 
Consortium Corpus-écrits 
Objective: Kernel corpus assembling existing corpora of different CMC 
genres and new corpora build on data extracted from the Internet. These 
heterogeneous corpora will be structured and processed in a uniform way, 
complemented with metadata. CoMeRe will be released as OpenData 
through the national infrastructure Ortolang, following constraints which will 
be reused for the forthcoming “Corpus de Référence du Français”. 
Variety + Standards + Open Access 
http://comere.org 
http://hdl.handle.net/11403/comere
11 
Dépositeur individuel 
Serveur 
Local LRL 
Ingénieur : 
Kun Jin 
Groupe qualité 
Discussion avec 
dépositeur 
Groupe étiquetage 
TAL : TEI-v2 
TEI-V1 
Financements : ORTOLANG > Corpus-écrits > LRL
12
13
Ref Tokens Partici. Posts Envir. 
(Antoniadis,2014) 449 313 359 22 052 SMS 
(Falaise, 2014) 35 M 25 000 3 M textchat 
(Ledegen, 2014) 357 000 850 22 000 SMS 
(Reffay et al., 2014) 600 000 67 + 4 groups 
- textchat: 6 790 
- emails: 2 030 
- forums: 2 686 
LMS 
(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat 
(Abendroth-Timmer 
et al., 2014) 273 546 26 + 4 groups 1 200 Blog 
(Longhi, Marinica, 
2014) 567 851 205 34273 Tweet 
Informal 
business 
Informal 
Informal 
education 
education 
education 
14 
politic
15
16
17
18
19
20
21
22
23
24
25 
Mono 
- Mode 
- Modality 
- Textchat 
- Forum 
- SMS 
- Tweets 
- Email 
- Blogs 
(image 
not means of interaction) 
Verbal Verbal & Non-verbal 
Multi 
Modalities 
LMS: 
- email 
- forum 
- chat 
Multi 
Modes 
Conf system: 
- Audiochat 
- Textchat 
Conference system, 
3D environment 
Etc. 
- Audiochat 
- Textchat 
- Icones 
- Collec prod 
Whiteboard 
Word proc. 
Semantic maps 
- Avatars 
- …
26 
Time(s) 
Interaction 
Space 
Locations 
Course 
Session 
Channel 
Simultaneity 
Participants 
Environments 
Author 
Adresse(s) 
Group 
Network
http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 
27 
New macro-level elements
1.5 mn video 
* Paper: (Wigham & Chanier, 2013) CALL 
journal 
* Data: (Wigham, 2013) LETEC corpus 
Modality interplay 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
Multimodalité : Verbal et non verbal 
(Wigham & Chanier, 2013) 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
Context: Lyceum conf environment, 3 learners (English L2) working into 
a word processor: one writing, others helping 
30 
Collab word 
processor 
Audio: 
clarification 
Textchat: 
Correction 
(with error) 
Textchat: 
Request 
confirmation 
Maintenant en 
TEI-speech
31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
32
l'utilisateur est autorisé à télécharger une copie du corpus […] 
• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] 
• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] 
• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur 
le fondement de la présente licence d'utilisation. 
Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) 
Example of corpus licence displayed on the National Infrastructure for Digital 
Humanities and considered as being"open access" 
Viewing but not re-using is 
that OA ? 
33
34
35
36
37

Mais conteúdo relacionado

Mais procurados

Language Literacy & MOOCs
Language Literacy & MOOCsLanguage Literacy & MOOCs
Language Literacy & MOOCsMaria Perifanou
 
ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014ClipFlair
 
Map of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domainMap of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domainPhil Barker
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowValeria de Paiva
 
Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016Daniele Giacomini
 
Exlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCsExlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCsMaria Perifanou
 
Project presentation
Project presentationProject presentation
Project presentationvicpara
 
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMSFajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMSFajar Purnama
 
Provenance for Multimedia
Provenance for MultimediaProvenance for Multimedia
Provenance for MultimediaRaphael Troncy
 

Mais procurados (13)

Language Literacy & MOOCs
Language Literacy & MOOCsLanguage Literacy & MOOCs
Language Literacy & MOOCs
 
Structured Interactive Scores
Structured Interactive ScoresStructured Interactive Scores
Structured Interactive Scores
 
ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014ClipFlair Final Report - September 2014
ClipFlair Final Report - September 2014
 
Map of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domainMap of the CETIS metadata and digital repository interoperability domain
Map of the CETIS metadata and digital repository interoperability domain
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016Europass-CV-20150624-Giacomini-EN 2016
Europass-CV-20150624-Giacomini-EN 2016
 
Exlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCsExlporing New challenges in TELL: Language Learning MOOCs
Exlporing New challenges in TELL: Language Learning MOOCs
 
Project presentation
Project presentationProject presentation
Project presentation
 
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMSFajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
Fajar Purnama 152-d8713 Dynamic Content Synchronization Distributed LMS
 
Provenance for Multimedia
Provenance for MultimediaProvenance for Multimedia
Provenance for Multimedia
 
Programming and problem solving with c++, 3rd edition
Programming and problem solving with c++, 3rd editionProgramming and problem solving with c++, 3rd edition
Programming and problem solving with c++, 3rd edition
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Cp viva q&a
Cp viva q&aCp viva q&a
Cp viva q&a
 

Semelhante a AG Corpus-écrits, 21 novembre meeting highlights

WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...Europeana
 
MOS MindOnSite
MOS MindOnSiteMOS MindOnSite
MOS MindOnSitetelss09
 
Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008urauch
 
Needs of others November 2011
Needs of others November 2011Needs of others November 2011
Needs of others November 2011Razi Masri
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Sandro D'Elia
 
Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?Paul Penfold
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...infoclio.ch
 
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...Pieter Pauwels
 
A startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source productsA startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source productsFrank Rousseau
 
Language Resources for Multilingual Europe
Language Resources for Multilingual EuropeLanguage Resources for Multilingual Europe
Language Resources for Multilingual EuropeGeorg Rehm
 
2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysisRogério Correia
 
The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008Alastair Dunning
 
DAISY Consortium Open Source Projects
DAISY Consortium Open Source ProjectsDAISY Consortium Open Source Projects
DAISY Consortium Open Source ProjectsDAISY Consortium
 
Approaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projectsApproaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projectsR. John Robertson
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsSpringer
 

Semelhante a AG Corpus-écrits, 21 novembre meeting highlights (20)

WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...WP3 Further specification of Functionality and Interoperability - Gradmann / ...
WP3 Further specification of Functionality and Interoperability - Gradmann / ...
 
Eswc14demo
Eswc14demoEswc14demo
Eswc14demo
 
MOS MindOnSite
MOS MindOnSiteMOS MindOnSite
MOS MindOnSite
 
Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008Participatory Media Literacy Diverse2008
Participatory Media Literacy Diverse2008
 
Research at RMOD
Research at RMODResearch at RMOD
Research at RMOD
 
Needs of others November 2011
Needs of others November 2011Needs of others November 2011
Needs of others November 2011
 
Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708Summer school bz_fp7research_20100708
Summer school bz_fp7research_20100708
 
Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?Virtual Communities of Practice – does technology make a difference?
Virtual Communities of Practice – does technology make a difference?
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
 
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...
 
A startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source productsA startup with no office, hipster tools and open source products
A startup with no office, hipster tools and open source products
 
Language Resources for Multilingual Europe
Language Resources for Multilingual EuropeLanguage Resources for Multilingual Europe
Language Resources for Multilingual Europe
 
2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis2.15 holotescu indicators for the analysis
2.15 holotescu indicators for the analysis
 
The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008The Standards Dilemma - Digital Library Standards 2008
The Standards Dilemma - Digital Library Standards 2008
 
DAISY Consortium Open Source Projects
DAISY Consortium Open Source ProjectsDAISY Consortium Open Source Projects
DAISY Consortium Open Source Projects
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
Approaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projectsApproaches to supporting Open Educational Resource projects
Approaches to supporting Open Educational Resource projects
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Semi-Automated Assistance for Conceiving Chatbots
Semi-Automated Assistance for Conceiving ChatbotsSemi-Automated Assistance for Conceiving Chatbots
Semi-Automated Assistance for Conceiving Chatbots
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 

Mais de Thierry Chanier

(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...Thierry Chanier
 
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...Thierry Chanier
 
OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...Thierry Chanier
 
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...Thierry Chanier
 
Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...Thierry Chanier
 
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...Thierry Chanier
 

Mais de Thierry Chanier (6)

(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
(co)-création d’un corpus en linguistique : une étape à la portée des jeunes ...
 
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
Concevoir la diffusion d’une banque de corpus dès le début du projet de reche...
 
OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...OpenData : au-delà des publications, le partage des données de la recherche e...
OpenData : au-delà des publications, le partage des données de la recherche e...
 
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
Eurocall2013: A viewpoint on the place of CALL within the Digital Humanities:...
 
Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...Corpus communication médiée par les réseaux en français et corpus allemand et...
Corpus communication médiée par les réseaux en français et corpus allemand et...
 
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
Le point sur l'accès ouvert aux résultats de la recherche, parlons un peu de ...
 

Último

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 

Último (20)

basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 

AG Corpus-écrits, 21 novembre meeting highlights

  • 1. AG Corpus-écrits, 21 novembre Consortium Corpus-écrits SIG TEI-CMC Open Resources and TOols for LANGuage http://comere.org http://hdl.handle.net/11403/comere Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
  • 3. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Cette resource doit donc être libre d'accès (open access research data) afin d'être réutilisable par les communautés de chercheurs Nous reviendrons plus tard sur ce point
  • 4. Our subject and goals Computer-mediated communication (CMC): All genres of interpersonal communication mediated through computer networks (the internet) and used via personal computers and/or mobile devices: chats, online forums, instant messaging, tweets, comments on weblogs, discussions in wikis and on “social net-work” sites, interactions in multimodal communication environments such as Skype, MMORPGs or “virtual worlds” (e.g., SecondLife), SMS, WhatsApp, ....
  • 5. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Our vision: These corpora shall be …  interoperable (i) with each other and (ii) with other types of linguistic corpora (text corpora, speech corpora)  represented conformant to established encoding standards in the field of Digital Humanities  linguistically annotated in order to allow for sophisticated queries and language-focused research
  • 6. Our subject and goals The problem / challenge:  By now, there are no established standards for the representation of CMC genres  Established standards for the representation of text genres do not include models for the representation of the peculiarities of CMC  “Off the shelf” NLP tools for automatic linguistic analysis and annotation (tokenizers, part-of-speech taggers, lematizers, normalizers, parsers) do not perform well on CMC data (because they usually have been trained on edited text and therefore can’t handle “non-standard” phenomena and multimodal elements in CMC discourse)
  • 7. Our subject and goals Our goals:  work on solutions for these desiderata  develop suggestions for standards for - packaging and sharing (mono- and multimodal) CMC corpora, - modeling these types of “texts” within a framework which is conformant with the encoding framework of the Text Encoding Initiative (TEI) and thus with a widely accepted de-facto standard in the field of Digital Humanities, - processing and annotating these corpora (part-of-speech, normalization, ...) with NLP tools.
  • 8. Who belongs to our community (so far)? Our kernel projects and founding members http://http://glottoweb.org/web2corpus/ http://hdl.handle.net/11403/comere French CMC corpora Infrastructure for languages National consortium on corpora National infrastructure for Digital Humanities Scientific network „Empirical research of CMC“ http://www.empirikom.net Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de German Reference Corpus of CMC http://www.tinyurl.com/derik-llc Wikipedia corpus in DeReKo (Mannheim) German CMC corpora Dutch CMC corpora SoNaR (Stevin Nederlandstalig Referentiecorpus) Italian CMC pilot corpus
  • 9. Activities and initiatives (past and future) 2013, 2014 -European workshops on CMC corpora (Dortmund - special journal issue (JLCL) 9 Our pathway 2013 creation of the TEI-CMC SIG End of 2014 Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC 2015 Application to CLARIN-DE Tranform existing German corpora into TEI-CMC 2015 October International CMC conference Rennes (Ledegen) 2015 Submission of TEI-CMC model 2015 Launch larger CMC-corpora community 2016 Common system of basic CMC-annotations (POS tagging)
  • 10. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang Consortium Corpus-écrits Objective: Kernel corpus assembling existing corpora of different CMC genres and new corpora build on data extracted from the Internet. These heterogeneous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Variety + Standards + Open Access http://comere.org http://hdl.handle.net/11403/comere
  • 11. 11 Dépositeur individuel Serveur Local LRL Ingénieur : Kun Jin Groupe qualité Discussion avec dépositeur Groupe étiquetage TAL : TEI-v2 TEI-V1 Financements : ORTOLANG > Corpus-écrits > LRL
  • 12. 12
  • 13. 13
  • 14. Ref Tokens Partici. Posts Envir. (Antoniadis,2014) 449 313 359 22 052 SMS (Falaise, 2014) 35 M 25 000 3 M textchat (Ledegen, 2014) 357 000 850 22 000 SMS (Reffay et al., 2014) 600 000 67 + 4 groups - textchat: 6 790 - emails: 2 030 - forums: 2 686 LMS (Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat (Abendroth-Timmer et al., 2014) 273 546 26 + 4 groups 1 200 Blog (Longhi, Marinica, 2014) 567 851 205 34273 Tweet Informal business Informal Informal education education education 14 politic
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25 Mono - Mode - Modality - Textchat - Forum - SMS - Tweets - Email - Blogs (image not means of interaction) Verbal Verbal & Non-verbal Multi Modalities LMS: - email - forum - chat Multi Modes Conf system: - Audiochat - Textchat Conference system, 3D environment Etc. - Audiochat - Textchat - Icones - Collec prod Whiteboard Word proc. Semantic maps - Avatars - …
  • 26. 26 Time(s) Interaction Space Locations Course Session Channel Simultaneity Participants Environments Author Adresse(s) Group Network
  • 28. 1.5 mn video * Paper: (Wigham & Chanier, 2013) CALL journal * Data: (Wigham, 2013) LETEC corpus Modality interplay Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  • 29. Multimodalité : Verbal et non verbal (Wigham & Chanier, 2013) Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  • 30. Context: Lyceum conf environment, 3 learners (English L2) working into a word processor: one writing, others helping 30 Collab word processor Audio: clarification Textchat: Correction (with error) Textchat: Request confirmation Maintenant en TEI-speech
  • 32. 32
  • 33. l'utilisateur est autorisé à télécharger une copie du corpus […] • la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] • la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] • la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur le fondement de la présente licence d'utilisation. Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) Example of corpus licence displayed on the National Infrastructure for Digital Humanities and considered as being"open access" Viewing but not re-using is that OA ? 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37

Notas do Editor

  1. I‘ll do a final editing of this slide in the next days (in Romne, before the meeting...)
  2. Parler des citations / références
  3. Journal de recherche structuré : création du chercheur pas documentaliste. Comere repository Polititwwet OLAC : métadonnées réduites pour Clarin Sautez un niveau pilitiwwet Aller au détail polititweet : manuel PDF Puis Simuligne diversité avec LMS, participants
  4. Dans la première on peut rectifier à la main. Malheureusement, les discussions sont organisées de façon très variées. Assez souvent les auteurs ne respectent pas ces consignes. La Figure 3‑3 en donne une illustration. Une personne tape explicitement les graphies Réponse : au début de son texte puis semble signé en faisant appel à la marque d'indentation, seulement pour cette signature. Ici la signature n'indique qu'une adresse IP et la date. On hésite à savoir où se termine le texte du premier auteur. Celui qui répond intervient semble-t-il deux fois, sans respecter les formats et semble terminer par une indication de signature, Curry (pas au sens Wikipédia cependant). Si l'on examine le lien associé à ce dernier mot, on trouve, non une page d'auteur mais une page générale de Wikipédia (cf. Figure 3‑4) ! Traiter automatiquement de telles pages pose donc problème.
  5. An Interaction Space is an abstract concept, located in time (with a beginning and ending date with absolute time, hence a time frame) where interactions between a set of participants occur within an online location . The online location is defined by the properties of the set of environments used by the set of participants
  6. In one of our paper, which will appear in the CALL journal, and the corresponding data are already online in Mulce, Ciara Wigham discusses the interplay between audio and textchat. Here is an extract from Archi21. In the left column you have the transcription of the audio of one learner, who presents his feeling related to the on-going process of his architectural project. He is a French native and speaks in English as his L2. In the 3 other columns on the right, you find textchats turns coming from the tutor and two other learners belonging to the same architectural project group. Let me show you a short video. **** In this example of conversation doubling, the acts in the text chat respond to the voice chat (blue arrows) but equally acts in the voice chat respond to the text chat (orange arrows) and text chat acts respond to interaction in both voice chat and text chat modalities and prompt interaction in both modalities
  7. http://88milsms.huma-num.fr/corpus.html
  8. There exist 3 main criteria that research data should follow in order to be considered OpenData. Besides being obviously available, the interesting perspective is the fact that data can be access in order to be reuse and mix with other data and licence should explicitly mention this. Second interesting point is that the constraints for reuse should be reduced to a minimum, then the definition stipulate that non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes are not allowed