SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Processing Extensive Text Data 
at the Leipzig Corpora Collection 
Dirk Goldhahn 
MLODE 2014, Leipzig 
Natural Language Processing Group 
Institute of Computer Science 
University of Leipzig
Leipzig Corpora Collection 
2 
● Project that offers 
● Corpus-based monolingual full form dictionaries 
● Started as project „Deutscher Wortschatz“ 
● Accessible at http://corpora.uni-leipzig.de 
● Web interface 
● Web services 
● Integration into CLARIN 
● Export as rdf (Sebastian Hellmann)
Available Resources (Online) 
3 
● Corpora in >220 languages 
● Text material 
● Word frequencies 
● Word co-occurrences 
● Semantic maps (strongest word co-occurrences) 
● Topics 
● Subject areas (partially) 
● Morphological information (partially) 
● POS-tagged sentences (partially) 
● Further semantic and grammatical information
Example 
4
Example 
5
Number of Languages over Time 
6
World Map Interface 
7
Data Stock 
●Number of languages: >300, >1000 including religious texts 
●Number of sentences (overall): ~4 billion 
●Biggest language: English (1.1 billion) 
8
Text Acquisition Methods 
9 
●Generic web crawler (Heritrix) 
● Whole TLDs 
● News (abyzNewsLinks) 
●RSS feeds (daily) 
→ http://wortschatz.uni-leipzig.de/wort-des-tages/ 
●Distributed text crawling (FindLinks) 
●Bootstrapping web corpora 
●Text dumps (Wikipedia, Watchtower) 
●Parallel corpora (Bible)
Corpus Creation 
10 
●Processing steps: 
● Stripping 
● Language Separation (about 500 languages) 
● Sentence Segmentation 
● Cleaning 
● Sentence Scrambling (copyright reasons) 
● Tokenization 
● Database Creation 
● Statistical Evaluation
Statistical Evaluation 
11 
●Enrichment of corpora with statistical annotations 
● Word frequencies 
● Co-occurrence frequencies and significance (based on left or right 
neighbours or whole sentences) 
● Topics, POS 
●Creation of further corpora statistics 
● >200 statistical properties based on letter, word, sentence, sources level 
etc. 
● e.g.: For automatic quality assurance using typical distributions 
Hindi newspaper 2010 % 
0 50 100 150 200 250 
1 
0,8 
0,6 
0,4 
0,2 
0 
0 50 100 150 200 250 
6 
5 
4 
3 
2 
1 
0 
Sundanese Wikipedia 2007 
%
LCC as Linked Data 
12 
●LD starts with data and metadata 
● Data + metadata for hundreds of languages 
● All data processed and stored identically, 
including metadata 
● Still at the beginning of turning into LD
LCC as Linked Data 
13 
●Monolingual Metadata: 
● Frequencies, word co-occurrences, NER, POS-tags, topics 
● Integration of external sources (DBPedia, FreeBase, Wordnet) 
● Still problems to solve (linking, disambiguation, ...) 
→ New Web Interface 
●Multilingual Metadata: 
● Crosslingual word co-occurrences (parallel text) 
● External sources (DBPedia...)
14 
●Data available via: 
● Web interface 
(http://corpora.informatik.uni-leipzig.de) 
● Download 
(http://corpora.informatik.uni-leipzig.de/download.html) 
● Webservices 
(http://wortschatz.uni-leipzig.de/Webservices/) 
Thank you!

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
Linked Open Data: A simple how-to
Linked Open Data: A simple how-toLinked Open Data: A simple how-to
Linked Open Data: A simple how-to
 
Normalizing Data for Migrations
Normalizing Data for MigrationsNormalizing Data for Migrations
Normalizing Data for Migrations
 
XML: A New Standard for Data
XML: A New Standard for DataXML: A New Standard for Data
XML: A New Standard for Data
 
Resident good: NoSQL
Resident good: NoSQLResident good: NoSQL
Resident good: NoSQL
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
Open Data Node - Platform and Methodology - 2015-May
Open Data Node - Platform and Methodology - 2015-MayOpen Data Node - Platform and Methodology - 2015-May
Open Data Node - Platform and Methodology - 2015-May
 
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
 
NoSql evaluation
NoSql evaluationNoSql evaluation
NoSql evaluation
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
On a web of data streams
On a web of data streamsOn a web of data streams
On a web of data streams
 
Kalp Corporate MongoDB Tutorials
Kalp Corporate MongoDB TutorialsKalp Corporate MongoDB Tutorials
Kalp Corporate MongoDB Tutorials
 

Semelhante a Dirk Goldhahn: Introduction to the German Wortschatz Project

The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 

Semelhante a Dirk Goldhahn: Introduction to the German Wortschatz Project (20)

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Translating software with SDL Passolo
Translating software with SDL PassoloTranslating software with SDL Passolo
Translating software with SDL Passolo
 
Introduction to SDL Passolo
Introduction to SDL PassoloIntroduction to SDL Passolo
Introduction to SDL Passolo
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Lemon at-mlw3
Lemon at-mlw3Lemon at-mlw3
Lemon at-mlw3
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Introduction to HDF 3.0
Introduction to HDF 3.0Introduction to HDF 3.0
Introduction to HDF 3.0
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Standardisation Challenges in Data Management Lifecycle
Standardisation Challenges in Data Management LifecycleStandardisation Challenges in Data Management Lifecycle
Standardisation Challenges in Data Management Lifecycle
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 
Php packages
Php packagesPhp packages
Php packages
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
LODStats (Presentation for KESW2013 System Demo)
LODStats (Presentation for KESW2013 System Demo)LODStats (Presentation for KESW2013 System Demo)
LODStats (Presentation for KESW2013 System Demo)
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 

Mais de mbruemmer

Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
mbruemmer
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
mbruemmer
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
mbruemmer
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
mbruemmer
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
mbruemmer
 

Mais de mbruemmer (7)

Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
 
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
 
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content ManagementTatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
 

Último

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Último (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Dirk Goldhahn: Introduction to the German Wortschatz Project

  • 1. Processing Extensive Text Data at the Leipzig Corpora Collection Dirk Goldhahn MLODE 2014, Leipzig Natural Language Processing Group Institute of Computer Science University of Leipzig
  • 2. Leipzig Corpora Collection 2 ● Project that offers ● Corpus-based monolingual full form dictionaries ● Started as project „Deutscher Wortschatz“ ● Accessible at http://corpora.uni-leipzig.de ● Web interface ● Web services ● Integration into CLARIN ● Export as rdf (Sebastian Hellmann)
  • 3. Available Resources (Online) 3 ● Corpora in >220 languages ● Text material ● Word frequencies ● Word co-occurrences ● Semantic maps (strongest word co-occurrences) ● Topics ● Subject areas (partially) ● Morphological information (partially) ● POS-tagged sentences (partially) ● Further semantic and grammatical information
  • 6. Number of Languages over Time 6
  • 8. Data Stock ●Number of languages: >300, >1000 including religious texts ●Number of sentences (overall): ~4 billion ●Biggest language: English (1.1 billion) 8
  • 9. Text Acquisition Methods 9 ●Generic web crawler (Heritrix) ● Whole TLDs ● News (abyzNewsLinks) ●RSS feeds (daily) → http://wortschatz.uni-leipzig.de/wort-des-tages/ ●Distributed text crawling (FindLinks) ●Bootstrapping web corpora ●Text dumps (Wikipedia, Watchtower) ●Parallel corpora (Bible)
  • 10. Corpus Creation 10 ●Processing steps: ● Stripping ● Language Separation (about 500 languages) ● Sentence Segmentation ● Cleaning ● Sentence Scrambling (copyright reasons) ● Tokenization ● Database Creation ● Statistical Evaluation
  • 11. Statistical Evaluation 11 ●Enrichment of corpora with statistical annotations ● Word frequencies ● Co-occurrence frequencies and significance (based on left or right neighbours or whole sentences) ● Topics, POS ●Creation of further corpora statistics ● >200 statistical properties based on letter, word, sentence, sources level etc. ● e.g.: For automatic quality assurance using typical distributions Hindi newspaper 2010 % 0 50 100 150 200 250 1 0,8 0,6 0,4 0,2 0 0 50 100 150 200 250 6 5 4 3 2 1 0 Sundanese Wikipedia 2007 %
  • 12. LCC as Linked Data 12 ●LD starts with data and metadata ● Data + metadata for hundreds of languages ● All data processed and stored identically, including metadata ● Still at the beginning of turning into LD
  • 13. LCC as Linked Data 13 ●Monolingual Metadata: ● Frequencies, word co-occurrences, NER, POS-tags, topics ● Integration of external sources (DBPedia, FreeBase, Wordnet) ● Still problems to solve (linking, disambiguation, ...) → New Web Interface ●Multilingual Metadata: ● Crosslingual word co-occurrences (parallel text) ● External sources (DBPedia...)
  • 14. 14 ●Data available via: ● Web interface (http://corpora.informatik.uni-leipzig.de) ● Download (http://corpora.informatik.uni-leipzig.de/download.html) ● Webservices (http://wortschatz.uni-leipzig.de/Webservices/) Thank you!