SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Big (Language) Data – From research strategies
to proof-of-concept and implementation
projects in linguistics research
Gerhard Budin
Centre for Translation Studies
University of Vienna
LEARN Workshop Vienna
6th of April, 2016
Introduction & Overview
• What is Big (Language) Data
• The “ZTW Cloud” – a digital humanities research
infrastructure
• Research data management strategies
• Case study: FWF-SFB “DiÖ German in Austria”
• Conclusions
Big Data + linguistics =
Big Language Data
• Big Data has become a crucial concept for linguistics
research due to its statistical paradigm of the analysis
of language usage in computational corpus linguistics
and computational translation studies.
• -> “More data are good data” – for training tools,
testing algorithms for language data analysis, for
linguistic corpus analysis
• Thus Big Language Data is becoming a cross-
disciplinary (computer science – linguistics – digital
humanities) methodology that is increasingly applied
in relevant research projects
ZTW Cloud: a digital humanities
research infrastructure
• ZTW = Zentrum für Translationswissenschaft = Centre
for Translation Studies at the University of Vienna
• The ZTW Cloud (located in the HPSC Arsenal) is a
research infrastructure for language and multimedia
data consisting of
– A HADOOP server cluster with about 100 TB storage and
processing capacity mainly used for:
• Machine translation research on MOSES
• Text Data mining and corpus research (Cloudera, ProTerm, etc.)
• CAT & Terminology research (SDL Suite, Multiterm, etc.)
• Multimedia conference interpreting corpus research
– A conference interpreting facility (20 booths in 2 rooms)
and other digital interpreting tools
– 3 media labs with 80 student multimedia workplaces
Research data management Strategies
• The ZTW Cloud is embedded in (and in fact the core of)
the CLARIN Core Facility at the University of Vienna
• …which in turn is embedded in the Austrian CLARIAH
(CLARIN+DARIAH) digital humanities research
infrastructure network (Vienna, Graz, Innsbruck, et al.)
• Thus the CLARIN and DARIAH data management best
practices focusing on the needs of linguistics and other
humanities disciplines (metadata standards, data
curation principles, corpus annotation standards, etc.)
are strictly followed (for CLARIN based on TEI and ISO)
• In addition, the generic research data management
principles of LERU, LEARN, and other initiatives at EU
level and at national levels are also followed
Case study: FWF-SFB
“DiÖ German in Austria”
• -> A showcase for the genesis of a large research project
(in this case in linguistics research) – from research
questions to a project that is meant to be a proof-of-
concept and an implementation of a systematic research
data management strategy
• SFB = Spezialforschungsbereich funded by FWF (Fond
zur Förderung der Wissenschaftlichen Forschung)
• F 60 DiÖ: A consortium consisting of UniVie (coord. with
3 departments from 2 faculties), Uni Graz, Uni Salzburg,
Austrian Academy of Sciences, and other members)
• The project started on January 1st 2016 and lasts in its
first phase for 4 years with an option for prolongation
for another 4 years
Long Term Goals
Comprehensive, interdisciplinary and multidimensional
linguistic analyses on DiÖ and its varieties
Set up of a digital research infrastructure on DiÖ
(sustainability)
Variation and Varieties Language Contact
Perception and Attitudes
Research Program: Task Clusters
8
A: Coordination
B: Variation and Change
of German in Austria −
Perspectives of Variationist Linguistics
C: German and
other Languages in Austria −
Perspectives of Language Contact
D: ‘German in the Minds’ −
Language Attitudes and Perception
E: Collaborative Online Research Platform −
‘German in Austria’
Examples of research methods
9
Data Processing … … and Data Analysis:
(segmented) recorded data,
transcriptions and annotations (EXMERaLDA)

orthographic transcriptions plus:
– conversation analytic transcription  – morphological and syntactic analysis
– phonetic transcription  – acoustic-phonetic analysis with STx
(selected
vowels/diphthongs, liquids)
Data Documentation:
– research platform (PP11)
– first ‘two-dimensional’
talking dialect atlas for
whole of Austria
(cf. SAO 1998ff.)

– analysis of variation and change in
apparent and real time
– dialectometric analysis (e.g. with
GeoLing ) in stage II
(cf. Rumpf/Pickl/Elspaß/König/Schmidt 2009
...)
Key Methods: Locations and
Informants
10
Informants (10 per location)
160 informants at 16 villages
Groups
Autoch-
thony
Number
(gender)
Age
(years)
Education
type
Occupation
type
Regional
mobility
‘Old’
+ high
1f / 1m > 65
˗ high + manual
- high
‘Young-I’ 2f / 2m
20–30 + high
‘Young-II’ 2f / 2m + high ˗ manual
Key Methods: Data Survey
11
• reading tasks
• translation tasks
• ‘experiments’ / tests
• interviews (formal)
• translation tasks
• ‘experiments’ / tests
• informal conversations
Providing data of diverse ‘sections’ (varieties/styles) of
an informant’s individual variation repertoire
SOUND
D
ST
ST
D
Key Methods: Data Handling and
Analyses
12
Stage I
quantitative & qualitative analyses (phonology, morphology, syntax)
various transcriptions
orthographic transcriptions, phonetic transcriptions, grammatical
annotations & (partly) conversation-analytic transcriptions
( ‘stylistic practice’)
various analyses
type-token analyses, phonetic distance measurements, analysis of
co-occurrence restrictions, ‘local-sequence analyses’ a. o.
comprehensive statistical analyses
Key Methods: Approach
13
analyses of connections
mobility and languages
Linguistic Target Levels
Stage I
› phonetic/phonological
› morphological/syntactic
› pragmatic
Stage II
› suprasegmental (prosodic)
› lexical
› pragmatic
Integrative Approach
Correlative Global
Analyses
Conversational Local
Analyses
analyses of preferred
variants/groups of variants
analyses of discourse data
Key Methods: Locations
14
Research Locations (urban!)
Vienn
a
Graz
Vienna
population: 1,766,746 (Statistik Austria)
migration background: 40-50% (wien.at)
Graz
population: 309,323 (Statistik Austria)
migration background: 18% (graz.at)
Key Methods: Informants
15
Stage I
40 informants per city
40 informants per regional surroundings of a city
total sample size: 160 Informants
Stage II
additional sample of informants with migration background
Selection of Informants
central criteria: local autochthony (permanent primary residence)
sample balanced by age (two groups: 20-30 and >65), gender
(1:1), educational background, type of occupation
Key Methods: Informants
16
Additional urban specific criteria of selection
2 districts per city
migration background in
population high
20 pers. 10 pers. (born and raised native speaker)
10 pers. (native speaker inner Austrian migration)
migration background in
population low
20 pers. 10 pers. (born and raised native speaker)
10 pers. (native speaker inner Austrian migration)
4 agglomeration locations per city
2 with preserved
demographical structures
20 pers. 10 pers. (born and raised native speaker)
10 pers. (native speaker inner Austrian migration)
2 with changing
demographical structures
(high influx)
20 pers. 10 pers. (born and raised native speaker)
10 pers. (native speaker inner Austrian migration)
Key Methods: Data Survey
17
Instruments of Data Collection
• analytical interviews
• questionnaires
• tests
• free everyday conversations
• conversations among friends
• long-time recordings
(stage II - 12h-recordings)
2. analyses regarding
situative-pragmatic and
interactional variation
→ intra-speaker variation
1. analyses regarding
socio-pragmatic and
individual variation
→ inter-speaker variation
P
P
0
4
formal situation –
close to
intended standard
informal situation –
far from
intended standard
(dialect)
Key Methods: Data Survey /
Selection
18
a) edited data from former research projects
b) data on the situational embedding: material revealing the factual
situation of language use at that time
(e. g. official statistical returns, annual school programmes and school reports,
publications on the occasion of anniversaries and other jubilees etc.)
18
Key Methods: Data Survey / Data
Selection
19
a) edited data from former research projects
b) data on language contact: material from all available primary
and secondary sources concerning Slavic-German contact
phenomena, e. g. language corpora, dictionaries, relevant
literature and collections, popular descriptions and recurring
assumptions etc.
19
Key Methods: Data Handling
and Analyses
2020
evaluation and annotation of written sources
application of philological and etymological methods
application of methods from comparative linguistics
(feature-by-feature comparision & internal reconstruction)
Slavic Loans in DiÖ
Buchtel (B-/W-) ‘yeast pastry’ < Czech buchta
Liwanze ‘pancakes’ < Czech lívanec
Klobasse (-e/-i) ‘hard smoked sausage’ < Czech klobása
Kolatsche (K-/G-) ‘small yeast cake with filling’ < Czech koláč
Oblate
‘fine wafer’ < Czech oplatka; the DiÖ form is
stressed on the first syllable as in Czech
Palatschinke
‘jam-filled pancake’ < Czech palačinka
(< Hungarian palacsinta < Romanian plăcintă)
Powidl ‘plum jam’ < Czech povidla
Haluschka
‘chopped cabbage fried in butter & served over
boiled noodles’ < Slovak haluška
Czech Loans in DiÖ
heidipritsch ‘totally gone’ < Czech hajdy + pryč
Hubitschko ‘peck on the cheek’ < Czech hubička
Kaluppe
‘dilapidated, ramshackle hut’ < Czech chalupa also in
the German diminutive form Kalupperl
Leschak ‘lay-about’ < Czech ležák
nemam ‘have-not’ < Czech nemám
petschieren ‘seal’ < Czech zapečetit
powidalen
‘tell’ derived from the preterite form of Czech
povídat
Key Methods: Data Handling
and Analyses
2121
evaluation and annotation of written sources
application of philological and etymological methods
application of methods from comparative linguistics
(feature-by-feature comparison & internal reconstruction)
Böhmak ‘Czech male’
Feschak ‘dashing young man’
Tränak ‘camp follower’ (< French train and -ák)
Tschunkerl ‘mucky pup’ < čuně ‘piglet’ + Bavarian diminutive suffix –erl
Armutschkerl
‘poor wretch’ with two combined diminutive suffixes, i. e.
Czech -č(e)k- + Bavarian –erl
verdobrischen ‘squander, blow’ < dobrý ‘good’
Na servus Březina! in order to express unpleasant surprise
Er ist immer der Nowak in the sense of ‘he is always the victim, he is always abused’
auf Lepschi gehen ‘enjoy oneself’ equivalent to Czech jít na lepší
pomāli, pomāli! ‘not so fast!’ < Moravian Czech or Slovak pomaly ‘slow’
Czech stem with German word
formation suffix -erl
German word with Czech word
formation suffix -ák
Mixed suffix forms
German verbs derived from Czech
stems
Mixed idiomatic expressions
Key Methods: Data Survey and
Selection
22
Multivariate Data Survey and Selection:
quantitative & qualitative methods
focusing on attitudes & perception
Prestudies
II: In-depth Interviews
(round I)
(Task Cluster B)
I: Online Questionnaires
III: ‘Listener Judgement Tests’
& Interviews (round II)
TASK CLUSTER E
COLLABORATIVE ONLINE RESEARCH
PLATFORM
‘GERMAN IN AUSTRIA’
23
Goals
24
• efficiency: collaborative research work
• persistency: consistent & persistent data preservation
• accessibility: enhancing accessibility & usability of research data
• extent: linking and correlation of all data
Construction and use of a collaborative online
research platform (DiÖ Virtual Research Environment)
for the entire project in order to facilitate or enable
Goals
25
• administration: coordination of individual PPs (with PP01)
• dissemination: fast and easy publication of results
• security: single-sign-on, copyright and access management
• progress: further development of digital humanities methods
Construction and use of a collaborative online
research platform (DiÖ Virtual Research Environment)
for the entire project in order to facilitate or enable
Expected Results
26
A comprehensive and consistent annotation
framework covering
• phonology, morphology, syntax/grammar, lexis, discourse/text,
multimedia, etc.
• based on international best practices in relevant R&D (Linguistic
Annotation Framework etc.., see CLARIN strategic standing document on
interoperability and standards (TEI, ISO, EAGLES, etc.))
• adapted if needed to variationist linguistics research perspectives and to
specific research questions in any of the project parts of DiÖ
DiÖ – Virtual Research Environment
27
processing & annotation
PoS tagging, using
LAF/LMF/MAF/SynAF standards
due to research questions
collaborative research
Corpus-driven vs. -based
detailed linguistic research
CLARIN-(AT)-Repository
corpora
documentation
describing
digitising/curating
research data, results
documentation/meta-data
visualisation/aggregation
networking/linking/curation
DiÖ – VRE Functional Architecture
28
collaborative workingspace repository
data
resources
DiÖ-data
dissemination
material
publications
transcription
annotation
analysis
analysis
peer discussion
collaborative
corpus building
collaborative
writing
Key Methods
29
Based on methods from computational linguistics, language technologies and
digital humanities we will use data modelling, research process modelling and
user modelling in order to provide the appropriate environment for the following
use cases:
• data search and data access
• data collection and analysis, aggregation
• collaborative research work (e.g. annotation)
• publication and long-term preservation
• reusing research results, also for teaching
t27 Heuer ADV heuer
t28 war VAFIN sein
t29 ich PPER ich
t30 fischn ADJD <unknown>
t31 in APPR in
t32 Malnitz NE <unknown>
t33 . $. .
t34 Mhm NN <unknown>
t35 Eine ART ein
t36 Woche NN Woche
t37 , $, ,
t38 Weu ADJD <unknown>
t39 i ADJD <unknown>
t40 bin VAFIN sein
t41 auch ADV auch
t42 Fliegenfischer NN <unknown>
- <div>
- <u who="#SPK1">
<anchor synch="#T37"/>
Oh Gott, o gott. Heuerwar
ich fischn in Mallnitz.
<anchor synch="#T38"/>
</u>
</div>
- <div>
- <u who="#SPK2">
<anchor synch="#T37"/>
PP
<anchor synch="#T38"/>
</u>
</div>
- <div>
- <u who="#SPK0">
<anchor synch="#T39"/>
Mhm
<anchor synch="#T40"/>
</u>
</div>
Data Processing
30
repository
Digitalisation of
material
harmonisation in line with TEI
(Text Encoding Initiative)
standards
adaptation of the tag set and training
of the POS-tagger on ‘German in
Austria’, especially spoken variety
data conversion into XML
POS tagging, lemmatising
storing, archiving,
publishing, reusing
Meta-data & workflow
management
The DiÖ Data management plan –
basic principles
• All research data are open access by default, designed for re-use by all
researchers in follow-up research
• All research data have been generated in a specific stage of a research workflow
(in a project part) and additional meta-data include relevant details
• Research data sets are to be published as such, in order to raise their visibility
(and that of their creators) and citability, but also linked to publications that
refer to them, both with meta-data
• Data scientists in PP 11 assist researchers in PPs 02-10 in coherent and
sustainable (meta-)data management, use of research tools and annotation
standards, data validation, etc.
• All research data are stored on the virtual reserach environment for re-use
within the project consortium and by default by researchers outside the
consortium. Final project results (data and publications) are stored at PHAIDRA
or a related research data archival infrastructure
• Roles and responsibilities are clearly defined for the whole project and all its
parts (data scientist, data curator, data collector, principal investigator, etc.
Conclusions
• Research data management has been an integral part
of the DiÖ project –
– during its conception and planning phase
– Clearly specified in the project application text
– Being implemented right from the beginning of the project
itself as integral part of the (emerging) virtual research
environment for DiÖ
– Embedded in and as part of the national and European
research infrastructure CLARIN
– It is not a goal by itself but specifically serves to
• increase the quality and efficiency of the project by generating
research data according to systematic research data management
workflows with coherent data generation methods
• Raise the visibility of the project, its research data, their creators
as researchers, thus increasing the long-term impact of the project
Conclusions
• The systematic research data management approach
in DiÖ serves as
– its methodological proof-of-concept as a showcase for the
necessity of research data policies beyond the
organisational level
– as a test lab for improving the RDM methodology in digital
humanities and all other fields in particlar for large data
sets thus providing a showcase for „Big Language Data“
• The role of data scientist and data curator requires
specific training and qualifications of researchers (in
our case linguists or other humanities researchers),
research librarians and archivists, IT specialists, etc.
Thank you for your attention!
Questions? Discussion?
gerhard.budin@univie.ac.at
Acknowledgements:
• Paolo Budroni and team for constant support and encouragement
in research data management strategies
• DiÖ Team (Speaker: Alexandra Lenz and team as well as co-project
part leaders Stephan Elspaß, Arne Ziegler, Stephan Newerkla, etc.)
for their research visions and their interest in digital humanities
research data management

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Partner Presenation: Armenian State Pedagogical University named after Kh. Ab...
Partner Presenation: Armenian State Pedagogical University named after Kh. Ab...Partner Presenation: Armenian State Pedagogical University named after Kh. Ab...
Partner Presenation: Armenian State Pedagogical University named after Kh. Ab...
 
UZH - Role in SUPERSEDE project
UZH - Role in SUPERSEDE projectUZH - Role in SUPERSEDE project
UZH - Role in SUPERSEDE project
 
The German Science Landscape
The German Science LandscapeThe German Science Landscape
The German Science Landscape
 
Fhnw - role in SUPERSEDE
Fhnw - role in SUPERSEDEFhnw - role in SUPERSEDE
Fhnw - role in SUPERSEDE
 
'Comparing Cross-Bordering City-Regional Strategies Beyond Nation States in O...
'Comparing Cross-Bordering City-Regional Strategies Beyond Nation States in O...'Comparing Cross-Bordering City-Regional Strategies Beyond Nation States in O...
'Comparing Cross-Bordering City-Regional Strategies Beyond Nation States in O...
 
Exploring comparative evaluation of semantic enrichment tools for cultural he...
Exploring comparative evaluation of semantic enrichment tools for cultural he...Exploring comparative evaluation of semantic enrichment tools for cultural he...
Exploring comparative evaluation of semantic enrichment tools for cultural he...
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
 
STATVIEW
STATVIEWSTATVIEW
STATVIEW
 

Semelhante a Big (Language) Data – From research strategies to proof-of-concept and implementation projects in linguistics research, by Gerhard Budin

Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert HenrichsPioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Wolfgang Stock
 
Dariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 PeterDariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 Peter
pkdoorn
 

Semelhante a Big (Language) Data – From research strategies to proof-of-concept and implementation projects in linguistics research, by Gerhard Budin (20)

Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
DARIAH Athens May 2009
DARIAH  Athens  May 2009DARIAH  Athens  May 2009
DARIAH Athens May 2009
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
 
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
Estermann Linked Data Ecosystem for Heritage Data - 29 Feb 2020
 
State of the Archives: questionnaire on the archives and heritage of UNPO
State of the Archives: questionnaire on the archives and heritage of UNPOState of the Archives: questionnaire on the archives and heritage of UNPO
State of the Archives: questionnaire on the archives and heritage of UNPO
 
Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020Estermann Panel on Authority Files, 3 June 2020
Estermann Panel on Authority Files, 3 June 2020
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality
Evaluating Data Quality in Europeana: Metrics for MultilingualityEvaluating Data Quality in Europeana: Metrics for Multilinguality
Evaluating Data Quality in Europeana: Metrics for Multilinguality
 
The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020
 
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
Knowledge exchange consensus on monitoring OA: recommendations from the Copen...
 
Innovating European Studies - CVCE Status Report 2015
Innovating European Studies - CVCE Status Report 2015Innovating European Studies - CVCE Status Report 2015
Innovating European Studies - CVCE Status Report 2015
 
Digital research infrastructure on European integration - CVCE Status report ...
Digital research infrastructure on European integration - CVCE Status report ...Digital research infrastructure on European integration - CVCE Status report ...
Digital research infrastructure on European integration - CVCE Status report ...
 
Open statistics Belgium
Open statistics BelgiumOpen statistics Belgium
Open statistics Belgium
 
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert HenrichsPioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
 
Developing a research data centre for Germany: IANUS and its IT-guidelines
Developing a research data centre for Germany: IANUS and its IT-guidelinesDeveloping a research data centre for Germany: IANUS and its IT-guidelines
Developing a research data centre for Germany: IANUS and its IT-guidelines
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Archiving In Digital Cartography And Geoinformation
Archiving In Digital Cartography And GeoinformationArchiving In Digital Cartography And Geoinformation
Archiving In Digital Cartography And Geoinformation
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
 
Dariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 PeterDariah Advisory Board June 2009 Peter
Dariah Advisory Board June 2009 Peter
 

Mais de LEARN Project

Mais de LEARN Project (20)

Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM PolicyLEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
 
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM ToolkitLEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
 
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3mResearch Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
 
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
 
LEARN Final Conference: Tutorial Group | Costing RDM
LEARN Final Conference: Tutorial Group | Costing RDMLEARN Final Conference: Tutorial Group | Costing RDM
LEARN Final Conference: Tutorial Group | Costing RDM
 
Paolo Budroni at COAR Annual Meeting
Paolo Budroni at COAR Annual MeetingPaolo Budroni at COAR Annual Meeting
Paolo Budroni at COAR Annual Meeting
 
LEARN Webinar
LEARN WebinarLEARN Webinar
LEARN Webinar
 
Developing a Framework for Research Data Management Protocols
Developing a Framework for Research Data Management ProtocolsDeveloping a Framework for Research Data Management Protocols
Developing a Framework for Research Data Management Protocols
 
The Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARNThe Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARN
 
Opening Research Data in EU Universities: Policies, Motivators and Challenges
Opening Research Data in EU Universities: Policies, Motivators and ChallengesOpening Research Data in EU Universities: Policies, Motivators and Challenges
Opening Research Data in EU Universities: Policies, Motivators and Challenges
 
About Data From A Machine Learning Perspective
About Data From A Machine Learning PerspectiveAbout Data From A Machine Learning Perspective
About Data From A Machine Learning Perspective
 
LEARN Carribean Workshop Opening Remarks
LEARN Carribean Workshop Opening RemarksLEARN Carribean Workshop Opening Remarks
LEARN Carribean Workshop Opening Remarks
 
Managing Research Data in the Caribbean: Good practices and challenges
Managing Research Data in the Caribbean: Good practices and challengesManaging Research Data in the Caribbean: Good practices and challenges
Managing Research Data in the Caribbean: Good practices and challenges
 
LEARN Project: The Story So Far
LEARN Project: The Story So FarLEARN Project: The Story So Far
LEARN Project: The Story So Far
 
The Data Deluge: the Role of Research Organisations
The Data Deluge: the Role of Research OrganisationsThe Data Deluge: the Role of Research Organisations
The Data Deluge: the Role of Research Organisations
 
Data for Development in the Caribbean
Data for Development in the CaribbeanData for Development in the Caribbean
Data for Development in the Caribbean
 
Open Data in a Big World by Fernando Ariel López
Open Data in a Big World by Fernando Ariel López Open Data in a Big World by Fernando Ariel López
Open Data in a Big World by Fernando Ariel López
 
CENTRO DE DATOS
CENTRO DE DATOSCENTRO DE DATOS
CENTRO DE DATOS
 

Último

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 

Último (20)

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 

Big (Language) Data – From research strategies to proof-of-concept and implementation projects in linguistics research, by Gerhard Budin

  • 1. Big (Language) Data – From research strategies to proof-of-concept and implementation projects in linguistics research Gerhard Budin Centre for Translation Studies University of Vienna LEARN Workshop Vienna 6th of April, 2016
  • 2. Introduction & Overview • What is Big (Language) Data • The “ZTW Cloud” – a digital humanities research infrastructure • Research data management strategies • Case study: FWF-SFB “DiÖ German in Austria” • Conclusions
  • 3. Big Data + linguistics = Big Language Data • Big Data has become a crucial concept for linguistics research due to its statistical paradigm of the analysis of language usage in computational corpus linguistics and computational translation studies. • -> “More data are good data” – for training tools, testing algorithms for language data analysis, for linguistic corpus analysis • Thus Big Language Data is becoming a cross- disciplinary (computer science – linguistics – digital humanities) methodology that is increasingly applied in relevant research projects
  • 4. ZTW Cloud: a digital humanities research infrastructure • ZTW = Zentrum für Translationswissenschaft = Centre for Translation Studies at the University of Vienna • The ZTW Cloud (located in the HPSC Arsenal) is a research infrastructure for language and multimedia data consisting of – A HADOOP server cluster with about 100 TB storage and processing capacity mainly used for: • Machine translation research on MOSES • Text Data mining and corpus research (Cloudera, ProTerm, etc.) • CAT & Terminology research (SDL Suite, Multiterm, etc.) • Multimedia conference interpreting corpus research – A conference interpreting facility (20 booths in 2 rooms) and other digital interpreting tools – 3 media labs with 80 student multimedia workplaces
  • 5. Research data management Strategies • The ZTW Cloud is embedded in (and in fact the core of) the CLARIN Core Facility at the University of Vienna • …which in turn is embedded in the Austrian CLARIAH (CLARIN+DARIAH) digital humanities research infrastructure network (Vienna, Graz, Innsbruck, et al.) • Thus the CLARIN and DARIAH data management best practices focusing on the needs of linguistics and other humanities disciplines (metadata standards, data curation principles, corpus annotation standards, etc.) are strictly followed (for CLARIN based on TEI and ISO) • In addition, the generic research data management principles of LERU, LEARN, and other initiatives at EU level and at national levels are also followed
  • 6. Case study: FWF-SFB “DiÖ German in Austria” • -> A showcase for the genesis of a large research project (in this case in linguistics research) – from research questions to a project that is meant to be a proof-of- concept and an implementation of a systematic research data management strategy • SFB = Spezialforschungsbereich funded by FWF (Fond zur Förderung der Wissenschaftlichen Forschung) • F 60 DiÖ: A consortium consisting of UniVie (coord. with 3 departments from 2 faculties), Uni Graz, Uni Salzburg, Austrian Academy of Sciences, and other members) • The project started on January 1st 2016 and lasts in its first phase for 4 years with an option for prolongation for another 4 years
  • 7. Long Term Goals Comprehensive, interdisciplinary and multidimensional linguistic analyses on DiÖ and its varieties Set up of a digital research infrastructure on DiÖ (sustainability) Variation and Varieties Language Contact Perception and Attitudes
  • 8. Research Program: Task Clusters 8 A: Coordination B: Variation and Change of German in Austria − Perspectives of Variationist Linguistics C: German and other Languages in Austria − Perspectives of Language Contact D: ‘German in the Minds’ − Language Attitudes and Perception E: Collaborative Online Research Platform − ‘German in Austria’
  • 9. Examples of research methods 9 Data Processing … … and Data Analysis: (segmented) recorded data, transcriptions and annotations (EXMERaLDA)  orthographic transcriptions plus: – conversation analytic transcription  – morphological and syntactic analysis – phonetic transcription  – acoustic-phonetic analysis with STx (selected vowels/diphthongs, liquids) Data Documentation: – research platform (PP11) – first ‘two-dimensional’ talking dialect atlas for whole of Austria (cf. SAO 1998ff.)  – analysis of variation and change in apparent and real time – dialectometric analysis (e.g. with GeoLing ) in stage II (cf. Rumpf/Pickl/Elspaß/König/Schmidt 2009 ...)
  • 10. Key Methods: Locations and Informants 10 Informants (10 per location) 160 informants at 16 villages Groups Autoch- thony Number (gender) Age (years) Education type Occupation type Regional mobility ‘Old’ + high 1f / 1m > 65 ˗ high + manual - high ‘Young-I’ 2f / 2m 20–30 + high ‘Young-II’ 2f / 2m + high ˗ manual
  • 11. Key Methods: Data Survey 11 • reading tasks • translation tasks • ‘experiments’ / tests • interviews (formal) • translation tasks • ‘experiments’ / tests • informal conversations Providing data of diverse ‘sections’ (varieties/styles) of an informant’s individual variation repertoire SOUND D ST ST D
  • 12. Key Methods: Data Handling and Analyses 12 Stage I quantitative & qualitative analyses (phonology, morphology, syntax) various transcriptions orthographic transcriptions, phonetic transcriptions, grammatical annotations & (partly) conversation-analytic transcriptions ( ‘stylistic practice’) various analyses type-token analyses, phonetic distance measurements, analysis of co-occurrence restrictions, ‘local-sequence analyses’ a. o. comprehensive statistical analyses
  • 13. Key Methods: Approach 13 analyses of connections mobility and languages Linguistic Target Levels Stage I › phonetic/phonological › morphological/syntactic › pragmatic Stage II › suprasegmental (prosodic) › lexical › pragmatic Integrative Approach Correlative Global Analyses Conversational Local Analyses analyses of preferred variants/groups of variants analyses of discourse data
  • 14. Key Methods: Locations 14 Research Locations (urban!) Vienn a Graz Vienna population: 1,766,746 (Statistik Austria) migration background: 40-50% (wien.at) Graz population: 309,323 (Statistik Austria) migration background: 18% (graz.at)
  • 15. Key Methods: Informants 15 Stage I 40 informants per city 40 informants per regional surroundings of a city total sample size: 160 Informants Stage II additional sample of informants with migration background Selection of Informants central criteria: local autochthony (permanent primary residence) sample balanced by age (two groups: 20-30 and >65), gender (1:1), educational background, type of occupation
  • 16. Key Methods: Informants 16 Additional urban specific criteria of selection 2 districts per city migration background in population high 20 pers. 10 pers. (born and raised native speaker) 10 pers. (native speaker inner Austrian migration) migration background in population low 20 pers. 10 pers. (born and raised native speaker) 10 pers. (native speaker inner Austrian migration) 4 agglomeration locations per city 2 with preserved demographical structures 20 pers. 10 pers. (born and raised native speaker) 10 pers. (native speaker inner Austrian migration) 2 with changing demographical structures (high influx) 20 pers. 10 pers. (born and raised native speaker) 10 pers. (native speaker inner Austrian migration)
  • 17. Key Methods: Data Survey 17 Instruments of Data Collection • analytical interviews • questionnaires • tests • free everyday conversations • conversations among friends • long-time recordings (stage II - 12h-recordings) 2. analyses regarding situative-pragmatic and interactional variation → intra-speaker variation 1. analyses regarding socio-pragmatic and individual variation → inter-speaker variation P P 0 4 formal situation – close to intended standard informal situation – far from intended standard (dialect)
  • 18. Key Methods: Data Survey / Selection 18 a) edited data from former research projects b) data on the situational embedding: material revealing the factual situation of language use at that time (e. g. official statistical returns, annual school programmes and school reports, publications on the occasion of anniversaries and other jubilees etc.) 18
  • 19. Key Methods: Data Survey / Data Selection 19 a) edited data from former research projects b) data on language contact: material from all available primary and secondary sources concerning Slavic-German contact phenomena, e. g. language corpora, dictionaries, relevant literature and collections, popular descriptions and recurring assumptions etc. 19
  • 20. Key Methods: Data Handling and Analyses 2020 evaluation and annotation of written sources application of philological and etymological methods application of methods from comparative linguistics (feature-by-feature comparision & internal reconstruction) Slavic Loans in DiÖ Buchtel (B-/W-) ‘yeast pastry’ < Czech buchta Liwanze ‘pancakes’ < Czech lívanec Klobasse (-e/-i) ‘hard smoked sausage’ < Czech klobása Kolatsche (K-/G-) ‘small yeast cake with filling’ < Czech koláč Oblate ‘fine wafer’ < Czech oplatka; the DiÖ form is stressed on the first syllable as in Czech Palatschinke ‘jam-filled pancake’ < Czech palačinka (< Hungarian palacsinta < Romanian plăcintă) Powidl ‘plum jam’ < Czech povidla Haluschka ‘chopped cabbage fried in butter & served over boiled noodles’ < Slovak haluška Czech Loans in DiÖ heidipritsch ‘totally gone’ < Czech hajdy + pryč Hubitschko ‘peck on the cheek’ < Czech hubička Kaluppe ‘dilapidated, ramshackle hut’ < Czech chalupa also in the German diminutive form Kalupperl Leschak ‘lay-about’ < Czech ležák nemam ‘have-not’ < Czech nemám petschieren ‘seal’ < Czech zapečetit powidalen ‘tell’ derived from the preterite form of Czech povídat
  • 21. Key Methods: Data Handling and Analyses 2121 evaluation and annotation of written sources application of philological and etymological methods application of methods from comparative linguistics (feature-by-feature comparison & internal reconstruction) Böhmak ‘Czech male’ Feschak ‘dashing young man’ Tränak ‘camp follower’ (< French train and -ák) Tschunkerl ‘mucky pup’ < čuně ‘piglet’ + Bavarian diminutive suffix –erl Armutschkerl ‘poor wretch’ with two combined diminutive suffixes, i. e. Czech -č(e)k- + Bavarian –erl verdobrischen ‘squander, blow’ < dobrý ‘good’ Na servus Březina! in order to express unpleasant surprise Er ist immer der Nowak in the sense of ‘he is always the victim, he is always abused’ auf Lepschi gehen ‘enjoy oneself’ equivalent to Czech jít na lepší pomāli, pomāli! ‘not so fast!’ < Moravian Czech or Slovak pomaly ‘slow’ Czech stem with German word formation suffix -erl German word with Czech word formation suffix -ák Mixed suffix forms German verbs derived from Czech stems Mixed idiomatic expressions
  • 22. Key Methods: Data Survey and Selection 22 Multivariate Data Survey and Selection: quantitative & qualitative methods focusing on attitudes & perception Prestudies II: In-depth Interviews (round I) (Task Cluster B) I: Online Questionnaires III: ‘Listener Judgement Tests’ & Interviews (round II)
  • 23. TASK CLUSTER E COLLABORATIVE ONLINE RESEARCH PLATFORM ‘GERMAN IN AUSTRIA’ 23
  • 24. Goals 24 • efficiency: collaborative research work • persistency: consistent & persistent data preservation • accessibility: enhancing accessibility & usability of research data • extent: linking and correlation of all data Construction and use of a collaborative online research platform (DiÖ Virtual Research Environment) for the entire project in order to facilitate or enable
  • 25. Goals 25 • administration: coordination of individual PPs (with PP01) • dissemination: fast and easy publication of results • security: single-sign-on, copyright and access management • progress: further development of digital humanities methods Construction and use of a collaborative online research platform (DiÖ Virtual Research Environment) for the entire project in order to facilitate or enable
  • 26. Expected Results 26 A comprehensive and consistent annotation framework covering • phonology, morphology, syntax/grammar, lexis, discourse/text, multimedia, etc. • based on international best practices in relevant R&D (Linguistic Annotation Framework etc.., see CLARIN strategic standing document on interoperability and standards (TEI, ISO, EAGLES, etc.)) • adapted if needed to variationist linguistics research perspectives and to specific research questions in any of the project parts of DiÖ
  • 27. DiÖ – Virtual Research Environment 27 processing & annotation PoS tagging, using LAF/LMF/MAF/SynAF standards due to research questions collaborative research Corpus-driven vs. -based detailed linguistic research CLARIN-(AT)-Repository corpora documentation describing digitising/curating research data, results documentation/meta-data visualisation/aggregation networking/linking/curation
  • 28. DiÖ – VRE Functional Architecture 28 collaborative workingspace repository data resources DiÖ-data dissemination material publications transcription annotation analysis analysis peer discussion collaborative corpus building collaborative writing
  • 29. Key Methods 29 Based on methods from computational linguistics, language technologies and digital humanities we will use data modelling, research process modelling and user modelling in order to provide the appropriate environment for the following use cases: • data search and data access • data collection and analysis, aggregation • collaborative research work (e.g. annotation) • publication and long-term preservation • reusing research results, also for teaching
  • 30. t27 Heuer ADV heuer t28 war VAFIN sein t29 ich PPER ich t30 fischn ADJD <unknown> t31 in APPR in t32 Malnitz NE <unknown> t33 . $. . t34 Mhm NN <unknown> t35 Eine ART ein t36 Woche NN Woche t37 , $, , t38 Weu ADJD <unknown> t39 i ADJD <unknown> t40 bin VAFIN sein t41 auch ADV auch t42 Fliegenfischer NN <unknown> - <div> - <u who="#SPK1"> <anchor synch="#T37"/> Oh Gott, o gott. Heuerwar ich fischn in Mallnitz. <anchor synch="#T38"/> </u> </div> - <div> - <u who="#SPK2"> <anchor synch="#T37"/> PP <anchor synch="#T38"/> </u> </div> - <div> - <u who="#SPK0"> <anchor synch="#T39"/> Mhm <anchor synch="#T40"/> </u> </div> Data Processing 30 repository Digitalisation of material harmonisation in line with TEI (Text Encoding Initiative) standards adaptation of the tag set and training of the POS-tagger on ‘German in Austria’, especially spoken variety data conversion into XML POS tagging, lemmatising storing, archiving, publishing, reusing Meta-data & workflow management
  • 31. The DiÖ Data management plan – basic principles • All research data are open access by default, designed for re-use by all researchers in follow-up research • All research data have been generated in a specific stage of a research workflow (in a project part) and additional meta-data include relevant details • Research data sets are to be published as such, in order to raise their visibility (and that of their creators) and citability, but also linked to publications that refer to them, both with meta-data • Data scientists in PP 11 assist researchers in PPs 02-10 in coherent and sustainable (meta-)data management, use of research tools and annotation standards, data validation, etc. • All research data are stored on the virtual reserach environment for re-use within the project consortium and by default by researchers outside the consortium. Final project results (data and publications) are stored at PHAIDRA or a related research data archival infrastructure • Roles and responsibilities are clearly defined for the whole project and all its parts (data scientist, data curator, data collector, principal investigator, etc.
  • 32. Conclusions • Research data management has been an integral part of the DiÖ project – – during its conception and planning phase – Clearly specified in the project application text – Being implemented right from the beginning of the project itself as integral part of the (emerging) virtual research environment for DiÖ – Embedded in and as part of the national and European research infrastructure CLARIN – It is not a goal by itself but specifically serves to • increase the quality and efficiency of the project by generating research data according to systematic research data management workflows with coherent data generation methods • Raise the visibility of the project, its research data, their creators as researchers, thus increasing the long-term impact of the project
  • 33. Conclusions • The systematic research data management approach in DiÖ serves as – its methodological proof-of-concept as a showcase for the necessity of research data policies beyond the organisational level – as a test lab for improving the RDM methodology in digital humanities and all other fields in particlar for large data sets thus providing a showcase for „Big Language Data“ • The role of data scientist and data curator requires specific training and qualifications of researchers (in our case linguists or other humanities researchers), research librarians and archivists, IT specialists, etc.
  • 34. Thank you for your attention! Questions? Discussion? gerhard.budin@univie.ac.at Acknowledgements: • Paolo Budroni and team for constant support and encouragement in research data management strategies • DiÖ Team (Speaker: Alexandra Lenz and team as well as co-project part leaders Stephan Elspaß, Arne Ziegler, Stephan Newerkla, etc.) for their research visions and their interest in digital humanities research data management