Open minted content_provision

•Transferir como PPTX, PDF•

1 gostou•271 visualizações

Lucas anastasiou

@Rhodes 2015

Serviços

Content Provision
Lucas Anastasiou
The Open University, Knowledge Media Institute
18-06-2015
Rhodes, Greece, 2015

A simple text mining exercise
Average length of
dissertations (doctorates
and master thesis) by
major
3037 records from
University of Minnesota
https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/

• R script to scrape pdfs from Institutional
Repository
• Extract Text
• Parse Text
• Plot the data
A simple text mining exercise

The challenge of TDM
• 90% of TDM project [1]
is spent
– Collecting data
– Harmonising
– Pre-processing data
• Magnitude of data
• Heterogeneity of data
[1] Jisc Open Mirror Report Oct 2013

The case of scientific literature
• Identifying levels of access
– Transactional information access
– Analytical information access
– Raw data (programatical) access
– Google scholar estimated at 100 million papers

Other scholarly data sources
Name of service Transactional Analytical Raw
CiteseerX  
PubMed 
Arxiv  
Scopus 
Web of Science 
SpringerLink  ☐ 
Elsevier  

The case of Elsevier API
• Sufficient for some tasks but not for all, no
access to the full corpus
• Restricting the usability of API, controlling the
access
• Potential loss of information (what you see in
portal is different than API)
• (may) require special dispensation from
authors
“It takes a lot of time and a lot of energy and
doesn’t scale at all”
Heather Piwowar

Is this enough for the TDM
community?
• If you are a TDM-er you need to have true
unrestricted programmatic level of access
• APIs
– Offer programatic access to individual articles
– Offered in XML/JSON
– Lack of standard schema, providers use
proprietary formats
– (may be) sufficient for some TDM tasks (e.g.
information extraction)

APIs not enough
• Other family of TDM tasks require access to
the full corpus
– E.g. recommender systems, text summarisation
• Need to have access to complete collection of
articles
• Data dumps

Data dumps not enough
• Even if you can access whole corpus you would
need special hardware resources
• Arxiv.org data dump compressed: 300Gb
– How do you get it?
– Where to store it? (*)
– How to analyse it?
– How to disseminate your findings?
– In what format?
– How can someone else verify your findings?
Research should be reproducible !

Non-technological barriers
• Legal uncertainty
• Copyright, database rights, licensing
• (some) publishers require special dispensation
• Skills gap
• Researchers lack of awareness of TDM
potential

Summary
• Collecting data is a tedious and time-
consuming task, perhaps impossible
• Scientific literature lacks programmatic level
of access
• APIs and Data dumps (though nice) not
enough
• Other barriers

openMinTed vision
• Data and algorithms in one place
• Interoperable framework
• “Safe” environment (legal status)
• Trusted environment

Mais conteúdo relacionado

Mais procurados

PhD Projects in Wordnet Research Assistance PhD Services

OpenMinted: It's Uses and Benefits for the Social Sciencesopenminted_eu

Data repositories -- Xiamen University 2012 06-08Jian Qin

Elab 16 5-13-re3data-scholze-finalKarlsruhe Institute of Technology (KIT)

re3data.org – Registry of Research Data RepositoriesHeinz Pampel

PhD Projects in Ad Hoc Network With Source CodePhD Services

How can repositories support the text mining of their content and why?openminted_eu

EDS for IFLACliveRWright

Research data management workshop april12 2016 Rebecca Raworth, MLIS

Semantic Technologies for Big Sciences including AstrophysicsArtificial Intelligence Institute at UofSC

Writing Help Your PhD Research System DevelopmentPhD Services

PhD Projects in Text Mining Research Topics With Source CodePhD Services

Making Research Data Repositories Visible – The re3data.org RegistryHeinz Pampel

Scholze liber 2015-06-25_finalKarlsruhe Institute of Technology (KIT)

PhD Projects in Audio Speech Language Processing TutorialPhD Services

Tesxt miningMaurice Masih

Making DMPs actionable and publicStephanie Simms

Digital librariesApurva Kulkarni

Writing Help Your PhD Research ProposalPhD Services

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni

Mais procurados (20)

PhD Projects in Wordnet Research Assistance

OpenMinted: It's Uses and Benefits for the Social Sciences

Data repositories -- Xiamen University 2012 06-08

Elab 16 5-13-re3data-scholze-final

re3data.org – Registry of Research Data Repositories

PhD Projects in Ad Hoc Network With Source Code

How can repositories support the text mining of their content and why?

EDS for IFLA

Research data management workshop april12 2016

Semantic Technologies for Big Sciences including Astrophysics

Writing Help Your PhD Research System Development

PhD Projects in Text Mining Research Topics With Source Code

Making Research Data Repositories Visible – The re3data.org Registry

Scholze liber 2015-06-25_final

PhD Projects in Audio Speech Language Processing Tutorial

Tesxt mining

Making DMPs actionable and public

Digital libraries

Writing Help Your PhD Research Proposal

Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

Destaque

Emerging Tech in LibrariesOregon State University Libraries and Press

Presentazione Profilo salute 2008 Valdichiana AretinaMarco Marcellini

Pp Symposium 250908coachingaandenrijn

Cud Program Overview 2010Shane Mitchell

Index nominum to ontologyGautier Poupeau

Building the climate movement onlinejameswhelan

Ucm Transcriptmadanim

Pwrslide1lovelyy243

LITA Instructional Technologies IG - Presentation at MW Philly 2014Oregon State University Libraries and Press

Chi ha davvero bisogno di una copia privata?marco scialdone

plano de trabalhoguest5b0cf

Business Coaching done Right!Gus Chaveste

AlucinaPamela Pipi-Chinchas

El Aguilaguest723a

Social NetworkingRob Williams

Presentación Expansion TIivanjoya

Abu Hanifah2Abdullah

Research2.0 by POSTECH LibraryPOSTECH Library

Diaporamaguestd4ec2b

何謂三聚氰胺honan4108

Destaque (20)

Emerging Tech in Libraries

Presentazione Profilo salute 2008 Valdichiana Aretina

Pp Symposium 250908

Cud Program Overview 2010

Index nominum to ontology

Building the climate movement online

Ucm Transcript

Pwrslide1

LITA Instructional Technologies IG - Presentation at MW Philly 2014

Chi ha davvero bisogno di una copia privata?

plano de trabalho

Business Coaching done Right!

Alucina

El Aguila

Social Networking

Presentación Expansion TI

Abu Hanifah2

Research2.0 by POSTECH Library

Diaporama

何謂三聚氰胺

Semelhante a Open minted content_provision

Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart

“Filling the digital preservation gap”an update from the Jisc Research Data ...Jenny Mitcham

Prototype Design of Open Access Institutional RepositoryDMR (Directorate of Mushroom Research), ICAR, GOI

FAIR BioData ManagementUlrike Wittig

Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29

Rscd 2017 bo f data lifecycle data skills for libsSusanMRob

"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham

Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew

co:op-READ-Convention Marburg - Günter MühlbergerICARUS - International Centre for Archival Research

2016 Ocean Sciences Meeting tutorialJosh Young

Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup

The Web of Data: The W3C Semantic Web InitiativeNational Information Standards Organization (NISO)

(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)OpenAIRE

Text MiningBiniam Asnake

FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM

A collaborative approach to "filling the digital preservation gap" for Resear...Jenny Mitcham

A collaborative approach to filling the digital preservation gap for RDMnortherncollaboration

Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc RDM

Efficient and effective data management for ILRI research projects: A holisti...ILRI

ERA CoBioTech Data Management WebinarFAIRDOM

Semelhante a Open minted content_provision (20)

Research Data (and Software) Management at Imperial: (Everything you need to ...

“Filling the digital preservation gap”an update from the Jisc Research Data ...

Prototype Design of Open Access Institutional Repository

FAIR BioData Management

Pemanfaatan Big Data Dalam Riset 2023.pptx

Rscd 2017 bo f data lifecycle data skills for libs

"Filling the Digital Preservation Gap" with Archivematica

Auto Mapping Texts for Human-Machine Analysis and Sensemaking

co:op-READ-Convention Marburg - Günter Mühlberger

2016 Ocean Sciences Meeting tutorial

Relevancy and Search Quality Analysis - Search Technologies

The Web of Data: The W3C Semantic Web Initiative

(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)

Text Mining

FAIRDOM data management support for ERACoBioTech Proposals

A collaborative approach to "filling the digital preservation gap" for Resear...

A collaborative approach to filling the digital preservation gap for RDM

Jisc Research Data Management Shared Service Workshop: An institutional persp...

Efficient and effective data management for ILRI research projects: A holisti...

ERA CoBioTech Data Management Webinar

Último

Call Us ☎97110√14705🔝 Call Girls In Mandi House (Delhi NCR)thapagita

Trusted Call~Girls In Rohini Delhi꧁❤ 9667422720 ❤꧂EscortsLipikasharma29

Call Girls In New Delhi Railway Station 9667422720 Top Quality Escorts ServiceLipikasharma29

Book Call Girls In Gurgaon Sector 29 Call 8800357707 Escorts Servicemonikaservice1

8800357707, Munirka Metro Good Looking For Call Girls And Escort Service Delhimonikaservice1

▶ ●─Hookup Call Girls In Noida Sector 137 (Noida) ⎝9667422720⎠ Delhi Female E...Lipikasharma29

9643097474 Full Enjoy @24/7 Call Girls In Moti Nagar Delhi Ncrthapariya601

Call Girls In saket 9711800081 Low Rate Short 1500 Night ...gitathapa4

FULL ENJOY Call Girls In Gurgaon Call 8588836666 Escorts ServiceCALLGIRLS DELHI

Call Us ≽ 9643900018 ≼ Call Girls In Laxmi Nagar (Delhi)ayushiverma1100

Call Us ➥9911191017▻Young Call Girls In Guru Dronacharya Metro Station Delhi NCRsafdarjungdelhi1

Call Girls In Sector 62, Noida꧁❤ 8800357707 ❤꧂Top Quality Escorts Servicemonikaservice1

Call Girl In Malviya Nagar Delhi 9711800081 Escort Servicegitathapa4

Call Us ≽ 9643900018 ≼ Call Girls In Dwarka Sector 7 (Delhi)ayushiverma1100

9643097474 Full Enjoy @24/7 Call Girls in Paschim Vihar Delhi NCRthapariya601

9643097474 Full Enjoy @24/7 Call Girls in Saket Metro Delhi NCRthapariya601

Justdial Call Girls In Vaishali, Ghaziabad 8800357707 Escorts Servicemonikaservice1

9643097474 Full Enjoy @24/7 Call Girls In Aerocty Delhi Ncrthapariya601

9643097474 Full Enjoy @24/7 Call Girls In Khirki Extension Delhi Ncrthapariya601

Call Girls In Lajpat Nagar Delhi➥9911191017 High Class Escorts In 24/7 Delhi NCRsafdarjungdelhi1

Open minted content_provision

1. Content Provision Lucas Anastasiou The Open University, Knowledge Media Institute 18-06-2015 Rhodes, Greece, 2015

2. A simple text mining exercise Average length of dissertations (doctorates and master thesis) by major 3037 records from University of Minnesota https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/

3. • R script to scrape pdfs from Institutional Repository • Extract Text • Parse Text • Plot the data A simple text mining exercise

4. The challenge of TDM • 90% of TDM project [1] is spent – Collecting data – Harmonising – Pre-processing data • Magnitude of data • Heterogeneity of data [1] Jisc Open Mirror Report Oct 2013

5. The case of scientific literature • Identifying levels of access – Transactional information access – Analytical information access – Raw data (programatical) access – Google scholar estimated at 100 million papers

6. 3 levels of access How the “big” guys are doing? Access type Google scholar MS Academic Research Transactional Browser interface (portal) Browser interface (portal) Analytical access Citation analysis, researcher profile Visualisations, citation analysis, authors connections Raw data access No API, scrapping possible (violation of ToC) Limited API, explicitly forbidden to download full corpus

7. Other scholarly data sources Name of service Transactional Analytical Raw CiteseerX   PubMed  Arxiv   Scopus  Web of Science  SpringerLink  ☐  Elsevier  

8. The case of Elsevier API • Sufficient for some tasks but not for all, no access to the full corpus • Restricting the usability of API, controlling the access • Potential loss of information (what you see in portal is different than API) • (may) require special dispensation from authors “It takes a lot of time and a lot of energy and doesn’t scale at all” Heather Piwowar

9. Is this enough for the TDM community? • If you are a TDM-er you need to have true unrestricted programmatic level of access • APIs – Offer programatic access to individual articles – Offered in XML/JSON – Lack of standard schema, providers use proprietary formats – (may be) sufficient for some TDM tasks (e.g. information extraction)

10. APIs not enough • Other family of TDM tasks require access to the full corpus – E.g. recommender systems, text summarisation • Need to have access to complete collection of articles • Data dumps

11. Data dumps not enough • Even if you can access whole corpus you would need special hardware resources • Arxiv.org data dump compressed: 300Gb – How do you get it? – Where to store it? (*) – How to analyse it? – How to disseminate your findings? – In what format? – How can someone else verify your findings? Research should be reproducible !

12. Non-technological barriers • Legal uncertainty • Copyright, database rights, licensing • (some) publishers require special dispensation • Skills gap • Researchers lack of awareness of TDM potential

13. Summary • Collecting data is a tedious and time- consuming task, perhaps impossible • Scientific literature lacks programmatic level of access • APIs and Data dumps (though nice) not enough • Other barriers

14. openMinTed vision • Data and algorithms in one place • Interoperable framework • “Safe” environment (legal status) • Trusted environment

15. Thank you! Q n A

16. TDM-ing today …

Notas do Editor

Show an example of a simple / minimalistic – perhaps useless- piece of TDM task => figure out the length (very simple metric) of academic dissertations and classify them by research area (major)
Transactional -> through portal -> researchers, students Analytical -> funders, government, business intelligence Raw -> developers, Digital libraries, companies
MS-AR : representations of data but not the data itself
Either aggregators, ✓✗ Verify the ticks if correct
Some people even suggest that API loses information
* Though it certainly fits in a laptop disk nowadays, it is not sustainable(?) to expect from everyone to get a copy of this data and run locally his own processing , there is a need for something more central and reproducible
Legal uncertainty: e.g. scraping is a violation of ToC of most, but under UK law if you are allowed to view a content through a browser then you are allowed to crawl it (provided that you are not compromisingprovider’s infrastructure) Copyright: e.g. arxiv provides as a dump only those articles with a default arxiv ilcense, vague description of what you can do with the processed infromation Special dispensation e.g. Elsevier Skills gap: it would be hard to expect from a (traditional) historian to write his own R scripts to dig historical text and extract perhaps important historical information BUT there is a HUGE potential if he decides to use TDM Researchers lack of awareness of TDM potential -> our responsibility as TDM community to demonstrate to the rest of the scientific community this potential Skills gap (2) : even for sciences close to TDM (computer scientists, mathematicians, statistians) TDM has a high barrier that you need to overcome if you want to do something useful
Each data source has its own way to access No legal guarantees (you may doing illegal stuff)

Open minted content_provision

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Open minted content_provision

Semelhante a Open minted content_provision (20)

Último

Último (20)

Open minted content_provision

Notas do Editor