2. A simple text mining exercise
Average length of
dissertations (doctorates
and master thesis) by
major
3037 records from
University of Minnesota
https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/
3. • R script to scrape pdfs from Institutional
Repository
• Extract Text
• Parse Text
• Plot the data
A simple text mining exercise
4. The challenge of TDM
• 90% of TDM project [1]
is spent
– Collecting data
– Harmonising
– Pre-processing data
• Magnitude of data
• Heterogeneity of data
[1] Jisc Open Mirror Report Oct 2013
5. The case of scientific literature
• Identifying levels of access
– Transactional information access
– Analytical information access
– Raw data (programatical) access
– Google scholar estimated at 100 million papers
6. 3 levels of access
How the “big” guys are doing?
Access type Google scholar MS Academic Research
Transactional Browser interface (portal) Browser interface (portal)
Analytical access Citation analysis, researcher
profile
Visualisations, citation
analysis, authors
connections
Raw data access No API, scrapping possible
(violation of ToC)
Limited API, explicitly
forbidden to download full
corpus
7. Other scholarly data sources
Name of service Transactional Analytical Raw
CiteseerX
PubMed
Arxiv
Scopus
Web of Science
SpringerLink ☐
Elsevier
8. The case of Elsevier API
• Sufficient for some tasks but not for all, no
access to the full corpus
• Restricting the usability of API, controlling the
access
• Potential loss of information (what you see in
portal is different than API)
• (may) require special dispensation from
authors
“It takes a lot of time and a lot of energy and
doesn’t scale at all”
Heather Piwowar
9. Is this enough for the TDM
community?
• If you are a TDM-er you need to have true
unrestricted programmatic level of access
• APIs
– Offer programatic access to individual articles
– Offered in XML/JSON
– Lack of standard schema, providers use
proprietary formats
– (may be) sufficient for some TDM tasks (e.g.
information extraction)
10. APIs not enough
• Other family of TDM tasks require access to
the full corpus
– E.g. recommender systems, text summarisation
• Need to have access to complete collection of
articles
• Data dumps
11. Data dumps not enough
• Even if you can access whole corpus you would
need special hardware resources
• Arxiv.org data dump compressed: 300Gb
– How do you get it?
– Where to store it? (*)
– How to analyse it?
– How to disseminate your findings?
– In what format?
– How can someone else verify your findings?
Research should be reproducible !
12. Non-technological barriers
• Legal uncertainty
• Copyright, database rights, licensing
• (some) publishers require special dispensation
• Skills gap
• Researchers lack of awareness of TDM
potential
13. Summary
• Collecting data is a tedious and time-
consuming task, perhaps impossible
• Scientific literature lacks programmatic level
of access
• APIs and Data dumps (though nice) not
enough
• Other barriers
14. openMinTed vision
• Data and algorithms in one place
• Interoperable framework
• “Safe” environment (legal status)
• Trusted environment
Show an example of a simple / minimalistic – perhaps useless- piece of TDM task => figure out the length (very simple metric) of academic dissertations and classify them by research area (major)
Transactional -> through portal -> researchers, students
Analytical -> funders, government, business intelligence
Raw -> developers, Digital libraries, companies
MS-AR : representations of data but not the data itself
Either aggregators, ✓✗
Verify the ticks if correct
Some people even suggest that API loses information
* Though it certainly fits in a laptop disk nowadays, it is not sustainable(?) to expect from everyone to get a copy of this data and run locally his own processing , there is a need for something more central and reproducible
Legal uncertainty: e.g. scraping is a violation of ToC of most, but under UK law if you are allowed to view a content through a browser then you are allowed to crawl it (provided that you are not compromisingprovider’s infrastructure)
Copyright: e.g. arxiv provides as a dump only those articles with a default arxiv ilcense, vague description of what you can do with the processed infromation
Special dispensation e.g. Elsevier
Skills gap: it would be hard to expect from a (traditional) historian to write his own R scripts to dig historical text and extract perhaps important historical information
BUT there is a HUGE potential if he decides to use TDM
Researchers lack of awareness of TDM potential -> our responsibility as TDM community to demonstrate to the rest of the scientific community this potential
Skills gap (2) : even for sciences close to TDM (computer scientists, mathematicians, statistians) TDM has a high barrier that you need to overcome if you want to do something useful
Each data source has its own way to access
No legal guarantees (you may doing illegal stuff)