Scott Edmunds slides from class 7 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering open data policy and practice, and the Hong Kong context.
1. Class 7…giant balancing
'if I have seen further it is by
standing on the shoulders of
giants'.
Scott Edmunds, HKU Data Curation MLIM7350
2. Communicating in-class
• Chat channel:
• http://backchannelchat.com/chat/dw131
• Feel free to ask questions, requests to speed
up/slow down
Also feel free to email: scott@gigasciencejournal.com
3. About me:
• Scott Edmunds
• Molecular biology, sci editing & comms
• Scientific journal & (big) data publishing
• Reproducibility & open science
• Open Data Hong Kong & Citizen Science
Journal, data-platform and database for
large-scale biological data
www.gigasciencejournal.com
5. • Formerly Beijing Genomics Institute
• Founded in 1999 (1% of HGP)
• China’s 1st citizen managed not-for-profit research
institute funded by commercial sequencing-as-a-service
(BGI Tech)
• Now largest genomic organization in the world
• HQ in Shenzhen, international data production in BGI HK
(Tai Po)
About my employer:
6. Open Data Hong Kong
ExCom member
for Open Science
Open Science
Working Group
15. Research Data ≈ Government Data
Canada's Action Plan on Open Government 2014-16
http://open.canada.ca/en/content/canadas-action-plan-open-government-2014-16
16. Research Data policies growing globally
http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=researchdata#1
18. Why Licensing is Important for:
http://dx.doi.org/10.1186/1756-0500-5-494
Placing restrictions on the reuse of scientific information,
particularly data, slows down the pace of research. Furthermore,
legal requirements for attribution ingrained in licenses such as CC-BY
can prohibit future research across large collections of content – as
commonly happens in data mining.
Therefore, to eliminate legal impediments to integration and re-use
of data, such as this stacking of attribution requirements in large
collections of data, and to help enable long-term interoperability an
appropriate license or waiver specific to data should be applied.
21. Levels of openness: 5★’s of open data
http://5stardata.info
★ - make your stuff available on the Web (whatever format)
under an open license
★★ - make it available as structured data (e.g., Excel instead of
image scan of a table)
★★★ - make it available in a non-proprietary open format (e.g.,
CSV as well as of Excel)
★★★★ - use URIs to denote things, so that people can point at
your stuff
★★★★★ - link your data to other data to provide context
22. Levels of openness: 5★’s of open data
Exercise: What star rating is this data?
Example: Hong Kong: Dengue Mosquito Breeding
Habitatshttp://www.fehd.gov.hk/english/safefood/dengue_fever/images/montlyO
vitrap_2003-2016.pdf
http://www.fehd.gov.hk/english/safefood/dengue_fever/
Static PDFs, images, not on data.gov.hk, no licensing information = ?
23. Levels of openness: 5★’s of open data
http://5stardata.info
Exercise: What star rating is this data?
1. HK FEHD: Distribution of the number of live pigs sold at different
auction prices on the day https://data.gov.hk/en-data/dataset/hk-
fehd-fehdsh-daily-auction
2. Singapore: Dengue Mosquito Breeding Habitats
https://data.gov.sg/dataset/dengue-mosquito-breeding-habitats
3. Linked Drug-Drug Interactions (LIDDI)
https://datahub.io/dataset/linked-drug-drug-interactions-liddi
24. Why closed data sucks?
https://commons.wikimedia.org/wiki/File:Inner_door_in_forbidden_city.jpg
25. Hong Kong Edition
https://data.gov.hk
Gov't spend on open data platform =
$1.2M
Gov't spend on 20 rubbish apps =
$20M
https://www.hongkongfp.com/2015/09/14/public-finance-concern-
group-raps-10-rubbish-govt-apps-one-has-only-10-downloads/
Why closed data sucks?
26. What the Gov't builds for $20M What open data can build for free
http://gazetteer.hk/
Hong Kong Edition
Why closed data sucks?
27. Open Data as a revenue stream...
Hong Kong Edition
Why closed data sucks?
28. Open Data as a revenue stream means can't share conservation data...
Why closed data kills spoonbills?
29. Climate change, global hunger, pollution, cancer,
disease outbreaks…
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966
Why closed data kills people?
30. Open Data as a revenue stream means can't share cancer data...
https://www.change.org/p/mark-c-capone-ceo-of-myriad-genetics-myriad-genetics-give-us-our-damn-brca-data
Why closed data kills women?
31. Open Data as a revenue (publishing) stream means nobody is sharing ethnic Chinese
control data to enable pharmacogenomics to work on Chinese populations...
Why closed data kills Chinese populations?
34. Consequences of 351 year old incentive systems…
Buckheit & Donoho: Scholarly articles are
merely advertisement of scholarship. The
actual scholarly artifacts, i.e. the data and
computational methods, which support
the scholarship, remain largely
inaccessible.
35. The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
37. Replication rates as low as 11%
http://www.nature.com/nature/journal/v483/n7391/full/483531a.html
https://osf.io/e81xl/wiki/home/
38. Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
39. Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
At current % increase by 2045 as
many papers published as
retracted!
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
40. Problem: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
More retractions:
>15X increase in last decade
At current % > by 2045 as many papers published as retracted
Insufficient methods
41. The Cost of Scientific Retractions?
A: $400,000 per paper
https://elifesciences.org/content/3/e02956
43. What is the journal Impact Factor (jIF)?
• Citation Index concept first developed
by Eugene Garfield in 1955 (Science)
• Formed Institute of Scientific
Information (ISI) in 1960
• Science Citation Index (SCI) launched
in 1963.
• Web version (Web of Science)
launched in 1997.
• ISI purchased by Thomson-Reuters in
1992.
• Sold as part of their Intellectual Property & Science portfolio in July 2016
for $3.55B USD to private equity funds.
https://commons.wikimedia.org/wiki/File:Eugene_Garfield_HD2007_Ric
hard_J._Bolte_Sr._Award.TIF
44. How do you calculate the jIF?
1. Count the total number of citations from the two
years before the IF release year.
2. Count total number of papers published in the two
years before IF release year
3. Divide number of citations by number of papers
2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
45. 1. Count the total number of citations from the two
years before the IF release year.
2. Count total number of papers published in the two
years before IF release year
3. Divide number of citations by number of papers
2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS
46. 1. Count the total number of citations from the two
years before the IF release year.
2. Count total number of papers published in the two
years before IF release year
3. Divide number of citations by number of papers
2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS
1. Rewards/incentivizes short term citations only
48. JIFBAIT Network
more
GWAS
GWAS
JIFBAIT NEWS
Arsenic Life forms, will
they take over the planet?
By Melba Ketchum, PhD
Which Overhyped, Unreproducible
Experiment Are You?
Want rapid citations for 2 years only? Carry out this quiz.
You got: STAP Cells
Of course dipping cells in
coffee will make them
pluripotent. Even if the
research gets discredited, it’ll
still get 100’s of citations in
two years.
49. 1. Count the total number of citations from the two
years before the IF release year.
2. Count total number of papers published in the two
years before IF release year
3. Divide number of citations by number of papers
2015 IF = # Citations for 2013-2014
# of Papers in 2013-2014
2015 20132014
TWO PROBLEMS
2. How do you count denominator? Negotiated.
56. http://reproducibility.cs.arizona.edu/
Arizona Repeatability in
Computer Science Experiment
• 2015 study examining extent Computer Systems
researchers share their research artifacts (code)
• NSF policies on sharing code since 2005
• Examined 613 papers from ACM conferences & journals
•
• Attempted to locate source code that backed up results
• If found, tried to build the code.
60. The Hong Kong context
http://web.archive.org/web/20131127073400/http://openaccess.hk/about.html
61. Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
Grant budget = $17.5 BN HKD/yr ($2.3BN USD)
UGC Policy: “Realization of
making Hong Kong Asia's
world city is only possible if it
is based upon the platform of
a very strong education and
higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm
62. Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
Grant budget = $17.5 BN HKD/yr ($2.3BN USD)
UGC Policy: “Realization of
making Hong Kong Asia's
world city is only possible if it
is based upon the platform of
a very strong education and
higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm
64. Hong Kong’s focus…
“The plot earmarked for expansion of Hong Kong Science Park might now be used to
build apartment blocks instead. Is the government backing down on its commitment to
project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3
65. “The plot earmarked for expansion of Hong Kong Science Park might now be used to
build apartment blocks instead. Is the government backing down on its commitment to
project Hong Kong as a major technology hub?” http://bit.ly/1TxCRj3
Hong Kong’s focus…
67. Science & Technology players in HK
Political forum Legislative Council (LegCo)
Policy
makers
Government Advisory Committee on Innovation and Technology
Innovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC
Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI
Facilitators HKPC HKTDC HKSTPC Cyberport HKIB
Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Researched policy, collected case studies,
FOI, interviewed many key players (funders,
libraries, administrators…)
68. HK: good with some parts of open…
http://hub.hku.hk/
78. Q: How much is spent on Open/Closed Access in HK?
A: Nobody has any idea!
https://lists.okfn.org/pipermail/open-access/2014-May/001888.html
79. In China publication + JIF = money = fraud
Attempts to “game the peer-review system on an industrial
scale”
1. http://www.scientificamerican.com/article/for-sale-your-name-here-in-a-prestigious-science-journal/
2. http://www.grassley.senate.gov/sites/default/files/about/upload/Senator-Grassley-Report.pdf
Companies offering authorship of papers made to order by “paper
mills”1. Common ghostwriting medical papers by pharma2
Guaranteed publication in JIF journal, often using fake referees, ID
theft, etc.
81. 1. http://www.scmp.com/comment/insight-opinion/article/1758662/china-must-restructure-its-academic-
incentives-curb-research
Created by skewed incentive systems in China…
“While we are rightly proud of Hong Kong’s highly regarded and ranked
universities system, we are not immune to the same pressures. While
funders in Europe have moved away from using citation based metrics such
as JIF in their research assessments, the Hong Kong University Grants
Committee states in their Research Assessment Exercise guidelines that they
may informally use it.”
83. How to fight back: Sign DORA.
http://www.ascb.org/dora/
84. Political forum Legislative Council (LegCo)
Policy
makers
Government Advisory Committee on Innovation and Technology
Innovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC
Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI
Facilitators HKPC HKTDC HKSTPC Cyberport HKIB
Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Who needs to provide leadership?
What new infrastructure do we need?
Science & Technology players in HK
85. Who needs to provide leadership?
RGC/UGC & new ITB
What new infrastructure do we need?
New “HK Data Service”, stewardship & platforms
Science & Technology players in HK
Political forum Legislative Council (LegCo)
Policy
makers
Government Advisory Committee on Innovation and Technology
Innovation and Technology Bureau (ITB) Innovation and Technology Commission (ITC)
Financing Government EB Private Sector
ITC -> ITF Innov. & Tech. Venture Fund RGC UGC
Operators Universities Public Technology Support Organizations Private Sector
R&D Centres ASTRI
Data Curators & Stewards (Libraries, OGCIO, Data Studio@SP)
Facilitators HKPC HKTDC HKSTPC Cyberport HKIB
Data Disseminators (HARNET, data.gov.hk, "HK Data Service")
Commercialization Agents Business Enterprises New High Tech Ventures Multination Corporations
Downstream Users (Researchers, Innovators, Citizens)
Academic/com
mercial cloud
86. If Government doesn’t act,
Universities need to lead way
http://hub.hku.hk/advanced-search?location=crisdataset
87. If Government doesn’t act,
Universities need to lead way
http://www.rss.hku.hk/integrity/research-data-records-management
88. First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
89. First CRIS in HK, built upon ScholarsHub
http://lib.hku.hk/researchdata/rpg.htm
“Beginning with the September 2017 intake, all HKU
research postgraduate (rpg) students have responsibility
for 1) using a data management plan (DMP), where
applicable, to describe the use of data in preparation for,
or in the generation of their theses, and 2) depositing,
where applicable, a dataset in the HKU Scholars Hub.”
90. First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
91. First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
92. First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
CC-BY NC by default
93. First CRIS in HK, built upon ScholarsHub
http://hub.hku.hk/advanced-search?location=crisdataset
Licensing T&Cs
94. HK CRIS: Further reading/resources
https://youtu.be/focv1z3lpPI
RPg Students -- Instructions for Data:
http://lib.hku.hk/researchdata/rpg.htm
Depositor's User Guide:
http://lib.hku.hk/researchdata/deposit_page.htm
Seminar slides from HKU Library
http://www.rss.hku.hk/integrity/rcr/rcr-info/seminars
See also ReShare
video guide:
95. The cost to Hong Kong of not doing this?
• Estimates lack of citation impact not being OA = 50% ($8.75B?)2
• How much is the HK taxpayer losing through missing out on potential
collaborations, wider engagement & unrepeatable work?
HK UCG grant budget = $17.5 Billion HKD/yr (4% of Gov spending)
Taking lowest reported reproducibility rates (11%) = >$15 billion wasted1
$$
$
1. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html
2. http://www.ecs.soton.ac.uk/~harnad/Temp/research-australia.doc
96. https://osf.io/cgpzb/
Open Science (Open Access & Open
Data) survey of Hong Kong
Reading/Reflection for
next class
Thoughts and ideas why Hong Kong is
lagging behind US/EU?
Any ideas what we need to do to move
forward?
Any feedback on the survey?
98. HKU Repeatability in HK
Research Experiment
• HKU policy on data sharing from 2015
• PLOS policy mandating sharing of supporting March 1,
2014
• HKU has published 267 PLOS ONE papers 2014-date
• Can we quantify reproducibility in a sample of these?
• Easy exercise in literature curation
• 2016 HKU PLOS publications = 49 papers
http://hub.hku.hk/simple-
search?query=&location=publication&sort_by=bi_sort_2_sort&order=asc&rpp=25&filter_field_1=journal&filter_type_
1=equals&filter_value_1=plos+one&filter_field_2=dateIssued&filter_type_2=equals&filter_value_2=[2014+TO+2017]&
filter_field_3=dctype&filter_type_3=equals&filter_value_3=article&etal=0&filtername=dateIssued&filterquery=2016&f
iltertype=equals
99. HKU Repeatability in HK
Research Experiment
• Everyone assigned 5 2016 HKU PLOS papers
• Quickly scan paper looking for supporting data
• If no data, ignore
• If uses data, is it all associated with the paper?
• If external data, is it available from URL or accession?
• If “data available on request”, are they contactable?
• Don’t spend more than 5mins per article
• Add data into googledoc, and we’ll go through results &
feedback next class
Homework/Case study: literature curation exercise
100. HKU Repeatability in HK
Research Experiment
Example 1.
https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nyeY
mB3Uh4U23HX-o/edit?usp=sharing
101. HKU Repeatability in HK
Research Experiment
Example 1.
Is there data presented in the paper? – Yes
Is there external data, and if so what is the
link/accession? – No
Is all the data in the paper available? – No
Comments - Has questionnaire, but not data as
says "minimal anonymized dataset will be made
available upon request”
Enter data here:
https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nye
YmB3Uh4U23HX-o/edit?usp=sharing
102. HKU Repeatability in HK
Research Experiment
Example 1.
OPTIONAL: Optional: If data missing, do the authors respond if contacted?
Enter data here:
https://docs.google.com/spreadsheets/d/15BszEhUodygyu4eGckR2b5p153nye
YmB3Uh4U23HX-o/edit?usp=sharing
103. Final Project
• For the final project for this course, you can
choose from 3 assignment options.
• The assignment is due on the 15th May and it
is worth 40% of your grade.
• Time will be set aside for presenting a
provisional draft of this during the final class
on the 24th April.
104. Final Project: Option 1
Write an Annotated Bibliography about data curation practices in an
academic discipline of your choosing.
• Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of
“open data.”
• Summarize data practices in your chosen discipline or topic. (5-7 sentences)
• Find 7-10 sources that relate that discipline or topic to data creation, management,
and/or curation.
• Provide a citation for the source in APA style.
• Write a short annotation that summarizes the content of the source. You may
include quotes from the source sparingly, but the annotations should be mostly, if
not entirely, in your own words. (3-5 sentences)
• Explain the relevance of the source with relation to the data practices of your
chosen discipline or topic. (1-2 sentences)
• Find a few example public datasets to demonstrate the above points. Cite the data
in the relevant places in the Bibliography according to the Data Citation Principles.
• Refer to this guide for more information about annotated bibliographies:
http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation
should be in the “Descriptive” style.
105. Final Project: Option 2
Using a relevant dataset (this can either be from the literature
curation exercise, a BYO dataset, or one given to you), write a report
that includes a description of the dataset, a Data Management Plan,
and a guidelines document for the researcher(s).
• Describe the dataset that explains the form of the data and the academic discipline in which it
was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data
Management Plan following the guidelines from HKU or a granting body such as NSF.
• 1 page guidelines document that could be presented to the researcher(s) that provides
guidelines for their data (extant and forthcoming):
– Preservation
– Appraisal
– Documentation
• For the DMP and the guidelines document, you can extrapolate from the your dataset to
imagine additional details about the research practices that created the dataset and will create
more data in the future.
• Look for suitable data repositories that can host this data (institutional, general purpose, or
subject specific), and if there is one relevant then publish the data if you have permission, and
correctly cite the data in the relevant places in your report.
106. Final Project: Option 3
Prepare a 30 minute data curation workshop that you could teach to
researchers that would provide them the necessary details to
understand why data curation is relevant to them and best practices
they should follow.
• Slide deck that introduces data curation for a researcher audience. (No
more than 40 slides.)
• Presenter outline that describes the important points for each slide.
• Topics that might be addressed in your workshop: the value of data
management, writing a data management plan, data repository options.
You can assume your audience is researchers are at HKU.
• Make sure all of the content is copyright free, and share the final material
openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient
metadata to make it discoverable.
107. Looking ahead…
• Next class on Monday 27th March we’ll go
from open to FAIR data
• We’ll also go through the reflection & curation
case studies
– Bring ideas & feedback, and we’ll look at the data
• Final project due 10th May
– Need to present preliminary version on 26th April
to get feedback before completion