If Data Are The New Oil, How Do We Prevent Global Warming?
1. If Data Are The New Oil,
How Do We Prevent Global
Warming?
Philip E. Bourne, PhD, FACMI
The National Institutes of Health
http://www.slideshare.net/pebourne
philip.bourne@nih.gov
University of Cincinnati Data Day 2017
March 23, 2017
2. Who am I representing and what is my bias?
• I am presenting my views, not necessarily those of NIH
• Total data parasite
• Unnatural interest in scholarly communication
• Co-founded and founding EIC PLOS Computational Biology – OA advocate
• Prior co-Director Protein Data Bank
• Amateur student researcher in scholarly communication
2
3. I appreciate this is a day to focus on data, but ..
I don’t think you can consider data in isolation
from the analytics associated with that data and
indeed the knowledge derived from both.
4. The Knowledge versus Data Landscape
• Knowledge
• Largely a for-profit business with
limited input into that business from
the producers of scholarship
• Some open access (OA), costs shifted
from consumer to producer
• Full accessibility for non-OA is
constrained/controlled
• Funders able to influence the
landscape eg PubMed Central
• Sustainable!
• An analog system functioning in a
digital world – aka not born digital
• Data
• Largely left to governments to
support
• Mostly OA
•
• Funders control the landscape
• Not sustainable
• Mostly born digital
4
5. Some Shared Issues …
• Reproducibility
• Comprehension / communication
• Quality
6. Reproducibility Examples From My Own Work
It took several months to replicate this work this work
… And just last week…
Phew…
http://www.sdsc.edu/pb/kinases/
7. Tools Fix This Problem Right?
• Extracted all PMC papers with associated Jupyter notebooks available
• Approx 100
• Took a random sample of 25
• Only 1 ran out of the box
• Several ran with minor modification
• Others lacked libraries, sufficient details to run etc.
It takes more than tools.. It takes incentives …
Daniel Mietchen 2017 Personal Communication
8. 1. A link brings up figures
from the paper
0. Full text of paper stored
in a database – one view
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
3. A composite view of
journal and database
content results
One Hypothetical End Point
• Paper is one attributable view of
the knowledge
• User clicks on a static image
• Metadata and data provide direct
further analysis - an executable
paper
• Private and public annotations
revealed
• Selecting a feature forms a query
for yet further knowledge
• That knowledge rendered as a
knowledge graph rather than a
paper
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
PLoS Comp. Biol. 2005 1(3) e34 8
10. Source Washington Post
On November 6, 2012, Donald Trump tweeted: "The concept of
global warming was created by and for the Chinese in order to
make U.S. manufacturing non-competitive."
We Need Relationships
Built on Trust
11. Trust Becomes Even More Important as We
Move to Platforms
Sangeet Paul Choudary https://www.slideshare.net/sanguit
13. Tools and Resources Will Continue To Be Developed
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
14. And Become More Interconnected
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
3/01/14 2014 SPARC Annual Meeting 14
15. Until We Become a Platform
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
Commercial &
Public Tools
Git-like
Resources
By Discipline
Data Journals
Discipline-
Based Metadata
Standards
Community Portals
Institutional Repositories
New Reward
Systems
Commercial Repositories
Training
17. • Airbnb is a platform that supports a trusted relationship between
consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of services
between supplier and consumer and maximizing the amount of trust
associated with a given stakeholder
• It seems to be working:
• 60 million users searching 2 million listings in 192 countries
• Average of 500,000 stays per night.
• Evaluation of US $25bn
Bonazzi & Bourne 2017, PLOS Biology, In Press
21. Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are simple
compared to what is required of a platform to support
biomedical research
Nevertheless there is much to be learnt
22. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Platforms - The Situation Today
23. In summary there is not currently a widely
adopted single platform for the exchange of
services in biomedical research. Either there is a
platform per service or no platform at all. Why
have we not done better and what are the
impediments today?
24. Impediments to a biomedical platform
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/10-barriers-to-
employee-innovation/#8bdbaa811133
25. The NIH through the Big Data to Knowledge
(BD2K) and others are experimenting with a
platform, keeping in mind the need to overcome
these impediments
Enter The Commons
https://en.wikipedia.org/wiki/Ealing_Common#/media/File:Eali
ng_Common_-_geograph.org.uk_-_17075.jpg
26. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Commons –
Initial focus is on integrating two layers of the
scholarly workflow
27. Commons Topology
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
PaaS
SaaS
IaaS
https://datascience.nih.gov/commons
28. “I really admire Airbnb as a pioneer of the sharing
economy and for building community. They've
found an elegant way to help hosts make more
money and for guests to have authentic
experiences. It brings those people together in a
unique way. “
Logan Green
29. “The Commons is one effort at creating a sharing
economy and for building community. We hope
for a more cost effective and productive research
environment while bringing people together in a
unique way. “
Phil Bourne
30. Acknowledgements
• Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry, Sonynka Ngosso,
Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)
• NLM/NCBI: Mike Huerta, George Komatsoulis
• NHGRI: Valentina di Francesco
• NIGMS: Susan Gregurick
• CIT: Debbie Sinmao, Andrea Norris
• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr
• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean
Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)
• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI),
Claire Schulkey (AI), Eric Choi (AI)
• OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke,
Bonazzi & Bourne 2017, PLOS Biology, In Press
Notas do Editor
8
Detailed description of the Commons Framework can be found at : https://datascience.nih.gov/commons