Beyond the EU: DORA and NIS 2 Directive's Global Impact
What Does Responsible Data Science Mean?
1. What Does Responsible Data
Science Mean?
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
08/09/19 Data Science for the Public Good
@pebourne
Thanks to Claudia Scholz for some slides
2. Context – Our new School of Data Science is intent on practicing
responsible data science as our hallmark
From our draft strategic plan –
The practice of data science
through education, research and
service whereby all aspects of these
endeavors consider the ethical,
legal and policy aspects of all we
do such that the reputation and
integrity of the SDS are never in
question.
08/09/19 Data Science for the Public Good
3. Opportunity – In over 40+ years in academia I have never seen
anything as transformative as what is happening today
08/09/19 Data Science for the Public Good
Data Science Initiatives Nationwide
EffectCause
https://surgery.duke.edu/divisions/trauma-and-critical-care-surgery
The story of the trauma surgeon
5. What is happening now is across all verticals – but
there is a precedent we can learn from …
08/09/19 Data Science for the Public Good
https://avora.com/blog/rise-of-the-data-warehouse/
https://individualizedmedicineblog.mayoclinic.org/2013/04/16/c
elebrating-10th-anniversary-of-human-genome-project/
https://science.sciencemag.org/content/291/5507/1304
6. What is happening now is across all verticals – but
there is a precedent we can learn from …
08/09/19 Data Science for the Public Good
https://avora.com/blog/rise-of-the-data-warehouse/
DNA Sequence Data Since the Human Genome
http://synbio.info/display/synbio/Genetic+data+likely+to+become+the+biggest+big+data+in+2025
7. What can we learn from what has come before….
Lesson 1
Responsible data science means recognizing that
exponential growth of data leads to unexpected
consequences
08/09/19 Data Science for the Public Good
8. 08/09/19 Data Science for the Public Good
https://www.montana.edu/news/17886/public-forum-exploring-the-science-and-ethics-of-gene-editing-
set-for-aug-7
http://theconversation.com/five-things-to-consider-before-ordering-an-online-dna-test-92504
https://www.cnbc.com/2019/05/02/ubiome-what-really-happened-at-health-start-up-raided-by-fbi.html
Accuracy
Do you want to know?
You can do it at home
What is ethical in the research lab is not
when commercialized
9. The 6D’s provides one description of
the consequences..
08/09/19 Data Science for the Public Good
10. Lesson 1
Exponential growth of data leads to unexpected
consequences
Responsible data science anticipates or at least
prepares to deal with such consequences ahead of
time
08/09/19 Data Science for the Public Good
11. Lesson 2 – Its all too easy to forget the negative
consequences when …
08/09/19 Data Science for the Public Good [Courtesy Eric Green, NHGRI]
12. Lesson 3 – Policies and laws lag…
08/09/19 Data Science for the Public Good
http://www.navajo-nsn.gov/News%20Releases/OPVP/2019/may/FOR%20IMMEDIATE%20RELEASE%20-
%20Navajo%20Nation%20signs%20data%20sharing%20agreement%20to%20advance%20uranium%20exposure%20research%20efforts.pdf
13. Lesson 4 – Data sharing is a double edge sword…
08/09/19 Data Science for the Public Good
14. On the plus side data sharing can save lives …
Use case: Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
[From Adam Resnick]
08/09/19 Data Science for the Public Good
15. Timeline of genomic studies in DIPG
• 2012 Landmark studies identify
histone mutations as recurrent
driver mutations in DIPG
• The data were not shared for 3
years
• In 2015 in largely the same
datasets, others identify ACVR1
mutations as a secondary, co-
occurring mutation
• ACVR1 is targetable by a drug
• 3 years = 180 lives From Adam Resnick
08/09/19 Data Science for the Public Good
16. NIH Strategic Plan for Data
• Support a Highly Efficient and Effective
Biomedical Research Data
Infrastructure
• Promote Modernization of the Data-
Resources Ecosystem
• Support the Development and
Dissemination of Advanced Data
Management, Analytics, and
Visualization Tools
• Enhance Workforce Development for
Biomedical Data Science
• Enact Appropriate Policies to Promote
Stewardship and Sustainability
08/09/19 Data Science for the Public Good
https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf
17. Lesson 4 – Data sharing is a double edge sword…
08/09/19 Data Science for the Public Good
18. STATE HEALTH SURVEILLANCE: NEWBORN SCREENING CASE STUDY
From Bonnie R and Bernheim R, Public Health Law, Policy and
Ethics, Foundation Press (2015)
Category Variables
Infant Patient ID, Birth date, birth time,
ethnicity, weight in grams, feeding
type, transfusion status, zip code
of mother
Sample Sample ID, collection date,
received date, disposition code for
sample (satisfactory/not
satisfactory)
Submitter Submitter ID, submitter name
Test 36 different tests
Diagnosis Diagnosis, diagnosis date, sample
ID
The final dataset contained more than 1.6 million sample
records and nearly 29,000 diagnosis records
08/09/19 Data Science for the Public Good
19. Zip Code Level Sickle Cell Prevalence
08/09/19 Data Science for the Public Good
20. Given these lessons – there are many others – from
just one vertical what should we be doing as a
School of Data Science to be responsible while
undertaking data science for the public good?
08/09/19 Data Science for the Public Good
21. Guiding Principles …
Be open, transparent & collaborative in all we do
• Make ourselves known - use persistent identifiers e.g., ORCID
• Use preprints to accelerate progress
• Only publish Open Access (OA)
• Recognize openness, transparency & collaboration in hiring
and P&T
• Promote institutional openness – Open Data Lab, wikimedian
in residence
• Support institutional open data governance
08/09/19 Data Science for the Public Good
22. Guiding Principles …
Consider the ethical consequences across the complete data
workflow
08/09/19 Data Science for the Public Good
23. Acquisition
Engineering
Analysis
Communication
Dissemination
Ethics
● Census, surveys
● Data mining, digitization
● Sensors, Internet of Things (IoT)
Ethical Issues:
● Mass surveillance
● Privacy, terms of service
● Data sovereignty
Data Acquisition:
Information → Data
Job titles:
● IoT engineer
● Chief privacy officer
● Survey designer
https://www.wired.com/story/all-of-us-launches/
28. Take home
• The fourth paradigm is upon us and will change society
• Forming a new schools is an opportunity to do it right – we need help!
• Look to fields like genomics that have been doing data science for some
time and consider best (and worst) practices
• Responsible data science involves working by a set of guiding principles
and..
• Considering the consequences of what we do across the complete data
lifecycle
08/09/19 Data Science for the Public Good
Only then will we truly be undertaking
data science for the public good
29. Acknowledgements
08/09/19 Data Science for the Public Good
The BD2K Team at NIH
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0