Identifying Appropriate Test Statistics Involving Population Mean
Biosb2017_Repositive
1. We are always looking for data
Finding & Accessing
Human Genomic
Data for research
BioSB 2017
Tweets welcome
#dataeureka
@repositiveio
2. Genomic data is important for research
Pre-clinical
drug discovery
Diagnostics and treatments
of genetic diseases
3. “Consensus among researchers, clinicians,
politicians & the public that
genomics will transform biomedical
research, healthcare and lifestyle choices”
Stephan Beck, UCL
OPPORTUNITY
4. Genome Technology Evolution
2001: 1 human genome
2005: Personal Genome Project
Human Genome Diversity Project
HapMap
2016: 2M AstraZeneca - HLI
2008: 1000 Genomes (1092 genomes, since increased to ~2500)
Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE)
2011: H3Africa
2012: International Cancer Genome Consortium
5. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
6. How many data sources?
How many sources of human
genomic data do you know about?
7. Hundreds of data sources
…but they aren’t easy to find!
http://tinyurl.com/plos-biology-repositiveFirst 30 data sources listed here:
10 25 33 35
102
174
239
314
506
582
0
100
200
300
400
500
600
700
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16 Sep-16 Dec-16 Mar-17
Data Sources Identified
9. 11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
Public Repositories
Universities
Companies
BioBanks
Research consortiums
11. • Required by funders
• Cannot publish unless accession
number given
• Specialised for genomics
• ArrayExpress
• EGA
• dbGaP
• GEO…
• Generalist
• Dryad
• Figshare…
See http://discover.repositive.io for more
Public Repositories
12. The researchers’ pain points
FRAGMENTED
No holistic approach
to discover new data
HIDDEN
13. The researchers’ pain points
FRAGMENTED
No holistic approach
to discover new data
ADMIN
BURDEN
14. Open Access
• Eg. PGP, CC0
• Bermuda Accord
Managed (Restricted or Controlled Access)
• Data Access Committee
• No effective agreement (policy vacuum)
GOVERNANCE Models
15. Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one repository.
16. Access to Restricted Data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?
17. Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
http://tinyurl.com/schaik-dnadigest
18. NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
19. Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
22. We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
A platform to make human genomic data accessible for research
23. 1-click to human genomic data access
to make finding data as easy as finding a book
on Amazon, book a hotel on Expedia!
Repositive
24. Simpler workflow
for data access
Our expertise is data search platforms
Discover and
access
Search, see
related results
Find colleagues &
their data interests
Co-annotate data &
community feedback
Genomics data is needed for research and drug discovery
It enables researchers to develop diagnostics and treatments for genetic diseases
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Data is fragmented in unconnected silos – makes it very difficult to discover data
There are many public repositories, but It can be hugely confusing to know where to look for the right kind of data
Data privacy Is a concern and controlled access is a requirement for many clinical datasets
Accessing data is a time-consuming and bureaucratic exercise
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Just like Liz, and researcher struggling to get hold of the genomics data she needed for her researcher.
So… she quite her job at illumina and decided to try and do something about that problem.
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Our vision is to make genomic data access as easy as finding a book on Amazon or book a hotel on Expedia
KEY POINTS:
Repositive builds tools for genomics data search & access.
We’re really good at it. We have the expertise in-house. It’s what we do.
Aside from building a highly functional tool, we’ve taken the time to prioritise User Experience, streamlining of user workflows & presentation.
Within a month of our formal platform launch we have over 600 registered users.
The Repositive platform is an online community and marketplace connecting data consumers with data providers.
On Repositive, Jenn has
Easy, Interactive search
Faster data access workflow
Easy access to new data collaborators
Benefiting from reading feedback on data from community, colleagues, to assess data quality and utility
The Repositive platform and technology will remove barriers to data sharing and will incentivise users to explore, contribute and collaborate in alignment with best practices
DNA.land
OpenSNP
PersonalGenomesProject
Direct to consumer genetic tests & microbiome
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and vetting of users