ICT Role in 21st Century Education & its Challenges.pptx
B4OS-2012
1. Data management and curation:
the other side of bioinformatics
Susanna-Assunta Sansone, PhD
Principal Investigator and Team Leader,
University of Oxford e-Research Centre, Oxford, UK
http://uk.linkedin.com/in/sasansone
http://www.slideshare.net/SusannaSansone/B4OS-2012
Bioinformatics for Omics Sciences (B4OS),
CNR Naples, 25-17 Sep 2012
4. Oxford e-Research Centre
Providing research
computing, high-
performance
computing
Integrating with
national and
international
infrastructure
Supporting leading
edge facilities through
education and training
5. Oxford e-Research Centre
Collaborating with European and wider
international groups in, e.g.:
• energy,
• radio astronomy,
• biological data federation,
• life sciences simulation,
• biodiversity,
• computational chemistry,
• neuroscience,
• digital humanities tools,
• digital music analysis
Research in
• computation,
• data infrastructure and analysis,
• visualisation
6. My team’s activities and groups we work with
data management, biocuration, development of software,
databases and community-driven standards and ontology
env
agro
tox/pharma
health
8. Today:
“The buzz around reproducible bioscience data -
the policies, the communities and the standards”
Thursday:
“The reality from the buzz: how to deliver
reproducible bioscience data”
9. Preserve
institutional /
corporate
memory
Harmonize collection across sites
Find matching studies
Data dissemination
Long-term data stewardship
9
13. Address
reproducibility /
reuse
of public data
Ioannidis et al., Repeatability of published microarray
gene expression analyses. Nature Genetics 41(2),
13
149-55 (2009) doi:10.1038/ng.295
17. Growing, worldwide movement for reproducible research
Shared, annotated research data and methods offer new discovery
opportunities and prevent unnecessary repetition of work.
Improved data sharing underpins science of the future
“Publicly-funded research data are a public good,
produced in the public interest”
“Publicly-funded research data should be openly available
17
to the maximum extent possible”
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
21. reasoning visualization
analysis browsing integration
exchange retrieval
Community Software
Standards Tools
Well-annotated &
Structured Data
Reproducible &
Reusable
Bioscience Research
22. Today’s bioscience research
Publications
Experimental
and
computational
data
§ Is interdisciplinary and integrative in character
• need to deal with new and existing datasets
• deal with a variety of data types
§ ‘How the organism works’ is the focus
• Twenty years ago data was the center
Source of the figure: EBI website
23. Example from the toxicogenomics domain
Study looking at the effect of a
compound inducing liver damage
by characterizing/measuring
- the metabolic profile by MS and
NMR
- protein expression in liver by MS
- gene expression by DNA
microarray
- conducting genetic and
phenotypical analysis
Information contributing to the
construction and validation of
system biology models
24. Example of experiments by
InnoMed PredTox
24 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 a FP6 public-private consortium
Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
25. Structured description of datasets
§ Capture all salient features
of the experimental workflow
§ Make annotation explicit and
discoverable
§ Structure the descriptions
for consistency, tracking
§ independent variables
§ dependent variables
using
§ cross reference and
resolvable identifiers
26. Not too much, not too little, just ‘right’
§ We must strike a balance
between
• depth and breadth of
information; and
• sufficient information
required to reuse the data
28. Information intensive experiments
To make the experiments
comprehensible and reusable,
underpinning future
investigations, we need
common ways to report and
share the experimental details
and the associated data.
Consistent reporting will have a
positive and long-lasting impact
on the value of collective
scientific outputs.
29. Common ways to report and share
§ The challenges we face
• Large in volume: lots of data types and metadata!
• Lots of free text descriptions: hard to mine, subject to mistakes!
• Babel of terminologies: lack of definitions, hard to map!
• Heterogeneous file formats: software lock-in!
§ Need for reporting standards
• Minimal reporting descriptors
- Report the same ‘core essentials’
• Controlled vocabularies or ontology
- Use the same word and mean the same thing
• Common exchange formats
- Make tools interoperable, allow data exchange and integration
30. Reporting standards – the benefits
§ Describe and communicate the information to others, in an
unambiguous manner
§ To unlock the value in the data
• Compare, query and evaluate data
- Facilitate scientific validation of the findings
• Understand variability within/between different technologies and
protocols
- Facilitate technical validation
- Enable optimization of the experimental designs
- Identify critical checkpoints and develop quality metrics
§ To define submission and/or publication requirements
• Journals
• Databases
§ To ensure data integrity, reproducibility and (re)use
31. Escalating number of standardization efforts in bioscience,
e.g.:
Genomics Standards
Genome annotation Consortium (GSC)
www.geneontology.org gensc.org
Functional Enzymology data
Genomics Data standards
Society (FGED) www.strenda.org
www.fged.org
HUPO- Proteomics
Standards Initiative (PSI) Systems modelling
http://www.psidev.info standards
www.sbml.org
Cheminformatics
www.ebi.ac.uk/chebi
Pathways
www.biopax.org
Metabolomics Standards Initiative (MSI)
http://www.metabolomicssociety.org
32. Different community, different norms and standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
33. Different community, different norms and standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
34. Different community, different norms and standards, e.g.:
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
Challenges:
lack of coordination, fragmentation and uneven coverage
35. Is this ‘general mobilization’ good or bad?
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
§ Difference in structures and processes:
• organization types (open, close to members, society, WG…)
• standards development (how to design, develop, evaluate, maintain…)
• adoption, uptake, outreach (link to journals, funders, commercial sector…)
• funds (sponsors, memberships, grants, volunteering…)
36. Is this ‘general mobilization’ good or bad?
use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information
§ Fragmentation of the standards is a major issue
• Being focused on particular communities’ interests, be their individual
technologies or biological/biomedical disciplines, leads to duplication of effort,
and more seriously, the development of (largely arbitrarily) different standards
• This severely hinders the interoperability of databases and tools and ultimately
the integration of datasets
39. But how much do we know about these standards
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!
40. But how much do we know about these standards
Which tools and I use high throughput
databases sequencing technologies,
implement which which one are applicable
standards? to me?
How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?
Which one are I work on plants,
mature enough for are these just for
me to use or biomedical
recommend? applications?
41. But how much do we know about these standards
§ A bewildering array of standards is available, but
• these are hard to find, at different levels of maturity; in
some areas duplications or gaps in coverage also exist
§ Standards are just a ‘means to an end’, therefore
• we want to make them discoverable and accessible,
maximizing their use to assist the virtuous data cycle,
from generation to standardization through publication to
subsequent sharing and reuse
42. A catalogue to map the
landscape of standards and the
systems implementing them:
Over 400 bio-standards
(public and in curation)
Field*, Sansone* et al., Omics data sharing. Science
42 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
326, 234-36 (2009) doi:0.1126/science.1180598
www.ebi.ac.uk/net-project
43. • A coherent, curated and searchable catalogue of data sharing resources
• Bioscience standards and associated data-sharing policies, publications, tools and databases
• Assessment criteria for usability and popularity of standards
• Relationships among standards
• Encouragement for communication & interaction among groups
• Promoting interoperability & informed decisions about standards
48. Smith et al, 2007
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
49. Smith et al, 2007
Taylor, Field, Sansone et al, 2008
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
50. List of databases, linked to standards a collaboration with Database Issue
50 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
51. List of databases, linked to standards a collaboration with Database Issue
51 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
52. List of databases, linked to standards a collaboration with Database Issue
52 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
53. Major challenge: define ‘relations’ among standards
CREDIT:
The relationship among popular standard formats for pathway information Demir, et al., The BioPAX
BioPAX and PSI-MI are designed for data exchange to and from databases and community standard for
pathway and network data integration. SBML and CellML are designed to pathway data sharing,
support mathematical simulations of biological systems and SBGN represents 2010.
pathway diagrams.
53 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
54.
55. This is not just a technical but also
a social engineering challenge!
55 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
56. Ownership of open standards
can be problematic in broad,
grass-root collaborations; it
requires improved models, to
encourage maintenance of and
contributions to these efforts,
supporting their evolutions
56 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
57. The extensive ‘social
engineering’ and community
liaison needs to be managed
and funded; rewards and
incentives need to be identified
for all contributors
57 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
60. The cost of implementing a
standards-supported data
sharing vision is as large as the
number of stakeholders that
must operate synchronously
60 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
61. 1. Funders actively developing data policies
§ Several data preservation, management and sharing policies have
emerged in response to increased funding for omics domains
§ Even if in general terms, standards are recognized as necessary ‘tools’ to
unambiguously represent, describe and communicate research data
62.
63. 2. Similar trend in the regulatory arena
§ “… lack of standardized data affects CDER’s review processes by curtailing a
reviewer’s ability to perform integral tasks such as rapid acquisition, storage,
analysis......efficient management of a portfolio of standards projects will
require coordinated efforts and clear roles for multiple participants within/outside
FDA”
64.
65. 3. Publishes have become strong advocators
§ Continue to support the development of open standards and tools
• to support sharing of sufficiently well annotated datasets
65 • to enable comprehensible, reusable, www.ebi.ac.uk/net-project research
reproducible
The International Conference on Systems Biology (ICSB), 22-28 August, 2008
Susanna-Assunta Sansone
66. ….the rise of data-driven journals, e.g.:
partnering with:
67.
68. The rise of data-driven journals, e.g.:
partnering with:
69. 4. Similar trend in the commercial sector
§ R&D has invested heavily in procedures and tools that integrate external
information with their own data to enhance the decision-making process
• Now joining forces to streamline non-competitive elements of the life
science workflow by the specification of common standards, business
terms, relationships and processes
70. ....their information landscape is evolving
Yesterday Today Tomorrow
Proprietary
Public content
content provider
provider
Big Life
Science Big Life CRO
Academic
Company Science
group
Company
Regulatory
authorities
Service provider
Software vendor
Yesterday Today Tomorrow
Innovation Innovation inside Searching for Innovation Heterogeneity of collaborations; part of
the wider ecosystem
Model
IT Internal apps & data Struggling with change Cloud, services
security and trust
Data Mostly inside In and out Distributed
Portfolio Internally driven and owned Partially shared Shared portfolio
Credit to: Pistoia Alliance
71. Take home messages
u Contribute to the reproducible research movement
u Think about data management as a career path
u Learn more about open community-standards
u Get involved, e.g.:
Open
Bioinformatics
Foundation
72. Data is not like a $ bill….
http://www.flickr.com/photos/jackofspades/4500411648/ CC BY
73. Your research and all (publicly
funded) research should make
make an … impact
http://www.flickr.com/photos/equinoxefr/2620239993/ CC BY
73 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
74. …..the biggest possible impact!
http://www.flickr.com/photos/webhamster/2582189977/ CC BY
74 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project
75. Today:
“The buzz around reproducible bioscience data -
the policies, the communities and the standards”
Thursday:
“The reality from the buzz: how to deliver
reproducible bioscience data”
76. Is it possible to achieve a common, structured
representation of diverse bioscience experiments that:
• follows the appropriate community standards and
• delivers richly-annotated datasets?
78. Increasing level of structure
Notes in Lab Books Spreadsheets and Tables Facts as RDF statements
(information for humans) ( the compromise) (information for machines)
TOWARDS INTEROPERABLE BIOSCIENCE DATA doi:10.1038/ng.1054
Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann
S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B,
Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S,
Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland
L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A,
Feb 2012
Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B,
www.biosharing.org www.isacommons.org
Wolstencroft K, Xenarios J, Hide W.
www.isacommons.org
79. References
1. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K,
Ireland A, Mungall CJ; OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA,
Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO Foundry: coordinated evolution of
ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251-1255 (2007)
2. Taylor CF,* Field D*, Sansone SA*, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA,
Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J,
Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK Jr,
Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, et al.: Promoting coherent
minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.
Nat Biotechnol 26(8):889-896 (2008)
3. Field D*, Sansone SA*, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P,
Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-
Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Megascience. 'Omics data sharing.
Science 326(5950):234-236 (2009)
4. Harland L, Larminie C, Sansone SA, Popa S, Marshall MS, Braxenthaler M, Cantor M,
Filsell W, Forster MJ, Huang E, Matern A, Musen M, Saric J, Slater T, Wilson J, Lynch N, Wise
J, Dix I: Empowering industrial research with shared biomedical vocabularies. Drug Discov
Today 16(21-22):940-947 (2011)
5. Sansone SA and Rocca-Serra P: On the evolving portfolio of community-standards and data
sharing policies: turning challenges into new opportunities. GigaScience 1:10 (2012)