Anúncio

A Big Picture in Research Data Management

10 de Sep de 2018
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a A Big Picture in Research Data Management(20)

Anúncio
Anúncio

A Big Picture in Research Data Management

  1. A Big Picture in Research Data Management Carole Goble The University of Manchester Head of Node: ELIXIR-UK Coordinator: FAIRDOM Chair RDM User Group: University of Manchester carole.goble@manchester.ac.uk GFBio - de.NBI Summer School 2018 Riding the Data Life Cycle! Braunschweig Integrated Centre of Systems Biology (BRICS) 03 - 07 September 2018
  2. Open Science Open Data Reuse Science Reproducible Science Personally Productive Science
  3. Governments spend a lot of public money on research Much (all?) of it uses data or generates data or both.
  4. Vahan Simonyan, Center for Biologics Evaluation and Research Food and Drug Administration USA
  5. Stodden, Seiler, Ma. An empirical analysis of journal policy effectiveness for computational reproducibility, PNAS March 13, 2018. 115 (11) 2584-2589; https://doi.org/10.1073/pnas.1708290115 Since 2011
  6. sharing/publishing assets in public archives… Data Models *top three most popular The evolution of standards and data management practices in systems biology (2015). Stanford et al, Molecular Systems Biology, 11(12):851
  7. NIH Rigor and Reproducibility https://www.nih.gov/research- training/rigor-reproducibility Plenty of advice cos.io/top
  8. Plenty of Funder Data Policies http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
  9. Pontika et al, Fostering Open Science to Research using a Taxonomy and an eLearning Portal at iKnow: 15th International Conference on Knowledge Technologies and Data Driven Business, http://dx.doi.org/10.1145/2809563.2809571 Open Science Taxonomy
  10. https://wellcomeopenresearch.org/ Nature Scientific Data Data Publishing and Citation http://www.scholix.org/ https://datacite.org/ https://www.force11.org/datacitationprinciples https://www.nature.com/sdata/
  11. “The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18 Principles Metadata Identifiers Access policies Standards Technical: Political Social Economic: A rallying cry ….
  12. European Commission https://github.com/FAIR-Data- EG/action-plan http://bit.ly/interim_FAIR_report https://ec.europa.eu/info/events/2nd -eosc-summit-2018-jun-11_en Open Data by Default
  13. Research Data Management Retain (or dispose) Review (replicate & validate) Reproduce (verify, compare) By the researcher and their collaborators By their peers, the public and competitors (include, combine)
  14. Fifty Shades of FAIR Workflows SOPs Containers, cloud services, common services Packaging platforms (Research Objects) Markup languages, reporting guidelines and checklists, ontologies, catalogues Sounds hard….Catalogues Search markup
  15. …. RDM Lifecycles CollectionSharing Stewardship Integration Primary & secondary data, models, SOPs Metadata Experimental context Integration with in house data infrastructuresFAIR Organise & link assets Standardised, consistent reporting Reproducible publications Yellow pages Exchange among colleagues How and when to share and publish Get and give credit Retain and find beyond project Span across legacy, in house, external systems, community archives Integrate with tools, analysis platforms, in house data infrastructures Curation support Capacity building Metadata practices Policies and governance Knowing what to throw away
  16. …. Curation Lifecycles + RDM Lifecycles https://www.nrel.colostate.edu/the-data-lifecycle-part-1-data-management-for-open-access-5- questions-to-ask-about-your-data/
  17. Do Research Research Infrastructure Services Assemble Methods, Materials Experiment ObserveSimulate Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Marketplace Services Publish Share Results Any research product Selected products Manage Results Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015 Science 2.0 Repositories
  18. 101 Innovations in Scholarly Communication - the Changing Research Workflow, Boseman and Kramer, 2015, http://figshare.com/articles/101_Innovations_in_Scholarly_Communication_the_Changing_Research_Workflow/1286826 A RDM Ecosystem
  19. Team Science …….Of Individuals Collaborating and Competing Simultaneously Self-deposit, self-curating, variable stewardship skills The RDMTeam… A RDM Egosystem
  20. FAIR RDM in the Team multi-partner, multi-disciplinary projects What methods are been used to determine enzyme activity? What SOP was used for this sample? Where is the validation data for this model? Is there any group generating kinetic data? Is this data available? Track versions of my model Whats the relationship between the data and model? Which data belong to which publications?
  21. Organise Share For Projects Disseminate
  22. Open source RDM Platform supports standards Free Public Resource Fairdomhub.org Stewardship Support For Projects 50+ 100+ projects
  23. Project Managed Spaces: Organisation -> Sharing -> Dissemination Project Investigation Programme Self-controlled spaces managed spaces One entry point over external systems A Project Commons
  24. X = data, software, method, article I can access your X Your X is (re)usable by me and with my tools/data I get credit for using your X You can’t use my X Only access/use my X if I say so I don’t have resources and skills to make my X reusable and reproducible I must get credit if you use X Someone else will paying for X stewardship and archiving. X will always be there & free for me. Maturing this view. FAIR RDM outside the Team
  25. “Getting it published, not getting it right” Matt Spitzer, COS, Jisc-CNI Leadership Conference 2018 Reuse Debt Annotate for strangers Organise Share Disseminate Data decreases Metadata increases Reach increases • Metadata quality and quantity • Identifier hygiene
  26. me ME my team close colleagues peers Access Spiral: Staged sharing organisation – collaboration - dissemination The number of assets reduces Reach of sharing increases The richness of metadata needed increases Burden of work increases
  27. Data ScienceAnalytics Machine learning Discovery, New algorithms Data stewardship Standardisation, Harmonisation, Annotation and enrichment, Maintaining access, preserving Software stewardship Updates, versions, porting Prep & Processing Data wrangling & curation Instrument pipelines Simulation sweeps
  28. Personal Productivity reviewers want additional work statistician wants more runs analysis needs to be repeated post-doc leaves, student arrives new/revised datasets updated/new versions of algorithms/codes sample was contaminated better kit - longer simulations new partners, new projects Means educating PIs and Supervisors Personal Productivity Retention, reuse Publish driven Public Good Sharing & Reproducibility Access driven
  29. Favourite excuses … The results are embedded in a figure in the paper I don’t know where the data is You can have it but the metadata is so bad you will need me to interpret it You can have it but only if you put me on your paper Pseudo Sharing Data Flirting Data Hugging The Reward Norms of Science… more later You won’t credit me or cite my data but you’ll demand work from me and use it for your own research reputation… Don’t have the resources or skills You will ask me questions
  30. RDM Stakeholders data managers librarians, IT admin Global Enterprises Standards, International Research Infrastructures RDM
  31. Capitalising on investments Retaining results post-project Pooling, transfer, sharing results Public collections Skilling workforce Compliance audit/metrics Community productivity Reproducibility Productivity Doing science with collaborators Publishing & getting credit Access to resources, results, collections Retention of my results post student Repeatability - reviewer wants more  Competitiveness, protecting assets Managing costs Compliance StakeholderAccountabilityValues overlaps, mismatches? Stakeholder Agendas New publishable assets Business models Reproducibility
  32. Knowledge Exchange Report: http://www.knowledge-exchange.info/event/ke-approach-open-scholarship RDM Knowledge Exchange Public Good Private Good Institutional Facility Community Organisation’s Good National centres Publishers, Funders Policy makers, Government Public archives Shared Infrastructure Shared Data Centres Global National Researcher Personal Researchers Trainers Students PIs Lab books Group infrastructure Data managers Lab managers Libraries Institutional repositories
  33. republic of science* regulation of science *Merton’s four norms of scientific behaviour (1942)
  34. Publishing in Public Central Repository Repertoire Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053 Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  35. The RDM Ecosystem • public collections & archives • data centres • journals • Institutional repositories • most researchers • labs & universities • my resources Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
  36. Squaring the FAIR Circle
  37. https://research.northumbria.ac.uk/support/2015/11/19/report-from-jisc-research-data- management-shared-service-requirements-workshop/ Jisc Ideal RDM System Architecture: An Institutional Perspective
  38. Global & National RDM Global “Moonshot” Projects NIH Data Commons Standards Organisations International Organisations
  39. Global & National RDM
  40. Services & Activities Training CommunitiesPolicy Data,Tools, Compute, Interoperability Engage European International National Industry domains technologiestechniques
  41. RDM select, support, and sustain public and national data resources support development of new ones CDRs DDs NDRs support and advocate for standards, their adoption and provide support services Identifiers.org run registries, discovery and analysis tools coordinate integration efforts BioTools support researchers for their data management: training, DMP, infrastructure, consultancy by nodes for nodes in their national settings Nodes
  42. 1k+ Databases 1k+ Standards 100+ Policies https://dsw.fairdata.solutions Data Stewardship Wizard Practice identifier hygiene A unique identifier for each record 800+ data collections 10 Rules for Identifiers 10 Rules for Selecting a BioOntology 200+ Ontologies https://www.ebi.ac.uk/ols https://doi.org/10.1371/journal.pbio.2001414 https://doi.org/10.1371/journal.pcbi.100743
  43. European Open Science Cloud
  44. A trusted virtual environment to store, share & re- use research information. Reduce reinvention. Avoid duplication Simplify access. Support interdisciplinary re-use. Serve Europe's 1.7 million researchers (of all disciplines) and 70 million science and technology professionals Open Science Move, share and re-use data seamlessly • across global markets and borders • among institutions and research disciplines • trusted free flow of data • data infrastructure to store and manage data • high-speed connectivity to transport data • High Performance Computers to process data Realising the EOSC doi:10.2777/940154
  45. eucli d Pan-European e-Infrastructures Research Infrastructures HPC Centres of Excellence NationalRegional e-Infrastructures Policy and Best Practice NationalLocal Research Infrastructures Integration Projects Thematic e-Infrastructures [Per Oster]
  46. Dataandtoolsfromcontributors NationalNodes,Sitemonitoring Community oriented Integration [Based on Massimo Cocco, ENVRI] e-Infrastructures Cloud Research Infrastructures Commons
  47. A Research Commons? collectively created, owned and shared, with governance “… a cloud-based platform where investigators can store, share, access, and interact with digital objects (data, software, etc.) generated from …. research. By connecting the digital objects and making them accessible, the Data Commons is intended to allow novel scientific research that was not possible before, including hypothesis generation, discovery, and validation.” https://commonfund.nih.gov/commons Pooled Resources Federation Access NIH Data Commons
  48. • Overcoming fragmentation – Across scattered resources, platforms, people • Improving flow of information – Coordination, collaboration • Cumulative, dynamic [original figure: Josh Sommer] Cumulative A Commons Goble, De Roure, Bechhofer, Accelerating KnowledgeTurns, I3CK, 2013, isbn: 978-3-642-37186-8 http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation
  49. multi-object multi-repositories Experimental context All together Type specific archives Fragmented silos Models Presentations events Articles Workflows Samples metadata Data StandardOperating Proceduresversion, tracking provenance parameters citation
  50. De-contextualised Static, Fragmented Lost Semantic linking Contextualised Active, Unified Semantic linking Buried in a PDF figure Reading and Writing Scattered…. Fragmented Dissemination
  51. 3 Studies Model analysis, construction, validation 24 Assays/Analysis Simulations, characterisations 16 19 13 2 1 Structured organisation Retain context in one place Deposit in the fragmented resources [Penkler, Snoep]
  52. FAIRDOMHub : A Federated “Virtual” Data Commons based on aggregation http://fairdomhub.org External Databases In House Stores Secure Stores Modelling Resources Distributed Commons, Integrated View Analytical Resources In progress
  53. FAIRDOMHub Federated with e-Infrastructure https://nels.bioinfo.no https://bio.tools/nels https://f1000research.com/articles/7-968/v1
  54. Knowledge Exchange Report: http://www.knowledge-exchange.info/event/ke-approach-open-scholarship project based asset management and collaboration (inter)national archives and infrastructuresAutomated deposition & harvesting institutional repositories and infrastructures Federation Standardised hygienic identifiers Standardised metadata exchange Standardised protocol/APIs
  55. Data-Literature Interoperability evolving lightweight set of guidelines http://www.scholix.org/
  56. Standardised metadata mark-up Metadata published & harvested withoutAPIs or special feeds Commodity Off the Shelf tools App eco-system schema.org tailored to the Biosciences for FAIR simple structured metadata markup on web pages & sitemaps MarRef Marine Metagenomics Database BioSamples Deposition Database Metadata Federation & SEARCH of course!
  57. The First and Last Mile “ramps” onto the Research Data Infrastructures FAIR data at source – data deposition, validation and upload pipelines into public repositories FAIR access from my tools Bench Benefit The ‘last mile’ challenge for European research e-infrastructures https://doi.org/10.3897/rio.2.e9933 EOSC Harvesting Templates Automation Tracking pipelines Notebooks Spreadsheet wrangling Data2Paper Data Tracking Sheets
  58. https://ncip.nci.nih.gov/blog/face-new-tragedy-commons-remedy-better-metadata/ “Creating good metadata takes considerable work …. when investigators act in their own self-interest, taking short cuts to generate metadata as quickly as possible, we should expect that the overall utility of the resource will decline. … a need for easy-to-use solutions that are generic to provide guidance over the entire life cycle of metadata — streamlining metadata creation, discovery, and access, as well as supporting metadata publication to third-party repositories” Mark Musen Stanford The First Mile: Metadata at Source Reduce complexity
  59. Specialist databases Local Biochem4j ICE Global Brenda, wikipathways, Biomodels ICE Public Deposition Databases Public Catalogues Tracking in Specialist Systems Institutional Catalogue & Repository Scientists workflow drives the RDM workflow, not the other way round…… “metadata transaction tools”
  60. Research Infrastructure Services Assemble Methods, Materials Experiment ObserveSimulate Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Marketplace Services Share Results Manage Results Building a FAIR Research Commons Science 2.0 Repositories:Time for a Change in Scholarly Communication Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante Mesirov,J. Accessible Reproducible Research Science 327(5964), 415-416 (2010) Born FAIR Elsewhere on-date Within during
  61. Research Infrastructure Services Assemble Methods, Materials Experiment ObserveSimulate Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Marketplace Services Share Results Manage Results Releasing Portable Reproducible Objects Science 2.0 Repositories:Time for a Change in Scholarly Communication Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante Mesirov,J. Accessible Reproducible Research Science 327(5964), 415-416 (2010) Supporting researchers to make & exchange FAIR content as they go… Credit for all products Value quality Data + the Methods
  62. Packaging: data + methods + models Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata - Bundling files - Keeping provenance - Exchanging data - Shipping results Bergmann, F.T.,Adams, R., Moodie, S., Cooper, J., Glont, M., Golebiewski, M., ... & Olivier, B. G. (2014). COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC bioinformatics,15(1), 1. Combine Archive https://sems.unirostock.de/projects/combinearchive/
  63. The Cinderella of RDM: Standard Operating Procedures Record your processing steps
  64. Research Object Bundling Provenance Dependencies Versions Checklists Variance Portability Transparent Processes
  65. Precision medicine NGS pipelines Alterovitz, Dean, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results, biorxiv.org, 2017, https://doi.org/10.1101/191783 Assemble, share, and analyze large and complex multi-element datasets distributed across multiple locations, referenced because too big Secure large scale moving of patient data. Chard et al I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets, https://doi.org/10.1109/BigData.2016.7840618
  66. FAIR Exchange of Research Goods Governance Stewardship Credit Tracking Lifecycles Fixivity… Arxiv, my Lab myExperiment GitHub, Web Service myWebSite bioModels.org, openModeller PubMed Spreadsheet in figshare ArrayExpress, BioSamples, PRIDE, GBIF, my Lab, institutional repository Overlaying the Research Commons Ecosystem
  67. Tracking, credit mining, comparison, auto- metadata, blockchain, boundary objects…. 1 3 2 A FAIR KnowledgeWeb of Research Objects Map across metadata Threaded publications Navigate, Pivot-Focus, Cite Self-describing
  68. http://www.researchobject.org/ro2018/
  69. Releasing Research: “within during” Analogous to software products & practices rather than articles An “evolving manuscript” would begin with a pre- publication, pre-peer review “beta 0.9” version of an article, followed by the approved published article itself, [ … ] “version 1.0”. Subsequently, scientists would update this paper with details of further work as the area of research develops. Versions 2.0 and 3.0 might allow for the “accretion of confirmation [and] reputation”. Ottoline Leyser […] assessment criteria in science revolve around the individual. “People have stopped thinking about the scientific enterprise”. http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article Demands different ideas of credit and citation
  70. Living Entry Published Snapshot Entry FAIRDOM Commons Releasing…. G. Penkler, F. DuToit,W. Adams, M. Rautenbach, D. C. Palm, D. D.Van Niekerk, & J. L. Snoep. (2014). Glucose metabolism in Plasmodium falciparum trophozoites. FAIRDOMHub. http://doi.org/10.15490/seek.1.investigation.56
  71. Research Infrastructure Services Assemble Methods, Materials Experiment ObserveSimulate Analyse Results Quality Assessment Track and Credit Disseminate Deposit & Licence Marketplace Services Share Results Manage Results Releasing Portable Reproducible Objects Science 2.0 Repositories:Time for a Change in Scholarly Communication Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante Mesirov,J. Accessible Reproducible Research Science 327(5964), 415-416 (2010) Supporting researchers to make & exchange FAIR content as they go… Credit for all products Value quality Data + the Methods
  72. FAIR Play: Walled Gardens Open science applies to you but not me … not available = not citable Jurgen Hannstra Vrije Universiteit, Amsterdam Using FAIRDOM my own lab colleagues saw what I was doing and called to collaborate! • Licenses • Negotiated access • Embargos • Permission controls • Staged sharing • Private spaces • enclave sharing • consortia pressures • within project mistrusts • patterns (models vs data) • hoarding & flirting • personal dowries • ex-member divorces • asymmetrical reciprocity • credit and citation • “on date” not “during” publishing
  73. FAIR Play: RDM Stewardship Value Systems • of assets, of reproducibility, of metadata • public vs personal good • economics of infrastructure • priorities • stewards and stewardship • credit & reward Sweatshops • competing • burden - time, skills • short term, shortcuts • untrained • leadership sets the tone The reward norms of science need to change Everyone know this. No-one knows how to fix it. All research products and all scholarly labour are equally valued (except by institutional promotion boards, funding panels, and review committees)
  74. Data Journals Data Citation Data Policies: Open Data by Default Credit & Citation Infrastructure (altmetrics based) Data Stewardship Careers
  75. Credit – giving and taking CreDiT Stop conflating credit with authorship Getting people to cite data Data Citation Metadata Landing Pages Persistent Identifiers Data citation mining https://project-thor.eu/ https://casrai.org/credit/ https://www.nature.com/articles/sdata201539 Making Data Count Linking Data to Literature https://www.project-freya.eu/
  76. Data Stewardship Career Recognition 500,000 needed in Europe Stewards – skilling and rewarding
  77. Commons Production Incentives
  78. http://www.rightfield.org.uk Semantic Annotation by Stealth
  79. Stable & Sustained Infrastructure & Support FAIR ≠ FREE Countless expectations to do RDM Much less in how to sustain the archives, infrastructure and the skills needed “we want FAIR data but we will only support research” Complexity of funding federated commons with project-based national funds Funding models need an update!
  80. http://www.nature.com/news/empty-rhetoric-over-data-sharing-slows-science-1.22133
  81. Why FAIR isn’t FREE…..
  82. data managers librarians Global Enterprises Standards, International Research Infrastructures FAIR Research Commons
  83. A Bigger RDM Picture Fragmentation Federation Ecosystem Embed in working practice Born FAIR Ramps First & Last Mile Egosystem Stakeholders Research Objects Stewardship Professionalisation Cultural norms Interoperability FAIR is not FREE Releasing Credit, reward
  84. What can you do? Five steps to better data better research Get expert help and give stewards credit Train yourTeam incl. your PI Publish your Data and credit others Develop a DMP and resource it Annotate for strangers Create analysis-friendly data Record your processing steps Use a unique identifier for each record Use standards Save and backup raw data Submit to a repository. Get a DOI Try to use platforms and tools that work together
  85. Acknowledgements • David De Roure • Tim Clark • Sean Bechhofer • Robert Stevens • Christine Borgman • Victoria Stodden • Marco Roos • Jose Enrique Ruiz del Mazo • Oscar Corcho • Ian Cottam • Steve Pettifer • Magnus Rattray • Chris Evelo • Katy Wolstencroft • Robin Williams • Pinar Alper • C. Titus Brown • Greg Wilson • Kristian Garza • Matthew Dovey • Nick Juty • Helen Parkinson • Juliana Freire • Jill Mesirov • Simon Cockell • Paolo Missier • Paul Watson • Gerhard Klimeck • Matthias Obst • Jun Zhao • Pinar Alper • Daniel Garijo • Yolanda Gil • James Taylor • Alex Pico • Sean Eddy • Cameron Neylon • Barend Mons • Kristina Hettne • Stian Soiland-Reyes • Rebecca Lawrence • Michael Crusoe • Raphael Jimenez • Alasdair Gray
  86. Jon OlavVik, Norwegian University of Life Science Maksim Zakhartsev University Hohenheim, Stuttgart, Germany Alexey Kolodkin Siberian Branch Russian Academy of Sciences Tomasz Zieliński, SynthSys Centre University Edinburgh, UK Martin Peters, Martin Scharm Systems Biology Bioinformatics University of Rostock, Germany Hadas Leonov
  87. EXTRA
  88. From: EOSC Stakeholder Forum, Brussels 28-29 November 2017 Soap-box session: Intermediaries, Research communities & Libraries, Valentino Cavalli
Anúncio