O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

EGU 2018 Ian McHarg Lecture

204 visualizações

Publicada em

Looking at the past of infrastructure development for research data in the context of infrastructure development patterns and experiences from the evolution of the IEDA data facility to inform future pathways and developments. A major focus of the lecture is on the FAIR principles and the issues surrounding reusability of data.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

EGU 2018 Ian McHarg Lecture

  1. 1. Data Infrastructure for the Earth & Space Science How Far Have We Come, Where Are We Heading? Kerstin Lehnert Lamont-Doherty Earth Observatory, Columbia University April 10, 2018 Ian McHarg Lecture 2018 1
  2. 2. Before I start, a short detour ... April 10, 2018 Ian McHarg Lecture 2018 2 The Kaiserstuhl, Germany
  3. 3. Making this lecture April 10, 2018 Ian McHarg Lecture 2018 3
  4. 4. My goal April 10, 2018 Ian McHarg Lecture 2018 4 study the past if you would define the future Confucius
  5. 5. Learning from the past: (1) The Big Picture April 10, 2018 Ian McHarg Lecture 2018 5 2007 2018 https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf
  6. 6. Learning from the past: (2) The Real World The story of IEDA (Interdisciplinary Earth Data Alliance) www.iedadata.org ... there was a database named PetDB April 10, 2018 Ian McHarg Lecture 2018 6
  7. 7. A biased perspective I am a geoscientist who directs a US data facility for primarily investigator-based data (“long tail”) funded by the National Science Foundation. April 10, 2018 Ian McHarg Lecture 2018 7 www.iedadata.org
  8. 8. Defining the Topic Data infrastructure is a digital infrastructure promoting data sharing and consumption. Its goal is to enable researchers to make the best use of the world’s growing wealth of data for the advancement of science and the benefit of society. April 10, 2018 Ian McHarg Lecture 2018 8
  9. 9. Data drive Earth science: A new way of understanding the world April 10, 2018 Ian McHarg Lecture 2018 9 Data: The 4th Paradigm The 5th Dimension
  10. 10. We have been talking about it for a while ... April 10, 2018 Ian McHarg Lecture 2018 10 2006
  11. 11. EGU ESSI Abstract titles April 10, 2018 Ian McHarg Lecture 2018 11 2008 2013 2018
  12. 12. Growth of Earth & Space Science Informatics  63 ESSI session proposals – an increase of 40%  729 ESSI abstracts – an increase of ~18.7 %  35 ESSI oral sessions - an increase of ~40%  4 Data Fair Town Halls  Machine Learning/Deep Learning: biggest increase in any theme  big increases also in FAIR, Repositories & Data Storage, and Adoption & Adaption Carnegie Institution: Unleash the Power of Data 12 Credit: Lesley Wyborn AGU FM Program Committee Member AGU Fall Meeting 2017:
  13. 13. April 10, 2018 Ian McHarg Lecture 2018 13
  14. 14. Learning from the past: The Big Picture Insights into the development of infrastructures April 10, 2018 Ian McHarg Lecture 2018 14
  15. 15. Revolutionary! April 10, 2018 Ian McHarg Lecture 2018 15  Roman water supply system  Railroad systems  Global electrification  Internet
  16. 16. Patterns of Infrastructure Development Edwards et al. 2007 1. Deliberate and successful design of ‘local’ systems. 2. Technology transfer across domains and locations 3. Infrastructure form via gateways that allow dissimilar systems to be linked into networks Wittenburg & Strawn 2018 1. Inventions and development of start-up systems 2. Technology transfer between regions and also society (creolization) 3. Planning for system growth where "reverse salients" need to be tackled 4. Substantial momentum (mass, velocity, direction) April 10, 2018 Ian McHarg Lecture 2018 16 System Building Growth Consolidation
  17. 17. Patterns of Infrastructure Development Edwards et al. 2007 1. Deliberate and successful design of ‘local’ systems. 2. Technology transfer across domains and locations 3. Infrastructure form via gateways that allow dissimilar systems to be linked into networks Wittenburg & Strawn 2018 1. Inventions and development of start-up systems 2. Technology transfer between regions and also society (creolization) 3. Planning for system growth where "reverse salients" need to be tackled 4. Substantial momentum (mass, velocity, direction) April 10, 2018 Ian McHarg Lecture 2018 17 System Building Growth Consolidation
  18. 18. Creolization  New components are continuously introduced trying to solve specific challenges  Capabilities grow unevenly (e.g. big vs small data)  Fragmentation Leads to  Inefficiencies in use and costs  Winners & loosers: some solutions are more promising and get more attraction  Better understanding the underlying rules, principles and limitations. April 10, 2018 Ian McHarg Lecture 2018 18After Wittenburg & Strawn, 2018)
  19. 19. Attraction via “Universals”  “Simple” principles, broadly supported  Only influence directly a specific part of the overall infrastructure, enable efficiency at the top layers  Form stable basis for new developments April 10, 2018 Ian McHarg Lecture 2018 19After Wittenburg & Strawn, 2018) “Universals are ... essential to create a momentum by overcoming fragmentation and achieving economies of scale.
  20. 20. Attraction is happening!  Relevance of community organizations that define principles, procedures, and component specifications  RDA: global & cross-disciplinary  ESIP: Earth Science & US (others coming?)  New: RDA Interest Group “ESIP/RDA Earth, Space, and Environmental Sciences” April 10, 2018 Ian McHarg Lecture 2018 20
  21. 21. Universal: FAIR principles April 10, 2018 Ian McHarg Lecture 2018 21  Represent a guideline for data providers to enhance the reusability of their data holdings:  Data can be found on the Internet.  Data are accessible in a usable format with clear rights and licenses.  Data access is reliable & persistent.  Data are identified in a unique and persistent way so that they can be referred to and cited.  Data are documented with rich metadata.
  22. 22. Universal: Standards for data repositories  Cooperative effort between Data Seal of Approval (DSA) and the World Data System (WDS) under the umbrella of the Research Data Alliance (RDA)  Harmonized requirements & procedures for certification of repositories  Confidence for publishers and funders which repositories to trust  Basis for development of new repositories April 10, 2018 Ian McHarg Lecture 2018 22
  23. 23. “Enabling FAIR Data” project @ AGU  Develop & implement standards that will connect researchers, publishers, and data repositories in the Earth and space sciences to enable FAIR data  Grant from the Laura and John Arnold Foundation (LJAF) to the AGU  FAIR-compliant data repositories (CoreTrustSeal certified, preferred domain specific)  FAIR-compliant Earth and space science publishers  Align their policies for data to be deposited in certified repositories  Gives similar experience for researchers. Carnegie Institution: Unleash the Power of Data 23 Slide after S. Stall et al., presentation at RDA P11 Berlin, March 2018
  24. 24. All publishers who are part of the Coalition on Publishing Data in the Earth and Space Sciences (COPDESS) support the efforts of trusted repositories that aggregate research data, software, and physical samples for the use of the scientific community. Carnegie Institution: Unleash the Power of Data 24 “These Data Guidelines align the Author’s instructions for the submission of data sets in the Earth and Space Sciences, for all affiliated publishers.”
  25. 25. Universal: Persistent Identifiers April 10, 2018 Ian McHarg Lecture 2018 25 Founded 2009 Founded 2011 Founded 2012 “The intention of this cross- disciplinary report is to overcome still existing confusions about PIDs and the lack of detail knowledge in many disciplines. ...to identify agreements across documents that have been suggested to be included by experts.”From: “Common Patterns in Revolutionary Infrastructures and Data” P. Wittenburg & G. Strawn, February 2018,
  26. 26. Learning from the past: (2) The Real World The story of IEDA (Interdisciplinary Earth Data Alliance) ...there was a database named PetDB April 10, 2018 Ian McHarg Lecture 2018 26
  27. 27. Once upon a time ... April 10, 2018 Ian McHarg Lecture 2018 27 PetDB web site in 1999
  28. 28. April 10, 2018 Ian McHarg Lecture 2018 28 Note: PetDB is a database that allows to access data at the level of individual data points, not files!
  29. 29. Success: New data-driven science in geochemistry April 10, 2018 Ian McHarg Lecture 2018 29 Meyzen et al. (2007): „Isotopic portrayal of the Earth's upper mantle flow field.“ Putirka et al. (2007) Stracke & Hofmann (2005) Class & Goldstein (2007) 2018: 740 citations
  30. 30. An analysis in 2007 April 10, 2018 Ian McHarg Lecture 2018 30 T. Plank, 1999: “Within about 5 minutes of logging on for the first time, I was staring at an EXCEL file that had all the REE on basalt glasses from the EPR from 10°N to 20°S. And the answer to my La/Sm question. I am very impressed, we are looking at the future of geochemistry.” GSA 2007 talk: “My Data, Your Data, Our Data!”
  31. 31. Attraction - but partners disappeared April 10, 2018 Ian McHarg Lecture 2018 31
  32. 32. Another failed network attempt  PaleoStrat not funded  Development of interoperability with CoreWall not funded  Too many political obstacles April 10, 2018 Ian McHarg Lecture 2018 32 “Promises, Achievements, and Challenges of Networking Global Geoinformatics Resources” EGU General Assembly 2008
  33. 33. Growth of data systems at Lamont April 10, 2018 Ian McHarg Lecture 2018 33
  34. 34. Consolidation “This Cooperative Agreement converts a series of proposal/award-driven activities into a community-based facility that serves to support, sustain, and advance the geosciences by providing a centralized location for the registry of and access to data essential for research in the solid-earth and polar sciences.” - Continue operating & maintaining existing systems - Develop tools for investigators to comply with NSF data policies (IEDA Data Management Plan Tool & Data Compliance Reporting Tool) - Develop tools and modify architecture to provide integrated access to holdings April 10, 2018 Ian McHarg Lecture 2018 34
  35. 35. IEDA’s layered architecture April 10, 2018 Ian McHarg Lecture 2018 35 The EUDAT model: Shared Partners Shared
  36. 36. IEDA Today: Data Holdings & Growth  > 70 TeraBytes of marine geophysical sensor data in the MGDS  > 20 million analytical measurements for >1 million samples in EarthChem  > 4.2 million samples registered and searchable in SESAR (System for Sample Registration) 11/15/17Presentation at NSF-EAR 36
  37. 37. IEDA Today  Thousands of download requests per month  >2,000 citations in the literature  ~ 10,000 start-ups of GeoMapApp per month  >2,700 GeoPass users*  Demonstrated impact on science 11/15/17Presentation at NSF-EAR 37 *GeoPass accounts are required to submit data to EarthChem/ Geochron, SESAR, & USAP-DC, and to use the DMP Tool 0 50 100 150 200 250 NumberofCitationsPerYear EarthChem/ PetDB / SedDB MGDS/ GMRT/ GMA Citations of IEDA Systems in the Scientific Literature
  38. 38. IEDA is “attracting” 👍  Certification: Member of World Data System since 2011 (CoreTrustSeal certification underway)  Use of Persistent Identifiers  Publication agent of DataCite since 2011  DOI registration of datasets since 2009 via TIB Hannover  The International Geo Sample Number: A PID for physical sampleas  FAIR data  Finable/accessible: DOIs, landing pages, GUIs, APIs  Interoperable: CSW, DataONE member node, schema.org (EarthCube project P418)  Reusable: disciplinary expertise for data curation, rich provenance metadata April 10, 2018 Ian McHarg Lecture 2018 38
  39. 39. Lessons Learnedr April 10, 2018 Ian McHarg Lecture 2018 39
  40. 40. Merger of EarthChem & MGDS created tensions  Partner system needs versus overarching IEDA level needs  Budget  Staff expertise  Staff allocations  Distribution among different funding sources (3 different NSF programs)  Scientific utility versus trustworthiness of operations  Operation & maintenance versus innovation April 10, 2018 Ian McHarg Lecture 2018 40
  41. 41. Merger did not lead to the expected ‘economies of scale’  Disciplinary data curation continues as the most relevant component.  Additional resources/effort needed for coordination and alignment of activities and practices across partners.  More project management required due to budget level and status as facility.  Building useful data search and discovery across multi-disciplinary systems is a challenging problem. April 10, 2018 Ian McHarg Lecture 2018 41 Costpersystem
  42. 42. Achievements: IEDA Data Browser April 10, 2018 Ian McHarg Lecture 2018 42
  43. 43.  Access to all IEDA repositories in one place  Free text, map, and facet-based search options  ISO metadata available for other catalogs to harvest  Major work to align concepts and vocabularies in the different repositories  Challenge to agree on facets  Relevance to different data types  Availability of metadata  Granularity of datasets April 10, 2018 Ian McHarg Lecture 2018 43 Achievements: IEDA Integrated Catalog
  44. 44. A changing ecosystem “IEDA’s cross-disciplinary services for data discovery (IEDA Data Browser) and data access (IEDA Integrated Catalog) across all IEDA systems are increasingly superseded by tools developed with substantially larger resources as part of EarthCube, Google (Google’s new Research Data Search based on schema.org), or perhaps DataONE. These recent developments aim to provide researchers with the tools to find and use data in a highly distributed and fragmented data infrastructure based on new approaches for interoperability, metadata registries, and hubs such as SCHOLIX to link data and literature.” IEDA: Future Scope and Structure (IEDA internal report, K. Lehnert & S. Carbotte, January 2018) April 10, 2018 Ian McHarg Lecture 2018 44
  45. 45. We need to adapt � Reduce complexity of operations � Adjust to and better leverage external CI developments (e.g. EarthCube) � Enhance opportunities to grow partnerships relevant to the disciplinary systems to target needs of the disciplinary communities  Systems and/or services that serve broader audiences should be funded independently (SESAR, GeoMapApp, GMRT)  Create a new management/governance structure  more independence for IEDA partners and funders to allow growth  rely on external developments for cross-disciplinary services Ian McHarg Lecture 2018 45
  46. 46. Where are we heading from here? April 10, 2018 Ian McHarg Lecture 2018 46
  47. 47. Oh no, that diagram again ...  A Digital Object has a structured bit sequence stored in a trustworthy repository.  A Digital Object has a PID and metadata.  The PID is associated with all relevant kernel information that allows humans and machines to enable FAIR.  Kernel information and Digit Object have types allowing humans and machines to associate operations with them. April 10, 2018 Ian McHarg Lecture 2018 47 According to Wittenburg & Strawn (2018), the implementation of data infrastructure can be guided by 4 statements:
  48. 48. Re- usability Impact on Science Sustaina- bility My take on priorities April 10, 2018 Ian McHarg Lecture 2018 48 Data type specific best practices Metadata quality Granularity of access, data fusion Metrics Data Science Education Business models Consolidation The impact of data infrastructure on science & society depends on the reusability of data and will ultimately justify its continued funding.
  49. 49. Reusability problem: Metadata quality  Discipline-specific and data type specific metadata not well defined and enforced  Lack of consistent vocabularies  Automated metadata enrichment (e.g. CINERGI) has not yet had convincing results  Manual data curation still best, but too costly April 10, 2018 Ian McHarg Lecture 2018 49 “The Geochemical Data(base) Factory: From Heterogeneous Input to Homogeneous Output. AGU FM 2009
  50. 50. Reusability problem: data wrangling Surveys in recent years show that data scientists still spend 75-80% of their time ‘data wrangling’.  RDA EU survey 2013 (75%)  Brodie 2015 (80%)  CrowdFlower 2017 (80%) April 10, 2018 Ian McHarg Lecture 2018 50 Source: Crowdflower
  51. 51. Reusability solution: Data Fusion Harmonize & integrate data so that disparate pieces of information form a picture that can be explored to reveal patterns in space, time, and properties. April 10, 2018 Ian McHarg Lecture 2018 51
  52. 52.  Structure data so they can be accessed and understood at a more granular level  Approaches are available and improving  ISO/OGC Observations & Measurements  Observation Data Model ODM2 (Horsburgh et al. 2017)  Schema.org  Open Core Data Reusability solution: Data Fusion April 10, 2018 Ian McHarg Lecture 2018 52 S. Cox et al. “Mainstream web standards now support science data too”; AGU FM 2017
  53. 53. Reusability problem: The Long Tail  Small data volumes, but big potential  Culture is not open to sharing  Data fragmented and highly heterogeneous  Lots of .xls files  Many data never see the light of day April 10, 2018 Ian McHarg Lecture 2018 53 ESIP Winter Meeting, January 2016
  54. 54. Reusability hope: Generation change “A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.” Max Planck April 10, 2018 Ian McHarg Lecture 2018 54
  55. 55. April 10, 2018 Ian McHarg Lecture 2018 55 Credit: Jon Stelling, LeHigh University
  56. 56.  steps in the data life cycle are siloed in many communities and disciplines  Recommendation: focus on the full data life cycle April 10, 2018 Ian McHarg Lecture 2018 56 Final Report from the NSF Computer and Information Science and Engineering Advisory Committee, Data Science Working Group Communications of the ACM, Vol. 61 No. 4, Pages 67-72, April 2018
  57. 57. A trend toward large facilities April 10, 2018 Ian McHarg Lecture 2018 57
  58. 58. Education in Data Science or Data Science in Education  Data Science as a new field in academia  Different organizational models emerging at academic institutions to integrate with domain sciences April 10, 2018 Ian McHarg Lecture 2018 58
  59. 59. I’ll leave the funding question to the experts. April 10, 2018 Ian McHarg Lecture 2018 59  Trust of the science community
  60. 60. Funding April 10, 2018 Ian McHarg Lecture 2018 60 “Funding research data management and related infrastructures”, May 2016 Authors: Knowledge Exchange Research Data Expert Group and Science Europe Working Group on Research Data.
  61. 61. Did we move at all? April 10, 2018 Ian McHarg Lecture 2018 61 Did we move at all? 2007
  62. 62. Success! The International Geo Sample Number  Grew from a local, centralized system started in 2004 to an international organization founded in 2011  Now has 24 members in 5 continents  currently 5 active Allocating Agents  Adoption by researchers, collection curators, publishers, and funding agencies growing  Adoption spreading to other disciplines  Biology, archeology, material sciences 2/15/2018 62 4,261,436 2,100,273 100,342 30,925 4,809 IEDA Geoscience Australia MARUM CSIRO GFZ # of IGSNs issued by active IGSN Allocating Agents Organic Biomarker Data Workshop Newest members since 2017: USGS (USA) BGS (UK) CNRS (France) IFREMER (France) ANDS (Australia)
  63. 63. The final message: Let’s work together!  It is relevant that we leverage existing capabilities and expertise.  We do not have the luxury of duplicating effort.  We need to break down barriers between communities and stakeholders that compete for their piece of the pie. April 10, 2018 Ian McHarg Lecture 2018 63 NSF Workshop Cyberinfrastructure for Large Facilities, Nov 2015
  64. 64. Back to the beginning: April 10, 2018 Ian McHarg Lecture 2018 64 “Do what excites you. Follow your passion. Don't necessarily worry about what obstacles might be there, because there are always ways to overcome them. But the most exciting thing is to be able to do what you love, and just don't let anything stand in the way of that.” Carol Greider 2009 Nobel Prize winner
  65. 65. April 10, 2018 Ian McHarg Lecture 2018 65 For my parents
  66. 66. April 10, 2018 Ian McHarg Lecture 2018 66

×