SlideShare uma empresa Scribd logo
1 de 41
a centre of expertise in data curation and preservation




Curation of Scientific Data:
                 Challenges for Repositories

                  Chris Rusbridge
           JISC Repositories Conference
              5 June 2007, Manchester
                                                                                                    Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License, excluding content property of others. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
a centre of expertise in data curation and preservation




                         Contents
     •   Audience?
     •   Science and digital curation
     •   Why are data important?
     •   What kinds of data?
     •   What to do with data?
     •   Repository options
     •   Changing practice


JISC Repositories 2007
a centre of expertise in data curation and preservation




                         Audience
     • I assume you are either…
         • A Repository Manager concerned about adding
           data to your collections of ePrints (most likely), or
         • A research data manager or other researcher,
           concerned about finding an appropriate repository
           to curate your data (possibly), or
         • Neither of the above, in the wrong room, just come
           in to get out of the sun…




JISC Repositories 2007
a centre of expertise in data curation and preservation




  Digital Curation Centre Mission
        “The over-riding purpose of the DCC is to
        support and promote continuing improvement
        in the quality of data curation, and of
        associated digital preservation”




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




       “The Records of Science”
     • Data increasingly important as evidence
         • Key part of the scholarly record (public good)
             • Unrepeatable observations & experiments
             • Value for public money (eg OECD)
     • Experimental verifiability (the basis of science)
         • Would Chang retractions have been reduced if his first data
           were available?
                 CHANG, G., ROTH, C. B., REYES, C. L., PORNILLOS, O., CHEN, Y.-J. & CHEN, A. P. (2006)
                 Retraction of Pornillos et al., Science 310 (5756) 1950-1953. Retraction of Reyes and Chang,
                 Science 308 (5724) 1028-1031. Retraction of Chang and Roth, Science 293 (5536) 1793-1800.
                 Science Magazine, 314. http://www.sciencemag.org/cgi/content/full/314/5807/1875b

     • Allows additional interpretations
     • Legal and compliance (eg emerging RC mandates)

JISC Repositories 2007
a centre of expertise in data curation and preservation




                OECD declaration
     • “…Work towards the establishment of access regimes
       for digital research data from public funding in
       accordance with the following objectives and principles:
         •   Openness
         •   Transparency
         •   Legal conformity
         •   Formal responsibility
         •   Professionalism
         •   Protection of intellectual property
         •   Interoperability
         •   Quality and security
         •   Efficiency
         •   Accountability”
JISC Repositories 2007
a centre of expertise in data curation and preservation




Retaining research data means…
     • Data secure against loss (within group)
     • Communal repository (secure data store)
     • Re-usable, sharable information
     • As above, plus active curation (eg bio-
       informatics)
     • Long term preservation of information

     • Be clear what you are trying to do!

JISC Repositories 2007
a centre of expertise in data curation and preservation




     … or the data trajectory is…
     • Hard drive → lost (crash)
     • Hard drive →DVD →Cardboard box →Loft
       →Skip/dumpster → lost




     • Sometimes this is a very bad thing
     • Sometimes these are the right options!


JISC Repositories 2007                                          •© Marita Bushell
a centre of expertise in data curation and preservation




         Long term bit storage…
     • A solved problem? Just requires well-
       understood good data management
       practices?
     • Wrong! For very large datasets over very long
       time, there are significant problems…




                   BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T.
                   J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys
                   '06. Leuven, Belgium, ACM.

JISC Repositories 2007
a centre of expertise in data curation and preservation




   How Well Must We Preserve?
   Keep a petabyte for a century
   – With   50% chance of remaining completely undamaged

   Consider each bit decaying independently
   – Analogy   with radioactive decay

   That's a bit half- life of 10**18 years
   – One    hundred million times the age of the universe

   That's a very demanding requirement
   – Hard   to measure
    – Even very unlikely faults will matter a lot

JISC Repositories 2007      •Slide from David Rosenthal, LOCKSS
a centre of expertise in data curation and preservation




       What to do about curation
     • Build curation/reusability into science workflow
         • Curation begins before creation
         • What’s easy at first becomes (impossibly) hard later
         • Describe data (metadata schemas, “representation info”,
           etc)
         • Keep experimental parameters (technical, who, what, when,
           where)
         • Keep ability to process
         • Keep data!




JISC Repositories 2007
a centre of expertise in data curation and preservation




    What to do about curation - 2
     • Use standard/agreed formats for data
     • Make ownership & restrictions clear, &
       explain how to cite data
     • Offer for deposit in institutional or discipline
       repository
         • Appraisal and selection essential
         • Possible time-limited embargos
     • “Publish” data in support of articles



JISC Repositories 2007
a centre of expertise in data curation and preservation



 Internet Archaeology: publication with
                 data




JISC Repositories 2007
a centre of expertise in data curation and preservation




            Database as book…
     • Buneman (early pilot)
       work on IUPHAR
       database
     • MySQL to XML
       database
         • Historic to logical
           schema
     • XML via XSLT to LaTeX




JISC Repositories 2007
a centre of expertise in data curation and preservation




                  The StORe vision
     • Seamless transport                                   Source
       from research data to
       research publications
       and vice versa                                        ware
     • Bi-directional links                                 Middle
       proven in social science
       e-research but capable
       of export to other
       disciplines
                                                            Output




                   •http://jiscstore.jot.com/WikiHome/
JISC Repositories 2007                              •Slide from Graham Pryor
a centre of expertise in data curation and preservation




  StORe survey: linkage value?
   The value of
                       University   University
   direct links                                    PG       Contract    Independent
                       academic     research                                           Other    Totals
   from source to                                student   researcher    researcher
                         staff      assistant
   output data

         Significant
        advantage         85           18         33           11            2          26       175
            Useful        78            9         41            5            4           9       146
        Interesting       24            4          5            3            0           5        41
     Of no interest        9            0          0            0            0           1        10
          Not sure         7            0          7            0            1           2        17
             Other         1            1          0            0            0           1         3
            Totals       204           32         86           19            7          44       392
             •But: “researchers’ attitudes to enabling access depend to a large
             •extent on whether they are behaving as producers or users of data”

JISC Repositories 2007                                              •Slide from StORe project
a centre of expertise in data curation and preservation




        What to do about data (3)
     • Institutional repository managers
         • Make contact with emerging institutional data services
         • Start raising awareness of the need to curate rather than just
           dump data
         • Start thinking about the relationship of data to publications
           (especially e-theses)
         • Start thinking about the metadata needed to find and re-use
           data
         • Make contact with key researchers
         • Start thinking about their data…




JISC Repositories 2007
a centre of expertise in data curation and preservation




             What kinds of data?
     • Observations
         • eg UARS (Upper Atmosphere) Level 0: telemetry
         • UARS Level 1: measured physical parameters (post
           calibration?)
     • Derived data
         • UARS Level 2: calculated geophysical? profiles
         • UARS level 3: gridded, interpolated?
     • Combined data
     • Crafted data
         • Eg annotated gene/protein databases
     • Descriptive (meta)data


JISC Repositories 2007
a centre of expertise in data curation and preservation




StORe: Source data formats
                                                  CAD/GIS:                       39

                 Extensible mark -up language (XML):                             35

                Database files (e.g. Access, MySQL):                            117

                                     Flat files (e.g. FITS):                     66

                Hypertext mark -up language (HTML):                              60

                 Image files (e.g. .jpg, .tif, .bmp, .gif):                     228

                                           Plain text (.txt):                   179

                    Portable document format (.pdf):                            156

                                      Rich text files (.rtf):                    53

                        Spreadsheets (e.g. Excel/.xls):                         220

                                     Statistical software:                       75

                                      Tables/catalogues:                        102

               Word processed files (e.g. Word/.doc):                           220

                                 Other (please specify) :                        76




JISC Repositories 2007                                          •Slide from StORe project
a centre of expertise in data curation and preservation




StORe: the other data formats?
     They said the 76 other formats included:
       +latex+.cc source code, .cif (crystallographic data),
       .pdb, .mtz, .pool, .root, .raw, .swf, .fla, .raw, .mpg,
       binary files, chemdraw cdx, xwin nmr files, .ps files,
       .fla, .swf, masslynx files, derived data in PAw-format
       ntuples, raw mass spectrometry data, X-ray
       diffraction data, kaleidagraphs, Atlas/ti hermeneutic
       unit files, C++/shell scripts, Fourier induction decay
       files, etc., etc., etc., etc………..



JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation




StORe: the other data formats - more
  They also said such things as:
    “It is stored in a database, but nothing so simple as an
    Access file! It's one of the largest databases in the world!
    The format is Kanga/Root and previously was
    Objectivity. I think it's of the order of Picobytes in size.”
  And:
    “God preserve us from idiots who archive data in
    proprietary commercial formats (Excel spreadsheets and
    MS-word documents)!”




JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation




  What are the reusability issues?
     • Data not neutral; highly contextual!
     • Hard to know the risks & pitfalls of a particular
       dataset
     • Data not self-describing: hard to find
       appropriate data (but see Murray-Rust on
       Googling InChI etc)
     • Hard to “understand” data once found
         • Really need information, not data!
     • Hard to use data once understood

JISC Repositories 2007
a centre of expertise in data curation and preservation




                         Context
     • Data meaningless without context
         • Metadata of many kinds
         • Representation information… from data to
           information
         • Linkage and connection between datasets
     • Provenance
         • Authenticity/integrity
         • Computational lineage



JISC Repositories 2007
a centre of expertise in data curation and preservation




              Access and re-use
     • Ethics and rights control access
         • Weak in expressing this long-term
     • Collaboration tools
         • Annotation, discussion, review (see DART…)
         • Re-use leading to change and development
     • “Publication”
         • Not just in “print”
         • Underlying data should be “published”, too


JISC Repositories 2007
a centre of expertise in data curation and preservation




           Data citation issues…
     • Citation for human readers and machine use cases
     • Granularity: database, record, item
     • Citation of changing objects
         • Version change (eg W3C practice: no version = latest, vs bibliographic:
           no version = first)
         • An efficient way to reference and access “archived” past states of
           more rapidly changing dataset, eg Genomics… datasets that result
           from the combined work of curators, or contain opinions or facts likely
           to change (work in progress, Buneman et al)
     • Standards conflict and immature (NLM best?)

     • Citation ESSENTIAL for motivating quality academic work on data
       management and curation


JISC Repositories 2007
a centre of expertise in data curation and preservation




             Repository challenges
     • Data are different: you’ll need access to some domain
       knowledge
     • Appraisal/selection harder
     • Broader range of formats
         • Appropriate “standards” for longevity? XML-based?
     • What metadata are needed?
         •   Descriptive, to find the dataset
         •   Context and background
         •   Provenance
         •   “Representation information” to connect data to information
             (whatever gives meaning to data for the “designated
             community”)

JISC Repositories 2007
a centre of expertise in data curation and preservation




        Repository challenges - 2
     • May distort your repository
         •   Size
         •   Number of objects
         •   Rate of deposit
         •   Nature of use
     • Databases may be dynamic
     • Databases may need to be accessed in situ
     • Rights and ethical limitations hard to describe and
       enforce
     • Need to build links to publications (cf StORe)
     • Need to build discipline links across repositories…

JISC Repositories 2007
a centre of expertise in data curation and preservation




        Repository challenges - 3
     • Is your platform suitable?
     • Most successful (ie older) data repositories
       are DIY
     • Data also held in repositories built on Dspace,
       ePrints and Fedora




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007   •Data from MIT DSpace Political Science
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




JISC Repositories 2007
a centre of expertise in data curation and preservation




         Who does data curation?
     •   Individuals
     •   Departments or groups
     •   Institutions, often through libraries
     •   Communities
     •   Disciplines
     •   Publishers
     •   National services
     •   Other 3rd parties…

JISC Repositories 2007
a centre of expertise in data curation and preservation




   Who are the curation players?




JISC Repositories 2007
a centre of expertise in data curation and preservation




       Disciplinary repositories…
     • >900 Nucleic Acids datasets!
     • ESDS/UKDA and NERC data centres, but…
     • “AHRC Council has decided to cease funding the Arts
       and Humanities Data Service (AHDS) from March
       2008. […] Grant holders must make materials they
       had planned to deposit with the AHDS available in an
       accessible depository for at least three years after the
       end of their grant”
             • AHRC Press Release 14/05/2007
             • (Note petition at http://petitions.pm.gov.uk/AHDSfunding/)
         • Does not apply to Archaeology: ADS still funded?



JISC Repositories 2007
a centre of expertise in data curation and preservation




        Institutional Repositories
     • OpenDOAR: only 5 Institutional Repositories claim to
       include datasets
         •   Bristol
         •   Cambridge
         •   Edinburgh
         •   Leicester
         •   Southampton
     • …and some of these seem doubtful on inspection!
         • … of course not all research data are “datasets”




JISC Repositories 2007
a centre of expertise in data curation and preservation




                 Cultural change
     • If we build it, will they come? NO!!
     • Outreach important: communication with
       scientists and researchers is hard graft
     • Cultural change to new approach requires more:
         • Incentives, rewards and mandates
         • Successful exemplars (well publicised)
         • Discipline-oriented approach (one size does not fit all)




JISC Repositories 2007
a centre of expertise in data curation and preservation




Need for advocacy?
       What functionality is missing from source repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers



       None               9           2                 7
       Don’t use          7                            10                   1
       Lack of            3                             4                   2
       knowledge
       Don’t know         5           3                13                   1

       No reply          129          20               45                  13

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation




Need for advocacy?
       What functionality is missing from output repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers



       None               3           2                 5                   1
       Don’t use          1           1
       Lack of                                          2                   1
       knowledge
       Don’t know                     2                 6                   1

       No reply          123          15               48                  15

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation




Need for advocacy?
        “The majority of academics do not know
        what repositories are nor are they
        familiar with the issues around new
        means of dissemination”
        – UKOLN/Eduserv Foundation: Digital
        Repositories Roadmap: looking forward, April
        2006



JISC Repositories 2007                    •Slide from StORe project
a centre of expertise in data curation and preservation




                               Thank you
                         c.rusbridge@ed.ac.uk




JISC Repositories 2007

Mais conteúdo relacionado

Mais procurados

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Deborah McGuinness
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?Heather Piwowar
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and PolicyAnton Van de Putte
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015LYRASIS
 

Mais procurados (7)

Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data ServicesNISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
 
Working with Global Infrastructure at a National Level
Working with Global Infrastructure at a National LevelWorking with Global Infrastructure at a National Level
Working with Global Infrastructure at a National Level
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and Policy
 
Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015Ala cspace aspace rep services demo 2015
Ala cspace aspace rep services demo 2015
 

Semelhante a Curation of scientifica data: Challenges for repositories

The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagramSteven Cracknell
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemASIS&T
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated scienceChris Rusbridge
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated scienceChris Rusbridge
 
Moving the repository upstream
Moving the repository upstreamMoving the repository upstream
Moving the repository upstreamChris Rusbridge
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Datacunera
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...John Scally
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Anna Maria Tammaro
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transferIyad Abou Rabii
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesSEAD
 
Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...rmacneil88
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research DataMichael Day
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...datacite
 
Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012Sarah Callaghan
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefCrossref
 

Semelhante a Curation of scientifica data: Challenges for repositories (20)

The future of the DCC
The future of the DCCThe future of the DCC
The future of the DCC
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagram
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management Ecosystem
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
 
Curating data for integrated science
Curating data for integrated scienceCurating data for integrated science
Curating data for integrated science
 
Moving the repository upstream
Moving the repository upstreamMoving the repository upstream
Moving the repository upstream
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
 
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...Scally The Library's Role in Research Data Management. OCLC partnership meeti...
Scally The Library's Role in Research Data Management. OCLC partnership meeti...
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
 
Data presentation and transfer
Data presentation and transferData presentation and transfer
Data presentation and transfer
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research Series
 
Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...Improving RDM through closer integration of electronic lab notebooks and data...
Improving RDM through closer integration of electronic lab notebooks and data...
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research Data
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
 
Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012Presentation to EASE, Tallinn, June 2012
Presentation to EASE, Tallinn, June 2012
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
 
DataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRefDataCite: the Perfect Complement to CrossRef
DataCite: the Perfect Complement to CrossRef
 

Mais de Chris Rusbridge

The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...Chris Rusbridge
 
JISC Digital Library initiatives
JISC Digital Library initiativesJISC Digital Library initiatives
JISC Digital Library initiativesChris Rusbridge
 
Practical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levelsPractical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levelsChris Rusbridge
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenChris Rusbridge
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Chris Rusbridge
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringChris Rusbridge
 
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stageChris Rusbridge
 
LOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceLOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceChris Rusbridge
 
Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?Chris Rusbridge
 
Disciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesisDisciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesisChris Rusbridge
 
Reference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital CurationReference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital CurationChris Rusbridge
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Chris Rusbridge
 
Blue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital PreservationBlue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital PreservationChris Rusbridge
 
Sustainable Digital Preservation and Access
Sustainable Digital Preservation and AccessSustainable Digital Preservation and Access
Sustainable Digital Preservation and AccessChris Rusbridge
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 

Mais de Chris Rusbridge (18)

The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...The Distributed National Electronic Resource and the Electronic Libraries Pro...
The Distributed National Electronic Resource and the Electronic Libraries Pro...
 
JISC Digital Library initiatives
JISC Digital Library initiativesJISC Digital Library initiatives
JISC Digital Library initiatives
 
Practical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levelsPractical steps towards digital preservation at institutional levels
Practical steps towards digital preservation at institutional levels
 
The Licence Trap
The Licence TrapThe Licence Trap
The Licence Trap
 
Cautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your GardenCautious Optimism: Cultivate your Garden
Cautious Optimism: Cultivate your Garden
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...
 
Dcc endeavour-2006
Dcc endeavour-2006Dcc endeavour-2006
Dcc endeavour-2006
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineering
 
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
 
LOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experienceLOCKSS UK, with a focus on reporting experience
LOCKSS UK, with a focus on reporting experience
 
Dcc jsr phase 3
Dcc jsr phase 3Dcc jsr phase 3
Dcc jsr phase 3
 
Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?Trust and repository audit: can repository managers assure trustworthiness?
Trust and repository audit: can repository managers assure trustworthiness?
 
Disciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesisDisciplinary dimensions of digital curation: introduction and synthesis
Disciplinary dimensions of digital curation: introduction and synthesis
 
Reference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital CurationReference Model for Economically Sustainable Digital Curation
Reference Model for Economically Sustainable Digital Curation
 
Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...Frequently-asked questions on Freedom of Information and Environmental Inform...
Frequently-asked questions on Freedom of Information and Environmental Inform...
 
Blue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital PreservationBlue Ribbon Task Force on Sustainable Digital Preservation
Blue Ribbon Task Force on Sustainable Digital Preservation
 
Sustainable Digital Preservation and Access
Sustainable Digital Preservation and AccessSustainable Digital Preservation and Access
Sustainable Digital Preservation and Access
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Curation of scientifica data: Challenges for repositories

  • 1. a centre of expertise in data curation and preservation Curation of Scientific Data: Challenges for Repositories Chris Rusbridge JISC Repositories Conference 5 June 2007, Manchester Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
  • 2. a centre of expertise in data curation and preservation Contents • Audience? • Science and digital curation • Why are data important? • What kinds of data? • What to do with data? • Repository options • Changing practice JISC Repositories 2007
  • 3. a centre of expertise in data curation and preservation Audience • I assume you are either… • A Repository Manager concerned about adding data to your collections of ePrints (most likely), or • A research data manager or other researcher, concerned about finding an appropriate repository to curate your data (possibly), or • Neither of the above, in the wrong room, just come in to get out of the sun… JISC Repositories 2007
  • 4. a centre of expertise in data curation and preservation Digital Curation Centre Mission “The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation” JISC Repositories 2007
  • 5. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 6. a centre of expertise in data curation and preservation “The Records of Science” • Data increasingly important as evidence • Key part of the scholarly record (public good) • Unrepeatable observations & experiments • Value for public money (eg OECD) • Experimental verifiability (the basis of science) • Would Chang retractions have been reduced if his first data were available? CHANG, G., ROTH, C. B., REYES, C. L., PORNILLOS, O., CHEN, Y.-J. & CHEN, A. P. (2006) Retraction of Pornillos et al., Science 310 (5756) 1950-1953. Retraction of Reyes and Chang, Science 308 (5724) 1028-1031. Retraction of Chang and Roth, Science 293 (5536) 1793-1800. Science Magazine, 314. http://www.sciencemag.org/cgi/content/full/314/5807/1875b • Allows additional interpretations • Legal and compliance (eg emerging RC mandates) JISC Repositories 2007
  • 7. a centre of expertise in data curation and preservation OECD declaration • “…Work towards the establishment of access regimes for digital research data from public funding in accordance with the following objectives and principles: • Openness • Transparency • Legal conformity • Formal responsibility • Professionalism • Protection of intellectual property • Interoperability • Quality and security • Efficiency • Accountability” JISC Repositories 2007
  • 8. a centre of expertise in data curation and preservation Retaining research data means… • Data secure against loss (within group) • Communal repository (secure data store) • Re-usable, sharable information • As above, plus active curation (eg bio- informatics) • Long term preservation of information • Be clear what you are trying to do! JISC Repositories 2007
  • 9. a centre of expertise in data curation and preservation … or the data trajectory is… • Hard drive → lost (crash) • Hard drive →DVD →Cardboard box →Loft →Skip/dumpster → lost • Sometimes this is a very bad thing • Sometimes these are the right options! JISC Repositories 2007 •© Marita Bushell
  • 10. a centre of expertise in data curation and preservation Long term bit storage… • A solved problem? Just requires well- understood good data management practices? • Wrong! For very large datasets over very long time, there are significant problems… BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T. J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys '06. Leuven, Belgium, ACM. JISC Repositories 2007
  • 11. a centre of expertise in data curation and preservation How Well Must We Preserve? Keep a petabyte for a century – With 50% chance of remaining completely undamaged Consider each bit decaying independently – Analogy with radioactive decay That's a bit half- life of 10**18 years – One hundred million times the age of the universe That's a very demanding requirement – Hard to measure – Even very unlikely faults will matter a lot JISC Repositories 2007 •Slide from David Rosenthal, LOCKSS
  • 12. a centre of expertise in data curation and preservation What to do about curation • Build curation/reusability into science workflow • Curation begins before creation • What’s easy at first becomes (impossibly) hard later • Describe data (metadata schemas, “representation info”, etc) • Keep experimental parameters (technical, who, what, when, where) • Keep ability to process • Keep data! JISC Repositories 2007
  • 13. a centre of expertise in data curation and preservation What to do about curation - 2 • Use standard/agreed formats for data • Make ownership & restrictions clear, & explain how to cite data • Offer for deposit in institutional or discipline repository • Appraisal and selection essential • Possible time-limited embargos • “Publish” data in support of articles JISC Repositories 2007
  • 14. a centre of expertise in data curation and preservation Internet Archaeology: publication with data JISC Repositories 2007
  • 15. a centre of expertise in data curation and preservation Database as book… • Buneman (early pilot) work on IUPHAR database • MySQL to XML database • Historic to logical schema • XML via XSLT to LaTeX JISC Repositories 2007
  • 16. a centre of expertise in data curation and preservation The StORe vision • Seamless transport Source from research data to research publications and vice versa ware • Bi-directional links Middle proven in social science e-research but capable of export to other disciplines Output •http://jiscstore.jot.com/WikiHome/ JISC Repositories 2007 •Slide from Graham Pryor
  • 17. a centre of expertise in data curation and preservation StORe survey: linkage value? The value of University University direct links PG Contract Independent academic research Other Totals from source to student researcher researcher staff assistant output data Significant advantage 85 18 33 11 2 26 175 Useful 78 9 41 5 4 9 146 Interesting 24 4 5 3 0 5 41 Of no interest 9 0 0 0 0 1 10 Not sure 7 0 7 0 1 2 17 Other 1 1 0 0 0 1 3 Totals 204 32 86 19 7 44 392 •But: “researchers’ attitudes to enabling access depend to a large •extent on whether they are behaving as producers or users of data” JISC Repositories 2007 •Slide from StORe project
  • 18. a centre of expertise in data curation and preservation What to do about data (3) • Institutional repository managers • Make contact with emerging institutional data services • Start raising awareness of the need to curate rather than just dump data • Start thinking about the relationship of data to publications (especially e-theses) • Start thinking about the metadata needed to find and re-use data • Make contact with key researchers • Start thinking about their data… JISC Repositories 2007
  • 19. a centre of expertise in data curation and preservation What kinds of data? • Observations • eg UARS (Upper Atmosphere) Level 0: telemetry • UARS Level 1: measured physical parameters (post calibration?) • Derived data • UARS Level 2: calculated geophysical? profiles • UARS level 3: gridded, interpolated? • Combined data • Crafted data • Eg annotated gene/protein databases • Descriptive (meta)data JISC Repositories 2007
  • 20. a centre of expertise in data curation and preservation StORe: Source data formats CAD/GIS: 39 Extensible mark -up language (XML): 35 Database files (e.g. Access, MySQL): 117 Flat files (e.g. FITS): 66 Hypertext mark -up language (HTML): 60 Image files (e.g. .jpg, .tif, .bmp, .gif): 228 Plain text (.txt): 179 Portable document format (.pdf): 156 Rich text files (.rtf): 53 Spreadsheets (e.g. Excel/.xls): 220 Statistical software: 75 Tables/catalogues: 102 Word processed files (e.g. Word/.doc): 220 Other (please specify) : 76 JISC Repositories 2007 •Slide from StORe project
  • 21. a centre of expertise in data curation and preservation StORe: the other data formats? They said the 76 other formats included: +latex+.cc source code, .cif (crystallographic data), .pdb, .mtz, .pool, .root, .raw, .swf, .fla, .raw, .mpg, binary files, chemdraw cdx, xwin nmr files, .ps files, .fla, .swf, masslynx files, derived data in PAw-format ntuples, raw mass spectrometry data, X-ray diffraction data, kaleidagraphs, Atlas/ti hermeneutic unit files, C++/shell scripts, Fourier induction decay files, etc., etc., etc., etc……….. JISC Repositories 2007 •Slide from StORe project
  • 22. a centre of expertise in data curation and preservation StORe: the other data formats - more They also said such things as: “It is stored in a database, but nothing so simple as an Access file! It's one of the largest databases in the world! The format is Kanga/Root and previously was Objectivity. I think it's of the order of Picobytes in size.” And: “God preserve us from idiots who archive data in proprietary commercial formats (Excel spreadsheets and MS-word documents)!” JISC Repositories 2007 •Slide from StORe project
  • 23. a centre of expertise in data curation and preservation What are the reusability issues? • Data not neutral; highly contextual! • Hard to know the risks & pitfalls of a particular dataset • Data not self-describing: hard to find appropriate data (but see Murray-Rust on Googling InChI etc) • Hard to “understand” data once found • Really need information, not data! • Hard to use data once understood JISC Repositories 2007
  • 24. a centre of expertise in data curation and preservation Context • Data meaningless without context • Metadata of many kinds • Representation information… from data to information • Linkage and connection between datasets • Provenance • Authenticity/integrity • Computational lineage JISC Repositories 2007
  • 25. a centre of expertise in data curation and preservation Access and re-use • Ethics and rights control access • Weak in expressing this long-term • Collaboration tools • Annotation, discussion, review (see DART…) • Re-use leading to change and development • “Publication” • Not just in “print” • Underlying data should be “published”, too JISC Repositories 2007
  • 26. a centre of expertise in data curation and preservation Data citation issues… • Citation for human readers and machine use cases • Granularity: database, record, item • Citation of changing objects • Version change (eg W3C practice: no version = latest, vs bibliographic: no version = first) • An efficient way to reference and access “archived” past states of more rapidly changing dataset, eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change (work in progress, Buneman et al) • Standards conflict and immature (NLM best?) • Citation ESSENTIAL for motivating quality academic work on data management and curation JISC Repositories 2007
  • 27. a centre of expertise in data curation and preservation Repository challenges • Data are different: you’ll need access to some domain knowledge • Appraisal/selection harder • Broader range of formats • Appropriate “standards” for longevity? XML-based? • What metadata are needed? • Descriptive, to find the dataset • Context and background • Provenance • “Representation information” to connect data to information (whatever gives meaning to data for the “designated community”) JISC Repositories 2007
  • 28. a centre of expertise in data curation and preservation Repository challenges - 2 • May distort your repository • Size • Number of objects • Rate of deposit • Nature of use • Databases may be dynamic • Databases may need to be accessed in situ • Rights and ethical limitations hard to describe and enforce • Need to build links to publications (cf StORe) • Need to build discipline links across repositories… JISC Repositories 2007
  • 29. a centre of expertise in data curation and preservation Repository challenges - 3 • Is your platform suitable? • Most successful (ie older) data repositories are DIY • Data also held in repositories built on Dspace, ePrints and Fedora JISC Repositories 2007
  • 30. a centre of expertise in data curation and preservation JISC Repositories 2007 •Data from MIT DSpace Political Science
  • 31. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 32. a centre of expertise in data curation and preservation JISC Repositories 2007
  • 33. a centre of expertise in data curation and preservation Who does data curation? • Individuals • Departments or groups • Institutions, often through libraries • Communities • Disciplines • Publishers • National services • Other 3rd parties… JISC Repositories 2007
  • 34. a centre of expertise in data curation and preservation Who are the curation players? JISC Repositories 2007
  • 35. a centre of expertise in data curation and preservation Disciplinary repositories… • >900 Nucleic Acids datasets! • ESDS/UKDA and NERC data centres, but… • “AHRC Council has decided to cease funding the Arts and Humanities Data Service (AHDS) from March 2008. […] Grant holders must make materials they had planned to deposit with the AHDS available in an accessible depository for at least three years after the end of their grant” • AHRC Press Release 14/05/2007 • (Note petition at http://petitions.pm.gov.uk/AHDSfunding/) • Does not apply to Archaeology: ADS still funded? JISC Repositories 2007
  • 36. a centre of expertise in data curation and preservation Institutional Repositories • OpenDOAR: only 5 Institutional Repositories claim to include datasets • Bristol • Cambridge • Edinburgh • Leicester • Southampton • …and some of these seem doubtful on inspection! • … of course not all research data are “datasets” JISC Repositories 2007
  • 37. a centre of expertise in data curation and preservation Cultural change • If we build it, will they come? NO!! • Outreach important: communication with scientists and researchers is hard graft • Cultural change to new approach requires more: • Incentives, rewards and mandates • Successful exemplars (well publicised) • Discipline-oriented approach (one size does not fit all) JISC Repositories 2007
  • 38. a centre of expertise in data curation and preservation Need for advocacy? What functionality is missing from source repositories? Academic Research Post- Independent staff assistants graduates researchers None 9 2 7 Don’t use 7 10 1 Lack of 3 4 2 knowledge Don’t know 5 3 13 1 No reply 129 20 45 13 JISC Repositories 2007 •Slide from StORe project
  • 39. a centre of expertise in data curation and preservation Need for advocacy? What functionality is missing from output repositories? Academic Research Post- Independent staff assistants graduates researchers None 3 2 5 1 Don’t use 1 1 Lack of 2 1 knowledge Don’t know 2 6 1 No reply 123 15 48 15 JISC Repositories 2007 •Slide from StORe project
  • 40. a centre of expertise in data curation and preservation Need for advocacy? “The majority of academics do not know what repositories are nor are they familiar with the issues around new means of dissemination” – UKOLN/Eduserv Foundation: Digital Repositories Roadmap: looking forward, April 2006 JISC Repositories 2007 •Slide from StORe project
  • 41. a centre of expertise in data curation and preservation Thank you c.rusbridge@ed.ac.uk JISC Repositories 2007