SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Escaping
Datageddon
                                 Dorothea Salo
                                 Ryan Schryver
                        Graduate Support Series




 Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
Why are you here?

• You’re managing data (your own or your lab’s)
• Or you think you maybe should be
• You’re not sure why it matters
• You’re not sure how best to do it
• You’d like to know whether you’re on the right
  track
            Adapted from Graham et al. “Managing Research Data 101.” http://libraries.mit.edu/guides/subjects/data-management/
                                                                                   Managing_Research_Data_101_IAP_2010.pdf



                                                             Photo: Jaysin, http://www.flickr.com/photos/orijinal/3539418133/
Why manage data?
• To make your research easier!
• Because somebody else said so
  • Your lab PI
  • Your lab PI’s funder
• In case you need it later
• To avoid accusations of fraud or bad science
• To share it for others to use and learn from
• To get credit for producing it
• To keep from drowning in irrelevant stuff
  • ... especially at grant/project end
                           Photo: Shashi Bellamkonda, http://www.flickr.com/photos/drbeachvacation/2874078655/
Research is changing...
• Research datasets were second-class citizens.
  • Publications were all that mattered!
  • And publishing data in print was uneconomical even when possible.
  • So nobody saw anybody’s data.
• Data are now digital. The game changes!
  • Data are shared more, and more openly! Open Source, Open Access,
    Open Data.
  • There’s a lot still to be worked out about how to share, cite, credit, and
    license digital data.
  • But data will unquestionably matter to your research careers, more
    than it does to your advisors’ generation.
• Learn good data habits now! You’ll need
  them later.
                           Photo: Karl-Ludwig Poggemann, http://www.flickr.com/photos/hinkelstone/2435823037/
Did you know?
• Gene expression microarray data: “Publicly
  available data was significantly (p=0.006)
  associated with a 69% increase in citations,
  independently of journal impact factor, date
  of publication, and author country of origin.”
  • Piwowar, Heather et al. “Sharing detailed research data is associated
    with increased citation rate.” PLoS One 2010. DOI: 10.1371/
    journal.pone.0000308
• Maybe there’s an advantage here!


                                        Photo: ynse, http://www.flickr.com/photos/ynse/2341095044/
Did you see?
Did you see?
Did you see?
How to plan
to keep data



 Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
Step 1: Inventory

• What data are you collecting or making?
  • Observational, experimental, simulation? Raw, derived, compiled?
  • Can it be recreated? How much would that cost?
• How much of it? How fast is it growing? Does it change?
• What file format(s)?
• What’s your infrastructure for data collection and
  storage like?
  • How do you find it, or find what you’re looking for in it?
  • How easy is it to get new people up to speed? Or share data with others?


                                           Photo: Anssi Koskinen, http://www.flickr.com/photos/ansik/304526237/
Step 2: Needs
• Who are the audiences for your data?
  • You (including Future You), your lab colleagues (including future ones), your PIs
  • Disciplinary colleagues, at your institution or at others
  • Colleagues in allied disciplines
  • The world!
• What are your obligations to others?
  • Funder requirements
  • Confidentiality issues
  • IP questions
  • Security
• How long do you need to keep your data?

                              Photo: Celeste “Vitamin C9000,” http://www.flickr.com/photos/celestemarie/2193327230/
Step 3: Process planning
• How do you and your lab get from where
  you are to where you need to be?
• Document, document, document all decisions
  and all processes!
• Secret sauce: the more you strategize up-
  front, the less angst and panic later.
  • “Make it up as you go along” is very bad practice!
  • But the best-laid plans go agley... so be flexible.
  • And watch your field! Best practices are still in flux.


                                     Photo: Kevin Utting, http://www.flickr.com/photos/tallkev/256810217/
Things to
think about



Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
File formats
• Will anybody be able to read these files at
  the end of your time horizon?
• Where possible, prefer file formats that are:
  • Open, standardized
  • Documented
  • In wide use
  • Easy to data-mine, transform, recast
• If you need to transform data for durability,
  do it now, not later.


                                   Photo: Bart Everson, http://www.flickr.com/photos/editor/859824333/
Documentation

• Fundamental question: What would someone
  unfamiliar with your data need in order to
  find, evaluate, understand, and reuse them?
 • Consider the differences between someone inside your lab, someone
   outside your lab but in your field, and someone outside your field.
• Two parts: metadata and methods




                                  Photo: “striatic,” http://www.flickr.com/photos/striatic/2144933705/
Metadata

• About the project
  • Title, people, key dates, funders and grants
• About the data
  • Title, key dates, creator(s), subjects, rights, included files, format(s),
    versions, checksums
• Keep this with the data.



                                        Photo: Paul Downey, http://www.flickr.com/photos/psd/422206144/
Methods
• Reason #1 for not reusing someone else’s
  data: “I don’t know enough about how it was
  gathered to trust it.”
• Document what you did. (A published article
  may or may not be enough.)
• Document any limitations of what you did.
• If you ran code on the data, document the
  code and keep it with the data.
• Need a codebook? Or a data dictionary?
  • If I can’t identify at sight what each bit of your dataset means, yes, you
    do need a codebook or data dictionary.
  • DO NOT FORGET UNITS!
                                  Photo: Joe Sullivan, http://www.flickr.com/photos/skycaptaintwo/90415435/
Standards

• Why reinvent the wheel? If there’s a standard
  format for your data or how to describe it,
  use that!
• The tricky part is finding the right standard.
  • Standards are like toothbrushes...
  • But using standards is good hygiene!
  • Your librarian can often help you find relevant standards.



                                    Photo: Kenneth Lu, http://www.flickr.com/photos/toasty/412580888/
Where to put
  your data



 Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
Storage, short-term
• Your own drive (PC, server, flash drive, etc.)
  • And if you lose it? Or it breaks?
• Somebody else’s drive
  • Departmental drive
  • “Cloud” drive
  • Do they care as much about your data as you do?
• What about versioning?
• Library motto: Lots Of Copies Keeps Stuff Safe.
  • Two onsite copies, one offsite copy.
  • Keep confidentiality and security requirements in mind, of course.

                            Photo: Vadim Molochnikov, http://www.flickr.com/photos/molotalk/3305001454/
Storage, long-term
• No, gold CD-ROMs don’t cut it.
• If data need to persist beyond project end, you
  have to deal with a new kind of risk:
  organizational risk.
  • Servers come and go. So do labs. So do entire departments.
  • In the churn, your data may well be lost or destroyed.
  • This is especially important if you share data! Don’t let it 404!
• You need to find a trustworthy partner.
  • On campus: try the library.
  • Off campus: look for a disciplinary data repository, or a journal that accepts
    data. (It’s a good idea to do this as part of your planning process.)
  • Let somebody else worry! You have new projects to get on with.
                                Photo: Simon Davison, http://www.flickr.com/photos/suzanneandsimon/84038024/
Summing up



Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
So, these “data
      management plans...”
• Here’s what MIT suggests should be in them:
  • name of the person responsible for data management within your research project
  • description of data to be collected
  • how data will be documented
  • data quality issues
  • backup procedures
  • how data will be made available for public use and potential secondary uses
  • preservation plans
  • any exceptional arrangements that might be needed to protect participant
    confidentiality
• Feel like common sense now? Good.
                                          Source: http://libraries.mit.edu/guides/subjects/data-management/
Help on campus

• “What’s Your Data Plan?” website:
  http://dataplan.wisc.edu/
  • Use the contact page!
• Your department’s liaison librarian
  • We can help you find how-tos, relevant standards, on- and off-campus
    archiving services, etc.
• MINDS@UW: http://minds.wisconsin.edu/
  • Data in final form that make sense as discrete files.


                                Photo: Jordan Pérez Nobody, http://www.flickr.com/photos/jp-/2548073841/
Thank you!
    This presentation is available under a
Creative Commons 3.0 Attribution license.

If you reuse it, please remember to credit
                 the included photographs.

    Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/

Mais conteúdo relacionado

Mais procurados

Sarah Callaghan Research Data Overview
Sarah Callaghan Research Data OverviewSarah Callaghan Research Data Overview
Sarah Callaghan Research Data Overview
OpenAIRE
 

Mais procurados (20)

Research Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social SciencesResearch Data Management in the Humanities and Social Sciences
Research Data Management in the Humanities and Social Sciences
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data ServicesNISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
 
Sarah Callaghan Research Data Overview
Sarah Callaghan Research Data OverviewSarah Callaghan Research Data Overview
Sarah Callaghan Research Data Overview
 
Open Data and the Panton Principles in the Humanities
Open Data and the Panton Principles in the HumanitiesOpen Data and the Panton Principles in the Humanities
Open Data and the Panton Principles in the Humanities
 
Organizing Your Research Data
Organizing Your Research DataOrganizing Your Research Data
Organizing Your Research Data
 
Breaking the Data Management Barrier
Breaking the Data Management BarrierBreaking the Data Management Barrier
Breaking the Data Management Barrier
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
How and Why to Share Your Data
How and Why to Share Your DataHow and Why to Share Your Data
How and Why to Share Your Data
 
Hacking the research process final version cil 2014
Hacking the research process final version   cil 2014Hacking the research process final version   cil 2014
Hacking the research process final version cil 2014
 
Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"
 
Paolo ciccarese DILS 2013 keynote
Paolo ciccarese DILS 2013 keynotePaolo ciccarese DILS 2013 keynote
Paolo ciccarese DILS 2013 keynote
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Demography pro sem
Demography pro semDemography pro sem
Demography pro sem
 
SemTechBiz 2012: Domeo: a web-based tool for semantic annotation of online do...
SemTechBiz 2012: Domeo: a web-based tool for semantic annotation of online do...SemTechBiz 2012: Domeo: a web-based tool for semantic annotation of online do...
SemTechBiz 2012: Domeo: a web-based tool for semantic annotation of online do...
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 

Semelhante a Escaping Datageddon

Semelhante a Escaping Datageddon (20)

Data citations: who cares?
Data citations:  who cares?Data citations:  who cares?
Data citations: who cares?
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Responsible Conduct of Research: Data Management
Responsible Conduct of Research: Data ManagementResponsible Conduct of Research: Data Management
Responsible Conduct of Research: Data Management
 
Creating a Data Management Plan
Creating a Data Management PlanCreating a Data Management Plan
Creating a Data Management Plan
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Introduction to Research Data Management for postgraduate students
Introduction to Research Data Management for postgraduate studentsIntroduction to Research Data Management for postgraduate students
Introduction to Research Data Management for postgraduate students
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
 
Little eScience
Little eScienceLittle eScience
Little eScience
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation Tools
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Conservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspectiveConservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspective
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Data and communication of research: incentives and disincentives
Data and communication of research: incentives and disincentivesData and communication of research: incentives and disincentives
Data and communication of research: incentives and disincentives
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 

Mais de Dorothea Salo

MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
Dorothea Salo
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
Dorothea Salo
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's Way
Dorothea Salo
 

Mais de Dorothea Salo (20)

Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)
 
Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!
 
Encryption
EncryptionEncryption
Encryption
 
Privacy and libraries
Privacy and librariesPrivacy and libraries
Privacy and libraries
 
Paying for it
Paying for itPaying for it
Paying for it
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 
FRBR and RDA
FRBR and RDAFRBR and RDA
FRBR and RDA
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly Communication
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
What We Organize
What We OrganizeWhat We Organize
What We Organize
 
Occupy Copyright!
Occupy Copyright!Occupy Copyright!
Occupy Copyright!
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's Way
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Escaping Datageddon

  • 1. Escaping Datageddon Dorothea Salo Ryan Schryver Graduate Support Series Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
  • 2. Why are you here? • You’re managing data (your own or your lab’s) • Or you think you maybe should be • You’re not sure why it matters • You’re not sure how best to do it • You’d like to know whether you’re on the right track Adapted from Graham et al. “Managing Research Data 101.” http://libraries.mit.edu/guides/subjects/data-management/ Managing_Research_Data_101_IAP_2010.pdf Photo: Jaysin, http://www.flickr.com/photos/orijinal/3539418133/
  • 3. Why manage data? • To make your research easier! • Because somebody else said so • Your lab PI • Your lab PI’s funder • In case you need it later • To avoid accusations of fraud or bad science • To share it for others to use and learn from • To get credit for producing it • To keep from drowning in irrelevant stuff • ... especially at grant/project end Photo: Shashi Bellamkonda, http://www.flickr.com/photos/drbeachvacation/2874078655/
  • 4. Research is changing... • Research datasets were second-class citizens. • Publications were all that mattered! • And publishing data in print was uneconomical even when possible. • So nobody saw anybody’s data. • Data are now digital. The game changes! • Data are shared more, and more openly! Open Source, Open Access, Open Data. • There’s a lot still to be worked out about how to share, cite, credit, and license digital data. • But data will unquestionably matter to your research careers, more than it does to your advisors’ generation. • Learn good data habits now! You’ll need them later. Photo: Karl-Ludwig Poggemann, http://www.flickr.com/photos/hinkelstone/2435823037/
  • 5. Did you know? • Gene expression microarray data: “Publicly available data was significantly (p=0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin.” • Piwowar, Heather et al. “Sharing detailed research data is associated with increased citation rate.” PLoS One 2010. DOI: 10.1371/ journal.pone.0000308 • Maybe there’s an advantage here! Photo: ynse, http://www.flickr.com/photos/ynse/2341095044/
  • 9. How to plan to keep data Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
  • 10. Step 1: Inventory • What data are you collecting or making? • Observational, experimental, simulation? Raw, derived, compiled? • Can it be recreated? How much would that cost? • How much of it? How fast is it growing? Does it change? • What file format(s)? • What’s your infrastructure for data collection and storage like? • How do you find it, or find what you’re looking for in it? • How easy is it to get new people up to speed? Or share data with others? Photo: Anssi Koskinen, http://www.flickr.com/photos/ansik/304526237/
  • 11. Step 2: Needs • Who are the audiences for your data? • You (including Future You), your lab colleagues (including future ones), your PIs • Disciplinary colleagues, at your institution or at others • Colleagues in allied disciplines • The world! • What are your obligations to others? • Funder requirements • Confidentiality issues • IP questions • Security • How long do you need to keep your data? Photo: Celeste “Vitamin C9000,” http://www.flickr.com/photos/celestemarie/2193327230/
  • 12. Step 3: Process planning • How do you and your lab get from where you are to where you need to be? • Document, document, document all decisions and all processes! • Secret sauce: the more you strategize up- front, the less angst and panic later. • “Make it up as you go along” is very bad practice! • But the best-laid plans go agley... so be flexible. • And watch your field! Best practices are still in flux. Photo: Kevin Utting, http://www.flickr.com/photos/tallkev/256810217/
  • 13. Things to think about Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
  • 14. File formats • Will anybody be able to read these files at the end of your time horizon? • Where possible, prefer file formats that are: • Open, standardized • Documented • In wide use • Easy to data-mine, transform, recast • If you need to transform data for durability, do it now, not later. Photo: Bart Everson, http://www.flickr.com/photos/editor/859824333/
  • 15. Documentation • Fundamental question: What would someone unfamiliar with your data need in order to find, evaluate, understand, and reuse them? • Consider the differences between someone inside your lab, someone outside your lab but in your field, and someone outside your field. • Two parts: metadata and methods Photo: “striatic,” http://www.flickr.com/photos/striatic/2144933705/
  • 16. Metadata • About the project • Title, people, key dates, funders and grants • About the data • Title, key dates, creator(s), subjects, rights, included files, format(s), versions, checksums • Keep this with the data. Photo: Paul Downey, http://www.flickr.com/photos/psd/422206144/
  • 17. Methods • Reason #1 for not reusing someone else’s data: “I don’t know enough about how it was gathered to trust it.” • Document what you did. (A published article may or may not be enough.) • Document any limitations of what you did. • If you ran code on the data, document the code and keep it with the data. • Need a codebook? Or a data dictionary? • If I can’t identify at sight what each bit of your dataset means, yes, you do need a codebook or data dictionary. • DO NOT FORGET UNITS! Photo: Joe Sullivan, http://www.flickr.com/photos/skycaptaintwo/90415435/
  • 18. Standards • Why reinvent the wheel? If there’s a standard format for your data or how to describe it, use that! • The tricky part is finding the right standard. • Standards are like toothbrushes... • But using standards is good hygiene! • Your librarian can often help you find relevant standards. Photo: Kenneth Lu, http://www.flickr.com/photos/toasty/412580888/
  • 19. Where to put your data Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
  • 20. Storage, short-term • Your own drive (PC, server, flash drive, etc.) • And if you lose it? Or it breaks? • Somebody else’s drive • Departmental drive • “Cloud” drive • Do they care as much about your data as you do? • What about versioning? • Library motto: Lots Of Copies Keeps Stuff Safe. • Two onsite copies, one offsite copy. • Keep confidentiality and security requirements in mind, of course. Photo: Vadim Molochnikov, http://www.flickr.com/photos/molotalk/3305001454/
  • 21. Storage, long-term • No, gold CD-ROMs don’t cut it. • If data need to persist beyond project end, you have to deal with a new kind of risk: organizational risk. • Servers come and go. So do labs. So do entire departments. • In the churn, your data may well be lost or destroyed. • This is especially important if you share data! Don’t let it 404! • You need to find a trustworthy partner. • On campus: try the library. • Off campus: look for a disciplinary data repository, or a journal that accepts data. (It’s a good idea to do this as part of your planning process.) • Let somebody else worry! You have new projects to get on with. Photo: Simon Davison, http://www.flickr.com/photos/suzanneandsimon/84038024/
  • 22. Summing up Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/
  • 23. So, these “data management plans...” • Here’s what MIT suggests should be in them: • name of the person responsible for data management within your research project • description of data to be collected • how data will be documented • data quality issues • backup procedures • how data will be made available for public use and potential secondary uses • preservation plans • any exceptional arrangements that might be needed to protect participant confidentiality • Feel like common sense now? Good. Source: http://libraries.mit.edu/guides/subjects/data-management/
  • 24. Help on campus • “What’s Your Data Plan?” website: http://dataplan.wisc.edu/ • Use the contact page! • Your department’s liaison librarian • We can help you find how-tos, relevant standards, on- and off-campus archiving services, etc. • MINDS@UW: http://minds.wisconsin.edu/ • Data in final form that make sense as discrete files. Photo: Jordan Pérez Nobody, http://www.flickr.com/photos/jp-/2548073841/
  • 25. Thank you! This presentation is available under a Creative Commons 3.0 Attribution license. If you reuse it, please remember to credit the included photographs. Photo: Steve Punter, http://www.flickr.com/photos/spunter/2554405690/