The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Good data practices for graduate students
1. Graduate
Office
Student
Success
Series
GOOD DATA
PRACTICES FOR
RESEARCH
January 12, 2012
Heather Coates, MLS, MS | Digital Scholarship & Data Management Librarian
2. CONTEXT: DATA LIFECYCLE
Source: DDI Structural Reform Group. “DDI Version 3.0 Conceptual Model." DDI Alliance. 2004. Accessed on 11 August 2008.
<http://www.icpsr.umich.edu/DDI/committee-info/Concept-Model-WD.pdf>.
4. PLAN AHEAD: BEYOND THE PROTOCOL
Plan early, before data collection
Identify ethical and legal issues
Define the data model
Think about a data organization strategy
Identify the most appropriate tools: instruments & software
5. ETHICAL & LEGAL ISSUES
Privacy
Are there people (human subjects) involved in your project? Animals?
Does the study involve personal or health information? Can it be used to
identify an individual?
Copyright
Are you using copyrighted data?
Have you sought permission?
Intellectual Property
You should cite any product that you use for your project: data,
publications, software, etc.
6. DESCRIBING YOUR DATA
Describe the research project
Describe overall organization of your dataset
Describe your data files
Describe the methods used to create your data
Describe measurement techniques (protocols, instruments)
Data processing – why, how, assumptions
Sensor network, taxonomic information, spatial location
Choose & use standard terminology (concepts, methods, tools)
Identify and use relevant metadata standards
Data citation
Describe the timeframe
7. HANDLING DATA FILES
Create, manage, and document your data storage system
Use descriptive file names
Define
Formats for date and time
Units of measurement
Parameters
Missing code values
Values that are estimated
Use consistent codes
Use appropriate field delimiters
Store data values separately from data annotations or notes
Store data at the right level of precision
Quality assurance & data integrity
Version control & authenticity
8. STORAGE & BACKUP
Backup your data: regular intervals, 3 copies
Local
Semi-local
Remote
Document your backup strategy
Make sure backup locations are secure and accessible
Use standard file formats
Non-proprietary, open format
Commonly used in your community
Unencrypted*
Uncompressed*
9. PROCESSING & ANALYSIS
Defining your research questions and documenting your data are
iterative processes
Inform each other
Are never done, until the project is complete
Developing good documentation will make analysis easier and more
efficient
Having good documentation will make writing your
paper/thesis/dissertation much easier
Use your readme or codebook files as source documents for your
methods sections
Having good documentation will identify problems sooner, when
it may be possible to resolve them or minimize the damage to
your data
10. RESOURCES
@IU
IUWare
IUanyWARE
StatMath
ITTraining
RFS & SDA
Open access/public use data sets
DataCite
ICPSR
Data.gov
Subject liaison librarians can assist in locating data on your topic
11. THANK YOU
Find us at http://ulib.iupui.edu/digitalscholarship
Heather Coates, MLS, MS
Digital Scholarship & Data Management Librarian
hcoates@iupui.edu
317-278-7125
Notas do Editor
Be aware of the research process, so you have some context for your experience. This can also help you organize your thoughts about executing/carrying out your projects.
Goal: help you translate your research protocol into a practical plan to carry out your project/studyAlthough these things do take some extra time at the beginning of your project, it will make analysis and writing much, much easier because you will be clear about what was done.
-data model: map out relationships between data, especially aggregated or calculated variables; translate research questions into analyses, then map to data to be used; can be particularly important if you are integrating data from multiple sources or have large quantitative datasets-data organization strategy: it should be part of the planning process and answer where, when, how? will talk more about this in the next slide-software: IUWare, IUanyWare, StatMath, RFS, SDA (links on handout)-ethical & legal issues: confidentiality, privacy, HIPAA, intellectual property, and copyright issues may arise; discuss these potential problems with your advisor; links for further information on handout)
-although facts cannot be copyrighted, specific instances of them (such as a database) can be
-research project: one option is to write a structured abstract (see handout)-dataset organization: use your plan and update it as things change (more on the next slide)-describeyour data files: what do you need to know to interpret the data? parameters, units, define coded values, define missing values-methods: -standards: don’t deviate from standards in your discipline or research community, unless you have a good reason for doing so; these standards reflect a common understanding and help to make data interoperable-citation: if you use someone else’s data, you should document and cite it: source, URL/DOI, detailed title of dataset, version information, date retrieved, authors/creators, brief description-timeframe: particularly if you’re using data from multiple sources or collecting data over a period of time, this needs to be documented clearly
-data typing: use appropriate field for data: date field for dates; comments included in a separate column-document your folder structure & file naming system -don’t rely on the computer’s time and date metadata; it’s not reliable and can be manipulated -keep file names short but descriptive; use a coding system to include project name, file contents, date, etc.-QA & data integrity: minimize opportunity to introduce human error, automate processing, check and verify periodically-version control & authenticity: especially important if multiple people are working on the same dataset; keep copies of your data before/after each major processing step; save you lots of work if errors creep in; you won’t have to start all over from the raw data; document how this is done
-backup strategy: quick and dirty way is to check and verify file quantity, file size, and randomly check values in original and copies-if you need to share or transfer files, use Slashtmp instead of a flash drive; especially if the data involve human subjects data