Reliability and Validity (Am I hitting the target)
Reliable not Valid Valid not reliable Reliable and Valid
Question: What Is Reliability?
Answer: Reliability refers to the consistency of a measure. A measure is considered
reliable if we get the same result repeatedly. A research method is considered
reliable if we can repeat it and get the same results. There are several different ways
to estimate or improve reliability depending on your research method.
Retest Reliability (most common way of testing reliability is to simply repeat
the test or experiment)
Retest reliability is best used for things that are stable over time for example research
methods that are highly controlled and therefore should be consistent (the same)
when carried out in the same way again. This is why laboratory experiments have
an advantage of usually being highly reliable because they are controlled and so can
be repeated and the same results found. To improve test reliability then improve
control over extraneous variables.
Test-retest reliability is when a test or a questionnaire is administered again at
different points in time. This kind of reliability is used to assess the consistency of a
test or questionnaire over time. For example an IQ test would be given at least twice
at different periods to ensure the same IQ score is registered making the score
Inter-rater Reliability (most often used in observations which are difficult to
simply repeat due to lack of control over extraneous variables. We have to
come up with more creative ways of ensuring reliability in observations!)
This type of reliability is assessed by having two or more independent observers.
Observation studies are often carried out in natural conditions where we have no
control – so we cannot simply repeat them as so many things (extraneous variables)
will change like the weather or the number or type of people present etc.
The scores of the different observers are collected together and are then compared
to determine the consistency of the raters’ scores. Next, you would calculate the
correlation between the ratings to determine the level of inter-rater reliability.
Another way of assessing inter-rater reliability is to determine the percentage of
agreement between the raters. So, if the raters agree 8 out of 10 times, the
observation has an 80% inter-rater reliability rate.
Parallel-Forms Reliability (use another method to compare and see if the
results are the same)
Parallel-forms reliability is gauged by comparing two different tests or two different
measures. For example if I want to make sure my observation is reliable I could also
use a self report method like a questionnaire on the same topic, to try and establish
the same results.
Internal Consistency Reliability
This form of reliability is used to judge the consistency of results across items on the
same test or on the same questionnaire. You are then comparing test items or
questions that measure the same behaviour to determine the tests internal
consistency. When you see a question that seems very similar to another question, it
may indicate that the two questions are being used to gauge reliability. Because the
two questions are similar and designed to measure the same thing, the participant
should answer both questions the same. This would indicate that the test has
internal consistency and so is ‘reliable’.
Question: What is Validity?
Answer: Validity is the extent to which a test measures what it claims to measure.
For example does an IQ test really measure your intelligence or does a personality
test really measure your personality? When we measure behaviour in a laboratory
are we measuring the same behaviour as in real life. When we measure behaviour
on a questionnaire are people telling us the truth or are we just measuring what they
think we want them to tell us?
Ecological Validity: When a test measures behaviour which is true to real life. Most
experiments are carried out in settings that are not true to real life. This causes the
participants to be more co-operative and adopt different behaviours to suit the
experimental conditions. This is why researchers like interviews, questionnaires and
observations as they are more likely to represent real life.
Experimental Validity or Content Validity: Have I defined and measured my
variable accurately? Sometimes this is easy (your height, weight, blood pressure,
speed or accuracy etc) and sometimes this is difficult (your mood, thoughts,
personality or your IQ etc). If my observation is measuring happiness, have I define
happiness so accurately that I am sure I am really measuring it and not leaving
anything out. Internal validity is difficult to achieve with observations alone as we
cannot see what people are thinking so we may be recording their behaviours
inaccurately, or we may not have thought of some of the different ways people might
show happiness etc. If the variable you are measuring is difficult to define then you
can be sure it is difficult to measure and you can argue it might lack validity.
Predictive Validity occurs when the data obtained helps predict future behaviour.
For example if my IQ test says you are very clever, then it should predict that you will
do well in other tests and are helpful in determining who is likely to succeed or fail in
certain subjects or occupations. If it does then this means it has predictive validity, if
it doesn’t then it means it has low predictive validity. Do you think Milgram’s study on
obedience would help us predict that we might all do things to harm someone else if
ordered to do so by someone in authority, in a range of different situations?
Two factors which most affect validity (validity stinks B.O.) bias and
operationalising (measuring) your variable. So whether it is an experiment, an
observation or a self report if you want to improve your validity you need to make
sure you are measuring your variable accurately and that you are reducing any
bias like experimenter bias, observer bias, demand characteristics or social
Experiments and increasing validity: don’t tell the participants what you are
researching to reduce the demand characteristics. Be careful how you measure the
variables. Make sure there is no experimenter bias; perhaps get someone else to
record the data.
Observations and increasing validity: don’t let the participants know you are
watching to avoid demand characteristics. Have more than one observer to reduce
observer bias. Don’t let the observers know the aim of the research. Do not design
a participant observation as this might make your observation even more biased.
Make sure you operationalise your variable carefully and do pilot studies to make
sure you have not missed out anything that should have been recorded when
measuring the particular type of behaviour. Do a pilot study to make sure each
observer agrees that they are ‘seeing’ behaviours in the same way (for example; was
that a smile, a smirk or a frown?).
Self Reports and increasing validity: try to hide the purpose of the research to
avoid demand characteristics (ask a range of different questions which you do not
intent to measure to make it difficult to guess the real aim of the questions). In
interview have a third party do the interview that does not know the aim of the
research to avoid researcher bias. If you are asking private or personal questions
get people to answer online or via a computer, they are more likely to tell the truth
and not give socially desirable answers. Put lie detector questions into your
questionnaire like ‘do you always clean your teeth twice a day’ as if people put yes to
this they may not be answering accurately or with care.
Simple things to remember:
Experiments are reliable because they are controlled and so should be
Observations are not reliable because they are not controlled and so are not
To improve reliability in an experiment try and control more extraneous
variables e.g.: improve the consistency. To check reliability just repeat it!
To improve the reliability of an observation have more observers. To check
reliability collate their data and see if it matches!
Observations, self reports and experiments can all have their validity reduced
if they do not measure the behaviour clearly and accurately.
To improve validity reduce bias by; not disclosing yourself in an observation;
not telling the participants the true nature of an experiment; hiding the true
purpose of a questionnaire in the detail of distracter questions etc!