Questionnaire and Instrument validity

1
Dr Mahmoud Danaee
Associate Prof.
Dr Anne Yee

3
Rorschach test
• At the end of the test, the tester says …
–you need therapy
–or you can't work for this company

4
Psychological Testing
• Occurs widely …
– in personnel selection
– in clinical settings
– in education
What constitutes a good test?

Validity and Reliability
• Validity: How well does the measure or
design do what it purports to do?
• Reliability: How consistent or stable is
the instrument?
–Is the instrument dependable?

Logical Statistical
AKA Criterion
Face Conten
t
PredictiveConvergent Concurren
t
Validity
ConsistencyReliability Objectivity
Conten
t
Construc
t
Divergent/
Discriminant

Face Validity
– Infers that a test is valid by face value
– It is clear that the test measures what it is supposed
to
– As a check on face validity, test/survey items are
sent to experts to obtain suggestions for
modification.
– Because of its vagueness and subjectivity,
psychometricians have abandoned this concept for a
long time.

Content Validity
– Infers that the test measures all aspects contributing
to the variable of interest
– Face validity Vs Content validity:
• Face validity can be established by one person
• Content validity should be checked by a panel,
and thus usually it goes hand in hand with inter-
rater reliability (Kappa!)

Example:
• Computer literacy includes skills in
operating system, word processing,
spreadsheet, database, graphics, internet,
and many others.
• It is difficult to administer a test covering
all aspects of computing. Therefore, only
several tasks are sampled from the
universe of computer skills.
• A test of computer literacy should be
written or reviewed by computer science
professors or senior programmers in the IT
industry because it is assumed that
computer scientists should know what are
important in his own discipline.

Overall:
A logically valid test simply appears to
measure the right variable in its entirety?
Subjective!!!

The Content Validity Index
Content validity has been defined as follows:
• (1) ‘‘...the degree to which an instrument has an appropriate
sample of items for the construct being measured’’ (Polit &
Beck, 2004, p. 423);
• (2) ‘‘...whether or not the items sampled for inclusion on the
tool adequately represent the domain of content addressed
by the instrument’’ (Waltz, Strickland, & Lenz, 2005, p. 155);
• (3) ‘‘...the extent to which an instrument adequately samples the
research domain of interest when attempting to measure
phenomena’’ (Wynd, Schmidt, & Schaefer, 2003, p. 509).

Two types of CVIs.
• content validity of individual items
• content validity of the overall scale.
• Researchers use I-CVI information to guide them in revising,
deleting, or substituting items
• I-CVIs tend only to be reported in methodological studies that
focus on descriptions of the content validation process
• Most often reported in scale development studies is the CVI

CVI
Degree to which an
instrument has an
appropriate sample of
items for construct
being measured
S-CVI
Content Validity of
the overall scale
S-CVI/UA
Proportion of items
on a scale that
achieves a
relevance rating of 3
or 4 by all the
experts
S-CVI/Ave
Average of the I-
CVIs forI-CVI
Content Validity of
individual items:

Question
Has each item
in the
instruments
consistency?
Are the items
representati
ve of
concepts
related to the
dissertation
topic?
Are the items
relevance to
concepts
related to the
dissertation
topic?
Are the items
clarity in term
of wording
comments
Q1 ① ② ③ ④ ① ② ③ ④ ① ② ③ ④ ① ② ③ ④
Q2 ① ② ③ ④ ① ② ③ ④ ① ② ③ ④ ① ② ③ ④
Q3 ① ② ③ ④ ① ② ③ ④ ① ② ③ ④ ① ② ③ ④
Q4 ① ② ③ ④ ① ② ③ ④ ① ② ③ ④ ① ② ③ ④
Q5 ① ② ③ ④ ① ② ③ ④ ① ② ③ ④ ① ② ③ ④
Ratings
1= not relevant 2 =somewhat relevant.
3= quite relevant 4= highly relevant.

I-CVI, item-level content validity index
S-CVI, content validity index for the scale

Acceptable standard for the S-CVI recommended a minimum S-CVI of .80.
If the I-CVI is higher than 79%, the item will be appropriate.
If it is between 70% and 79%, it needs revision.
If it is less than 70% it is eliminated

Kappa statistic is a consensus index of inter-rater
agreement that adjusts for chance agreement and is an
important supplement to CVI because Kappa provides
information about the degree of agreement beyond
chance
Evaluation criteria for Kappa is the values
above 0.74= excellent
between 0.60 and 0.74=good
between 0.40 and 0.59= fair

Logical Statistical
Criterion
Face Conten
t
t
Validity
Conten
t
Construc
t
Divergent/
Discriminant

Criterion Validity
• This type of validity is used to measure the
ability of an instrument to predict future
outcomes.
• Validity is usually determined by comparing
two instruments ability to predict a similar
outcome with a single variable being
measured.
• There are two major types of criterion validity
predictive or concurrent forms of validity.

24
Criterion validity
• A test has high criterion validity if
– It correlates highly with some external benchmark
(concurrent) ?
– How well does the test correlated with outcome criteria
(predictive)?
• Eg
– You have lost 30 pounds if your scale reported that you lost 30
pounds, you would expect that your clothes would also feel
looser
‘Warwick spider phobia questionnaire’
positive correlation with SPQ

Concurrent Criterion Validity
• Concurrent criterion validity is used when
the two instruments are used to measure the
same event at the same time.
• Example:

Predictive Criterion Validity
• Predictive validity is used when the instrument
is administered then time is allowed to pass
and is measured against the another outcome.
• Example:

Criterion validity
• When the focus of the test is on
criterion validity, we draw an inference
from test scores to performance. A high
score of a valid test indicates that the
test taker has met the performance
criteria.
• Regression analysis can be applied to
establish criterion validity. An
independent variable could be used as
a predictor variable and a dependent
variable, the criterion variable. The
correlation coefficient between them is
called validity coefficients.

How is Criterion Validity Measured?
• The statistical measure or correlation coefficient tells the
degree to which the instrument is valid based on the measured
criteria.
• What does it look like in an equation?
– The symbol “r” denotes the correlation coefficient.
– A higher “r” value shows a positive relationship between the
instruments.
– A mix of high and low “r” values shows a negative relationship.

Predictive Validity Concurrent Validity

• As a rule of thumb, for absolute value of r:
• 0.00-0.19: very weak
• 0.20-0.39: weak
• 0.40-0.59: moderate
• 0.60-0.79: strong
• 0.80-1.00: very strong.

Logical Statistical
AKA Criterion
Face Conten
t
t
Validity
Conten
t
Construct
Divergent/
Discriminant

Construct validity
• Measuring things that are in our theory of a domain.
• The construct is sometimes called a latent variable
– You can’t directly observe the construct
– You can only measure its surface manifestations
– it is concerned with abstract and theoretical
construct, construct validity is also known
as theoretical validity
32

What are Latent Variables?
• Most/all variables in the social world are not directly
observable.
• This makes them ‘latent’ or hypothetical constructs.
• We measure latent variables with observable
indicators, e.g. questionnaire items.
• We can think of the variance of an observable
indicator as being partially caused by:
– The latent construct in question
– Other factors (error)

• I cringe when I have to go to math class.
• I am uneasy about going to the board in
a math class.
• I am afraid to ask questions in math
class.
• I am always worried about being called
on in math class.
• I understand math now, but I worry that
it's going to get really difficult soon.
Math
anxiety

• Specifying formative versus reflective constructs is a
critical preliminary step prior to further statistical
analysis. Specification follows these guidelines:
• Formative
– Direction of causality is from measure to construct
– No reason to expect the measures are correlated
– Indicators are not interchangeable
• Reflective
– Direction of causality is from construct to measure
– Measures expected to be correlated
– Indicators are interchangeable
– An example of formative versus reflective constructs is given in the
figure below.

Factor model
• A factor model identifies the relationship between observed
items and latent factors. For example, when a psychologist
wants to study the causal relationships between Math anxiety
and job performance, first he/she has to define the constructs
“Math anxiety” and “job performance.” To accomplish this
step, about the psychologist needs to develop items that
measure the defined construct.

Construct, dimension,
subscale, factor, component
• This construct has eight dimensions (e.g.
Intelligence has eight aspects)
• This scale has eight subscales (e.g. the survey
measures different but weakly related things)
• The factor structure has eight
factors/components (e.g. in factor
analysis/PCA)

Exploratory Factor Analysis
• (EFA) is a statistical approach to determining the
correlation among the variables in a dataset.
• This type of analysis provides a factor structure (a
grouping of variables based on strong correlations).
• EFA is good for detecting "misfit" variables. In
general, an EFA prepares the variables to be used for
cleaner structural equation modeling. An EFA
should always be conducted for new datasets.

. The Kaiser-Meyer-Olkin measure of sampling adequacy tests whether the partial
correlations among variables are small.
KMO Statistics
Marvelous: .90s
Meritorious: .80s
Middling: .70s
Mediocre: .60s
Miserable: .50s
Unacceptable: <.50

Bartlett’s Test of Sphericity
• Tests hypothesis that correlation matrix is an identity matrix.
• Diagonals are ones
• Off-diagonals are zeros
• A significant result (Sig. < 0.05) indicates matrix is not an
identity matrix; i.e., the variables do relate to one another enough
to run a meaningful EFA.
Anti-image
• The anti-image correlation matrix contains the negatives of the partial correlation
coefficients, and the anti-image covariance matrix contains the negatives of the
partial covariances.
• In a good factor model, most of the off-diagonal elements will be small. The
measure of sampling adequacy for a variable is displayed on the diagonal of the
anti-image correlation matrix.

Communalities
• A communality is the extent to which an item correlates
with all other items.
• Higher communalities are better. If communalities for a
particular variable are low (between 0.0-0.4), then that
variable will struggle to load significantly on any factor.
• In the table below, you should identify low values in the
"Extraction" column. Low values indicate candidates for
removal after you examine the pattern matrix.

Parallel analysis
• is a method for
determining the number
of components or factors
to retain from pca or
factor analysis.
Essentially, the program
works by creating a
random dataset with the
same numbers of
observations and
variables as the original
data.

Factor analysis for dichotomous variables

• Using Factor software
and simultaneously
Parallel analysis for
binary data

Establishing construct validity
• Convergent validity
– Agrees with other measures of the same thing
• Divergent/Discriminant validity
– Different tests measure different things
– Does the test have the “ability to discriminate”?
– (Campbell & Fiske, 1959)
55

Construct validity is the extent to which a set of measured items actually
reflected the theoretical latent construct those item are designed to measure.
Thus, it deals with the accuracy of measurement.
Construct validity is made up of TWO important components which they are:
1) Convergent validity: the items that are indicators of a specific construct
should converge or share a high proportion of variance in common,
known as convergent validity. The ways to estimate the relative amount
of convergent validity among item measures:
Construct validity

Discriminant Validity:
the extent to which a construct is truly distinct frame other
construct. To test the discriminant validity the AVE for two
factors should be grater than the square of the correlation
between the two factors to provide evidence of discriminant
validity.
• Discriminant validity can be tested by examining the AVE for
each construct against squared correlations (shared variance)
between the construct and all other constructs in the model.
• A construct will have adequate discriminant validity if the
AVE exceeds the squared correlation among the constructs
(Fornell & Larcker, 1981; Hair et al., 2006).

Factor Loading:
at a minimum, all factor loading should be statistically significant.
A good rule of thumb is that standardized loading estimates should
be .5 or higher, and ideally .7 or higher.
Average Variance Extracted (AVE):
is the average squared factor loading. A VE of 0.5 or higher is a
good rule of thumb suggesting adequate convergence. A VE less
than .5 indicates that on average, more error remains in the items
than variance explained by the latent factor structure impose on the
measure (Haire et al., 2006, p 777).
Construct Reliability: construct reliability should be .7 or higher to
indicate adequate convergence or internal consistency.

Individual model
(First order CFA)

Structural Equation Modeling (SEM)
Individual
Model
Measurement
Model
Structural
Model

Developing Assessments
If necessary, consider
commercial assessments
or create a new
assessment
What assessments do
you already have that
purport to measure this?
What are you trying
to measure? Purpose
Review
Purchase Develop

Considerations
• Using what already have
– Is it carefully aligned to your purpose?
– Is it carefully matched to your purpose?
– Do you have the funds (for assessment, equipment, training)?
• Developing a new assessment
– Do you have the in-house content knowledge?
– Do you have the in-house assessment knowledge?
– Does your team have time for development?
– Does your team have the knowledge and time needed for proper scoring?
– Identify the goal of your questionnaire.
– What kind of information do you want to gather with your questionnaire?
– What is your main objective?
– Is a questionnaire the best way to go about collecting this information?

Adopting an Instrument
• Adopting an instrument is
quite simple and requires
very little effort. Even when
an instrument is adopted,
though, there still might be
a few modifications that
are necessary
Adapting an Instrument
• Adapting an instrument
requires more substantial
changes than adopting an
instrument. In this situation,
the researcher follows the
general design of another
instrument but adds items,
removes items, and/or
substantially changes the
content of each item

• Whenever possible, it is best for an instrument to be adopted.
• When this is not possible, the next best option is to adapt an
instrument.
• However, if there are no other instruments available, then the
last option is to develop an instrument.

STEP Type of Validity Development Adaption Adoption
pertest Logical
Face
+ +/- +/-
Content
+ +
+/-
Pilot /
main
study
Criterion
Concurrent + +
-
Predictive + +
-
Construct
Convergent + + +
Divergent + + +

Types of Reliability
• Test-Retest Reliability: Degree of temporal stability of the
instrument.
– Assessed by having instrument completed by same
people during two different time periods.
• Alternate-Forms Reliability: Degree of relatedness of
different forms of test.
– Used to minimize inflated reliability correlations due to
familiarity with test items.

Types of Reliability (cont.)
• Internal-Consistency Reliability: Overall degree of
relatedness of all test items or raters.
– Also called reliability of components.
• Item-to-Item Reliability: The reliability of any single
item on average.
• Judge-to-Judge Reliability: The reliability of any single
judge on average.

• Cronbach’s alpha to evaluate the internal consistency
of observed items, and also applies factor analysis to
extract latent constructs from these consistent
observed variables.
• >0.90, means the questions are asking the same things
• 0.7 to 0.9 is the acceptable range.

Remember!
An assessment that is highly reliable is
not necessarily valid. However, for an
assessment to be valid, it must also be
reliable.

Improving Validity & Reliability
• Ensure questions are based on standards
• Ask purposeful questions
• Ask concrete questions
• Use time periods based on importance of the questions
• Use conventional language
• Use complete sentences
• Avoid abbreviations
• Use shorter questions

Overall Cronbach Coefficient Alpha
• One may argue that when a high Cronbach
Alpha indicates a high degree of internal
consistency, the test or the survey must be
uni-dimensional rather than multi-
dimensional. Thus, there is no need to further
investigate its subscales. This is a common
misconception.

Ch 11 83
Performing the Pilot test
• A pilot test involves conducting
the survey on a small,
representative set of
respondents in order to reveal
questionnaire errors before the
survey is launched.
• It is important to run the pilot
test on respondents that are
representative of the target
population to be studied.
Cronbach's alpha Measures the
intercorrelations among test items, and is
thus known as an internal
consistency estimate of reliability of test
scores
Test-retest reliability refers to the degree
to which test results are consistent over
time. In order to measure test-retest
reliability, we must first give the
same test to the same individuals on two
occasions and correlate the scores

Thank you
Dr Mahmoud Danaee
mdanaee@um.edu.my
Senior Visiting Research
Fellow , Academic
enhancement and leadership
Development Center ( ADeC)
Associate prof. Dr. Anne Yee
annyee17@um.edu.my
Addiction psychiatrist
Department of psychological
medicine, University of
Malaya Center of Addiction
Science (UMCAS)

Questionnaire and Instrument validity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Questionnaire and Instrument validity

Similar to Questionnaire and Instrument validity (20)

Recently uploaded

Recently uploaded (20)

Questionnaire and Instrument validity