1. MASTER’S COMPREHENSIVE EXAMINATION
AND DOCTORAL PRELIMINARY EXAMINATION
SAMPLE RESPONSES – SPRING 2009
1. A university research group in Eastern Maryland is interested in investigating the
relationship between two constructs in elderly populations, “Mental Agility” and “Physical
Activity”. The researchers decide to conduct a study with 18 residents at a retirement
community in Annapolis. The age range of the 18 residents is 72-79 years and four of the
residents are male.
To investigate the relationship between the two constructs of interest, the researchers
administer an IQ-test to the residents and also ask them about their exercise habits.
Specifically, they ask the subjects how they would rate their exercise activity on a five-point
scale from 1 = “not very active” to 5 = “highly active”.
The researchers correlate the scores from the IQ test with the activity ratings and find a
correlation of .31, which is statistically significant at α = .10. They claim that their findings
prove that physical activity causes the elderly to have a high degree of mental agility.
As an independent researcher you were asked to comment on the validity of the conclusions
drawn by the primary researchers.
a. Identify at least four threats to the inferential validity in this study and explain their
First, the retirement community is locally selected, which makes the sample of
respondents a convenience sample. This may or may not be an issue depending on the
degree to which the sample is representative in its composition of the general population.
Given that the retirement community is in a rather wealthy location – Annapolis – it is
questionable to what degree the sample shares key characteristics with the general
population of elderly about which inferences are desired. Furthermore, the responses of
the retirees are going to be dependent due to the shared context. This may or may not be a
statistical issue depending on the degree of dependency.
Second, the sample is rather small, which will mean that any model effects are going to
be rather imprecisely estimated and non-parametric models might be better models of
choice. Third, both measurements contain some degree of measurement error. The rating
scale is an imprecise measurement of the degree of physical activity because “active” can
be interpreted in a variety of different ways. The IQ measure also contains measurement
error, because intelligence is a latent characteristic. Moreover, it is questionable whether
mental agility should be equated with general intelligence.
2. Fourth, the study is an observational study without any controlled intervention by the
researchers. Thus, any correlational effects reflect a composite of characteristics, only
one of which is physical activity measured in a particular manner. It is not surprising,
therefore, that the correlation is relatively low in this study given the nature of the study
and the measurement error in the variables.
b. Describe an alternative research design that would be more appropriate to investigate
the link of mental agility and physical activity in the elderly. Provide a thorough
justification for the structure of this alternative design.
A designed experiment with a larger cohort of retirees would be desirable to investigate
the research question of interest. From a sampling perspective, it would be important to
stratify the population of retirees in an area of interest (e.g., the Mid-Atlantic region)
according to relevant characteristics that impact their level of mental agility. These might
include things like sex, type of job held, and some measure of the severity of medical
illnesses in the past. In addition, it would seem necessary to develop particular exercise
regimen – or design interventions based on existing exercise regimen – that vary
systematically in their intensity and assign retirees to these programs.
In practice, this would be somewhat challenging because they might be used to doing this
with friends and the amount of physical activity has to be matched to the physical state of
the retirees for health safety. Hence, if there are different interventions they should
always be offered at the same retirement community and retirees should be matched on
key characteristics of interest before being randomly assigned to the interventions.
Rather than using a general intelligence measure to measure mental agility, it might seem
more appropriate to use an age-appropriate measure of fluid intelligence and to
supplement this measure with tasks that require a reliance on quick information
processing of particular types. These tasks could be used to supplement the quantitative
profile from the standardized measure and may include such things as crossword puzzles,
Sudoku, or games that use geographical information.
2. Answer the following questions based on the research designs represented by the schematics
Design 1 Design 2
R O X O
R O O
a. Name and briefly describe each design. Include all critical characteristics of each design
in your explanation.
3. Design 1 is an example of a quasi-experimental design. More specifically, the schematic
represents a non-equivalent control group, pre-test/post-test design (Campbell & Stanley,
1963). The groups here are intact – meaning that the assignment to groups is not under
the experimenter’s control. The “O’s” in the schematic represent observations (pre-test
and post-test, respectively) and the assignment of X (the treatment) is assumed to be
random and under the control of the investigator.
Design 2 is an example of a true experimental design. One could classify this design as
an equivalent control group, pre-test/post-test design. The equivalency is achieved
through randomization. That is, subjects are randomly assigned to one of the two groups
– either the group that receives the treatment (with X) or the control group (without X).
b. State any threats to internal validity that may exist for each of the two designs. Provide a
brief explanation describing how these threats might produce effects confounded with the
treatment effect if they are not controlled for in the design.
Campbell and Stanley (1963) and Cook and Campbell (1979) suggest that there are 7 – 8
different classes of nuisance variables that, if not controlled, might produce effects
confounded with the treatment effect. They are: history, maturation, testing,
instrumentation, regression to the mean, selection, mortality (attrition), selection-
Design 2, the true experiment, controls these rival hypotheses listed above. However, this
is not the case for the quasi-experimental design outlined in design 1.
According to Campbell and Stanley (1963), there are two potential threats to internal
validity with the use of design 1. Of primary concern is the interaction between selection
and maturation. For example, if the experimental group consists of children with low
reading ability, and the control group some other convenient population tested and
retested, a gain specific to the experimental group (afterschool reading program) might be
interpreted as a spontaneous gain in reading achievement specific to such an extreme
group. That is, a gain would have occurred even without the presence of X, and therefore
is confounded with X.
Regression to the mean provides the other major threat to internal validity. This can occur
when either of the comparison groups (treatment or control), has been selected on the
basis of its extreme scores on O1. Then, the degree of difference between pre-test and
post-test between the two groups could very well be the by-product of regression –
groups selected on the basis of extreme scores – and not on the experimental condition,
c. Provide a discussion of the concept of causality and how it relates to the selection or
construction of a suitable research design.
4. Margenau (1950) pointed out that “the words cause and effect are among the most loosely
used in our language” (p. 389), and declared that science and scientists cannot be of much
help in clarifying their meaning as they “are not primarily scientific terms.” With that
said, the type of design has important implications for the validity of conclusions,
inferences, and generalizations from research. While this discussion could take on many
paths, germane to the discussion here will be the link between theory, internal validity
and causal inference.
Theory. One important link in establishing causation is a theoretical model. Theory is
aimed at organizing and explaining specific aspects of the environment. Underlying
scientific theory is that testable hypotheses may be derived from it. Theory provides the
researcher with a selective point of view – an orientation for what to look for, which
variables are relevant and which are superfluous.
Internal Validity. Internal validity refers to the validity of assertions regarding effects of
the independent variable(s) on the dependent variable(s). In the form of a question, this
can be summarized as: Is what as taken place (i.e., the phenomenon observed) due to the
variables the researcher claims to be operating (e.g., the manipulated variables), or
attributable to other extraneous variables? It is clear that the “validity” of answering this
question relies on the plausibility of alternative explanations. And it is here where control
plays such an integral part. In practice, two general approaches are used to control the
effects of extraneous variables: (i) experimental control and (ii) statistical control. These
relate to the aforementioned design types in the following manner.
In quasi-experimental designs, like that corresponding to design 1 above, in which
random assignment of subjects to treatments is not possible or potentially unethical,
statistical control is achieved through adjusting the estimated treatment effect by means
of controlling for pre-existing group differences on the extraneous (concomitant)
variable(s). This adjustment can be striking especially when the difference on the
concomitant variable(s) across intact treatment groups is dramatic. Using ANCOVA or
other statistical methods to equate groups on important covariates should not be viewed
as a substitute for randomization. Control of all potential concomitant variables is not
possible in quasi-experimental designs; and therefore, is always subject to threats to
internal validity from unidentified covariates.
Random assignment to experimental conditions, like that represented in design 2 above,
ensures that any idiosyncratic differences among the groups are not systematic at the
outset of the experiment. Random assignment does not guarantee that the groups are
equivalent; rather that any observed differences are due only to chance.Various authors
have asserted that it is only through variable manipulation (e.g., random assignment) that
one may hope to study causation (e.g., Holland, 1986).
5. In sum, causation is a nebulous term that is often misused in the social/behavioral
sciences. Establishing causation necessitates among numerous other things, providing a
theoretical model which show the relationships between relevant variables as well as
excluding irrelevant ones. Choosing a research design which provides the strongest
internal validity evidence is crucial. In terms of internal validity, an experiment where
subjects are randomly assigned to treatment conditions is the “gold standard” against
which other designs are measured. When random assignment is not possible, internal
validity evidence can be accumulated through the use of statistical control.
1. Imagine that you are the district coordinator of three large school systems in Northern
Virginia. You have been directed to develop an integrated assessment system for measuring
third- and fourth-grade students’ learning progress in basic arithmetic skills (e.g., addition,
subtraction, multiplication, division, and fractions).
Each school system has adopted a different textbook for teaching the relevant content, and
they are using the benchmark assessments (for unit tests and end-of-marking-period tests)
that were provided by the publisher of the textbooks. In addition, each school system is using
a standards-based standardized large-scale assessment administered to all students at the
end of each grade.
The results of the new integrated assessment system will be utilized by three different groups
for instructional and policy decision-making: Teachers will use the results to make decisions
about instruction; parents will use the results to determine how their children are doing in
school; school principals and district representatives will use the results to make school
a. Describe the assessment components and associated feedback mechanisms that you
believe should be put into place to provide accurate, reliable, and valid information that
will meet the needs of these different stakeholder groups.
In order to develop an integrated assessment system, one would have to, ideally, align the
targeted curriculum, the enacted curriculum, the non-standardized assessment practices,
and the standardized assessment practices. To do this efficiently and effectively, intensive
amounts of professional development will be required for teachers and district specialists.
Furthermore, parents will need to be informed about the structure of the assessment
systems and its impact on their children’s learning progressions.
A meaningful starting point would be to link and augment existing standards-based
assessments, benchmark assessments, and key classroom assessments. One could call
into life an interdisciplinary expert group with teachers, curriculum developers, textbook
6. developers, and district specialists to (a) review the key competencies targeted by each
assessment, (b) the linkages between the assessments, and (c) the desired interpretations
from each assessment.
The next step could be to augment these assessments with additional tasks that strengthen
existing linkages and to develop remedial materials that can be deployed into the state for
use by the teachers. To make such a system effective it would seem that continual
training of teachers through workshops would be required alongside the development and
deployment of computer-assisted management systems for uploading data and
b. Using two assessment examples in this integrated system, describe the key principles that
you would use to design, implement, and score the assessments given their respective
feedback functions in the system.
The resulting assessment system could target more narrow competencies in particular
units in more depth, via classroom and benchmark assessments, and broader competency
domains across units in less depth, via large-scale standardized assessments. Reporting
for all stakeholders would require that simple statistics such as sum-scores, percentages
correct, or transformed scale scores be used, which are based mostly on classical test
theory. More complicated models such as item response theory models would be used
“behind-the-scenes” for the large-scale standardized assessments only. The
interdisciplinary expert team, in collaboration with a measurement specialist, might use
these models to create simple score reports.
As an example of what teachers might do, consider having them use an on-line interface
or an Excel spreadsheet. They could be trained to check of particular competencies that
they measure with their classroom assessments. Only those assessments that provide
reasonably reliable information about individual students should be tracked while more
impressionistic data could be added and weighted accordingly. As stated above, the
classroom assessments should target narrow competencies (e.g., reasoning with simple
rational numbers) linked to state standards such that reliable interpretations about
subskills could be obtained.
The benchmark assessments should cover broader competency domains that arch over the
content covered by a series of classroom assessments. They would probably contain a
larger fraction of selected response items and simple constructed response questions than
other assessments. Ideally, they would be administered and scored in a standardized
manner and would be linked across the districts via common items. The results would
provide information about the performance of individual students to the teachers,
students, and parents, and information about the performance of student groups (e.g., by
sex, school, or school district) for district specialists.
2. Two of the most critical concepts in measurement are reliability and validity. Despite the fact
that these concepts are relatively simple to describe at a broad conceptual level, many
people find it challenging to address their more practical applications. One context that is
7. often seen as particularly challenging is non-standardized educational classroom
a. Explain the similarities and differences between the operationalization of reliability in
large-scale standardized assessment contexts and its operationalization in classroom
assessment contexts. Provide at least two examples to support your explanation of the
similarities and differences. Choose examples that can be considered realistic, rather
Reliability is concerned with the consistency or replicability of scores, rather than the
interpretations that are drawn from the scores. This meaning of score quality is similar to
both large-scale and classroom assessment even though the means by which it is
empirically assessed can differ. In a classical test theory framework, reliability is the ratio
of true-score variance to observed-score variance or the amount of observed score
variation that can be attributed to true inter-individual differences. In an item response
theory framework, reliability is a marginal quantity that is less frequently used. More
commonly, the information of a test at particular values of the latent score range is of
most interest similar to the conditional standard error of measurement in classical test
In large-scale assessment, we typically distinguish between three types of reliability of
consistency, (a) parallel forms, (b) test-retest, and (c) internal consistency. It is quite
common in large-scale assessment to develop structurally similar test forms to minimize
cheating or to re-administer the same test form to obtain an estimate of temporal stability.
In classroom assessment this is not realistic, however. If teachers utilize assessments
from a textbook or make up their own assessment, they may compute an internal
consistency coefficient such as coefficient alpha for their scores. Realistically, however,
this only happens when such activities are supported by professional development efforts,
a commitment of the district to data-driven instruction, and easy-to-use interfaces for
tracking the data.
b. Explain the similarities and differences between the investigation of validity in large-
scale standardized assessment contexts and its investigation in classroom assessment
contexts. Provide at least two examples to support your explanation of the similarities
and differences. Choose examples that can be considered realistic, rather than contrived.
Validity is concerned with the defensibility of interpretations that are drawn from test
scores, at least in the most common conceptualization. Thus, scores need to be reliable to
support potentially valid interpretations, but even if they are reliable they may be
interpreted in inappropriate ways by decision-makers. In large-scale assessment a lot of
effort is devoted to comprehensively document evidence for different aspects of validity.
For example, the test development process, from the construct definition in the target
domain to its operationalization via test items, their administration and scoring, and the
reporting need to be meticulously documented. In addition, secondary measures that are
8. assumed to positively correlate with test scores (e.g., general ability measures, working-
memory measures, alternative achievement measures) and negatively correlate with test
scores (e.g., measures of unrelated constructs) are typically administered alongside the
test of interest to investigate criterion-related evidence for validity.
In classroom assessments, the primary emphasis is on content validity, which is often
documented by the design of the assessment and the sequence of assessments overall.
Investigations of criterion-related evidence for validity are typically not administered
even though school principals might correlate particular grades across particular domains
for documentation or research purposes. If the assessment system that is in use in a
particular district was thoughtfully designed, then the weighting of the individual
assessments within the system might also be rationally used to argue that they structurally
match the relative importance of the measured competencies in the target domain.