O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Formative assessment is a range of formal and informal assessment procedures employed by teachers during the learning process in order to modify teaching and learning activities to improve student attainment
Two features in e-rater V.2 are related specifically to word-based characteristics. The first is a measure of vocabulary level (referred to as vocabulary) based on Breland, Jones, and Jenkins’ (1994) Standardized Frequency Index across the words of the essay. The second feature is based on the average word length in characters across the words in the essay (referred to as word length).
Multiple choice questions are often biased. Here is an old analogy question from the SAT
The correct answer is C, but how many inner city kids know what a regatta is?
Still, using the work of my friend and colleague Jay Rosner, Director of the Princeton Review foundation, I want to show how the standard practice of question selection produces bias
The answer is C Latinos score about 67 points (or – 2/3 STD) below whites on both the Critical Reading and Math Sections On this question, however, 49% of Latinos answered it correctly; compared to 46% white students The October 1998 and October 2000 SAT’s had a total of 138 Math and Verbal Questions each or 276 questions total Of these 276 questions, on how many did Latinos do better than whites? Only this one.
Dartmouth 2018 writing assessment presentation Les Perelman
Writing Assessment & Its
Dartmouth Summer Seminar for
4 August 2018
• Assess thyself or
be done unto
– Does it measure what you want to measure?
– Can we trust the measurement?
– Is it not random chance?
– If the effect is not random chance, is it large enough to be important?
– What are the intended, unintended, positive, and negative effects?
• Type I & Type II Errors
– How do we identify false positives and false negatives?
Types of Validity
• Face Validity
– directly measures the thing you want to measure
• Construct Validity
– measures of a construct (i.e., practical tests developed from a theory)
that actually measures something in terms of that construct.
• Predictive or Concurrent Validity
– measures that can predict or correlate with other measures of the
same construct either in the future or concurrently.
• Criterion or Proxy Validity
– measures of proxy variables that correlate with the construct
• Piano Technicians Guild
– Repair and tune an old
• Online Essay Evaluations
– MIT’s Freshman Essay
Evaluation assesses two
• Ability to write an
to a reading with its own
argument and data;
• Ability to synthesize and
paraphrase several data
Showing that AES Measures No Construct
Related to Human Communication
Privacy at dictators has not, and in all likelihood never will be agreed yet somehow vast. The
amanuensis will, nonetheless, be equitable in the extent to which we advocate. Because
almost all of the demarcations for the area of literature are foretold to privacy, privacy
which assassinates postulation can be more enormously appreciated. Seclusion will always
be an experience of humankind. Privateness is the most sophistic escapade of mankind.
Privacy has not, and probably never will be lavish but not solemn. Humanity will always
preach privateness; whether on the assembly or with the adjuration. A lack of privateness
lies in the study of literature as well as the search for reality. Why is seclusion so regrettable
to declaration? The response to this query is that privacy is indispensably propitious.
Seclusion, usually by a dictum, will despicably be ouster that should be the proclamation. If
all of the allusions reprove the admonishment to a embroidery, substantiated affronts
infuse equally with privacy. Furthermore, as I have learned in my literature class, human life
will always appease privacy. My celebration adheres. The precipitously homogenized anvil
may, nonetheless, be immense, venomous, and situational. In my philosophy class, many of
the assassinations at our personal drone by the injunction we civilize speculate. Ever since,
an altruist is unsubstantiated, mimicking, and startling of my appetite. Community
postulates probes, not infusion. In my literature class, most of the circumspections with our
personal circumstance for the account we demolish explain the thermostat on dicta to
inspections. Because just about all of the assumptions are contradicted with privacy, a
momentous privacy can be more mournfully propagandized.
According to professor of semiotics Oscar Wilde, privateness is the most fundamental pledge of
humankind. Although the same gamma ray may receive two different pendulums by an
assimilationist by surfeit, interference reproduces. The same neuron may receive two different
brains on the contradiction to counteract orbitals. Information is not the only thing radiation
implodes; it also transmits gamma rays at privacy. By quarreling, postulates of ruminations
which entreat the people involved and respond account as well for seclusion. The sooner
puissant axioms authorize acquiescence, the more happenstance will rightfully be an
outlandishly or surprisingly spurious declaration.
Seclusion, often to aggregations, taunts the assassin but can be edification. Privacy which is
inchoate changes the disparaging privacy. Additionally, if aborigines arrange juggernauts,
appeasement with the prison by privateness can be more rapaciously circumscribed. Our
personal quarrel for the commencement we implore jeers. However, armed with the
knowledge that a respondent should erratically be the amygdala that might be the epigraph, all
of the interlopers to our personal development at the salver we assimilate sequester the
denouncement but provide the people involved. In my experience, some of the inquisitions
with my avocation countenance taunts. a dearth of privacy enlightenments inquiries on our
personal reprobate of the inspection we lament also. Resourcefulness propagates
amplifications but collapses, not an avowed spectrometry. In my philosophy class, none of the
contradictions by our personal embroidery for the organism we assault shriek. From foretelling
appetites, many of the allegations quibble to the same extent of privacy.
Privateness to the circumspection will always be an experience of human society. The
affirmation will, even so, be lavish yet somehow amicable. The less advancements at inclination
of the area of theory of knowledge anesthetize a precinct that choreographs sublimation with
the disenfranchisement or confide, the sooner a circumstance is haphazard, petulant, and
inappropriate. Seclusion at aggregations has not, and presumably never will be deleteriously
risible. Because of the fact that privacy preaches those in question which concede thermostats
and allude, mankind should assassinate seclusion immediately.
1. Survey of instructors in various levels of first-year
writing subject indicates that placement instrument is
placing students in the right places.
2. Instrument is significant predictor of grades in first
year writing classes.
3. Instrument is significant predictor of first-year GPA.
4. The best predictor of college success is family
Proxy Validity: E-Rater® 2.0 created –
Grammar, usage, mechanics, & Style 1. Ratio of grammar errors to the total number of
2. Ratio of mechanics errors to the total number of
3. Ratio of usage errors to the total number of words
4. Ratio of style errors (repetitious words, passive
sentences, very long sentences, very short sentences)
to the total number of words
Organization & Development
5. The number of "discourse" units detected in the
essay (i.e., thesis, main ideas, supporting ideas,
6. The average length of each unit in words
Topical analysis 7. Similarity of the essay's vocabulary to other
previously scored essays in the top score
8. The score category containing essays whose
words are most similar to the target essay
Word complexity 9. Word repetition (ratio of different content
words to total number of words)
10. Vocabulary difficulty (based on word
11. Average word length in characters
Essay length 12. Total number of words
Adapted from Attali, Y. & Burstein, J. (2006) &
Ben-Simon, A. & Bennett, R.E. (2007)
Fairness -- Consequences
• Intended outcomes based on constructs not on
extraneous socially-determined values
• Constructs themselves are as neutral as possible
• Sufficient care has been taken to avoid adverse
impact on specific groups
• Measure attempts to avoid systemic negative
effects on marginalized populations
Bias in Multiple Choice Test Selection Method
• Old analogy question
RUNNER: MARATHON ::
A) envoy: embassy
B) martyr: massacre
C) oarsman: regatta
D) referee: tournament
E) horse: stable
• No more analogies on most standardized tests
• Committee checks questions for cultural bias
• But process of question selection produces bias.
Kidder & Rosner 2003; Rosner 2003; Rosner 2010
• Consider this unscored test question from the
Oct. 2000 SAT:
7. At bedtime the security blanket served the child as _______ with
seemingly magical powers to ward off frightening phantasms.
(A) an arsenal (B) an incentive (C) a talisman
(D) a trademark (E) a harbinger
• Yielding the same or compatible results in
different clinical experiments or statistical
• Tension between reliability and validity
Parallel Forms Reliability
• Assessments are at same level of difficulty for all
• For essay assessments, that each assessment be
at the same level of difficulty; that each prompt
be equally challenging
• SAT Multiple Choice selection process gives an
insight into how complex this goal is
– June 2018 SAT
Errors in measurement: Inter-rater reliability
• All measurements contain a certain amount of error
– Engineers always use error bars
• As Huot notes, essay scoring has a significantly low
correlation among readers
• Even now, an acceptable correlation is 0.7 which is
squared to give a shared variance of 0.49 or ~ ½ .
• The best scoring produces a correlation of around 0.8,
which gives a shared variance of 64% or slightly less
than 2/3 rds.
• But how do most large-scale testers get to even 0.7 or
0.8? Not by rigorous training.
• The secret is length! And short time to write!
Significance & Importance
The new SAT writing
section increased the
weighted SAT with GPA as a
predictor of 1st year grades
by r = .08 or an increase of
0.64% or .0064 in shared
variance. Because of the
large numbers, this value
Two-Types of Statistical Error
• Type I
– Incorrect rejection of a true null hypothesis
– False Positive
• Type II
– Acceptance of a false null hypothesis
– False Negative
The Perils of Pre-Test / Post-Test
• False Negatives
–Even with a rubric, holistic scoring
to be accurate will create a bell
curve near the middle of the scale
• Primary trait
Instrument and measures are determined by
purpose and audience
The Rhetoric of Assessment
Logos or Topic
Speaker or Writer Audience
Research Question & Data
The Assessment Process
Use Data &
• Why are you assessing?
• What do you want to know?
• For whom are you assessing and who will be doing the
• Where, in what context, will you be assessing?
• When and how often will you assess?
• How will you assess?