SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
94                                                                                     Key Words
                                                                                        Reliability, measurement,
                                                                                        quantitative measures,
                                                                                        statistical method.

                                                                                        by Anne Bruton
  Reliability:                                                                             Joy H Conway
                                                                                           Stephen T Holgate
  What is it, and
  how is it measured?

  Summary Therapists regularly perform various measurements.
  How reliable these measurements are in themselves, and how
                                                                                        clearly essential knowledge to help clinicians
  reliable therapists are in using them, is clearly essential knowledge
                                                                                        decide whether or not a particular
  to help clinicians decide whether or not a particular measurement                     measurement is of any value.
  is of any value. The aim of this paper is to explain the nature of                      This article focuses on the reliability of
  reliability, and to describe some of the commonly used estimates                      measures that generate quantitative data, and
  that attempt to quantify it. An understanding of reliability, and                     in particular ‘interval’ and ‘ratio’ data.
  how it is estimated, will help therapists to make sense of their own                  Interval data have equal intervals between
                                                                                        numbers but these are not related to true
  clinical findings, and to interpret published studies.
                                                                                        zero, so do not represent absolute quantity.
     Although reliability is generally perceived as desirable, there is no              Examples of inter val data are IQ and
  firm definition as to the level of reliability required to reach clinical             degrees Centigrade or Fahrenheit. In the
  acceptability. As with hypothesis testing, statistically significant                  temperature scale, the difference between
  levels of reliability may not translate into clinically acceptable levels,            10° and 20° is the same as between 70° and
  so that some authors’ claims about reliability may need to be                         80°, but is based on the numerical value of
                                                                                        the scale, not the true nature of the variable
  interpreted with caution. Reliability is generally population specific,
                                                                                        itself. Therefore the actual difference in
  so that caution is also advised in making comparisons between                         heat and molecular motion generated is not
  studies.                                                                              the same and it is not appropriate to say that
     The current consensus is that no single estimate is sufficient to                  someone is twice as hot as someone else.
  provide the full picture about reliability, and that different types of                 With ratio data, numbers represent units
  estimate should be used together.                                                     with equal intervals, measured from true
                                                                                        zero, eg distance, age, time, weight, strength,
                                                                                        blood pressure, range of motion, height.
                                   Introduction                                         Numbers therefore reflect actual amounts of
                                   Therapists regularly per form various                the variable being measured, and it is
                                   measurements of varying reliability. The             appropriate to say that one person is twice as
                                   term ‘reliability’ here refers to the                heavy, tall, etc, as another. The kind of
                                   consistency or repeatability of such                 quantitative measures that therapists often
                                   measurements. Irrespective of the area               carry out are outlined in table 2.
                                   in which they work, therapists take                    The aim of this paper is to explain the
                                   measurements for any or all of the reasons           nature of reliability, and to describe, in
                                   outlined in table 1. How reliable these              general terms, some of the commonly used
                                   measurements are in themselves, and how              methods for quantifying it. It is not intended
                                   reliable therapists are in performing them, is       to be a detailed account of the statistical

                                    Table 1: Common reasons why therapists perform      Table 2: Examples of quantitative measures
                                    measurements                                        performed by physiotherapists

                                    As part of patient assessment.                      Strength measures (eg in newtons of force, kilos
                                                                                        lifted.
                                    As baseline or outcome measures.
                                                                                        Angle or range of motion measures (eg in degrees,
  Bruton, A, Conway, J H            As aids to deciding upon treatment plans.
                                                                                        centimetres).
  and Holgate, S T (2000).          As feedback for patients and other interested
                                                                                        Velocity or speed measures (eg in litres per minute
  ‘Reliability: What is it and      parties.
                                                                                        for peak expiratory flow rate).
  how is it measured?’              As aids to making predictive judgements, eg about
                                                                                        Length or circumference measures (eg in metres,
  Physiotherapy, 86, 2,             outcome.
                                                                                        centimetres).
  94-99.


Physiotherapy February 2000/vol 86/no 2
Professional articles                                                                                                               95


minutiae associated with reliability measures,     Table 3: Repeated maximum inspiratory pressure measures data
for which readers are referred to standard         demonstrating good relative reliability
books on medical statistics.                                                MIP                                        Rank
                                                   Subject         Day 1          Day 2      Difference       Day 1           Day 2
Measurement Error
                                                   1               110            120           +10              2              2
It is very rare to find any clinical
                                                   2                94            105           +11              4              4
measurement that is perfectly reliable, as all
instruments and observers or measurers             3                86             70           --16             5              5
(raters) are fallible to some extent and all       4               120            142           +22              1              1
humans respond with some inconsistency.            5               107            107              0             3              3
Thus any observed score (X) can be thought
of as a function of two components, ie a true
score (T) and an error component
(E): X = T ± E                                     Table 4: Repeated maximum inspiratory pressures measures data
                                                   demonstrating poor relative reliability
  The difference between the true value and
the observed value is measurement error. In                                 MIP                                        Rank
statistical terms, ‘error’ refers to all sources   Subject         Day 1          Day 2      Difference        Day 1          Day 2
of variability that cannot be explained by the     1               110             95           --15             2              5
independent (also known as the predictor,          2                94            107           +13              4              3
or explanatory) variable. Since the error
                                                   3                86             97           +11              5              4
components are generally unknown, it is
                                                   4               120            120              0             1              2
only possible to estimate the amount of any
measurement that is attributable to error          5               107            129           +22              3              1
and the amount that represents an accurate
reading. This estimate is our measure of
reliability.
  Measurement errors may be systematic or          by some type of correlation coefficient, eg
random. Systematic errors are predictable          Pearson’s correlation coefficient, usually
errors, occurring in one direction only,           written as r. For table 3 the data give a
constant and biased. For example, when             Pearson’s correlation coefficient of r = 0.94,
using a measurement that is susceptible to a       generally accepted to indicate a high degree
learning effect (eg strength testing), a retest    of correlation. In table 4, however, although
may be consistently higher than a prior test       the differences between the two measures
(perhaps due to improved motor unit co-            look similar to those in table 1 (ie –15 to +22
ordination). Such a systematic error would         cm of water), on this occasion the ranking
not therefore affect reliability, but would        has changed. Subject 4 has the highest MIP
affect validity, as test values are not true       on day 1, but is second highest on day 2,
representations of the quantity being              subject 1 had the second highest MIP in day
measured. Random errors are due to chance          1, but the lowest MIP on day 2, and so on.
and unpredictable, thus they are the basic         For table 4 data r = 0.51, which would be
concern of reliability.                            interpreted as a low degree of correlation.
                                                   Correlation coefficients thus give infor-
Types of Reliability                               mation about association between two
Baumgarter (1989) has identified two types         variables, and not necessarily about their
of reliability, ie relative reliability and        proximity.
absolute reliability.                                Absolute reliability is the degree to which
  Relative reliability is the degree to which      repeated measurements vary for individuals,
individuals maintain their position in a           ie the less they vary, the higher the reliability.
sample over repeated measurements. Tables          This type of reliability is expressed either in
3 and 4 give some maximum inspiratory              the actual units of measurement, or as a
pressure (MIP) measures taken on two               proportion of the measured values. The
occasions, 48 hours apart. In table 3,             standard error of measurement (SEM),
although the differences between the two           coefficient of variation (CV) and Bland and
measures vary from –16 to +22 centimetres          Altman’s 95% limits of agreement (1986)
of water, the ranking remains unchanged.           are all examples of measures of absolute
That is, on both day 1 and day 2 subject 4         reliability. These will be described later.
had the highest MIP, subject 1 the second
highest, subject 5 the third highest, and so
on. This form of reliability is often assessed

                                                                                                 Physiotherapy February 2000/vol 86/no 2
96


   Authors                          Why Estimate Reliability?                           estimate calculated for their data. Table 5
   Anne Bruton MA MCSP is           Reliability testing is usually performed to         summarises the more common reliability
   currently involved in            assess one of the following:                        indices found in the literature, which are
   postgraduate research,                                                               described below.
   Joy H Conway PhD MSc             s Instrumental reliability, ie the reliability of
   MCSP is a lecturer in              the measurement device.                           Table 5: Reliability indices in common use
   physiotherapy, and
                                    s Rater reliability, ie the reliability of the      Hypothesis tests for bias, eg paired t-test, analysis
   Stephen T Holgate MD                                                                 of variance.
   DSc FRCP                           researcher/observer/clinician
                                      administering the measurement device.             Correlation coefficients, eg Pearson’s, ICC.
   is MRC professor of
                                                                                        Standard error of measurement (SEM).
   immunopharmacology,              s Response reliability, ie the
   all at the University of           reliability/stability of the variable being       Coefficient of variation (CV).
   Southampton.                       measured.                                         Repeatability coefficient.
   This article was received                                                            Bland and Altman 95% limits of agreement.
   on November 16, 1998,
                                    How is Reliability Measured?
   and accepted on
   September 7, 1999.               As described earlier, observed scores consist       Indices Based on Hypothesis Testing for Bias
                                    of the true value ± the error component.            The paired t-test, and analysis of variance
                                    Since it is not possible to know the true           techniques are statistical methods for
   Address for                      value, the true reliability of any test is not      detecting systematic bias between groups
   Correspondence                   calculable. It can however be estimated,            of data. These estimates, based upon
                                    based on the statistical concept of variance,       hypothesis testing, are often used in
   Ms Anne Bruton, Health
   Research Unit, School of         ie a measure of the variability of differences      reliability studies. However, they give
   Health Professions and           among scores within a sample. The greater           information only about systematic
   Rehabilitation Sciences,         the dispersion of scores, the larger the            differences between the means of two sets of
   University of                    variance; the more homogeneous the scores,          data, not about individual differences. Such
   Southampton, Highfield,          the smaller the variance.                           tests should, therefore, not be used in
   Southampton SO17 1BJ.              If a single measurer (rater) were to record       isolation, but be complemented by other
                                    the oxygen saturation of an individual 10           methods, eg Bland and Altman agreement
                                    times, the resulting scores would not all be        tests (1986).
   Funding
                                    identical, but would exhibit some variance.
   Anne Bruton is currently         Some of this total variance is due to true          Correlation Coefficients (r)
   sponsored by a South and         differences between scores (since oxygen            As stated earlier, correlation coefficients give
   West Health Region R&D
                                    saturation fluctuates), but some can be             information about the degree of association
   studentship.
                                    attributable to measurement error (E).              between two sets of data, or the consistency
                                    Reliability (R) is the measure of the amount        of position within the two distributions.
                                    of the total variance attributable to true          Provided the relative positions of each
                                    differences and can be expressed as the ratio       subject remain the same from test to test,
                                    of true score variance (T) to total variance        high measures of correlation will be
                                    or:       T                                         obtained. However, a correlation coefficient
                                          R=T+E                                         will not detect any systematic errors. So it is
                                      This ratio gives a value known as a               possible to have two sets of scores that are
                                    reliability coefficient. As the observed score      highly correlated, but not highly repeatable,
                                    approaches the true score, reliability              as in table 6 where the hypothetical data
                                    increases, so that with zero error there is         give a Pearson’s correlation coefficient of
                                    perfect reliability and a coefficient of 1,         r = 1, ie per fect correlation despite a
                                    because the observed score is the same as           systematic difference of 40 cm of water
                                    the true score. Conversely, as error increases      for each subject.
                                    reliability diminishes, so that with maximal          Thus correlation only tells how two sets of
                                    error there is no reliability and the               scores vary together, not the extent of
                                    coefficient approaches 0. There is, however,        agreement between them. Often researchers
                                    no such thing as a minimum acceptable level         need to know that the actual values obtained
                                    of reliability that can be applied to all           by two measurements are the same, not just
                                    measures, as this will vary depending on the        proportional to one another. Although
                                    use of the test.                                    published studies abound with correlation
                                                                                        used as the sole indicator of reliability, their
                                    Indices of Reliability                              results can be misleading, and it is now
                                    In common with medical literature,                  recommended that they be no longer used
                                    physiotherapy literature shows no                   in isolation (Keating and Matyas, 1998;
                                    consistency in authors’ choice of reliability       Chinn, 1990).

Physiotherapy February 2000/vol 86/no 2
Professional articles                                                                                                            97


          Table 6: Repeated maximum inspiratory pressures measures data
          demonstrating a high Pearson’s correlation coefficient, but poor absolute
          reliability

                                  MIP                                       Rank
          Subject         Day 1         Day 2       Difference      Day 1          Day 2

          1               110            150           +40            2              2
          2                94            134           +40            4              4
          3                86            126           +40            5              5
          4               120            160           +40            1              1
          5               107            147           +40            3              3




Intra-class Correlation Coefficient (ICC)            subjects to the sum of error variance and
The intra-class correlation coefficient (ICC)        subject variance. If the variance between
is an attempt to overcome some of the                subjects is sufficiently high (that is, the data
limitations of the classic correlation               come from a heterogeneous sample) then
coefficients. It is a single index calculated        reliability will inevitably appear to be high.
using variance estimates obtained through            Thus if the ICC is applied to data from a
the partitioning of total variance into              group of individuals demonstrating a wide
between and within subject variance (known           range of the measured characteristic,
as analysis of variance or ANOVA). It thus           reliability will appear to be higher than
reflects both degree of consistency and              when applied to a group demonstrating a
agreement among ratings.                             narrow range of the same characteristic.
   There are numerous versions of the ICC
(Shrout and Fleiss, 1979) with each form             Standard Error of Measurement (SEM)
being appropriate to specific situations.            As mentioned earlier, if any measurement
Readers interested in using the ICC can find         test were to be applied to a single subject an
worked examples relevant to rehabilitation           infinite number of times, it would be
in various published articles (Rankin and            expected to generate responses that vary a
Stokes, 1998; Keating and Matyas, 1998;              little from trial to trial, as a result of
Stratford et al, 1984; Eliasziw et al, 1994). The    measurement error. Theoretically these
use of the ICC implies that each component           responses could be plotted and their
of variance has been estimated appropriately         distribution would follow a normal curve,
from sufficient data (at least 25 degrees of         with the mean equal to the true score,
freedom), and from a sample representing             and errors occurring above and below the
the population to which the results will be          mean.
applied (Chinn, 1991). In this instance,               The more reliable the measurement
degrees of freedom can be thought of as the          response, the less error variability there
number of subjects multiplied by the                 would be around the mean. The standard
number of measurements.                              deviation of measurement errors is therefore
   As with other reliability coefficients, there     a reflection of the reliability of the test
is no standard acceptable level of reliability       response, and is known as the standard error
using the ICC. It will range from 0 to 1, with       of measurement (SEM). The value for the
values closer to one representing the higher         SEM will vary from subject to subject, but
reliability. Chinn (1991) recommends that            there are equations for calculating a group
any measure should have an intra-class               estimate, eg SEM = sx √1 – rxx (where sx is the
correlation coefficient of at least 0.6 to be        standard deviation of the set of observed test
useful. The ICC is useful when comparing             scores and rxx is the reliability coefficient for
the repeatability of measures using different        those data -- often the ICC is used here.)
units, as it is a dimensionless statistic. It is       The SEM is a measure of absolute
most useful when three or more sets of               reliability and is expressed in the actual units
observations are taken, either from a single         of measurement, making it easy to interpret,
sample or from independent samples. It               ie the smaller the SEM, the greater the
does, however, have some disadvantages as            reliability. It is only appropriate, however, for
described by Rankin and Stokes (1998) that           use with interval data (Atkinson and Neville,
make it unsuitable for use in isolation. As          1998) since with ratio data the amount of
described earlier, any reliability coefficient is    random error may increase as the measured
determined as the ratio of variance between          values increase.

                                                                                                  Physiotherapy February 2000/vol 86/no 2
98


                                   Coefficient of Variation (CV)                      appropriate for method comparison studies
                                   The CV is an often-quoted estimate of              for reasons described by Bland and Altman
                                   measurement error, particularly in lab-            in their 1986 paper. These authors have
                                   oratory studies where multiple repeated tests      therefore proposed an approach for
                                   are standard procedure. One form of the CV         assessing agreement between two different
                                   is calculated as the standard deviation of the     methods of clinical measurement. This
                                   data, divided by the mean and multiplied by        involves calculating the mean for each
                                   100 to give a percentage score. This               method and using this in a series of
                                   expresses the standard deviation as a              agreement tests.
                                   proportion of the mean, making it unit                Step 1 consists of plotting the difference in
                                   independent. However, as Bland (1987)              the two results against the mean value from
                                   points out, the problem with expressing the        the two methods. Step 2 involves calculating
                                   error as a percentage, is that x% of the           the mean and standard deviation of the
                                   smallest observation will differ markedly          differences between the measures. Step 3
                                   from x% of the largest observation. Chinn          consists of calculating the 95% limits of
                                   (1991) suggests that it is preferable to use       agreement (as the mean difference plus or
                                   the ICC rather than the CV, as the former          minus two standard deviations of the
                                   relates the size of the error variation to the     differences), and 95% confidence intervals
                                   size of the variation of interest. It has been     for these limits of agreement. The
                                   suggested that the above form of the CV            advantages of this approach are that by using
                                   should no longer be used to estimate               scatterplots, data can be visually interpreted
                                   reliability, and that other more appropriate       fairly swiftly. Any outliers, bias, or rel-
                                   methods should be employed based on                ationship between variance in measures and
                                   analysis of variance of logarithmically            size of the mean can therefore be observed
                                   transformed data (Atkinson and Neville,            easily. The 95% limits of agreement provide
                                   1998).                                             a range of error that may relate to clinical
                                                                                      acceptability, although this needs to be
                                   Repeatability Coefficient                          interpreted with reference to the range of
                                   Another way to present measurement error           measures in the raw data.
                                   over two tests, as recommended by the                 In the same paper, Bland and Altman
                                   British Standards Institution (1979) is the        have a section headed ‘Repeatability’ in
                                   value below which the difference between           which they recommend the use of the
                                   the two measurements will lie with                 ‘repeatability coefficient’ (described earlier)
                                   probability 0.95. This is based upon the           for studies involving repeated measures with
                                   within-subject standard deviation (s).             the same instrument. In their final
                                   Provided the measurement errors are from a         discussion, however, they suggest that their
                                   normal distribution this can be estimated by       agreement testing approach may be used
                                   1.96 x √(2s2), or 2.83s and is known as the        either for analysis of repeatability of a single
                                   repeatability coefficient (Bland and Altman,       measurement method, or for method
                                   1986). This name is rather confusing, as           comparison studies. Worked examples using
                                   other coefficients (eg reliability coefficient)    Bland and Altman agreement tests can be
                                   are expected to be unit free and in a range        found in their original paper, and more
                                   from zero to one. The method of calculation        recently in papers by Atkinson and Nevill
                                   varies slightly in two different references        (1998) and Rankin and Stokes (1998).
                                   (Bland and Altman, 1986; Bland, 1987), and
                                   to date it is not a frequently quoted statistic.   Nature of Reliability
                                                                                      Unfortunately, the concept of reliability is
                                   Bland and Altman Agreement Tests                   complex, with less of the straightforward
                                   In 1986 The Lancet published a paper by            ‘black and white’ statistical theory that
                                   Bland and Altman that is frequently cited          surrounds hypothesis testing. When testing
                                   and has been instrumental in encouraging           a research hypothesis there are clear
                                   changing use of reliability estimates in the       guidelines to help researchers and clinicians
                                   medical literature. In the past, studies           decide whether results indicate that the
                                   comparing the reliability of two different         hypothesis can be supported or not. In
                                   instruments designed to measure the                contrast, the decision as to whether a
                                   same variable (eg two different types              particular measurement tool or method
                                   of goniometer) often quoted correlation            is reliable or not is more open to
                                   coefficients and ICCs. These can both              interpretation. The decision to be made is
                                   be misleading, however, and are not                whether the level of measurement error is

Physiotherapy February 2000/vol 86/no 2
Professional articles                                                                                                                       99


considered acceptable for practical use.               instrument will have a certain degree of
There are no firm rules for making this                reliability when applied to certain
decision, which will inevitably be context             populations under certain conditions. The
based. An error of ±5° in goniometry                   issue to be addressed is what level of
measures may be clinically acceptable in               reliability is considered to be clinically
some circumstances, but may be less                    acceptable. In some circumstances there
acceptable if definitive clinical decisions (eg        may be a choice only between a measure
surgical intervention) are dependent on the            with lower reliability or no measure at all, in
measure. Because of this dependence on the             which case the less than perfect measure
context in which they are produced, it is              may still add useful information.
therefore very difficult to make comparisons             In recent years several authors have
of reliability across different studies, except        recommended that no single reliability
in very general terms.                                 estimate should be used for reliability
                                                       studies. Opinion is divided over exactly
Conclusion                                             which estimates are suitable for which
This paper has attempted to explain the                circumstances. Rankin and Stokes (1998)
concept of reliability and describe some of            have recently suggested that a consensus
the estimates commonly used to quantify it.            needs to be reached to establish which tests
Key points to note about reliability are               should be adopted universally. In general,
summarised in the panel below. Reliability             however, it is suggested that no single
should not necessarily be conceived as a               estimate is universally appropriate, and that
property that a particular instrument or               a combination of approaches is more likely
measurer does or does not possess. Any                 to give a true picture of reliability.


References                                             Chinn, S (1991). ‘Repeatability and method
Atkinson, G and Nevill, A M (1998). ‘Statistical       comparison’, Thorax, 46, 454-456.
methods for assessing measurement error                Eliasziw, M, Young, S L, Woodbury, M G et al
(reliability) in variables relevant to sports          (1994). ‘Statistical methodology for the
medicine’, Sports Medicine, 26, 217-238.               concurrent assessment of inter-rater and
Baumgarter, T A (1989). ‘Norm-referenced               intra-rater reliability: Using goniometric
measurement: reliability’ in: Safrit, M J and Wood,    measurements as an example’, Physical Therapy,
T M (eds) Measurement Concepts in Physical             74, 777-788.
Education and Exercise Science, Champaign, Illinois,   Keating, J and Matyas, T (1998). ‘Unreliable
pages 45-72.                                           inferences from reliable measurements’,
Bland, J M (1987). An Introduction to Medical          Australian Journal of Physiotherapy, 44, 5-10.
Statistics, Oxford University Press.                   Rankin, G and Stokes, M (1998).
Bland, J M and Altman, D G (1986). ‘Statistical        ‘Reliability of assessment tools in rehabilitation:
methods for assessing agreement between two            An illustration of appropriate statistical analyses’,
methods of clinical measurement’, The Lancet,          Clinical Rehabilitation, 12, 187-199.
February 8, 307-310.                                   Shrout, P E and Fleiss, J L (1979). ‘Intraclass
British Standards Institution (1979). ‘Precision of    correlations: Uses in assessing rater reliability’,
test methods. 1: Guide for the determination and       Psychological Bulletin, 86, 420-428.
reproducibility for a standard test method’            Stratford, P, Agostino, V, Brazeau, C and
BS5497, part 1. BSI, London.                           Gowitzke, B A (1984). ‘Reliability of joint angle
Chinn, S (1990). ‘The assessment of methods of         measurement: A discussion of methodology
measurement’, Statistics in Medicine, 9, 351-362.      issues’, Physiotherapy Canada, 36, 1, 5-9.




 Key Messages
 Reliability is:
                                                       s Population specific.
 s Not an all-or-none phenomenon.                      s Related to the variability in the group
 s Open to interpretation.                               studied.
 s Not the same as clinical acceptability.             s Best estimated by more than one index.




                                                                                                             Physiotherapy February 2000/vol 86/no 2

Mais conteúdo relacionado

Mais procurados

What Is Occupational Therapy?
What Is Occupational Therapy?What Is Occupational Therapy?
What Is Occupational Therapy?BrightStar Care
 
Models in OT practice
Models in OT practiceModels in OT practice
Models in OT practiceKavita Murthi
 
Neuropsychological Assessment
Neuropsychological AssessmentNeuropsychological Assessment
Neuropsychological AssessmentDr. Sunil Suthar
 
Interpretation of ost & ribt
Interpretation of ost & ribtInterpretation of ost & ribt
Interpretation of ost & ribtVarun Muthuchamy
 
Testing in various areas
Testing in various areasTesting in various areas
Testing in various areasFebby Kirstin
 
Psychometric properties
Psychometric propertiesPsychometric properties
Psychometric propertiesYoussef2000
 
Nature and use of Psychological Tests
Nature and use of Psychological TestsNature and use of Psychological Tests
Nature and use of Psychological TestsLenie Rose Julia
 
Psych 24 history of personality assessment
Psych 24 history of personality assessmentPsych 24 history of personality assessment
Psych 24 history of personality assessmentMaii Caa
 
Ethical Standards in Testing
Ethical Standards in TestingEthical Standards in Testing
Ethical Standards in TestingANCYBS
 
uses of PSY test.pptx
uses of PSY test.pptxuses of PSY test.pptx
uses of PSY test.pptxDimpleboy4
 
Clinical assessment: legal and ethical issues
Clinical assessment: legal and ethical issuesClinical assessment: legal and ethical issues
Clinical assessment: legal and ethical issuesJoshua Watson
 
Types of psychological test
Types of psychological testTypes of psychological test
Types of psychological testAbigail Gamboa
 
Bhatiya test battery of intelligence.pptx
Bhatiya test battery of intelligence.pptxBhatiya test battery of intelligence.pptx
Bhatiya test battery of intelligence.pptxDheeraj Mishra
 
Basic concepts of validation
Basic concepts of validationBasic concepts of validation
Basic concepts of validationChai-Eng Tan
 
Role of yog practices on endocrine functions
Role of yog practices on endocrine functionsRole of yog practices on endocrine functions
Role of yog practices on endocrine functionsShweta Mishra
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validityKaimrc_Rss_Jd
 

Mais procurados (20)

What Is Occupational Therapy?
What Is Occupational Therapy?What Is Occupational Therapy?
What Is Occupational Therapy?
 
Models in OT practice
Models in OT practiceModels in OT practice
Models in OT practice
 
Neuropsychological Assessment
Neuropsychological AssessmentNeuropsychological Assessment
Neuropsychological Assessment
 
Interpretation of ost & ribt
Interpretation of ost & ribtInterpretation of ost & ribt
Interpretation of ost & ribt
 
Testing in various areas
Testing in various areasTesting in various areas
Testing in various areas
 
Psychometric properties
Psychometric propertiesPsychometric properties
Psychometric properties
 
Validity and reliablity
Validity and reliablityValidity and reliablity
Validity and reliablity
 
Nature and use of Psychological Tests
Nature and use of Psychological TestsNature and use of Psychological Tests
Nature and use of Psychological Tests
 
Rorschach test
Rorschach testRorschach test
Rorschach test
 
Observation
ObservationObservation
Observation
 
Psych 24 history of personality assessment
Psych 24 history of personality assessmentPsych 24 history of personality assessment
Psych 24 history of personality assessment
 
Ethical Standards in Testing
Ethical Standards in TestingEthical Standards in Testing
Ethical Standards in Testing
 
uses of PSY test.pptx
uses of PSY test.pptxuses of PSY test.pptx
uses of PSY test.pptx
 
Clinical assessment: legal and ethical issues
Clinical assessment: legal and ethical issuesClinical assessment: legal and ethical issues
Clinical assessment: legal and ethical issues
 
Types of psychological test
Types of psychological testTypes of psychological test
Types of psychological test
 
Bhatiya test battery of intelligence.pptx
Bhatiya test battery of intelligence.pptxBhatiya test battery of intelligence.pptx
Bhatiya test battery of intelligence.pptx
 
Basic concepts of validation
Basic concepts of validationBasic concepts of validation
Basic concepts of validation
 
Acupuncture (2)
Acupuncture (2)Acupuncture (2)
Acupuncture (2)
 
Role of yog practices on endocrine functions
Role of yog practices on endocrine functionsRole of yog practices on endocrine functions
Role of yog practices on endocrine functions
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 

Destaque

Measures of reliability in sports medicine and science
Measures of reliability in sports medicine and scienceMeasures of reliability in sports medicine and science
Measures of reliability in sports medicine and scienceanalisedecurvas
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and ReliabilityMaury Martinez
 
How to assess the reliability of measurements in rehabilitation
How to assess the reliability of measurements in rehabilitationHow to assess the reliability of measurements in rehabilitation
How to assess the reliability of measurements in rehabilitationanalisedecurvas
 
A Powerful Partnership: GIS and Sampling
A Powerful Partnership: GIS and SamplingA Powerful Partnership: GIS and Sampling
A Powerful Partnership: GIS and SamplingMEASURE Evaluation
 
1 Reliability and Validity in Physical Therapy Tests
1  Reliability and Validity in Physical Therapy Tests1  Reliability and Validity in Physical Therapy Tests
1 Reliability and Validity in Physical Therapy Testsaebrahim123
 
Reliability and yield
Reliability and yield Reliability and yield
Reliability and yield rohitladdu
 

Destaque (7)

Measures of reliability in sports medicine and science
Measures of reliability in sports medicine and scienceMeasures of reliability in sports medicine and science
Measures of reliability in sports medicine and science
 
Reliability
ReliabilityReliability
Reliability
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
How to assess the reliability of measurements in rehabilitation
How to assess the reliability of measurements in rehabilitationHow to assess the reliability of measurements in rehabilitation
How to assess the reliability of measurements in rehabilitation
 
A Powerful Partnership: GIS and Sampling
A Powerful Partnership: GIS and SamplingA Powerful Partnership: GIS and Sampling
A Powerful Partnership: GIS and Sampling
 
1 Reliability and Validity in Physical Therapy Tests
1  Reliability and Validity in Physical Therapy Tests1  Reliability and Validity in Physical Therapy Tests
1 Reliability and Validity in Physical Therapy Tests
 
Reliability and yield
Reliability and yield Reliability and yield
Reliability and yield
 

Semelhante a Reliability what is it, and how is it measured

Research methodology Chapter 6
Research methodology Chapter 6Research methodology Chapter 6
Research methodology Chapter 6Pulchowk Campus
 
Topic validity
Topic validityTopic validity
Topic validitymikki khan
 
Criminal Justice Research 6216Application Measuring Variables.docx
Criminal Justice Research 6216Application Measuring Variables.docxCriminal Justice Research 6216Application Measuring Variables.docx
Criminal Justice Research 6216Application Measuring Variables.docxcrystal5fqula
 
Research Methodology
Research MethodologyResearch Methodology
Research MethodologyAneel Raza
 
Importance of evidence
Importance of evidenceImportance of evidence
Importance of evidencesahughes
 
Outcomes in Occupational Therapy (& Assistive Technology)
Outcomes in Occupational Therapy (& Assistive Technology)Outcomes in Occupational Therapy (& Assistive Technology)
Outcomes in Occupational Therapy (& Assistive Technology)will wade
 
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE  Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE Pradip Limbani
 
Reseaech methodology reena
Reseaech methodology reenaReseaech methodology reena
Reseaech methodology reenareena andrew
 
Accuracy precision and significant figures
Accuracy precision and significant figuresAccuracy precision and significant figures
Accuracy precision and significant figuresnehla313
 
Introduction to outcome measures
Introduction to outcome measuresIntroduction to outcome measures
Introduction to outcome measuresSreeraj S R
 
RELIABILITY.pptx
RELIABILITY.pptxRELIABILITY.pptx
RELIABILITY.pptxrupasi13
 
Frequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare ProfessioanlsFrequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare Professioanlsalberpaules
 
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...CrimsonpublishersMedical
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliabilityrijocool
 

Semelhante a Reliability what is it, and how is it measured (20)

Research methodology Chapter 6
Research methodology Chapter 6Research methodology Chapter 6
Research methodology Chapter 6
 
Topic validity
Topic validityTopic validity
Topic validity
 
Criminal Justice Research 6216Application Measuring Variables.docx
Criminal Justice Research 6216Application Measuring Variables.docxCriminal Justice Research 6216Application Measuring Variables.docx
Criminal Justice Research 6216Application Measuring Variables.docx
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 
Importance of evidence
Importance of evidenceImportance of evidence
Importance of evidence
 
Multivariate
MultivariateMultivariate
Multivariate
 
Outcomes in Occupational Therapy (& Assistive Technology)
Outcomes in Occupational Therapy (& Assistive Technology)Outcomes in Occupational Therapy (& Assistive Technology)
Outcomes in Occupational Therapy (& Assistive Technology)
 
Fem
FemFem
Fem
 
Chapter_1_Lecture.pptx
Chapter_1_Lecture.pptxChapter_1_Lecture.pptx
Chapter_1_Lecture.pptx
 
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE  Ag Extn.504 :-  RESEARCH METHODS IN BEHAVIOURAL SCIENCE
Ag Extn.504 :- RESEARCH METHODS IN BEHAVIOURAL SCIENCE
 
Reseaech methodology reena
Reseaech methodology reenaReseaech methodology reena
Reseaech methodology reena
 
Accuracy precision and significant figures
Accuracy precision and significant figuresAccuracy precision and significant figures
Accuracy precision and significant figures
 
Introduction to outcome measures
Introduction to outcome measuresIntroduction to outcome measures
Introduction to outcome measures
 
Accuracy
AccuracyAccuracy
Accuracy
 
Measurement theory
Measurement theoryMeasurement theory
Measurement theory
 
RELIABILITY.pptx
RELIABILITY.pptxRELIABILITY.pptx
RELIABILITY.pptx
 
Frequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare ProfessioanlsFrequency Measures for Healthcare Professioanls
Frequency Measures for Healthcare Professioanls
 
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...
Crimson Publishers - Assessing Pain Using Morbid Motion Monitor System in the...
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
 
Rmppp ch04web
Rmppp ch04webRmppp ch04web
Rmppp ch04web
 

Mais de analisedecurvas

Measuring agreement in method comparison
Measuring agreement in method comparisonMeasuring agreement in method comparison
Measuring agreement in method comparisonanalisedecurvas
 
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...analisedecurvas
 
Duhamel, 2004 statistical tools for clinical gait analysis
Duhamel, 2004   statistical tools for clinical gait analysisDuhamel, 2004   statistical tools for clinical gait analysis
Duhamel, 2004 statistical tools for clinical gait analysisanalisedecurvas
 

Mais de analisedecurvas (6)

Measuring agreement in method comparison
Measuring agreement in method comparisonMeasuring agreement in method comparison
Measuring agreement in method comparison
 
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...
Kadaba (1989) -_repeatability_of_kinematic,_kinetic_and_elctromiophic_data_in...
 
Duhamel, 2004 statistical tools for clinical gait analysis
Duhamel, 2004   statistical tools for clinical gait analysisDuhamel, 2004   statistical tools for clinical gait analysis
Duhamel, 2004 statistical tools for clinical gait analysis
 
Besier (2003)
Besier (2003)Besier (2003)
Besier (2003)
 
10.1.1.133.8405
10.1.1.133.840510.1.1.133.8405
10.1.1.133.8405
 
Mackey (2005)
Mackey (2005)Mackey (2005)
Mackey (2005)
 

Último

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Reliability what is it, and how is it measured

  • 1. 94 Key Words Reliability, measurement, quantitative measures, statistical method. by Anne Bruton Reliability: Joy H Conway Stephen T Holgate What is it, and how is it measured? Summary Therapists regularly perform various measurements. How reliable these measurements are in themselves, and how clearly essential knowledge to help clinicians reliable therapists are in using them, is clearly essential knowledge decide whether or not a particular to help clinicians decide whether or not a particular measurement measurement is of any value. is of any value. The aim of this paper is to explain the nature of This article focuses on the reliability of reliability, and to describe some of the commonly used estimates measures that generate quantitative data, and that attempt to quantify it. An understanding of reliability, and in particular ‘interval’ and ‘ratio’ data. how it is estimated, will help therapists to make sense of their own Interval data have equal intervals between numbers but these are not related to true clinical findings, and to interpret published studies. zero, so do not represent absolute quantity. Although reliability is generally perceived as desirable, there is no Examples of inter val data are IQ and firm definition as to the level of reliability required to reach clinical degrees Centigrade or Fahrenheit. In the acceptability. As with hypothesis testing, statistically significant temperature scale, the difference between levels of reliability may not translate into clinically acceptable levels, 10° and 20° is the same as between 70° and so that some authors’ claims about reliability may need to be 80°, but is based on the numerical value of the scale, not the true nature of the variable interpreted with caution. Reliability is generally population specific, itself. Therefore the actual difference in so that caution is also advised in making comparisons between heat and molecular motion generated is not studies. the same and it is not appropriate to say that The current consensus is that no single estimate is sufficient to someone is twice as hot as someone else. provide the full picture about reliability, and that different types of With ratio data, numbers represent units estimate should be used together. with equal intervals, measured from true zero, eg distance, age, time, weight, strength, blood pressure, range of motion, height. Introduction Numbers therefore reflect actual amounts of Therapists regularly per form various the variable being measured, and it is measurements of varying reliability. The appropriate to say that one person is twice as term ‘reliability’ here refers to the heavy, tall, etc, as another. The kind of consistency or repeatability of such quantitative measures that therapists often measurements. Irrespective of the area carry out are outlined in table 2. in which they work, therapists take The aim of this paper is to explain the measurements for any or all of the reasons nature of reliability, and to describe, in outlined in table 1. How reliable these general terms, some of the commonly used measurements are in themselves, and how methods for quantifying it. It is not intended reliable therapists are in performing them, is to be a detailed account of the statistical Table 1: Common reasons why therapists perform Table 2: Examples of quantitative measures measurements performed by physiotherapists As part of patient assessment. Strength measures (eg in newtons of force, kilos lifted. As baseline or outcome measures. Angle or range of motion measures (eg in degrees, Bruton, A, Conway, J H As aids to deciding upon treatment plans. centimetres). and Holgate, S T (2000). As feedback for patients and other interested Velocity or speed measures (eg in litres per minute ‘Reliability: What is it and parties. for peak expiratory flow rate). how is it measured?’ As aids to making predictive judgements, eg about Length or circumference measures (eg in metres, Physiotherapy, 86, 2, outcome. centimetres). 94-99. Physiotherapy February 2000/vol 86/no 2
  • 2. Professional articles 95 minutiae associated with reliability measures, Table 3: Repeated maximum inspiratory pressure measures data for which readers are referred to standard demonstrating good relative reliability books on medical statistics. MIP Rank Subject Day 1 Day 2 Difference Day 1 Day 2 Measurement Error 1 110 120 +10 2 2 It is very rare to find any clinical 2 94 105 +11 4 4 measurement that is perfectly reliable, as all instruments and observers or measurers 3 86 70 --16 5 5 (raters) are fallible to some extent and all 4 120 142 +22 1 1 humans respond with some inconsistency. 5 107 107 0 3 3 Thus any observed score (X) can be thought of as a function of two components, ie a true score (T) and an error component (E): X = T ± E Table 4: Repeated maximum inspiratory pressures measures data demonstrating poor relative reliability The difference between the true value and the observed value is measurement error. In MIP Rank statistical terms, ‘error’ refers to all sources Subject Day 1 Day 2 Difference Day 1 Day 2 of variability that cannot be explained by the 1 110 95 --15 2 5 independent (also known as the predictor, 2 94 107 +13 4 3 or explanatory) variable. Since the error 3 86 97 +11 5 4 components are generally unknown, it is 4 120 120 0 1 2 only possible to estimate the amount of any measurement that is attributable to error 5 107 129 +22 3 1 and the amount that represents an accurate reading. This estimate is our measure of reliability. Measurement errors may be systematic or by some type of correlation coefficient, eg random. Systematic errors are predictable Pearson’s correlation coefficient, usually errors, occurring in one direction only, written as r. For table 3 the data give a constant and biased. For example, when Pearson’s correlation coefficient of r = 0.94, using a measurement that is susceptible to a generally accepted to indicate a high degree learning effect (eg strength testing), a retest of correlation. In table 4, however, although may be consistently higher than a prior test the differences between the two measures (perhaps due to improved motor unit co- look similar to those in table 1 (ie –15 to +22 ordination). Such a systematic error would cm of water), on this occasion the ranking not therefore affect reliability, but would has changed. Subject 4 has the highest MIP affect validity, as test values are not true on day 1, but is second highest on day 2, representations of the quantity being subject 1 had the second highest MIP in day measured. Random errors are due to chance 1, but the lowest MIP on day 2, and so on. and unpredictable, thus they are the basic For table 4 data r = 0.51, which would be concern of reliability. interpreted as a low degree of correlation. Correlation coefficients thus give infor- Types of Reliability mation about association between two Baumgarter (1989) has identified two types variables, and not necessarily about their of reliability, ie relative reliability and proximity. absolute reliability. Absolute reliability is the degree to which Relative reliability is the degree to which repeated measurements vary for individuals, individuals maintain their position in a ie the less they vary, the higher the reliability. sample over repeated measurements. Tables This type of reliability is expressed either in 3 and 4 give some maximum inspiratory the actual units of measurement, or as a pressure (MIP) measures taken on two proportion of the measured values. The occasions, 48 hours apart. In table 3, standard error of measurement (SEM), although the differences between the two coefficient of variation (CV) and Bland and measures vary from –16 to +22 centimetres Altman’s 95% limits of agreement (1986) of water, the ranking remains unchanged. are all examples of measures of absolute That is, on both day 1 and day 2 subject 4 reliability. These will be described later. had the highest MIP, subject 1 the second highest, subject 5 the third highest, and so on. This form of reliability is often assessed Physiotherapy February 2000/vol 86/no 2
  • 3. 96 Authors Why Estimate Reliability? estimate calculated for their data. Table 5 Anne Bruton MA MCSP is Reliability testing is usually performed to summarises the more common reliability currently involved in assess one of the following: indices found in the literature, which are postgraduate research, described below. Joy H Conway PhD MSc s Instrumental reliability, ie the reliability of MCSP is a lecturer in the measurement device. Table 5: Reliability indices in common use physiotherapy, and s Rater reliability, ie the reliability of the Hypothesis tests for bias, eg paired t-test, analysis Stephen T Holgate MD of variance. DSc FRCP researcher/observer/clinician administering the measurement device. Correlation coefficients, eg Pearson’s, ICC. is MRC professor of Standard error of measurement (SEM). immunopharmacology, s Response reliability, ie the all at the University of reliability/stability of the variable being Coefficient of variation (CV). Southampton. measured. Repeatability coefficient. This article was received Bland and Altman 95% limits of agreement. on November 16, 1998, How is Reliability Measured? and accepted on September 7, 1999. As described earlier, observed scores consist Indices Based on Hypothesis Testing for Bias of the true value ± the error component. The paired t-test, and analysis of variance Since it is not possible to know the true techniques are statistical methods for Address for value, the true reliability of any test is not detecting systematic bias between groups Correspondence calculable. It can however be estimated, of data. These estimates, based upon based on the statistical concept of variance, hypothesis testing, are often used in Ms Anne Bruton, Health Research Unit, School of ie a measure of the variability of differences reliability studies. However, they give Health Professions and among scores within a sample. The greater information only about systematic Rehabilitation Sciences, the dispersion of scores, the larger the differences between the means of two sets of University of variance; the more homogeneous the scores, data, not about individual differences. Such Southampton, Highfield, the smaller the variance. tests should, therefore, not be used in Southampton SO17 1BJ. If a single measurer (rater) were to record isolation, but be complemented by other the oxygen saturation of an individual 10 methods, eg Bland and Altman agreement times, the resulting scores would not all be tests (1986). Funding identical, but would exhibit some variance. Anne Bruton is currently Some of this total variance is due to true Correlation Coefficients (r) sponsored by a South and differences between scores (since oxygen As stated earlier, correlation coefficients give West Health Region R&D saturation fluctuates), but some can be information about the degree of association studentship. attributable to measurement error (E). between two sets of data, or the consistency Reliability (R) is the measure of the amount of position within the two distributions. of the total variance attributable to true Provided the relative positions of each differences and can be expressed as the ratio subject remain the same from test to test, of true score variance (T) to total variance high measures of correlation will be or: T obtained. However, a correlation coefficient R=T+E will not detect any systematic errors. So it is This ratio gives a value known as a possible to have two sets of scores that are reliability coefficient. As the observed score highly correlated, but not highly repeatable, approaches the true score, reliability as in table 6 where the hypothetical data increases, so that with zero error there is give a Pearson’s correlation coefficient of perfect reliability and a coefficient of 1, r = 1, ie per fect correlation despite a because the observed score is the same as systematic difference of 40 cm of water the true score. Conversely, as error increases for each subject. reliability diminishes, so that with maximal Thus correlation only tells how two sets of error there is no reliability and the scores vary together, not the extent of coefficient approaches 0. There is, however, agreement between them. Often researchers no such thing as a minimum acceptable level need to know that the actual values obtained of reliability that can be applied to all by two measurements are the same, not just measures, as this will vary depending on the proportional to one another. Although use of the test. published studies abound with correlation used as the sole indicator of reliability, their Indices of Reliability results can be misleading, and it is now In common with medical literature, recommended that they be no longer used physiotherapy literature shows no in isolation (Keating and Matyas, 1998; consistency in authors’ choice of reliability Chinn, 1990). Physiotherapy February 2000/vol 86/no 2
  • 4. Professional articles 97 Table 6: Repeated maximum inspiratory pressures measures data demonstrating a high Pearson’s correlation coefficient, but poor absolute reliability MIP Rank Subject Day 1 Day 2 Difference Day 1 Day 2 1 110 150 +40 2 2 2 94 134 +40 4 4 3 86 126 +40 5 5 4 120 160 +40 1 1 5 107 147 +40 3 3 Intra-class Correlation Coefficient (ICC) subjects to the sum of error variance and The intra-class correlation coefficient (ICC) subject variance. If the variance between is an attempt to overcome some of the subjects is sufficiently high (that is, the data limitations of the classic correlation come from a heterogeneous sample) then coefficients. It is a single index calculated reliability will inevitably appear to be high. using variance estimates obtained through Thus if the ICC is applied to data from a the partitioning of total variance into group of individuals demonstrating a wide between and within subject variance (known range of the measured characteristic, as analysis of variance or ANOVA). It thus reliability will appear to be higher than reflects both degree of consistency and when applied to a group demonstrating a agreement among ratings. narrow range of the same characteristic. There are numerous versions of the ICC (Shrout and Fleiss, 1979) with each form Standard Error of Measurement (SEM) being appropriate to specific situations. As mentioned earlier, if any measurement Readers interested in using the ICC can find test were to be applied to a single subject an worked examples relevant to rehabilitation infinite number of times, it would be in various published articles (Rankin and expected to generate responses that vary a Stokes, 1998; Keating and Matyas, 1998; little from trial to trial, as a result of Stratford et al, 1984; Eliasziw et al, 1994). The measurement error. Theoretically these use of the ICC implies that each component responses could be plotted and their of variance has been estimated appropriately distribution would follow a normal curve, from sufficient data (at least 25 degrees of with the mean equal to the true score, freedom), and from a sample representing and errors occurring above and below the the population to which the results will be mean. applied (Chinn, 1991). In this instance, The more reliable the measurement degrees of freedom can be thought of as the response, the less error variability there number of subjects multiplied by the would be around the mean. The standard number of measurements. deviation of measurement errors is therefore As with other reliability coefficients, there a reflection of the reliability of the test is no standard acceptable level of reliability response, and is known as the standard error using the ICC. It will range from 0 to 1, with of measurement (SEM). The value for the values closer to one representing the higher SEM will vary from subject to subject, but reliability. Chinn (1991) recommends that there are equations for calculating a group any measure should have an intra-class estimate, eg SEM = sx √1 – rxx (where sx is the correlation coefficient of at least 0.6 to be standard deviation of the set of observed test useful. The ICC is useful when comparing scores and rxx is the reliability coefficient for the repeatability of measures using different those data -- often the ICC is used here.) units, as it is a dimensionless statistic. It is The SEM is a measure of absolute most useful when three or more sets of reliability and is expressed in the actual units observations are taken, either from a single of measurement, making it easy to interpret, sample or from independent samples. It ie the smaller the SEM, the greater the does, however, have some disadvantages as reliability. It is only appropriate, however, for described by Rankin and Stokes (1998) that use with interval data (Atkinson and Neville, make it unsuitable for use in isolation. As 1998) since with ratio data the amount of described earlier, any reliability coefficient is random error may increase as the measured determined as the ratio of variance between values increase. Physiotherapy February 2000/vol 86/no 2
  • 5. 98 Coefficient of Variation (CV) appropriate for method comparison studies The CV is an often-quoted estimate of for reasons described by Bland and Altman measurement error, particularly in lab- in their 1986 paper. These authors have oratory studies where multiple repeated tests therefore proposed an approach for are standard procedure. One form of the CV assessing agreement between two different is calculated as the standard deviation of the methods of clinical measurement. This data, divided by the mean and multiplied by involves calculating the mean for each 100 to give a percentage score. This method and using this in a series of expresses the standard deviation as a agreement tests. proportion of the mean, making it unit Step 1 consists of plotting the difference in independent. However, as Bland (1987) the two results against the mean value from points out, the problem with expressing the the two methods. Step 2 involves calculating error as a percentage, is that x% of the the mean and standard deviation of the smallest observation will differ markedly differences between the measures. Step 3 from x% of the largest observation. Chinn consists of calculating the 95% limits of (1991) suggests that it is preferable to use agreement (as the mean difference plus or the ICC rather than the CV, as the former minus two standard deviations of the relates the size of the error variation to the differences), and 95% confidence intervals size of the variation of interest. It has been for these limits of agreement. The suggested that the above form of the CV advantages of this approach are that by using should no longer be used to estimate scatterplots, data can be visually interpreted reliability, and that other more appropriate fairly swiftly. Any outliers, bias, or rel- methods should be employed based on ationship between variance in measures and analysis of variance of logarithmically size of the mean can therefore be observed transformed data (Atkinson and Neville, easily. The 95% limits of agreement provide 1998). a range of error that may relate to clinical acceptability, although this needs to be Repeatability Coefficient interpreted with reference to the range of Another way to present measurement error measures in the raw data. over two tests, as recommended by the In the same paper, Bland and Altman British Standards Institution (1979) is the have a section headed ‘Repeatability’ in value below which the difference between which they recommend the use of the the two measurements will lie with ‘repeatability coefficient’ (described earlier) probability 0.95. This is based upon the for studies involving repeated measures with within-subject standard deviation (s). the same instrument. In their final Provided the measurement errors are from a discussion, however, they suggest that their normal distribution this can be estimated by agreement testing approach may be used 1.96 x √(2s2), or 2.83s and is known as the either for analysis of repeatability of a single repeatability coefficient (Bland and Altman, measurement method, or for method 1986). This name is rather confusing, as comparison studies. Worked examples using other coefficients (eg reliability coefficient) Bland and Altman agreement tests can be are expected to be unit free and in a range found in their original paper, and more from zero to one. The method of calculation recently in papers by Atkinson and Nevill varies slightly in two different references (1998) and Rankin and Stokes (1998). (Bland and Altman, 1986; Bland, 1987), and to date it is not a frequently quoted statistic. Nature of Reliability Unfortunately, the concept of reliability is Bland and Altman Agreement Tests complex, with less of the straightforward In 1986 The Lancet published a paper by ‘black and white’ statistical theory that Bland and Altman that is frequently cited surrounds hypothesis testing. When testing and has been instrumental in encouraging a research hypothesis there are clear changing use of reliability estimates in the guidelines to help researchers and clinicians medical literature. In the past, studies decide whether results indicate that the comparing the reliability of two different hypothesis can be supported or not. In instruments designed to measure the contrast, the decision as to whether a same variable (eg two different types particular measurement tool or method of goniometer) often quoted correlation is reliable or not is more open to coefficients and ICCs. These can both interpretation. The decision to be made is be misleading, however, and are not whether the level of measurement error is Physiotherapy February 2000/vol 86/no 2
  • 6. Professional articles 99 considered acceptable for practical use. instrument will have a certain degree of There are no firm rules for making this reliability when applied to certain decision, which will inevitably be context populations under certain conditions. The based. An error of ±5° in goniometry issue to be addressed is what level of measures may be clinically acceptable in reliability is considered to be clinically some circumstances, but may be less acceptable. In some circumstances there acceptable if definitive clinical decisions (eg may be a choice only between a measure surgical intervention) are dependent on the with lower reliability or no measure at all, in measure. Because of this dependence on the which case the less than perfect measure context in which they are produced, it is may still add useful information. therefore very difficult to make comparisons In recent years several authors have of reliability across different studies, except recommended that no single reliability in very general terms. estimate should be used for reliability studies. Opinion is divided over exactly Conclusion which estimates are suitable for which This paper has attempted to explain the circumstances. Rankin and Stokes (1998) concept of reliability and describe some of have recently suggested that a consensus the estimates commonly used to quantify it. needs to be reached to establish which tests Key points to note about reliability are should be adopted universally. In general, summarised in the panel below. Reliability however, it is suggested that no single should not necessarily be conceived as a estimate is universally appropriate, and that property that a particular instrument or a combination of approaches is more likely measurer does or does not possess. Any to give a true picture of reliability. References Chinn, S (1991). ‘Repeatability and method Atkinson, G and Nevill, A M (1998). ‘Statistical comparison’, Thorax, 46, 454-456. methods for assessing measurement error Eliasziw, M, Young, S L, Woodbury, M G et al (reliability) in variables relevant to sports (1994). ‘Statistical methodology for the medicine’, Sports Medicine, 26, 217-238. concurrent assessment of inter-rater and Baumgarter, T A (1989). ‘Norm-referenced intra-rater reliability: Using goniometric measurement: reliability’ in: Safrit, M J and Wood, measurements as an example’, Physical Therapy, T M (eds) Measurement Concepts in Physical 74, 777-788. Education and Exercise Science, Champaign, Illinois, Keating, J and Matyas, T (1998). ‘Unreliable pages 45-72. inferences from reliable measurements’, Bland, J M (1987). An Introduction to Medical Australian Journal of Physiotherapy, 44, 5-10. Statistics, Oxford University Press. Rankin, G and Stokes, M (1998). Bland, J M and Altman, D G (1986). ‘Statistical ‘Reliability of assessment tools in rehabilitation: methods for assessing agreement between two An illustration of appropriate statistical analyses’, methods of clinical measurement’, The Lancet, Clinical Rehabilitation, 12, 187-199. February 8, 307-310. Shrout, P E and Fleiss, J L (1979). ‘Intraclass British Standards Institution (1979). ‘Precision of correlations: Uses in assessing rater reliability’, test methods. 1: Guide for the determination and Psychological Bulletin, 86, 420-428. reproducibility for a standard test method’ Stratford, P, Agostino, V, Brazeau, C and BS5497, part 1. BSI, London. Gowitzke, B A (1984). ‘Reliability of joint angle Chinn, S (1990). ‘The assessment of methods of measurement: A discussion of methodology measurement’, Statistics in Medicine, 9, 351-362. issues’, Physiotherapy Canada, 36, 1, 5-9. Key Messages Reliability is: s Population specific. s Not an all-or-none phenomenon. s Related to the variability in the group s Open to interpretation. studied. s Not the same as clinical acceptability. s Best estimated by more than one index. Physiotherapy February 2000/vol 86/no 2