Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Estimating Dyslexia in the Web

Ricardo Baeza-Yates Luz Rello

Yahoo! Research & Web Research and
Web Research Group, NLP Groups
Pompeu Fabra University, Pompeu Fabra University,
Barcelona, Spain Barcelona, Spain

W4A 2011, Hyderabad

Outline
Outline

— What

— Why
to distinguish dyslexic errors
— How to build a sample
to measure dyslexia

— Results

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web

What
Outline

Dyslexia is a neurologically-based disorder which
Dyslexia interferes with the acquisition and processing of
language. It manifests itself with diﬃculties in
receptive and expressive language, including
phonological processing, in reading, writing, spelling
(The Boder’s Test and handwriting and sometimes in arithmetic.
of Reading-Spelling
Patterns) (Committee of Members Orton
Dyslexia Society. Deﬁnition of
Dyslexia, 1994.)

The largest of the three subtypes of dyslexia that
Dysphonetic the author presents. Dysphonetic dyslexia is
dyslexia viewed as a disability in associating symbols with
sounds. The misspellings typical of this disorder
are due to phonetic inaccuracy. (Boder &
Jarrico, 1982)


Why
Outline

There is a universal neuro-cognitive basis for
dyslexia.
(Paulesu et al. 2001)

It manifestations are culture-speciﬁc due to
All languages diﬀerent orthographies.
(Alegria, 2006)

English is a language with deep orthography,
the mapping between letters, speech sounds, and
whole-word sounds is often highly ambiguous and
therefore dyslexics examples are more
widespread than in other languages with
transparent or shallow orthography.
(Paulesu et al. 2001)


Why
Outline

Researchers estimate that 10-17% of the population
in the U.S.A. has dyslexia and only 30% of dyslexics
have trouble with reversing letters and numbers. On
the other hand, the level of dyslexia in other regions
such as Europe or China is lower.
Frequent
(H. Meng et al., 2005)

There are around 38 million of dyslexics in Europe.

(Ruiz del Árbol, 2008)


Why
Outline

Detecting the presence of dyslexic texts in the Web helps us
to know the real impact of dyslexia in the Web as well as
to value dyslexic-accessible practices.

Useful There is a common agreement in these studies that the
application of dyslexic-accessible practices beneﬁts also the
readability for non-dyslexic users as well as other users
with disabilities such as low vision. (McCarthy & Swierenga, 2010)
(Evett & Brown, 2005)

Spelling error rates has proven to be a useful index for
website content quality.
(Gelman & Barletta, 2008)


Why
Outline

Estimating dyslexia in a group of web pages depending
on their domain.
(Ringlstetter et al. 2006)

Novel

This is the ﬁrst attempt to estimate the amount of
texts containing English dyslexic errors in the Web.


How
Outline

Two examples of dyslexic texts

There seams to be some confusetion. Althrow
he rembers the situartion, he is not clear on
z
detailes. With regard to deleteing parts,
could you advice me of the excat nature of the
promblem and I will investgate it imeaditly.

I halve a spelling chequer
It cam with my pea see
Eye now I’ve gut the spilling rite
Its plane fore al too sea ... I
ts latter prefect awl the weigh
My chequer tolled mi sew.
(Pedler, 2007)


How
Outline

How many kinds of errors can be produced by a dyslexic?

Simple errors 53%
Multi errors 39%
Word boundary errors 8%
——
100%
dyslexic
errors Real-word errors 17%
Non-word errors 83%
——
100%

First letter errors 5%
(Pedler, 2007)


How
Outline

How many kinds of errors in the Web?

1. Dyslexic errors: Among the diﬀerent kinds of errors commonly made made by
dyslexics (i.e. unﬁnishedwords or letters, omitted words, inconsistent spaces
between words and letters (Vellutino, 1979). *reiecve instead of receive

2. Regular spelling errors produced by non-impaired native English individuals,
such as the transposition error, i.e. *recieve.

3. Regular typos caused by the adjacency of letters in the keyboard, i.e. *teceive.

4. OCR errors, due to letters of similar shape, such as *ieceive.

5. Errors made by non-native speakers who use English as a foreign
language. For example, *receibe is a typical error made by Spanish learners of
English, since the graphemes ‘b’ and ‘v’ are pronounced as /b/, and
the phoneme /v/ does not exist in the standard Spanish phonemic system.


How
Outline

Selection criteria

To avoid the overlap of dyslexic errors and other errors:

— We consider only words written by dyslexics containing multi-
errors, that is, the dyslexic word diﬀers from the intended correct
word by more than one letter. For example, the dyslexic word
*konwlegde from knowledge.

To avoid the overlap of dyslexic errors and real words:

— Errors which coincide with other existing words in English are
omitted, i.e. *trust being the intended word truth.

— Errors which give as a result a proper name are also ﬁltered, for
instance the typo *wirries from worries is also a proper name.

in the

How
Outline

Selection criteria

— All the dyslexic spelling errors are extracted from samples of text written by adults
with diagnosed dyslexia (extracted from a corpus compiled for this purpose) and from
literature (Pedler, 2007).

— Among the dyslexic errors, we take in account the ones which include the letters
that produce more confusion among dyslexic individuals, such as ‘b’, ‘d’, ‘p’, ‘m’, ‘n’,
‘u’ and ‘w’ together with other similar looking letters. For instance, it is specially
frequent to ﬁnd reversals of similar letters, such as ‘b’ and ‘d’ (Deloche et al. 1982).
i.e. *impossidle being the intended word impossible.

— Errors due to homophone confusion, that is words which have a similar
pronunciation (Pedler, 2007), are not selected even though 15% of the dyslexic errors
presented homophone confusion in a corpus of dyslexic texts (witch and which).


How
Outline

Sample D, an example for the word comparison

1. Dyslexic error: *comaprsion.

2. Spelling errors: *comparision, *conparison and *coparison.

3. Typos: *vomparison, *xomparison, *cimparison, *cpmparison,
*conparison, *co,parison, *comoarison, *com[arison,
*comprison, *compsrison, *compaeison, *compatison,
*comparuson, *comparoson, *compariaon,*comparidon,
*comparisin, *comparispn, *comparisob and *comparisom.

4. OCR errors: *compaiison and *comparisom.

5. Non-native speakers *comparition and *comparizon.
errors:


How
Outline

Sample D, dyslexic errors

comparison *comaprsion
understanding *understangind
knowledge *knwolegde
impossible *inpossbile
tomorrow *torromow
worries *worires
explain *exaplin
interesting *intersenting
situation *situartion
confusion *confusetion

Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the

How
Outline


— Let us deﬁne:

f : fraction of Web pages with lexical errors.
d : fraction of dyslexic errors among all lexical errors.

— Then, the fraction of Web pages with dyslexia is f × d.

— We ﬁnd a lower bound for f and d, to obtain a lower bound for the
fraction of dyslexic pages in the Web.


How
Outline


— We use the main search engines (Bing, Google and Yahoo!)
to estimate the document frequency of a word.

— Each of the words in our list is searched only in English web
pages to avoid cases of wrong words that may have a meaning
in other language.


How
Outline


— We bound the relative fraction of documents with lexical error, f, by
using a sample of frequent words that appear in most documents,
usually called stopwords in information retrieval (becuase, trhough, etc.).

— We use the largest relative fraction of misspells for all these words to
estimate f, as we cannot assume that all of them appear in diﬀerent pages.

— To bound d we do the same frequency search with a sample of non-
frequent words (Sample D) where we can distinguish the diﬀerent types of
errors without ambiguity.


Results
Outline


Range of percentages and average for the
diﬀerent error classes.

We use the real document frequencies of the terms from one of
the search engines to validate the results obtained, ﬁnding very
similar results.


Results
Outline


— From the sample D, the percentage of dyslexic errors among all
lexical errors is very low with an average of 0.67%

— From Pedler (2007), only 39% of dyslexics errors are multi-errors

— This implies that the lower bound is at least d/0.39, but we can
safely use a factor of 3 to correct this fact.

— We have that f is at least 0.27% from the word becuase.

— Then, we can estimate d as 2.01%.

— Lower bound for dyslexia in the Web is 0.005%.


Conclusions
Outline

• The amount of dyslexic texts in the Web is not as large as it could
be. This suggests the idea that the widespread use of spell checkers
ameliorates dyslexia in the Web.

• Particular words can be used to detect dyslexic texts, and hence
dyslexic users. This can be used to improve Web accessibility as
well as future spell checkers or other tools targeted to dyslexic users.


Conclusions
Outline

• Since this is the ﬁrst attempt to estimate text written by dyslexics
individuals in the Web, a comparison with previous work is not possible.

• Previous research on dyslexia reveals that error frequency is related
with word length (Pedler, 2007). Short words such as there, where, form,
etc. are misspelled much more frequently in dyslexic texts than long words
like the ones used in our experiments. Hence, we can do a better estimation
by using a larger sample of stopwords as well as long dyslexic words.

• As a byproduct we have found that other types of errors are much more
frequent in the Web and this can be used to assess the quality of Web
text.


On-going Work
Outline

New methodology.
Sample enlarged to 50 words.
Real data extracted from a leading search engine.
Up-down/Left-right typos.
New lower bound: 0.8 % (16 times better).

Range of percentages and average for the
diﬀerent error classes.


Future Work
Outline

1 — Identiﬁcation of dyslexic errors. Dyslexia diagnosis.

2 — NLP techniques for making text more accessible for
dyslexic users.

3 — Web quality estimation (Gelman & Barletta, 2008),
across countries, domiens and social media.


Outline

Zank u beri mach


Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011

Semelhante a Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011 (20)

Último

Último (20)

Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011