Breaking the Kubernetes Kill Chain: Host Path Mount
Ricardo Baeza-Yates, Luz Rello - Estimating Dyslexia in the Web - W4A - 2011
1. Estimating Dyslexia in the Web
Ricardo Baeza-Yates Luz Rello
Yahoo! Research & Web Research and
Web Research Group, NLP Groups
Pompeu Fabra University, Pompeu Fabra University,
Barcelona, Spain Barcelona, Spain
W4A 2011, Hyderabad
2. Outline
Outline
— What
— Why
to distinguish dyslexic errors
— How to build a sample
to measure dyslexia
— Results
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
3. What
Outline
Dyslexia is a neurologically-based disorder which
Dyslexia interferes with the acquisition and processing of
language. It manifests itself with difficulties in
receptive and expressive language, including
phonological processing, in reading, writing, spelling
(The Boder’s Test and handwriting and sometimes in arithmetic.
of Reading-Spelling
Patterns) (Committee of Members Orton
Dyslexia Society. Definition of
Dyslexia, 1994.)
The largest of the three subtypes of dyslexia that
Dysphonetic the author presents. Dysphonetic dyslexia is
dyslexia viewed as a disability in associating symbols with
sounds. The misspellings typical of this disorder
are due to phonetic inaccuracy. (Boder &
Jarrico, 1982)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
4. Why
Outline
There is a universal neuro-cognitive basis for
dyslexia.
(Paulesu et al. 2001)
It manifestations are culture-specific due to
All languages different orthographies.
(Alegria, 2006)
English is a language with deep orthography,
the mapping between letters, speech sounds, and
whole-word sounds is often highly ambiguous and
therefore dyslexics examples are more
widespread than in other languages with
transparent or shallow orthography.
(Paulesu et al. 2001)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
5. Why
Outline
Researchers estimate that 10-17% of the population
in the U.S.A. has dyslexia and only 30% of dyslexics
have trouble with reversing letters and numbers. On
the other hand, the level of dyslexia in other regions
such as Europe or China is lower.
Frequent
(H. Meng et al., 2005)
There are around 38 million of dyslexics in Europe.
(Ruiz del Árbol, 2008)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
6. Why
Outline
Detecting the presence of dyslexic texts in the Web helps us
to know the real impact of dyslexia in the Web as well as
to value dyslexic-accessible practices.
Useful There is a common agreement in these studies that the
application of dyslexic-accessible practices benefits also the
readability for non-dyslexic users as well as other users
with disabilities such as low vision. (McCarthy & Swierenga, 2010)
(Evett & Brown, 2005)
Spelling error rates has proven to be a useful index for
website content quality.
(Gelman & Barletta, 2008)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
7. Why
Outline
Estimating dyslexia in a group of web pages depending
on their domain.
(Ringlstetter et al. 2006)
Novel
This is the first attempt to estimate the amount of
texts containing English dyslexic errors in the Web.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
8. How
Outline
Two examples of dyslexic texts
There seams to be some confusetion. Althrow
he rembers the situartion, he is not clear on
z
detailes. With regard to deleteing parts,
could you advice me of the excat nature of the
promblem and I will investgate it imeaditly.
I halve a spelling chequer
It cam with my pea see
Eye now I’ve gut the spilling rite
Its plane fore al too sea ... I
ts latter prefect awl the weigh
My chequer tolled mi sew.
(Pedler, 2007)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
9. How
Outline
How many kinds of errors can be produced by a dyslexic?
Simple errors 53%
Multi errors 39%
Word boundary errors 8%
——
100%
dyslexic
errors Real-word errors 17%
Non-word errors 83%
——
100%
First letter errors 5%
(Pedler, 2007)
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
10. How
Outline
How many kinds of errors in the Web?
1. Dyslexic errors: Among the different kinds of errors commonly made made by
dyslexics (i.e. unfinishedwords or letters, omitted words, inconsistent spaces
between words and letters (Vellutino, 1979). *reiecve instead of receive
2. Regular spelling errors produced by non-impaired native English individuals,
such as the transposition error, i.e. *recieve.
3. Regular typos caused by the adjacency of letters in the keyboard, i.e. *teceive.
4. OCR errors, due to letters of similar shape, such as *ieceive.
5. Errors made by non-native speakers who use English as a foreign
language. For example, *receibe is a typical error made by Spanish learners of
English, since the graphemes ‘b’ and ‘v’ are pronounced as /b/, and
the phoneme /v/ does not exist in the standard Spanish phonemic system.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
11. How
Outline
Selection criteria
To avoid the overlap of dyslexic errors and other errors:
— We consider only words written by dyslexics containing multi-
errors, that is, the dyslexic word differs from the intended correct
word by more than one letter. For example, the dyslexic word
*konwlegde from knowledge.
To avoid the overlap of dyslexic errors and real words:
— Errors which coincide with other existing words in English are
omitted, i.e. *trust being the intended word truth.
— Errors which give as a result a proper name are also filtered, for
instance the typo *wirries from worries is also a proper name.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
in the
12. How
Outline
Selection criteria
— All the dyslexic spelling errors are extracted from samples of text written by adults
with diagnosed dyslexia (extracted from a corpus compiled for this purpose) and from
literature (Pedler, 2007).
— Among the dyslexic errors, we take in account the ones which include the letters
that produce more confusion among dyslexic individuals, such as ‘b’, ‘d’, ‘p’, ‘m’, ‘n’,
‘u’ and ‘w’ together with other similar looking letters. For instance, it is specially
frequent to find reversals of similar letters, such as ‘b’ and ‘d’ (Deloche et al. 1982).
i.e. *impossidle being the intended word impossible.
— Errors due to homophone confusion, that is words which have a similar
pronunciation (Pedler, 2007), are not selected even though 15% of the dyslexic errors
presented homophone confusion in a corpus of dyslexic texts (witch and which).
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
13. How
Outline
Sample D, an example for the word comparison
1. Dyslexic error: *comaprsion.
2. Spelling errors: *comparision, *conparison and *coparison.
3. Typos: *vomparison, *xomparison, *cimparison, *cpmparison,
*conparison, *co,parison, *comoarison, *com[arison,
*comprison, *compsrison, *compaeison, *compatison,
*comparuson, *comparoson, *compariaon,*comparidon,
*comparisin, *comparispn, *comparisob and *comparisom.
4. OCR errors: *compaiison and *comparisom.
5. Non-native speakers *comparition and *comparizon.
errors:
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
14. How
Outline
Sample D, dyslexic errors
comparison *comaprsion
understanding *understangind
knowledge *knwolegde
impossible *inpossbile
tomorrow *torromow
worries *worires
explain *exaplin
interesting *intersenting
situation *situartion
confusion *confusetion
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the
15. How
Outline
Estimating Dyslexia in the Web
— Let us define:
f : fraction of Web pages with lexical errors.
d : fraction of dyslexic errors among all lexical errors.
— Then, the fraction of Web pages with dyslexia is f × d.
— We find a lower bound for f and d, to obtain a lower bound for the
fraction of dyslexic pages in the Web.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
16. How
Outline
Estimating Dyslexia in the Web
— We use the main search engines (Bing, Google and Yahoo!)
to estimate the document frequency of a word.
— Each of the words in our list is searched only in English web
pages to avoid cases of wrong words that may have a meaning
in other language.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
17. How
Outline
Estimating Dyslexia in the Web
— We bound the relative fraction of documents with lexical error, f, by
using a sample of frequent words that appear in most documents,
usually called stopwords in information retrieval (becuase, trhough, etc.).
— We use the largest relative fraction of misspells for all these words to
estimate f, as we cannot assume that all of them appear in different pages.
— To bound d we do the same frequency search with a sample of non-
frequent words (Sample D) where we can distinguish the different types of
errors without ambiguity.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
18. Results
Outline
Estimating Dyslexia in the Web
Range of percentages and average for the
different error classes.
We use the real document frequencies of the terms from one of
the search engines to validate the results obtained, finding very
similar results.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
19. Results
Outline
Estimating Dyslexia in the Web
— From the sample D, the percentage of dyslexic errors among all
lexical errors is very low with an average of 0.67%
— From Pedler (2007), only 39% of dyslexics errors are multi-errors
— This implies that the lower bound is at least d/0.39, but we can
safely use a factor of 3 to correct this fact.
— We have that f is at least 0.27% from the word becuase.
— Then, we can estimate d as 2.01%.
— Lower bound for dyslexia in the Web is 0.005%.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
20. Conclusions
Outline
• The amount of dyslexic texts in the Web is not as large as it could
be. This suggests the idea that the widespread use of spell checkers
ameliorates dyslexia in the Web.
• Particular words can be used to detect dyslexic texts, and hence
dyslexic users. This can be used to improve Web accessibility as
well as future spell checkers or other tools targeted to dyslexic users.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
21. Conclusions
Outline
• Since this is the first attempt to estimate text written by dyslexics
individuals in the Web, a comparison with previous work is not possible.
• Previous research on dyslexia reveals that error frequency is related
with word length (Pedler, 2007). Short words such as there, where, form,
etc. are misspelled much more frequently in dyslexic texts than long words
like the ones used in our experiments. Hence, we can do a better estimation
by using a larger sample of stopwords as well as long dyslexic words.
• As a byproduct we have found that other types of errors are much more
frequent in the Web and this can be used to assess the quality of Web
text.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
22. On-going Work
Outline
New methodology.
Sample enlarged to 50 words.
Real data extracted from a leading search engine.
Up-down/Left-right typos.
New lower bound: 0.8 % (16 times better).
Range of percentages and average for the
different error classes.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
23. Future Work
Outline
1 — Identification of dyslexic errors. Dyslexia diagnosis.
2 — NLP techniques for making text more accessible for
dyslexic users.
3 — Web quality estimation (Gelman & Barletta, 2008),
across countries, domiens and social media.
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web
24. Outline
Zank u beri mach
Ricardo Baeza-Yates and Luz Rello W4A 2011, Hyderabad Estimating Dyslexia in the Web