Handwritten Text Recognition for manuscripts and early printed texts
Building corpus from www for arabic
1. Building corpus from
www for Arabic
Arabic NLP group at Imam University 2013
Al-Fridi.A , Bhattab.R , Al-Rakaf.N
2. Outline
• Introduction
• Data collection
• Data processing
• Architecture
• Problems
• Tools Methodology
• Conclusion
3. Introduction
• Building a corpus requires major time and effort.
• Texts may not be easily available for building a
corpus.
• Web data that a new strand of research developed
• The web is immense, free and available.
• The Web as a source of language data, because that
it's so big source rather than other sources.
• The idea of building corpora starting at 1897 by
German linguist Kading.
4. Data collection
• There is many ways to collecting the data from the
websites.
• used a locally developed spider program to get the
data from each site.
• used the Arabic Optical Character Recognition (OCR)
program Automatic Reader.
5.
6.
7.
8. Data processing
The processing of the data to obtain the corpus
consisted of the following steps:
• Language classification.
• Linguistic filtering.
• Processing.
• Corpus indexing.
14. Boot CaT
• This is the first propose a full procedure for the
automated extraction of specialized corpora and
technical terms by web-mining.
• Let’s us try to build corpus
15. Sketch Engine
Introduction
• The Sketch Engine is a corpus processing system
developed in 2002.
• The basic elements of the Sketch Engine are
concordances, word sketches, grammatical
relations, and a distributional thesaurus.
• The Sketch Engine service makes a number of
large web corpora available for online
analysis which can be done by using
a web-based corpus query.
16. Sketch Engine
Implementation and Design
• The Sketch Engine has a different query system.
• A Word Sketch includes: subject, object,
prepositional object, and modifier.
17.
18.
19.
20. Conclusion
• Building corpus from www for Arabic.
• Ways to collecting data from web.
• Problem we faced and the tools that
support us to build the corpus.