4. Shallow parsing (also chunking, "light parsing") is an
analysis of a sentence which identifies the
constituents (noun groups, verbs, verb groups, etc.),
but does not specify their internal structure, nor their
role in the main sentence.
It is a technique widely used in natural language
processing. It is similar to the concept of lexical
analysis for computer languages.
Shallow Parser
5. A "parser" is a system that transforms sentences (strings of
characters) into a representation that describes the groupings
of words (phrases) and their relations (e.g. subject and
object). The representation of choice for such information is a
syntactic tree in which nodes refer to phrases, word
categories, or words, and links refer to relations between
these objects:
Why Shallow Parser?
6. Parsing the sentence into a tree whose leaves will hold POS tags (which
correspond to words in the sentence), but the rest of the tree would tell
you how exactly these words are joining together to make the overall
sentence.
Example an adjective and a noun might combine to be a 'Noun Phrase',
which might combine with another adjective to form another Noun
Phrase (e.g. quick brown fox) (the exact way the pieces combine depends
on the parser in question).
A shallow parser or 'chunker' comes somewhere in between these two. A
plain POS tagger is really fast but does not give you enough information
and a full blown parser is slow and gives you too much. A POS tagger can
be thought of as a parser which only returns the bottom-most tier of the
parse tree to you.
A chunker might be thought of as a parser that returns some other tier of
the parse tree to you instead. Sometimes you just need to know that a
bunch of words together form a Noun Phrase but don't care about the
sub-structure of the tree within those words (i.e. which words are
adjectives, determiners, nouns, etc and how do they combine). In such
cases you can use a chunker to get exactly the information you need
instead of wasting time generating the full parse tree for the sentence.
Difference b/w Shallow
Parser and POS Tagger
7. Morphology
Morphology is the part of linguistics that deals with the
study of words, their internal structure and partially their
meanings. It refers to identification of a word stem from a full
word form. A morpheme in morphology is the smallest units
that carry meaning and fulfill some grammatical function.
Morphology
8. Morphological analysis
Morphological Analysis is the process of providing grammatical
information of a word given its suffix.
Models
There are three principal approaches to morphology, which each try to
capture the distinctions above in different ways. These are,
• Morpheme-based morphology also known as Item-and-Arrangement
approach.
• Lexeme-based morphology also known as Item-and-Process
approach.
• Word-based morphology also known as Word-and-Paradigm
approach.
Morphological Analysis
and Models
9. Morphological Analyzer
A morphological analyzer is a program for analyzing the
morphology of an input word, it detects morphemes of any
text.
Presently we are referring to two types of morph analyzers
for Indian languages:
1. Phrase level Morph Analyzer
2. Word level Morph Analyzer
Morphological Analyzer
10. Transliteration is the conversion of a text from one script to
another.
For instance:
kaay kam karato = काय कम करतो
kyaa chal rahaa hai = क्या चल रहा है
Transliteration can form an essential part
of transcription which converts text from one writing
system into another. Transliteration is not concerned with
representing the phonemics of the original
Transliteration
11. We have researched in detail about our project by means of research
papers, blogs and internet. There are various approaches for the
development of the morphological analyzers such as Finite State
Automata (FSA) approach, Two Level Morphology approach, Finite
State Transducers (FST) approach, Stemmer Algorithm, Corpus
Based Approach, DAWG (Directed Acrylic Word Graph) and
Paradigm Based Approach in which the FST based approach is the
most efficient approach for the development of the morphological
analyzer for Hindi that is highly inflectional language.
There are several approaches for the construction of Shallow parser
such as Chunker based Shallow parser, HMM based Shallow parser,
Memory based Shallow parser, Shallow parser based on conditional
random fields and Shallow parser based on Winnow algorithm. Among
these, Shallow parser based on conditional random fields is proven to
be the most efficient and flexible approach. Shallow parsers are very
essential tools for various NLP applications as they provide a complete
set of the natural language while decreasing the complexity inherent in
the complete parser. Thus, shallow parsers are important for
applications that require only syntactic analysis of the sentence and
don’t require relationships between the chunks of the sentence. This
includes applications like auto-text summarization, speech-to-speech
translation systems and text-mining applications.
Literary Survey-Summary
12. Many cultures around the world use different scripts to
represent their languages. By transliterating, people can make
their languages more accessible to people who do not
understand their scripts. For example, to someone who knows
the Roman alphabet, the name محمدis incomprehensible.
However, when it is transliterated as Muhammad, readers of the
Roman alphabet understand that it means the Muslim prophet
Muhammad.
So Transliterator helps the non-native speakers to type the Hindi
phrase in Roman Script using any keyboard and thus providing
the input for Shallow Parser
Literary Survey-Summary
13. We intent to develop a ‘Shallow Parser for Hindi Language’ and
a FST based Morphological Analyzer which can be used as a tool
in building more application specific tools like auto-text
summarizer, speech-to-speech translators etc. Key objective of
the project is to provide the shallow parser and morphological
analyzer open source software.
We also want to develop a simple tool to convert roman script to
Indic(Devanagari) script. As most keyboards are English, so to
write in Indic script is difficult. It is easy to write Hindi in roman
script this gives inspiration to make a tool for Linux to write
Hindi text easily.
Problem Statement
14. Plan of Action
1. Transliteration
2. Lexicon Generator
3. Morphological Analyzer
4. Shallow Parsing
15. 1. Transliterator
Figure: Block Diagram of transliteration process
It is a simple tool to convert roman script to Indic(Devanagari) script. As most
keyboards are English, so to write in Indic script is difficult. It is easy to write
Hindi in roman script this gives inspiration to make a tool for Linux to write
Hindi text easily.
16. 2. Lexicon Generator
Figure: Block Diagram of Lexicon Generation
There are three steps to process the corpus to extract the words. The first step is to
extract the words from the given corpus' sentences. In the next step the duplicate
words are removed to extract the unique words. After that the sorting of the
words are done which makes easier to processing of the words manually such as
the classification of the words. The lexicon files for each word classes are
classified as per its inflection, and derivations types.
17. 3. Morphological Analyzer
Figure: Architecture of the Morphological Processor
The analyzer takes the input, the word that is of surface form and produces the
result as the grammatical structure of the word that is of the lexicon form. The
Generator takes the input, the grammatical structure of the word that is lexicon
form and produces the result, the corresponding word that is of surface form.
18. 4. Shallow Parsing by CFG
A CFG is a 4-tuple <N,E,R,S >
A set of non-terminals N
(e.g. N = {S, NP, VP, PP, Noun, Verb, ....})
A set of terminals E
(e.g. E = {In, the, popular, mythology, the, computer, is, a, mathematics,
machine })
A set of rules R
A start symbol S (sentence)
20. Flow Chart
Input : Ram School Jaata Hai.
Output1: राम स्कू ल जाता है|
Transliterator
Shallow Parser
Output2: NP NP VP
NP – Noun Phrase
VP – Verb Phrase
21. Findings and Conclusion
It is challenging to translate names and technical terms across
languages with different alphabets and sound inventories.
These items are commonly transliterated, i.e., replaced with
approximate phonetic equivalents. An efficient shallow parser
for Hindi is needed to build a full-blown parser.
Since proper nouns and technical terms — which need
phonetical translation — are part of most text documents,
transliteration is an important problem to study.
Found only few shallow parsers for Hindi
Analysed different approaches for creating shallow parser
Parsing by CFG is the used approach.
Approach is labour-intensive as rules are crafted manually.
22. References
‘Transliterated Search using Syllabification Approach’ by
Hardik Joshi, Apurva Bhatt, Honey Patel
‘Transliteration Systems Across Indian Languages Using
Parallel Corpora’ by RishabhSrivastava and Riyaz
Ahmad Bhat
‘Semi-Supervised Learning of Hindi Morphology’ by
Teena Bajaj and Parteek Bhatia
‘Phonetically Rich Hindi Sentence Corpus for Creation of
Speech Database’ by Vishal Chourasia, Samudravijaya K,
Manohar Chandwani