SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
Basic concepts of Probability
and Statistics
Thennarasu Sakkan
Department of Linguistics
Central University of Kerala
A probability provides a quantitative description of
the chances or likelihoods associated with various
outcomes.
Probability is the tool that statistical methods use in
order to make inferences about the characteristics of a
population given a random sample of data.
Understanding probability is therefore a key to
understand the statistics.
The probability of an event A :
P(A) = NA / N
Where N is the number of possible outcomes of the
random experiment and NA is the number of
outcomes favourable to the event A.
For example,
for a 6-sided die there are 6 outcomes and 3 of
them are even, and thus
P(even) = 3/6
Probability theory is a formal way of representing
probabilistic concepts and describing uncertain
events.
Probability is a mapping from the set of events or
sample space into the set [0, 1].
Naturally, the probability of a particular event or set
of events is the fraction of the time that the particular
event or set of events occur.
Thus, a probability mapping goes from the set of all
possible events to their respective probabilities of
occurring.
Probability’s empirical counterparts are proportions
(between 0 and 1) and percentages (between 0 and
100).
Since something must always occur, probabilities
always add up to 1 (as long as all possible events are
included in the sum).
Since no one event can happen less than 0% of the
time or more than 100% of the time, an individual
probability must be between 0 and 1.
LANGUAGE MODEL
Language modelling refers to the task of modelling the
language using probabilities.
Language model is one of the important requirements
in statistical machine translation.
This component takes care the fluency of the given
language.
i.e. how much is the given sentence probable
quantitatively; it assigns high probability to plausible
sentences.
Language model does not give any guarantee on
syntax or semantics of the language being modelled.
An n-gram is a contiguous sequence of n items from a
given sequence of text.
Let us start with word prediction using simple n-grams.
Our goal is to calculate the probability of a word w given
some history h, or mathematically Pr(w|h).
N-gram model is a widely used language modeling tool,
found crucial in applications such as SR, spelling
correction, word prediction, POS tagging, natural
language generation and word similarity.
An n-gram model
An n-gram model is a type of probabilistic model for
predicting the next item in a text sequence.
n-grams are used in various areas of statistical
natural language processing and genetic sequence
analysis.
It use the previous N-1 words in a sequence to predict
the next word.
The items in question can be phonemes, syllables,
letters, words or base pairs according to the
application.
N-gram models can be imagined as placing a small
window over a sentence or a text, in which only n
words are visible at the same time.
The simplest n-gram model is therefore a so-called
unigram model.
This is a model in which we only look at one word at
a time.
An n-gram of size 1 is referred to as a "unigram"; size
2 is a "bigram" (or, less commonly, a "digram"); size 3
is a "trigram"; and size 4 or more is simply called an
"n-gram"
http://guidetodatamining.com/ngramAnalyzer/
Collocations
The notion collocation used in lexicography in the 19th
century.
What is a collocation?
A collocation is a pair or group of words that are
often used together.
These combinations sound natural to native speakers,
but students of other language have to make a special
effort to learn them because they are often difficult
to guess.
A straightforward application of bigrams is the
identification of so-called collocations.
Recall that bigram language models exploit the
observations that words do not simply combine in
any random order, that is, word order is constraint by
grammatical structure. (e.g. phrase)
However, some combinations of words are subject to
an additional law of constraint.
Such combinations are commonly known as collocations.
– Examples of collocations are:
• United States
• vice president
• chief executive, chief office etc.
Corpus linguists study such collocations to answer
interesting questions about the combinatory properties
of words.
Collocations are a feature of natural languages that are not
well addressed by current language teaching and current
models used for NLP.
According to Benson et al, there are two types of
collocations; i) lexical and ii) grammatical
collocations.
i) lexical collocations such as
noun + noun,
adjective + noun,
ii) Grammatical collocations such as
noun + suffixes etc.
See collocations of panam ‘money’ in
Tamil
How to generate collocation out of a
corpus text?….
To take a list of modern collocations….
POS tagging and approaches
Part of Speech (POS) tagging is the process of labeling
a Part of Speech category to each and every word in
a text.
POS tagging is considered to be an important process
in speech recognition, natural language parsing,
morphological parsing, information retrieval and
machine translation.
Automatic Part-of-Speech tagger can help in
building automatic word-sense disambiguating
algorithms.
Parts of Speech are very often used for shallow parsing
texts, or for finding Noun and other phrases for
information extraction applications.
The corpora that have been marked for Part-of-
Speech are very useful for linguistic research,
For example, to find frequencies of a particular word
or sentence constructions in large corpora.
Apart from these, many Natural Language Processing
(NLP) activities such as summarization, Natural
Language Understanding (NLU) and Question
Answering (QA) systems are dependent on Part-of-
Speech Tagging.
Approaches to POS Tagging
POS taggers are broadly classified into three categories
called rule based, Empirical based and Hybrid based.
In case of rule based approach hand-written rules
are used to distinguish the tag ambiguity.
The empirical POS taggers are further classified
into Example based and Stochastic based taggers.
Stochastic taggers are either HMM based, choosing the
tag sequence which maximizes the product of word
likelihood and tag sequence probability, or cue-based,
using decision trees or maximum entropy models to
combine probabilistic features.
The stochastic taggers are further classified in to
supervised and unsupervised taggers. Each of these
supervised and unsupervised taggers are categorized
into different groups as below:
Maximum Entropy Part of Speech Tagger
by Standford University
POS Tagging
UnsupervisedSupervised
Rule Based Stochastic Neural Rule Based Stochastic Neural
Brill Brill
N-gram
based
Maximum
Likelihood
Hidden Markov
Model
Baum-Welch
Algorithm
Viterbi
Algorithm
Classification of POS tagging models
Rule-based taggers generally involve a large database
of hand-written disambiguation rules.
For example, that an ambiguous word is a noun rather
than a verb if it follows a determiner.
Among those rule-based part-of-speech taggers, the
one built by Brill has the advantage of learning
tagging rules automatically.
Stochastic taggers generally resolve tagging
ambiguities by using a training corpus to compute the
probability of a given word having a given tag in a
given context.
Supervised POS tagging
The supervised POS tagging models require pre-
tagged corpora which are used for training to learn
rule sets, information about the tagset, word-tag
frequencies etc.
The learning tool generates trained models along
with the statistical information.
The performance of the models generally increases
with increase in the size of pre-tagged corpus.
Unsupervised POS tagging
Unlike the supervised models, the unsupervised POS
tagging models do not require a pre-tagged corpus.
Instead, they use advanced computational methods
like the Baum-Welch algorithm to automatically induce
tagsets, transformation rules etc.
Based on the information, they either calculate the
probabilistic information needed by the stochastic
taggers or induce the contextual rules needed by rule-
based systems or transformation based systems.
Rule based POS tagging
The rule based POS tagging models apply a set of hand written
rules and use contextual information to assign POS tags to
words in a sentence.
These rules are often known as context frame rules. For example,
a context frame rule might say something like:
“If an ambiguous/unknown word X is preceded by a Determiner
and followed by a Noun, tag it as an Adjective.”
On the other hand, the transformation based approaches use a
pre-defined set of handcrafted rules as well as automatically
induced rules that are generated during training.
Some models also use information about capitalization and
punctuation, the usefulness of which are largely dependent
on the language being tagged.
The earliest algorithms for automatically assigning Part-of-
Speech were based on a two-stage architecture [Harris Z. S,
1962].
The first stage used a dictionary to assign each word a list of
potential parts of speech.
The second stage used large lists of hand-written disambiguation
rules to bring down this list to a single Part-of-Speech for each
word.
The ENGTWOL [Voutilainen Atro, 1995] tagger is based on the
same two-stage architecture, although both the lexicon and the
disambiguation rules are much more sophisticated than the early
algorithms.
The ENGTWOL lexicon is based on the two-level morphology.
It has about 56,000 entries for English word stems, counting a
word with multiple parts of speech (e.g. nominal and verbal
senses of hit) as separate entries, and of course not counting
inflected and many derived forms.
Each entry is annotated with a set of morphological and
syntactic features. In the first stage of the tagger, each word is
run through the two-level lexicon transducer and the entries for
all possible parts of speech are returned.
Stochastic POS tagging
A stochastic approach includes frequency, probability or
statistics. The simplest stochastic approach finds out the
most frequently used tag for a specific word in the
annotated training data and uses this information to tag
that word in the unannotated text.
The problem with this approach is that it can come up with
sequences of tags for sentences that are not acceptable
according to the grammar rules of a language.
An alternative to the word frequency approach is known as the
n-gram approach that calculates the probability of a given
sequence of tags.
It determines the best tag for a word by calculating the
probability that it occurs with the n previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes.
The most common algorithm for implementing an n-gram
approach for tagging a new text is known as the Viterbi
Algorithm, which is a search algorithm that avoids the
polynomial expansion of a breadth first search by trimming
the search tree at each level using the best m Maximum
Likelihood Estimates (MLE) where m represents the number
of tags of the following word.
These are known as the unigram, bigram and trigram models.
• Very robust, can process any input strings
• Training is automatic, very fast
• Can be retrained for different corpora/tagsets
without much effort
• Language independent
• Minimize the human effort and human error.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_al
gorithm/s1_pg1.html
Advantages of Statistical Approach
Apart from these, quiet a few different approaches to
tagging have been developed.
Support Vector Machines: This is the powerful machine
learning method used for various applications in NLP and
other areas like bio-informatics, data mining, etc.
Neural Networks: These are potential candidates for the
classification task since they learn abstractions from
examples [Schmid H, 1994].
Decision Trees:
A decision tree is a decision support tool that uses a tree-
like graph. It is one way to display an algorithm.
These are classification devices based on hierarchical
clusters of questions. They have been used for natural
language processing such as POS Tagging [Schmid
H, 1994].
The software “Weka” can be used for classifying the
ambiguous words.
Maximum Entropy Models: These avoid certain
problems of statistical interdependence and have
proven successful for tasks such as parsing and
POS tagging.
Example-Based Techniques: These techniques find
the training instance that is most similar to the
current problem instance and assume the same
class for the new problem instance as for the
similar one.
Freely downloadable Part of Speech Taggers for English
and other languages
Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
hunpos
An HMM tagger with models available for English and
Hungarian. A reimplementation of TnT (see below) in
OCaml. pre-compiled models. Runs on Linux, Mac OS X,
and Windows.
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
http://nlp.stanford.edu/links/statnlp.html
• A decision tree based tagger from the University of
Stuttgart is language independent, but comes complete
with parameter files for English, German, Italian, Dutch,
French, Old French, Spanish, Bulgarian, and Russian.
(Linux, Sparc-Solaris, Windows, and Mac OS X versions.
Binary distribution only.) Page has links to sites where
one can run it online.
SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST)
Open source C taggers originally written by Ingo
Schröder. Implements maximum entropy, HMM trigram,
and transformation-based learning. C source available
under GNU public license.
MXPOST
Adwait Ratnaparkhi's Maximum Entropy part of
speech tagger
Java POS tagger
A sentence boundary detector (MXTERMINATOR)
is also included. Original version was only JDK1.1; later
version worked with JDK1.3+. Class files, not source.
fnTBL
A fast and flexible implementation of Transformation-
Based Learning in C++. Includes a POS tagger, but also
NP chunking and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner
(a la Brill), usable for POS tagging and other things by
Torbjörn Lager. Web demo also available.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER,
etc. C/C++ open source. Won CoNLL 2000 shared task.
(Less automatic than a specialized POS tagger for an end
user.)
QTAG Part of speech tagger
An HMM-based Java POS tagger from
Birmingham U. (Oliver Mason). English and
German parameter files. [Java class files, not
source.]
The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the
decision to make this famous system available is very
interesting from an historical perspective, and for software
sharing in academia more generally. LOB tag set.
Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a
canonical location, but one may find a version from the
Wikipedia page or one can try a reimplementation such as
fnTBL.
• Original Xerox Tagger
A common lisp HMM tagger available by ftp.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron
Coburn. Version 0.11. (A bigram HMM tagger.)
Development of POS Annotated Corpora
Corpus linguistics seeks to further the understanding of
language through the analysis of large quantities of naturally
occurring data.
Text corpora are used in a number of different ways.
Traditionally, corpora have been used for the study and analysis
of language at different levels of linguistic description.
Corpora have been constructed for the specific purpose of
acquiring knowledge for information extraction systems,
knowledge-based systems and e-business systems.
Corpora have been used for studying child language
development. Speech corpora play a vital role in the
specification, design and implementation of telephonic
communication and for the broadcast media.
There is a long tradition of corpus linguistic studies in
Europe. The need for corpus for a language is
multifarious(various types).
Starting from the preparation of a dictionary or lexicon to
machine translation, corpus has become an inevitable resource
for technological development of languages.
Corpus means a body of huge text incorporating various
types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on.
Corpus represents all the styles of a language. Corpus must
be very huge in size as it is going to be used for many
language applications such as preparation of lexicons of
different sizes, purposes and types, NLP tools, machine
translation programs and so on.
Corpuses can be distinguished as tagged corpus, parallel
corpus and aligned corpus.
The tagged corpus is that which is tagged for Part-of-Speech,
morphology, lemma, phrases etc.
A parallel corpus contains texts and translations in each of the
languages involved in it. It allows wider scopes for double-
checking of the translation equivalents.
Aligned corpus is a kind of bilingual corpus where text
samples of one language and their translations into another
language are aligned, sentence by sentence, phrase by
phrase, word by word, or even character by character.
Applications of POS tagged corpus
The POS tagged corpus is used in the following task.
– Chunking
– Parsing
– Information extraction and retrieval
– Tree bank creation
– Document classification
– Question answering
Applications of POS tagged corpus cont…
– Automatic dialogue system
– Speech processing
– Summarization
– Statistical training of Language models
– Machine Translation using multilingual corpora
– Text checkers for evaluating spelling and grammar
– Computer Lexicography
– Educational application like Computer Assisted
Language Learning
Complexity in Dravidian POS tagging
As Dravidian is an agglutinative language, Nouns get
inflected for number and cases. Verbs get inflected for
various inflections which include tense, person, number,
gender suffixes.
Verbs are adjectivalized and adverbialized. Also verbs
and adjectives are nominalized by means of certain
nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Tamil [Arden 1942; Rajendran S,
2007] are from nominal and verbal sources. So, many
times one has to depend on the syntactic function or
context to decide upon whether one is a noun or adjective
or adverb or postposition.
This leads to the complexity of Tamil in POS tagging.
Root ambiguity
The root word can be ambiguous. It can have more than one
sense, sometimes roots belong to more than one POS
category.
Though the POS can be disambiguated using contextual
information like co-occurring morphemes, it is not possible
always.
These issues should be taken care of when POS Taggers are
built for Tamil Language.
For example, the Tamil root words like adi, padi, isai, mudi,
kudi can take both noun and verb category which leads to the
root ambiguity problem in POS tagging.
Noun complexity
Nouns are the words which denote a person, place, thing,
time, etc. In Tamil language, nouns are inflected for the
number and case in morphological level.
Morphological level inflection
Noun ( + number ) (+ case )
Example: pUk-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Example: pUk-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix
Nouns further need to be annotated into common noun,
compound noun, proper noun, compound proper noun,
pronoun, cardinal and ordinal.
Pronouns need to be further annotated for personal pronoun.
There occurs complexity between common noun and
compound noun and also between proper noun and
compound proper noun. Common noun can also occur as
compound noun, for example
UrAdci <NNC> thalaivar <NNC>
When UrAdci and thalaivar comes together it can be
compound noun (<NNC>), but when UrAdci and thalaivar
comes separately in a sentence it should be tagged as a
common noun (<NN>). Such complexity also occurs with the
proper noun <NNP> and compound proper noun (<NNPC>).
Moreover there occurs complexity between noun and adverb,
pronoun and emphasis in syntactic level.
Verb complexity
The verbal forms are complex in Tamil. A finite verb
shows the following morphological structure
Verb stem + Tense + Person-Number + Gender
Example: nada +nth +En <VF>
‘I walked’
A number of non-finite forms are possible: adverbial forms,
adjectival forms, infinitive forms, and conditional.
Verb stem + Adverbial participle
Example: cey + thu = ceythu <VNAV>
‘having done’
Verb stem + relative_participle
Example: cey + tha = ceytha <VNAJ>
‘who did’
Verb stem + infinitive suffix
Example: azu + a = aza <VINT>
‘to weep’
Verb stem + conditional suffix
Example: kEL+d + Al =kEddAl <CVB>
‘if asked’
Distinction needs to be made between a main verb followed
by a main verb and a main Verb followed by an auxiliary
verb.
The main verb followed by an auxiliary verb need to be
interpreted together, whereas the main verb followed by a
main verb need to be interpreted separately. This lead to
functional ambiguity as given below:
Developing Part-of- Speech tagger for
Indian languages
For Bengali, Sandipan et al., (2007), have developed a
corpus based semi-supervised learning algorithm for POS
tagging based on HMMs.
Their system uses a small tagged corpus (500 sentences) and a
large unannotated corpus along with a Bengali morphological
analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.
Smriti Singh et.al (2006), have proposed tagger for Hindi, that
uses the affix information stored in a word and assigns a
POS tag using no contextual information. By considering
the previous and the next word in the Verb Group (VG), it
correctly identifies the main verb and the auxiliaries.
Lexicon lookup was used for identifying the other POS
categories.
In NLPAI ML contest, Dalal et al (2006) have achieved
accuracies of 82.22 % and 82.4% for Hindi POS tagging and
chunking respectively using maximum entropy models.
Karthik et al. (2006) got 81.59 % accuracy for Telugu POS tagging
using HMMs.
Sivaji et al (2006) came up with a rule based chunker for Bengali
which gave an accuracy of 81.64 %. The training data for all the
three languages contained approximately 20,000 words and the
testing data had approximately 5000 words.
For Telugu, three POS taggers have been proposed by using
different POS tagging approaches viz., (1) Rule-based
approach, (2) Transformation based learning (TBL)
approach of Erich Brill (3) Maximum Entropy Model, a
machine learning technique [Ramasree, R.J and Kusuma
Kumari, P, 2007].
Hidden Markov Model (HMM) based tagger for Hindi was
proposed by Manish Shrivastava and Pushpak Bhattacharyya
(2008). The authors attempted to utilize the morphological
richness of the languages without resorting to complex and
expensive analysis. The core idea of their approach was to
explode the input in order to increase the length of the input
and to reduce the number of unique types encountered during
learning. This in turn increases the probability score of the
correct choice while simultaneously decreasing the ambiguity
of the choices at each stage.
A stochastic Hidden Markov Model (HMM) based part of
speech tagger has been proposed for Malayalam. To perform
parts of tagging speech using stochastic approach, an annotated
corpus is needed. Due to the non-availability of annotated
corpus, a morphological analyzer was also developed to
generate a tagged corpus from the training set [Manju K e.tal,
2009].
Various methodologies have been developed for POS Tagging
for Tamil language. A rule-based POS tagger for Tamil was
developed and tested [Arulmozhi et al., 2004]. This system
gives only the major tags and the sub tags are overlooked
during evaluation. A hybrid POS tagger for Tamil using HMM
technique and a rule based system was also developed
[Arulmozhi P and Sobha L, 2006].
Lakshmana Pandian S and Geetha T V (2008) have developed a
Morpheme based Language Model for Tamil Part-of-Speech
Tagging. A language model based on the information of the
stem type, last morpheme, and previous to the last morpheme
part of the word for categorizing its part of speech was
developed. For estimating the contribution factors of the model,
they have followed the generalized iterative scaling technique.
Dhanalakshmi et. al.(2008) proposed an SVM based tagger
using linear programming and developed their own POS tagset
for Tamil which has 32 tags. They used this tagset to annotate
their corpus and then trained their model and reported an
accuracy of 95.63%. Dhanalakshmi et. al.(2009) have also
proposed another tagger where they used machine learning
techniques to extract linguistic information which was then
used to train the tagger based on SVM approach. They used
their own 32 tags tagset for annotating the corpus and reported
an accuracy of 95.64%.
Considerable Effort of developing a POS Tagger in other
Indian Languages have also been put in for Malayalam, an
HMM based tagger was proposed by Manju et. al., since
they did not had an annotated corpus, they used a
morphological analyzer to generate the corpus which was
then used for training the HMM algorithm. Another tagger
for Malayalam was developed by Anthony et. al. [2009] who
used Support Vector Machines (SVM). They used a
SVMTool for tagging which was developed by Giménez and
Màrquez. For developing this tagger Anthony et. al. first
proposed a tagset which they claim is suitable for
Malayalam and then created an annotated corpus using this
tagset. Their tagger reported 94% accuracy with their tagset.
Word Sense Disambiguation
• Word sense disambiguation (WSD) is the ability to
identify the meaning of words in context in a
computational manner. WSD is considered an AI-
complete problem, that is, a task whose solution is at least
as hard as the most difficult problems in artificial
intelligence.
A striking feature of Natural Language is that many words
and sentences have more than one meaning (i.e. are
semantically ambiguous), and which meaning is correct
depends on the context. This problem arises at several
levels.
There are problems at the level of individual words. Consider
this example
The man went to the (old ladies hostel)/bank.
What kind of 'bank'? A river bank or a source of money or blood
bank? Here we have three distinct English words with the
same spelling/pronunciation.
Word sense disambiguation (WSD) is the problem of
determining in which sense a word having a number of distinct
senses is used in a given sentence. So, WSD is a task of
removing the ambiguity of word in context.
7 probability and statistics an introduction

Mais conteúdo relacionado

Mais procurados

Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisRule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Word sense dissambiguation
Word sense dissambiguationWord sense dissambiguation
Word sense dissambiguationAshwin Perti
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalamijcsit
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERijnlc
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 

Mais procurados (20)

Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisRule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
Rule-based Prosody Calculation for Marathi Text-to-Speech Synthesis
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Word sense dissambiguation
Word sense dissambiguationWord sense dissambiguation
Word sense dissambiguation
 
Nlp
NlpNlp
Nlp
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalam
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMER
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 

Semelhante a 7 probability and statistics an introduction

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002IJARTES
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Taggingkevig
 
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Editor IJARCET
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESA REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESIJCSES Journal
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia
 
Langauage model
Langauage modelLangauage model
Langauage modelc sharada
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...NALESVPMEngg
 
A Word Stemming Algorithm for Hausa Language
A Word Stemming Algorithm for Hausa LanguageA Word Stemming Algorithm for Hausa Language
A Word Stemming Algorithm for Hausa Languageiosrjce
 

Semelhante a 7 probability and statistics an introduction (20)

Language Modeling.docx
Language Modeling.docxLanguage Modeling.docx
Language Modeling.docx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
 
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESA REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
 
Langauage model
Langauage modelLangauage model
Langauage model
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
A Word Stemming Algorithm for Hausa Language
A Word Stemming Algorithm for Hausa LanguageA Word Stemming Algorithm for Hausa Language
A Word Stemming Algorithm for Hausa Language
 
D017362531
D017362531D017362531
D017362531
 

Mais de ThennarasuSakkan

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)ThennarasuSakkan
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introductionThennarasuSakkan
 
5 relevance of annotated corpus
5 relevance of annotated corpus5 relevance of annotated corpus
5 relevance of annotated corpusThennarasuSakkan
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpusThennarasuSakkan
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introductionThennarasuSakkan
 

Mais de ThennarasuSakkan (7)

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introduction
 
5 relevance of annotated corpus
5 relevance of annotated corpus5 relevance of annotated corpus
5 relevance of annotated corpus
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introduction
 

Último

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Último (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

7 probability and statistics an introduction

  • 1. Basic concepts of Probability and Statistics Thennarasu Sakkan Department of Linguistics Central University of Kerala
  • 2. A probability provides a quantitative description of the chances or likelihoods associated with various outcomes. Probability is the tool that statistical methods use in order to make inferences about the characteristics of a population given a random sample of data. Understanding probability is therefore a key to understand the statistics.
  • 3. The probability of an event A : P(A) = NA / N Where N is the number of possible outcomes of the random experiment and NA is the number of outcomes favourable to the event A. For example, for a 6-sided die there are 6 outcomes and 3 of them are even, and thus P(even) = 3/6
  • 4. Probability theory is a formal way of representing probabilistic concepts and describing uncertain events. Probability is a mapping from the set of events or sample space into the set [0, 1]. Naturally, the probability of a particular event or set of events is the fraction of the time that the particular event or set of events occur. Thus, a probability mapping goes from the set of all possible events to their respective probabilities of occurring.
  • 5. Probability’s empirical counterparts are proportions (between 0 and 1) and percentages (between 0 and 100). Since something must always occur, probabilities always add up to 1 (as long as all possible events are included in the sum). Since no one event can happen less than 0% of the time or more than 100% of the time, an individual probability must be between 0 and 1.
  • 6. LANGUAGE MODEL Language modelling refers to the task of modelling the language using probabilities. Language model is one of the important requirements in statistical machine translation. This component takes care the fluency of the given language. i.e. how much is the given sentence probable quantitatively; it assigns high probability to plausible sentences.
  • 7. Language model does not give any guarantee on syntax or semantics of the language being modelled. An n-gram is a contiguous sequence of n items from a given sequence of text. Let us start with word prediction using simple n-grams. Our goal is to calculate the probability of a word w given some history h, or mathematically Pr(w|h). N-gram model is a widely used language modeling tool, found crucial in applications such as SR, spelling correction, word prediction, POS tagging, natural language generation and word similarity.
  • 8. An n-gram model An n-gram model is a type of probabilistic model for predicting the next item in a text sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis. It use the previous N-1 words in a sequence to predict the next word. The items in question can be phonemes, syllables, letters, words or base pairs according to the application.
  • 9. N-gram models can be imagined as placing a small window over a sentence or a text, in which only n words are visible at the same time. The simplest n-gram model is therefore a so-called unigram model. This is a model in which we only look at one word at a time. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram"
  • 11.
  • 12.
  • 13.
  • 14. Collocations The notion collocation used in lexicography in the 19th century. What is a collocation? A collocation is a pair or group of words that are often used together. These combinations sound natural to native speakers, but students of other language have to make a special effort to learn them because they are often difficult to guess.
  • 15. A straightforward application of bigrams is the identification of so-called collocations. Recall that bigram language models exploit the observations that words do not simply combine in any random order, that is, word order is constraint by grammatical structure. (e.g. phrase) However, some combinations of words are subject to an additional law of constraint.
  • 16. Such combinations are commonly known as collocations. – Examples of collocations are: • United States • vice president • chief executive, chief office etc. Corpus linguists study such collocations to answer interesting questions about the combinatory properties of words. Collocations are a feature of natural languages that are not well addressed by current language teaching and current models used for NLP.
  • 17. According to Benson et al, there are two types of collocations; i) lexical and ii) grammatical collocations. i) lexical collocations such as noun + noun, adjective + noun, ii) Grammatical collocations such as noun + suffixes etc.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. See collocations of panam ‘money’ in Tamil
  • 28.
  • 29. How to generate collocation out of a corpus text?…. To take a list of modern collocations….
  • 30. POS tagging and approaches Part of Speech (POS) tagging is the process of labeling a Part of Speech category to each and every word in a text. POS tagging is considered to be an important process in speech recognition, natural language parsing, morphological parsing, information retrieval and machine translation. Automatic Part-of-Speech tagger can help in building automatic word-sense disambiguating algorithms.
  • 31. Parts of Speech are very often used for shallow parsing texts, or for finding Noun and other phrases for information extraction applications. The corpora that have been marked for Part-of- Speech are very useful for linguistic research, For example, to find frequencies of a particular word or sentence constructions in large corpora. Apart from these, many Natural Language Processing (NLP) activities such as summarization, Natural Language Understanding (NLU) and Question Answering (QA) systems are dependent on Part-of- Speech Tagging.
  • 32. Approaches to POS Tagging POS taggers are broadly classified into three categories called rule based, Empirical based and Hybrid based. In case of rule based approach hand-written rules are used to distinguish the tag ambiguity. The empirical POS taggers are further classified into Example based and Stochastic based taggers.
  • 33. Stochastic taggers are either HMM based, choosing the tag sequence which maximizes the product of word likelihood and tag sequence probability, or cue-based, using decision trees or maximum entropy models to combine probabilistic features. The stochastic taggers are further classified in to supervised and unsupervised taggers. Each of these supervised and unsupervised taggers are categorized into different groups as below:
  • 34. Maximum Entropy Part of Speech Tagger by Standford University
  • 35. POS Tagging UnsupervisedSupervised Rule Based Stochastic Neural Rule Based Stochastic Neural Brill Brill N-gram based Maximum Likelihood Hidden Markov Model Baum-Welch Algorithm Viterbi Algorithm Classification of POS tagging models
  • 36. Rule-based taggers generally involve a large database of hand-written disambiguation rules. For example, that an ambiguous word is a noun rather than a verb if it follows a determiner. Among those rule-based part-of-speech taggers, the one built by Brill has the advantage of learning tagging rules automatically. Stochastic taggers generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context.
  • 37. Supervised POS tagging The supervised POS tagging models require pre- tagged corpora which are used for training to learn rule sets, information about the tagset, word-tag frequencies etc. The learning tool generates trained models along with the statistical information. The performance of the models generally increases with increase in the size of pre-tagged corpus.
  • 38. Unsupervised POS tagging Unlike the supervised models, the unsupervised POS tagging models do not require a pre-tagged corpus. Instead, they use advanced computational methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules etc. Based on the information, they either calculate the probabilistic information needed by the stochastic taggers or induce the contextual rules needed by rule- based systems or transformation based systems.
  • 39. Rule based POS tagging The rule based POS tagging models apply a set of hand written rules and use contextual information to assign POS tags to words in a sentence. These rules are often known as context frame rules. For example, a context frame rule might say something like: “If an ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as an Adjective.” On the other hand, the transformation based approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training.
  • 40. Some models also use information about capitalization and punctuation, the usefulness of which are largely dependent on the language being tagged. The earliest algorithms for automatically assigning Part-of- Speech were based on a two-stage architecture [Harris Z. S, 1962]. The first stage used a dictionary to assign each word a list of potential parts of speech. The second stage used large lists of hand-written disambiguation rules to bring down this list to a single Part-of-Speech for each word.
  • 41. The ENGTWOL [Voutilainen Atro, 1995] tagger is based on the same two-stage architecture, although both the lexicon and the disambiguation rules are much more sophisticated than the early algorithms. The ENGTWOL lexicon is based on the two-level morphology. It has about 56,000 entries for English word stems, counting a word with multiple parts of speech (e.g. nominal and verbal senses of hit) as separate entries, and of course not counting inflected and many derived forms. Each entry is annotated with a set of morphological and syntactic features. In the first stage of the tagger, each word is run through the two-level lexicon transducer and the entries for all possible parts of speech are returned.
  • 42.
  • 43. Stochastic POS tagging A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. The problem with this approach is that it can come up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language.
  • 44. An alternative to the word frequency approach is known as the n-gram approach that calculates the probability of a given sequence of tags. It determines the best tag for a word by calculating the probability that it occurs with the n previous tags, where the value of n is set to 1, 2 or 3 for practical purposes. The most common algorithm for implementing an n-gram approach for tagging a new text is known as the Viterbi Algorithm, which is a search algorithm that avoids the polynomial expansion of a breadth first search by trimming the search tree at each level using the best m Maximum Likelihood Estimates (MLE) where m represents the number of tags of the following word. These are known as the unigram, bigram and trigram models.
  • 45. • Very robust, can process any input strings • Training is automatic, very fast • Can be retrained for different corpora/tagsets without much effort • Language independent • Minimize the human effort and human error. http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_al gorithm/s1_pg1.html Advantages of Statistical Approach
  • 46. Apart from these, quiet a few different approaches to tagging have been developed. Support Vector Machines: This is the powerful machine learning method used for various applications in NLP and other areas like bio-informatics, data mining, etc. Neural Networks: These are potential candidates for the classification task since they learn abstractions from examples [Schmid H, 1994]. Decision Trees: A decision tree is a decision support tool that uses a tree- like graph. It is one way to display an algorithm.
  • 47. These are classification devices based on hierarchical clusters of questions. They have been used for natural language processing such as POS Tagging [Schmid H, 1994]. The software “Weka” can be used for classifying the ambiguous words.
  • 48. Maximum Entropy Models: These avoid certain problems of statistical interdependence and have proven successful for tasks such as parsing and POS tagging. Example-Based Techniques: These techniques find the training instance that is most similar to the current problem instance and assume the same class for the new problem instance as for the similar one.
  • 49. Freely downloadable Part of Speech Taggers for English and other languages Stanford POS tagger Loglinear tagger in Java (by Kristina Toutanova) hunpos An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows. MBT: Memory-based Tagger Based on TiMBL TreeTagger http://nlp.stanford.edu/links/statnlp.html
  • 50. • A decision tree based tagger from the University of Stuttgart is language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where one can run it online.
  • 51. SVMTool POS Tagger based on SVMs (uses SVMlight). LGPL. ACOPOST (formerly ICOPOST) Open source C taggers originally written by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license. MXPOST Adwait Ratnaparkhi's Maximum Entropy part of speech tagger Java POS tagger A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
  • 52. fnTBL A fast and flexible implementation of Transformation- Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models. mu-TBL An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
  • 53. QTAG Part of speech tagger An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.] The TOSCA/LOB tagger. Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set. Brill's Transformation-based learning Tagger A symbolic tagger, written in C. It's no longer available from a canonical location, but one may find a version from the Wikipedia page or one can try a reimplementation such as fnTBL.
  • 54. • Original Xerox Tagger A common lisp HMM tagger available by ftp. Lingua-EN-Tagger Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)
  • 55. Development of POS Annotated Corpora Corpus linguistics seeks to further the understanding of language through the analysis of large quantities of naturally occurring data. Text corpora are used in a number of different ways. Traditionally, corpora have been used for the study and analysis of language at different levels of linguistic description. Corpora have been constructed for the specific purpose of acquiring knowledge for information extraction systems, knowledge-based systems and e-business systems. Corpora have been used for studying child language development. Speech corpora play a vital role in the specification, design and implementation of telephonic communication and for the broadcast media.
  • 56. There is a long tradition of corpus linguistic studies in Europe. The need for corpus for a language is multifarious(various types). Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus represents all the styles of a language. Corpus must be very huge in size as it is going to be used for many language applications such as preparation of lexicons of different sizes, purposes and types, NLP tools, machine translation programs and so on.
  • 57. Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus. The tagged corpus is that which is tagged for Part-of-Speech, morphology, lemma, phrases etc. A parallel corpus contains texts and translations in each of the languages involved in it. It allows wider scopes for double- checking of the translation equivalents. Aligned corpus is a kind of bilingual corpus where text samples of one language and their translations into another language are aligned, sentence by sentence, phrase by phrase, word by word, or even character by character.
  • 58. Applications of POS tagged corpus The POS tagged corpus is used in the following task. – Chunking – Parsing – Information extraction and retrieval – Tree bank creation – Document classification – Question answering
  • 59. Applications of POS tagged corpus cont… – Automatic dialogue system – Speech processing – Summarization – Statistical training of Language models – Machine Translation using multilingual corpora – Text checkers for evaluating spelling and grammar – Computer Lexicography – Educational application like Computer Assisted Language Learning
  • 60. Complexity in Dravidian POS tagging As Dravidian is an agglutinative language, Nouns get inflected for number and cases. Verbs get inflected for various inflections which include tense, person, number, gender suffixes. Verbs are adjectivalized and adverbialized. Also verbs and adjectives are nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect. Many post-positions in Tamil [Arden 1942; Rajendran S, 2007] are from nominal and verbal sources. So, many times one has to depend on the syntactic function or context to decide upon whether one is a noun or adjective or adverb or postposition.
  • 61. This leads to the complexity of Tamil in POS tagging. Root ambiguity The root word can be ambiguous. It can have more than one sense, sometimes roots belong to more than one POS category. Though the POS can be disambiguated using contextual information like co-occurring morphemes, it is not possible always. These issues should be taken care of when POS Taggers are built for Tamil Language. For example, the Tamil root words like adi, padi, isai, mudi, kudi can take both noun and verb category which leads to the root ambiguity problem in POS tagging.
  • 62. Noun complexity Nouns are the words which denote a person, place, thing, time, etc. In Tamil language, nouns are inflected for the number and case in morphological level. Morphological level inflection Noun ( + number ) (+ case ) Example: pUk-kaL-ai <NN> Flower-plural-accusative case suffix Noun ( + number ) (+ oblique) (+ euphonic) (+ case ) Example: pUk-kaL-in-Al <NN> Flower-plural-euphonic suffix-accusative case suffix Nouns further need to be annotated into common noun, compound noun, proper noun, compound proper noun, pronoun, cardinal and ordinal.
  • 63. Pronouns need to be further annotated for personal pronoun. There occurs complexity between common noun and compound noun and also between proper noun and compound proper noun. Common noun can also occur as compound noun, for example UrAdci <NNC> thalaivar <NNC> When UrAdci and thalaivar comes together it can be compound noun (<NNC>), but when UrAdci and thalaivar comes separately in a sentence it should be tagged as a common noun (<NN>). Such complexity also occurs with the proper noun <NNP> and compound proper noun (<NNPC>). Moreover there occurs complexity between noun and adverb, pronoun and emphasis in syntactic level.
  • 64. Verb complexity The verbal forms are complex in Tamil. A finite verb shows the following morphological structure Verb stem + Tense + Person-Number + Gender Example: nada +nth +En <VF> ‘I walked’ A number of non-finite forms are possible: adverbial forms, adjectival forms, infinitive forms, and conditional. Verb stem + Adverbial participle Example: cey + thu = ceythu <VNAV> ‘having done’
  • 65. Verb stem + relative_participle Example: cey + tha = ceytha <VNAJ> ‘who did’ Verb stem + infinitive suffix Example: azu + a = aza <VINT> ‘to weep’ Verb stem + conditional suffix Example: kEL+d + Al =kEddAl <CVB> ‘if asked’ Distinction needs to be made between a main verb followed by a main verb and a main Verb followed by an auxiliary verb. The main verb followed by an auxiliary verb need to be interpreted together, whereas the main verb followed by a main verb need to be interpreted separately. This lead to functional ambiguity as given below:
  • 66. Developing Part-of- Speech tagger for Indian languages For Bengali, Sandipan et al., (2007), have developed a corpus based semi-supervised learning algorithm for POS tagging based on HMMs. Their system uses a small tagged corpus (500 sentences) and a large unannotated corpus along with a Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003 words), their system obtained an accuracy of 95%.
  • 67. Smriti Singh et.al (2006), have proposed tagger for Hindi, that uses the affix information stored in a word and assigns a POS tag using no contextual information. By considering the previous and the next word in the Verb Group (VG), it correctly identifies the main verb and the auxiliaries. Lexicon lookup was used for identifying the other POS categories. In NLPAI ML contest, Dalal et al (2006) have achieved accuracies of 82.22 % and 82.4% for Hindi POS tagging and chunking respectively using maximum entropy models. Karthik et al. (2006) got 81.59 % accuracy for Telugu POS tagging using HMMs. Sivaji et al (2006) came up with a rule based chunker for Bengali which gave an accuracy of 81.64 %. The training data for all the three languages contained approximately 20,000 words and the testing data had approximately 5000 words.
  • 68. For Telugu, three POS taggers have been proposed by using different POS tagging approaches viz., (1) Rule-based approach, (2) Transformation based learning (TBL) approach of Erich Brill (3) Maximum Entropy Model, a machine learning technique [Ramasree, R.J and Kusuma Kumari, P, 2007]. Hidden Markov Model (HMM) based tagger for Hindi was proposed by Manish Shrivastava and Pushpak Bhattacharyya (2008). The authors attempted to utilize the morphological richness of the languages without resorting to complex and expensive analysis. The core idea of their approach was to explode the input in order to increase the length of the input and to reduce the number of unique types encountered during learning. This in turn increases the probability score of the correct choice while simultaneously decreasing the ambiguity of the choices at each stage.
  • 69. A stochastic Hidden Markov Model (HMM) based part of speech tagger has been proposed for Malayalam. To perform parts of tagging speech using stochastic approach, an annotated corpus is needed. Due to the non-availability of annotated corpus, a morphological analyzer was also developed to generate a tagged corpus from the training set [Manju K e.tal, 2009]. Various methodologies have been developed for POS Tagging for Tamil language. A rule-based POS tagger for Tamil was developed and tested [Arulmozhi et al., 2004]. This system gives only the major tags and the sub tags are overlooked during evaluation. A hybrid POS tagger for Tamil using HMM technique and a rule based system was also developed [Arulmozhi P and Sobha L, 2006].
  • 70. Lakshmana Pandian S and Geetha T V (2008) have developed a Morpheme based Language Model for Tamil Part-of-Speech Tagging. A language model based on the information of the stem type, last morpheme, and previous to the last morpheme part of the word for categorizing its part of speech was developed. For estimating the contribution factors of the model, they have followed the generalized iterative scaling technique. Dhanalakshmi et. al.(2008) proposed an SVM based tagger using linear programming and developed their own POS tagset for Tamil which has 32 tags. They used this tagset to annotate their corpus and then trained their model and reported an accuracy of 95.63%. Dhanalakshmi et. al.(2009) have also proposed another tagger where they used machine learning techniques to extract linguistic information which was then used to train the tagger based on SVM approach. They used their own 32 tags tagset for annotating the corpus and reported an accuracy of 95.64%.
  • 71. Considerable Effort of developing a POS Tagger in other Indian Languages have also been put in for Malayalam, an HMM based tagger was proposed by Manju et. al., since they did not had an annotated corpus, they used a morphological analyzer to generate the corpus which was then used for training the HMM algorithm. Another tagger for Malayalam was developed by Anthony et. al. [2009] who used Support Vector Machines (SVM). They used a SVMTool for tagging which was developed by Giménez and Màrquez. For developing this tagger Anthony et. al. first proposed a tagset which they claim is suitable for Malayalam and then created an annotated corpus using this tagset. Their tagger reported 94% accuracy with their tagset.
  • 72. Word Sense Disambiguation • Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. WSD is considered an AI- complete problem, that is, a task whose solution is at least as hard as the most difficult problems in artificial intelligence. A striking feature of Natural Language is that many words and sentences have more than one meaning (i.e. are semantically ambiguous), and which meaning is correct depends on the context. This problem arises at several levels.
  • 73. There are problems at the level of individual words. Consider this example The man went to the (old ladies hostel)/bank. What kind of 'bank'? A river bank or a source of money or blood bank? Here we have three distinct English words with the same spelling/pronunciation. Word sense disambiguation (WSD) is the problem of determining in which sense a word having a number of distinct senses is used in a given sentence. So, WSD is a task of removing the ambiguity of word in context.