The document discusses Rudolf Eremyan's work as a machine learning software engineer, including several natural language processing (NLP) projects. It provides details on a chatbot Eremyan created for the TBC Bank in Georgia that had over 35,000 likes and facilitated over 100,000 conversations. It also mentions sentiment analysis on Facebook comments and introduces NLP, discussing its history and applications such as text classification, machine translation, and question answering. The document outlines Eremyan's theoretical NLP project involving creating a machine learning pipeline for text classification using a labeled dataset.
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
1. RUDOLF EREMYAN
MACHINE LEARNING SOFTWARE ENGINEER
INTRODUCTION TO NATURAL LANGUAGE
PROCESSING
CONTACTS: EREMYAN.RUDOLF@GMAIL.COM HTTPS://WWW.LINKEDIN.COM/IN/RUDOLFEREMYAN/
2. CHATBOT FRAMEWORK FOR GEORGIAN
LANGUAGE
TI BOT FOR TBC
BANK
• 35K LIKES
• 100K CONVERSATIONS
• 8K ACTIVE USERS PER MONTH
• 41,5K USERS ASKES ABOUT
WEATHER
• 1K P2P TRANSACTIONS IN
AUGUST
4. NATURAL LANGUAGE PROCESSING
https://en.wikipedia.org/wiki/Natural_language_processing
NATURAL LANGUAGE PROCESSING (NLP) IS A FIELD
OF COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE AND
COMPUTATIONAL LINGUISTICS CONCERNED WITH THE
INTERACTIONS BETWEEN COMPUTERS AND HUMAN
(NATURAL) LANGUAGES, AND, IN PARTICULAR,
CONCERNED WITH PROGRAMMING COMPUTERS TO
FRUITFULLY PROCESS LARGE NATURAL LANGUAGE
CORPORA.
5. THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1950 - ALAN TURING PUBLISHED
AN ARTICLE TITLED "COMPUTING
MACHINERY AND
INTELLIGENCE" WHICH
PROPOSED WHAT IS NOW
CALLED THE TURING TEST AS A
CRITERION OF INTELLIGENCE.
6. THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1954 - THE GEORGETOWN
EXPERIMENT INVOLVED FULLY
AUTOMATIC TRANSLATION OF
MORE THAN SIXTY RUSSIAN
SENTENCES INTO ENGLISH. THE
AUTHORS CLAIMED THAT WITHIN
THREE OR FIVE YEARS, MACHINE
TRANSLATION WOULD BE A SOLVED
PROBLEM.
7. THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1970 - MANY PROGRAMMERS BEGAN TO WRITE "CONCEPTUAL ONTOLOGIES", WHICH STRUCTURED REAL-
WORLD INFORMATION INTO COMPUTER-UNDERSTANDABLE DATA. EXAMPLES ARE QUALM (LEHNERT, 1977),
POLITICS (CARBONELL, 1979), AND PLOT UNITS (LEHNERT 1981). DURING THIS TIME, MANY CHATTERBOTS
WERE WRITTEN INCLUDING PARRY, RACTER.
• WORDNET
• EUROWORDNET
• SENTIWORDNET
8. THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
1980 - THERE WAS A REVOLUTION IN NLP WITH
THE INTRODUCTION OF MACHINE LEARNING
ALGORITHMS FOR LANGUAGE PROCESSING. PART-
OF-SPEECH TAGGING INTRODUCED THE USE OF
HIDDEN MARKOV MODELS TO NLP, AND
INCREASINGLY, RESEARCH HAS FOCUSED ON
STATISTICAL MODELS, WHICH MAKE SOFT,
PROBABILISTIC DECISIONS BASED ON ATTACHING
REAL-VALUED WEIGHTS TO THE FEATURES MAKING
UP THE INPUT DATA.
9. THE HISTORY OF NLP
https://en.wikipedia.org/wiki/Natural_language_processing
IN RECENT YEARS, THERE HAS BEEN A FLURRY OF RESULTS SHOWING DEEP
LEARNING TECHNIQUES ACHIEVING STATE-OF-THE-ART RESULTS IN MANY
NATURAL LANGUAGE TASKS, FOR EXAMPLE IN LANGUAGE MODELING,
PARSING AND MANY OTHERS.
12. NLP APPLICATIONS
TEXT CLASSIFICATION
TEXT CLUSTERING
TEXT SUMMARISATION
MACHINE TRANSLATION
SEMANTIC SEARCH
SENTIMENT ANALYSIS
QUESTION ANSWERING
INFORMATION EXTRACTION
13. NLP. TEXT CLASSIFICATION
Document classification or
document categorization is a
problem in library science,
information science and computer
science. The task is to assign a
document to one or more classes or
categories. This may be done
"manually" or algorithmically.
Popular algorithms:
1. Multinomial Naive Bayes
2. SVM
3. Neural Networks
14. NLP. TEXT CLUSTERING
Document clustering (or text
clustering) is the application of
cluster analysis to textual
documents. It has applications in
automatic document organization,
topic extraction and fast information
retrieval or filtering.
Popular algorithms:
1. k-Means
2. DBSCAN
3. Deep Learning
15. NLP. TEXT SUMMARISATION
Automatic summarization is the
process of shortening a text
document with software, in order
to create a summary with the
major points of the original
document. Technologies that
can make a coherent summary
take into account variables such
as length, writing style and
syntax.
Popular algorithms:
1. LDA
2. Deep Learning
16. NLP. MACHINE TRANSLATION
MT performs simple substitution of words in
one language for words in another, but that
alone usually cannot produce a good
translation of a text because recognition of
whole phrases and their closest counterparts
in the target language is needed. Solving this
problem with corpus statistical, and neural
techniques is a rapidly growing field that is
leading to better translations, handling
differences in linguistic typology, translation of
idioms, and the isolation of anomalies
Algorithms:
1. Rule based
2. Statistical methods
3. Encoder-Decoder
17. NLP. SEMANTIC SEARCH
Semantic search seeks to
improve search accuracy by
understanding searcher intent
and the contextual meaning of
terms as they appear in the
searchable dataspace, whether
on the Web or within a closed
system, to generate more
relevant results.
Approaches:
1. Entity Recognition
2. User context
18. NLP. SENTIMENT ANALYSIS
Sentiment Analysis is the
process of determining whether a
piece of writing is positive,
negative or neutral. It's also
known as opinion mining,
deriving the opinion or attitude of
a speaker.
Algorithms:
1. Lexicon-based
2. Machine Learning (SVM)
3. Deep Learning (RNN, LSTM)
19. NLP. QUESTION ANSWERING
Question answering (QA) is a
computer science discipline within
the fields of information retrieval and
natural language processing (NLP),
which is concerned with building
systems that automatically answer
questions posed by humans in a
natural language.
Algorithms:
1. Rule based
2. Machine Learning
3. Deep Learning
20. NLP. INFORMATION EXTRACTION
Information extraction is the task of automatically
extracting structured information from unstructured
and/or semi-structured machine-readable documents.
22. NLP. STEMMER
Stemmers remove morphological affixes from words, leaving only the word stem.
bananas -> banana
flies -> fli
cats -> cat
dogs -> dog
How about “flies” -> fly?
23. NLP. MORPHOLOGICAL ANALYZER
Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological
analysis of words, normally aiming to remove
inflectional endings only and to return the base or
dictionary form of a word, which is known as the
lemma .
flies -> fly
went -> go
am, are, is -> be
25. NLP. POS TAGGER
A Part-Of-Speech Tagger (POS Tagger) is a piece of
software that reads text in some language and
assigns parts of speech to each word (and other
token), such as noun, verb, adjective, etc., although
generally computational applications use more fine-
grained POS tags like 'noun-plural'.
27. NLP. PARSER
A natural language parser is a program that works out the grammatical structure of
sentences, for instance, which groups of words go together (as "phrases") and which
words are the subject or object of a verb.
Dependency tree Constituency tree
28. NLP. NAMED ENTITY RECOGNIZER
Named-entity recognition (NER) (also known as entity identification, entity chunking and
entity extraction) is a subtask of information extraction that seeks to locate and classify
named entities in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
29. PROJECT. THEORETICAL PART
THERE IS A DATASET OF LABELED TEXTS,
OUT TASK TO CREATE MACHINE LEARNING
PIPELINE, FOR TEXT CLASSIFICATION,
TRAINED ON GIVEN DATA
33. PROJECT. FEATURE EXTRACTION.TF-IDF
“TF-IDF is a weighting scheme that assigns each term in a
document a weight based on its term frequency (tf) and inverse
document frequency (idf). The terms with higher weight scores
are considered to be more important. It’s one of the most popular
weighting schemes in Information Retrieval”
34. PROJECT. FEATURE EXTRACTION.TF-IDF
Term Frequency (TF)
“Term Frequency, which measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear
much more times in long documents than shorter ones. Thus, the term frequency is
often divided by the document length as a way of normalization”
TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document)
35. PROJECT. FEATURE EXTRACTION.TF-IDF
Inverse Document Frequency(IDF)
“IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have
little importance. Thus we need to weigh down the frequent terms while scale up
the rare ones, by computing the following:”
IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Base 10 logarithms are just as good as these although the values are considerably smaller.
49. PROJECT. TEXT CLASSIFICATION EVALUATION
“If you cannot measure it, you can not improve it”
Lord Kelvin
Main metrics for Text Classification:
Precision and Recall
Precision and recall are the measures used in the information
retrieval domain to measure how well an information retrieval
system retrieves the relevant documents requested by a user.
The measures are defined as follows:
Precision = Total number of documents retrieved that are
relevant/Total number of documents that are retrieved.
Recall = Total number of documents retrieved that are
relevant/Total number of relevant documents in the database.