SlideShare a Scribd company logo
1 of 51
Download to read offline
Natural Language Processing using Java
SangVenkatraman
April 21, 2015
Agenda
• Text Retrieval and Search
• Implementing Search
• Evaluating Search Results
• NLP - Document Level Analysis
• Parsing and Part of Speech Tagging
• Entity Extraction
• Word Sense Disambiguation
• Concept Extraction
• Concept Polarity
• NLP - Sentence Level Analysis
• Document Summarization
• Dependency Analysis and Coreference
• Example Question Parsing System
• Sentiment Analysis
• Final Thoughts/Questions 2
Text Retrieval and Search
• An collection of text documents exists in a system. This is called
the corpus.
• The documents are preprocessed and indexed before query time.
• User performs a query - the query defines one or more concepts
that the user is interested in. For e.g. “Thai restaurant in Atlanta”
• The search engine is expected to retrieve most relevant
documents based on a ranking function
• The search engine can also apply some heuristics based on user
feedback (such as always ignoring a specific document) to further
prune the results.
3
Search - Vector Space Model
• Term: Is a word or set of words (ngrams)
• Each term defines one dimension
• Query Vector: q = (X1,…,Xn)
• Document Vector: d = (Y1,…,Ym)
• relevance (q,d) ~ similarity(q,d)
4
Preparing Text for Search
• Tokenization: For each document, we split it into paragraphs, split paragraphs into
sentences and sentences into words.
• Word Normalization:
• Index text and query terms have same form e.g. match U.S.A and USA
• Usually lower cased
• Stop word Removal: An optional step where a predefined list of stop words are
removed. More important for small corpuses
• Stemming - Reduce terms to their stems
• Language dependent - in English, every word has 2 parts, the stem and the affix
• automate(s), automatic, automation => automat, plural forms like cats => cat
• The “stem” may not be an actual word for e.g. consolidating => consolid
5http://snowball.tartarus.org/algorithms/english/stemmer.html
6The inverted index part of the image taken from http://butchiso.com/assets/posts/mysql-full-text-search-p3/inverted_index.png
Search Example
• For any given term in the query:
• Term Frequency (TF) - The number of times a term occurs in a document. Normalize this by the
total number of terms in a document.
• Document Frequency (DF) - The number of documents that the term occurs in
• Inverse Document Frequency (IDF) - Inverse of above. So, it will be high for less frequent terms
and low for more frequent terms.
• Simple ranking of documents for a query
• For all the terms in the query, sum up the product of TF and IDF. This can be used to rank
the results with the documents with the highest tf-idf on top.
• Example:
• Document 1 = “The rose is red”
• Document 2 = “Red shoe”
• Query 1 = “Red” => both Document 1 and Document 2 because both documents have same
number of terms after removing stop words
7
Evaluating Search Results
• Search results can be evaluated by 2 metrics that encourage two kinds of algorithm
behavior:
• High Precision - Very few false positives. Critical for systems that cannot make a
wrong recommendation.
• High Recall - Very few misses. Critical for systems where every missed opportunity
needs to be minimized but there is a low cost associated with a false positive.
• FMeasure - The harmonic mean of precision and recall. It tries to balance out the
explorative nature of search with the preciseness of the results.
8
• precision = a/(a + c)
• recall = a/(a + b)
• fMeasure = 2 * precision * recall/(precision + recall)
• Example:
• retrieved documents = 5, relevant documents = 10
• relevant documents within the 5 retrieved results = 4
• precision = 4/5 = 0.8, recall = 4/10 = 0.4, FMeasure = 0.53
9Table from https://www.coursera.org/course/textretrieval
Section Summary
• In this section, we applied NLP techniques across an entire
corpus. This is where frameworks like map reduce play an
important role.
• The NLP techniques by themselves were shallow but were able to
implicitly handle compound words and stop words.
• Introduced a simple formula for ranking and retrieving search
results. The real world involve more complex probabilistic models
like BM25 that follow the same principles.
• Reviewed some techniques for evaluating search algorithms.
These simple approaches can also be used for other NLP and
machine learning problems.
10http://en.wikipedia.org/wiki/Okapi_BM25
11
Big Data is for Losers.
I’m into Small Data now.
Extracting Concepts From Text
• We apply various NLP techniques to analyze the contents of a document. Some example are:
• Mentions of people, places, locations etc.
• Central Themes or concepts in the document
• This is different from search
• Search follows a pull model where the users take initiative in querying the system for
relevant documents.
• In concept extraction, we can infer abstract concepts from text and push it to interested
users. We may also be able to infer the concepts a user is interested in based on the
content they consume.
12
Concept Extraction - Motivation
13
Sentence Segmentation
• Periods are ambiguous - Abbreviations, decimals etc.
• !, ? - Less ambiguous
• Classifier - rules (using case, punctuation rules etc.), ML etc.
• StanfordNLP sentence detection and tokenizer
• Trained on Penn Bank dataset and is hence suited towards more
formal english.
• OpenNLP has a sentence detection and tokenizer as well.
• Both these libraries perform pretty well for English and there is not
much to choose between them. They can also be retrained.
14
http://nlp.stanford.edu/software/tokenizer.shtml
https://opennlp.apache.org
https://github.com/dpdearing/nlp
Part of Speech Tagging using
StanfordNLP
• StanfordNLP is quite accurate (~90%) and has
been trained using a maximum entropy tagger.
15
TAG POS TAG POS
DT Determiner PRP Pronoun
JJ,JJR,JJS Adjective VB Verbs
NN,NNS Noun IN Preposition
NNP,NNPS Proper Noun CC Conjunction
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Named Entity Recognition
• Named Entity Recognition is the NLP task of recognizing proper nouns in a
document.
• Named Entity Recognition consists of three steps:
• Spotting: Statistical model pre-trained on well known corpus data help us
“spot” entities in the text.
• Disambiguation: Once spots are found, we may need to disambiguate them
(for e.g. there are multiple entities with the same name and the correct url
needs to be retrieved)
• Filtering: Remove named entities whose types we are not interested in or
entities that have very few links pointing to them.
• At the end of NER, we get back a set of url of resources that were referenced
in the text.
16
Spotting is the process of identifying and assigning classes to named
entities.
17
STANFORDNLP OPENNLP
I go to school at <ORGANIZATION>Stanford
University</ORGANIZATION>, which is located in
<LOCATION>California</LOCATION>.
I go to school at <ORGANIZATION>Stanford University</
ORGANIZATION> which is located in <LOCATION>California</
LOCATION>
Schooled in the <LOCATION>Philippines</LOCATION> Schooled in the <LOCATION>Philippines</LOCATION>
Where does <ORGANIZATION>Toyota</
ORGANIZATION> have its factories?
Where does <ORGANIZATION>Toyota</ORGANIZATION>
have its factories?
What does <ORGANIZATION>GM</ORGANIZATION>
produce?
What does <ORGANIZATION>GM</ORGANIZATION> produce?
Is <ORGANIZATION>GM</ORGANIZATION> moving
its jobs to <LOCATION>Atlanta</LOCATION>.
is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to
<LOCATION>Atlanta</LOCATION>.
I work at <ORGANIZATION>Chevy</
ORGANIZATION>.
I work at Chevy.
I work at <ORGANIZATION>chevy</
ORGANIZATION>.
I work at chevy.
I am fixing a <ORGANIZATION>General Motors</
ORGANIZATION> car
I am fixing a <ORGANIZATION>General Motors</
ORGANIZATION> car
You told me I was like the <LOCATION>Dead Sea</
LOCATION>
You told me I was like the <LOCATION>Dead Sea</LOCATION>
Dbpedia Spotlight
• Dbpedia Spotlight is an API that can be used to perform all 3 steps of NER
• Spots - It identifies spots using a statistical backed model.
• Spots are disambiguated based on other references in the document
• Uri’s are retrieved for each of the identified named entities. These are usually
dbpedia urls with references to freebase and other ontologies.
• Provides API to perform the steps of NER separately as well
• Spotting - Identifies only the spots
• Disambiguate - Performs disambiguation based on different options provided
• Annotate - Performs all 3 steps of NER and provides results
• Candidates - Provides a ranked list of candidates for each spot
18https://github.com/dbpedia-spotlight/dbpedia-spotlight
Dbpedia Spotlight Results
19
ID SONG EXPECTED ACTUAL PRECISION RECALL FMEASURE
1
Here We Stand
(Talking Heads)
http://dbpedia.org/resource/Pizza_Hut
http://dbpedia.org/resource/7-Eleven
http://dbpedia.org/resource/
Dairy_Queen
http://dbpedia.org/resource/7-
Eleven
1.0 0.33 0.5
2
Kodachrome
(Paul Simon)
http://dbpedia.org/resource/
Kodachrome
http://dbpedia.org/resource/Nikon
http://dbpedia.org/resource/
Nikon
1.0 0.5 0.66
3
Brand New
Cadillac
(The Crash)
http://dbpedia.org/resource/Cadillac
http://dbpedia.org/resource/
Cadillac
1.0 1.0 1.0
4
A Certain
Romance
(Arctic Monkeys)
http://dbpedia.org/resource/Reebok
http://dbpedia.org/resource/
Converse_(shoe_company)
http://dbpedia.org/resource/
Reebok
http://dbpedia.org/resource/
Converse_(shoe_company)
1.0 1.0 1.0
5
My Humps
(Black Eyed Peas)
http://dbpedia.org/resource/Prada
http://dbpedia.org/resource/Gucci
http://dbpedia.org/resource/Fendi
http://dbpedia.org/resource/
Dolce_&_Gabbana
http://dbpedia.org/resource/
True_Religion
http://dbpedia.org/resource/
Prada
http://dbpedia.org/resource/
Gucci
1.0 0.33 0.5
Mean 1.0 0.63 0.73
Querying the Semantic Web
• SPARQL is a query language to interact with the semantic web.
• SPARQL is the equivalent of SQL for RDF stores.
• Ontologies provide knowledge about different entities usually
in the form of a subject-predicate-object triple.
• English version of dbpedia contains 4.58 million things with 584
million facts.
20
SELECT ?industry WHERE {<http://dbpedia.org/
resource/Fendi? dbprop:industry ?industry>
http://dbpedia.org/sparql
http://wiki.dbpedia.org/Datasets#h434-9
Named Entity Recognition Demo
• http://dbpedia-spotlight.github.io/demo/
21
Extracting Concepts using Word Senses
22http://www.picgifs.com/clip-art/activities/sweating/clip-art-sweating-328953.jpg
Word Sense Disambiguation
• For many words, multiple senses of the word exists based on the context. For
e.g. there are multiple senses for the word “bank” (even within the same part of
speech).
• Extremely difficult for Computers. A combination of context and common sense
information make this quite easy for humans.
• Word Sense Disambiguation can be useful for
• Machine translation between languages (surface form loses value during
translation because the only thing that matters is the sense of the word)
• Information Retrieval - Correct interpretation of the query. However this can
be overcome by providing enough terms to only retrieve relevant documents.
• Automatic annotation of text
• Measuring semantic relatedness between documents.
23
http://babelnet.org/
https://code.google.com/p/dkpro-wsd/wiki/LSRs
• Solving the Word Sense Disambiguation Problem
• Need an inventory of knowledge that can be used to disambiguate words. Usually a graph
structure. Some examples are:
• WordNet
• Wikipedia
• Yago
• Freebase
• ConceptNet
• Algorithms to traverse the inventory to retrieve most likely disambiguation of a word.
These are usually graph algorithms that work on a measure of centrality like degree
centrality etc.
• Assumptions:
• The document has enough context to disambiguate the word correct. If not, we would
default to the most frequent sense of a word.
• Single sense per discourse
24
WordNet
• WordNet is a hierarchically organized lexical database widely used in NLP applications. Started at
Princeton in 1985.
• Contains nouns, verbs adjectives and adverbs
• Words are separated into senses and are represented as synsets.
• The noun “bank” can have multiple senses based on the context (for e.g. bank of a river, financial
institution etc.)
• Synsets are connected by well defined semantic relationships
• Majority of WordNet relations connect words from same part of speech.
• Can be accessed in Java using the extJWNL library
25
PART OF SPEECH UNIQUE
STRINGSNoun 117,798
Verb 11,529
Adjective 22,479
Adverb 4,481
http://extjwnl.sourceforge.net/
WordNet Synsets
26
Synset format => baseform#pos#index
bank#n#1 -> river bank
bank#n#2 -> Financial institution
bank#v#3 -> bank with a financial institution
http://wordnetweb.princeton.edu/perl/webwn
WordNet Relationships
• Hypernym - Defines a superordinate relationship.
• Motor vehicle is a hypernym of car
• Hyponym - Subordinate relationship
• Mango is a hyponym of fruit
• The root node of nouns is “entity”
• Other relationships: InstanceOf, Synonyms/Antonyms, Meronym (PartOf) etc.
27
28http://www.ling.helsinki.fi/kit/2008s/clt231/nltk-0.9.5/doc/images/wordnet-hierarchy.png
Accessing WordNet using extJWNL
• Download WordNet 3.0 dataset
• Use the properties file to point to the location of WordNet
• on the file system or database
• Lemmatization - Needed to get the base form of a word
(different from stemming) using the WordNet dictionary.
• cat and cats have same lemma
29
val dictionary = Dictionary.getInstance(new FileInputStream(“data/file_properties.xml"))
def getBaseForm(pos: POS, word: String): String = {

dictionary.getMorphologicalProcessor.lookupBaseForm(pos, word.toLowerCase)
}
http://extjwnl.sourceforge.net/
WSD using WordNet
• Example 1 - “I am going to the bank”
• “bank” by itself usually just defaults to bank#n#1
• Example 2 - “What is the difference between a bank
and a credit union?”
• Credit Union only has one sense - credit_union#n#1
• Because credit union is present, “bank” is
disambiguated to “bank#n#2”
30https://code.google.com/p/dkpro-wsd/wiki/LSRs
Concept Graph
• WordNet does not capture any common sense information. For e.g.
bank (financial institution) and money do not have a close relationship
in WordNet.
• It is possible to use other resource like ConceptNet that map common
sense knowledge to WordNet (and ontologies like dbpedia). For e.g. we
can download mappings for concepts like Money, Love, Sports, Family
etc.
• Another option is to deploy a custom concept graph:
• Deploy WordNet onto a Graph database. That forms the base graph.
• Deploy custom concept mapping to the WordNet synsets.
• Add mappings for relevant wikipedia (dbpedia) categories
31http://conceptnet5.media.mit.edu/data/5.3/c/en/family?limit=1000
Concept Extraction Architecture
32
Concept Analysis of over 500K songs
33
Concept Polarity
• SentiWordNet is a lexical resource for opinion mining and
sentiment analysis
• SentiWordNet provides sentiment values for the different WordNet
sysnsets. For each synset in WordNet, SentiWordNet assigns it
scores on 3 dimensions - positivity, negativity and objectivity.
• Once the central concepts are found, we can extract the polarity of
the concepts.
• Example:
• “They are really happy to be here” => happy#a#1 has a very
positive polarity.
34http://sentiwordnet.isti.cnr.it/
Section Summary
• Went beyond surface forms and analyzed the concepts
contained in documents.
• The approach was still mostly bag of words meaning that
the structure of the individual sentences did not matter.
• The approaches in tandem with common sense
knowledge sources help in extracting concepts from
documents.
• It also allows documents to be compared based on
semantic similarity measures.
35
36http://www.smosh.com/smosh-pit/photos/funny-smartass-siri-responses
Document Summarization
• Objective - Reduce the document in order to create a summary that retains the most
important points of the original document.
• Two Approaches:
• Extractive: Extract the sentences that are most representative of the content of the
document.
• Generative: Generate a summary of the text using words that may not be part of the
original text. This is a difficult task and is often not attempted.
• Evaluating summarization techniques:
• Somewhat subjective because humans sometimes cannot agree on the best summary
• Extractive Approaches
• Based on term frequency
• Based on sentence similarity
37
38http://en.wikipedia.org/wiki/Apache_Cassandra
ID SENTENCE
EXPECTED
SCORE
1
Apache Cassandra is an open source distributed database management system designed to handle large
amounts of data across many commodity servers, providing high availability with no single point of failure.
High
2
Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless
replication allowing low latency operations for all clients.
High
3 Cassandra also places a high value on performance. Low
4
In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability,
there is a clear winner throughout our experiments.
Low
5
Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although
"this comes at the price of high write and read latencies."
High
6 Cassandra's data model is a partitioned row store with tunable consistency. Medium
7
Rows are organized into tables; the first component of a table's primary key is the partition key; within a
partition, rows are clustered by the remaining columns of the key.
Medium
8 Other columns may be indexed separately from the primary key. Low
9 Tables may be created, dropped, and altered at runtime without blocking updates and queries. Low
10 Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Medium
11 Rather, Cassandra emphasizes denormalization through features like collections. Medium
TextRank
• A graph approach where each vertex is a sentence and each edge has a weight
corresponding to the similarity between the two sentences. Every vertex is
connected to every other vertex.
• For every sentence:
• Calculate its similarity to every other sentence. The similarity measure can be
simple for e.g. normalized value of the number of common terms between the
2 sentences
• Sum the similarity of the sentence to every other row (sum up each of the
rows). That is the score of the sentence.
• Sort the vertices based on the sum of the weights of their edges and return the
top k sentences.
39http://lit.csci.unt.edu/index.php/Graph-based_NLP
40
TOP SENTENCES SCORE
Cassandra offers robust support for clusters spanning
multiple datacenters, with asynchronous masterless
replication allowing low latency operations for all clients.
1.6
Cassandra also places a high value on performance. 1.125
Other columns may be indexed separately from the primary
key.
0.999
• Can the similarity metric be improved?
Dependency Analysis in Sentences
• StanfordNLP can be used to analyze the grammatical
structure of sentences and provide a dependency graph
between the different elements of the sentence.
• LexicalizedParser can provide a graph where the vertices
are the words and the edges are the grammatical
relationships in a sentence.
41http://nlp.stanford.edu/software/lex-parser.shtml
42
TAG MEANING TAG MEANING
advmod
Adverbial
Modifier
dobj
Direct Object

(she,gave)
neg
Negation
Modifier
iobj
Indirect Object

(gave,me)
nsubj Nominal Subject amod
Adjective
Modifier
nsubjpass
Passive Nominal
Subject
prep Preposition
Question Parsing
43
Dependency Analysis
• Works well for short sentences. It loses accuracy when
the scope is increased to a document.
• May aid in text simplification by using the relationships
between the entities.
• By analyzing the subject and the object, we can clearly
establish a point of view (for e.g. direct address vs first
person vs. second person etc.).
• Could potentially help in story extrapolation but does
not generalize well. So this is a topic of research.
44
Sentiment Analysis
• StanfordNLP has a deep learning model for sentiment
analysis.
• Takes a deep parsing approach to sentiment analysis - the
structure of the sentence is constructed prior to the analysis.
• Was trained on movie reviews data and obtained an
accuracy of 5% more than the closest model.
• Uses an annotated dataset called the Stanford Sentiment
Treebank. Users are encouraged to add labels to improve
the model further.
45
Sentiment Analysis Examples
• Taxonomy
• Very Negative
• Negative
• Neutral
• Positive
• Very Positive
46
Sentiment Analysis Demo
• http://nlp.stanford.edu:8080/sentiment/
rntnDemo.html
47
StanfordNLP Sentiment Analysis
• Provides relatively good results for short sentences.
• Sentences that are similar to the training data (movie
reviews) perform much better than other sentences.
• No good way to aggregate sentiments across a
document. A future work would probably involve
document level dependency parsing and sentiment
analysis.
• Only provides overall sentiment. Does not provide an
indication of the object of the sentiment.
48
Final Thoughts
• Shallow NLP is employed in text retrieval and search and
provide good results for general search use cases.
• Deeper NLP involves semantic parsing, common sense
interpolation (both local and global knowledge bases) and
tends to be harder.
• Deeper NLP is more practical after picking a specific
domain for e.g. medical records, legal documents etc.
• 2 cents on Intelligence - Memory based systems
• http://watson-um-demo.mybluemix.net/
49http://en.wikipedia.org/wiki/On_Intelligence
Resources
• StanfordNLP Github: https://github.com/stanfordnlp/CoreNLP
• Own repository: https://github.com/sangv/swsd
• Dbpedia Spotlight: https://github.com/dbpedia-spotlight/
dbpedia-spotlight
• Opennlp repo: https://github.com/apache/opennlp
• ConceptNet conceptnet5.media.mit.edu
• On Intelligence book: http://en.wikipedia.org/wiki/
On_Intelligence
50
Thank You
51
@sang_v

More Related Content

What's hot

Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Preetha Chatterjee
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets🧑‍💻 Manuel Coppotelli
 
Extracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related ChatsExtracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related ChatsPreetha Chatterjee
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Sentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataSentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataIswarya M
 
Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitterpiya chauhan
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...Alexander Panchenko
 
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...Preetha Chatterjee
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrTrey Grainger
 
A review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigA review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigNurfadhlina Mohd Sharef
 
Sentiment Analysis
Sentiment Analysis Sentiment Analysis
Sentiment Analysis prnk08
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonHetu Bhavsar
 
Aspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webAspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webKarishma chaudhary
 
New sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarNew sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarRavi Kumar
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media Ravindra Chaudhary
 
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsMining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsPreetha Chatterjee
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...Prateek Singh
 

What's hot (20)

Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
 
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
 
Extracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related ChatsExtracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related Chats
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Sentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big DataSentiment analysis in Twitter on Big Data
Sentiment analysis in Twitter on Big Data
 
Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitter
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
A review of sentiment analysis approaches in big
A review of sentiment analysis approaches in bigA review of sentiment analysis approaches in big
A review of sentiment analysis approaches in big
 
Sentiment Analysis
Sentiment Analysis Sentiment Analysis
Sentiment Analysis
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
Aspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webAspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the web
 
Project report
Project reportProject report
Project report
 
New sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarNew sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumar
 
sentiment analysis text extraction from social media
sentiment  analysis text extraction from social media sentiment  analysis text extraction from social media
sentiment analysis text extraction from social media
 
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsMining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software Artifacts
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
 

Viewers also liked

NLP in Practice - Part II
NLP in Practice - Part IINLP in Practice - Part II
NLP in Practice - Part IIDelip Rao
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
Atlantic Monthly Tropical Weather Summary
Atlantic Monthly Tropical Weather SummaryAtlantic Monthly Tropical Weather Summary
Atlantic Monthly Tropical Weather Summarynewsmiami
 
DieGem: onderzoek naar solidariteit in superdiversiteit
DieGem: onderzoek naar solidariteit in superdiversiteitDieGem: onderzoek naar solidariteit in superdiversiteit
DieGem: onderzoek naar solidariteit in superdiversiteitDieGem
 
Broward Offshore Waters
Broward Offshore WatersBroward Offshore Waters
Broward Offshore Watersnewsmiami
 
Eastern Pacific Hurricane SIMON Tropical Cyclone Update
Eastern Pacific Hurricane SIMON Tropical Cyclone UpdateEastern Pacific Hurricane SIMON Tropical Cyclone Update
Eastern Pacific Hurricane SIMON Tropical Cyclone Updatenewsmiami
 
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30newsmiami
 
Discretionaire ruimte
Discretionaire ruimteDiscretionaire ruimte
Discretionaire ruimteDieGem
 
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30newsmiami
 
Vorming viboso mj ts 19052016
Vorming viboso mj ts 19052016Vorming viboso mj ts 19052016
Vorming viboso mj ts 19052016DieGem
 
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...DieGem
 
NHC Marine Weather Discussion
NHC Marine Weather DiscussionNHC Marine Weather Discussion
NHC Marine Weather Discussionnewsmiami
 
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30newsmiami
 
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23newsmiami
 
Professioneel ondersteunen van nabij of van op afstand
Professioneel ondersteunen van nabij of van op afstandProfessioneel ondersteunen van nabij of van op afstand
Professioneel ondersteunen van nabij of van op afstandDieGem
 
Mater Dei en plaats
Mater Dei en plaatsMater Dei en plaats
Mater Dei en plaatsDieGem
 
NHC Atlantic High Seas Forecast
NHC Atlantic High Seas ForecastNHC Atlantic High Seas Forecast
NHC Atlantic High Seas Forecastnewsmiami
 

Viewers also liked (20)

NLP in Practice - Part II
NLP in Practice - Part IINLP in Practice - Part II
NLP in Practice - Part II
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
Atlantic Monthly Tropical Weather Summary
Atlantic Monthly Tropical Weather SummaryAtlantic Monthly Tropical Weather Summary
Atlantic Monthly Tropical Weather Summary
 
DieGem: onderzoek naar solidariteit in superdiversiteit
DieGem: onderzoek naar solidariteit in superdiversiteitDieGem: onderzoek naar solidariteit in superdiversiteit
DieGem: onderzoek naar solidariteit in superdiversiteit
 
Broward Offshore Waters
Broward Offshore WatersBroward Offshore Waters
Broward Offshore Waters
 
Eastern Pacific Hurricane SIMON Tropical Cyclone Update
Eastern Pacific Hurricane SIMON Tropical Cyclone UpdateEastern Pacific Hurricane SIMON Tropical Cyclone Update
Eastern Pacific Hurricane SIMON Tropical Cyclone Update
 
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Advisory Number 30
 
Discretionaire ruimte
Discretionaire ruimteDiscretionaire ruimte
Discretionaire ruimte
 
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
 
magazine
magazinemagazine
magazine
 
Vorming viboso mj ts 19052016
Vorming viboso mj ts 19052016Vorming viboso mj ts 19052016
Vorming viboso mj ts 19052016
 
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...
Actie-onderzoek Torekes: activitieve burgers op de stadsakker verstoren de ar...
 
NHC Marine Weather Discussion
NHC Marine Weather DiscussionNHC Marine Weather Discussion
NHC Marine Weather Discussion
 
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
Atlantic Post-Tropical Cyclone GONZALO Forecast/Advisory Number 30
 
South Miami
South MiamiSouth Miami
South Miami
 
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23
Eastern Pacific Post-Tropical Cyclone ENRIQUE Discussion Number 23
 
Professioneel ondersteunen van nabij of van op afstand
Professioneel ondersteunen van nabij of van op afstandProfessioneel ondersteunen van nabij of van op afstand
Professioneel ondersteunen van nabij of van op afstand
 
Mater Dei en plaats
Mater Dei en plaatsMater Dei en plaats
Mater Dei en plaats
 
NHC Atlantic High Seas Forecast
NHC Atlantic High Seas ForecastNHC Atlantic High Seas Forecast
NHC Atlantic High Seas Forecast
 
Chrisman geospatial summit2015
Chrisman geospatial summit2015Chrisman geospatial summit2015
Chrisman geospatial summit2015
 

Similar to Natural Language Processing using Java

An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Lucidworks
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsVaibhav Khanna
 

Similar to Natural Language Processing using Java (20)

An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
Final presentation
Final presentationFinal presentation
Final presentation
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Full text search
Full text searchFull text search
Full text search
 
Taming Text
Taming TextTaming Text
Taming Text
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
 
Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Text analytics
Text analyticsText analytics
Text analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Natural Language Processing using Java

  • 1. Natural Language Processing using Java SangVenkatraman April 21, 2015
  • 2. Agenda • Text Retrieval and Search • Implementing Search • Evaluating Search Results • NLP - Document Level Analysis • Parsing and Part of Speech Tagging • Entity Extraction • Word Sense Disambiguation • Concept Extraction • Concept Polarity • NLP - Sentence Level Analysis • Document Summarization • Dependency Analysis and Coreference • Example Question Parsing System • Sentiment Analysis • Final Thoughts/Questions 2
  • 3. Text Retrieval and Search • An collection of text documents exists in a system. This is called the corpus. • The documents are preprocessed and indexed before query time. • User performs a query - the query defines one or more concepts that the user is interested in. For e.g. “Thai restaurant in Atlanta” • The search engine is expected to retrieve most relevant documents based on a ranking function • The search engine can also apply some heuristics based on user feedback (such as always ignoring a specific document) to further prune the results. 3
  • 4. Search - Vector Space Model • Term: Is a word or set of words (ngrams) • Each term defines one dimension • Query Vector: q = (X1,…,Xn) • Document Vector: d = (Y1,…,Ym) • relevance (q,d) ~ similarity(q,d) 4
  • 5. Preparing Text for Search • Tokenization: For each document, we split it into paragraphs, split paragraphs into sentences and sentences into words. • Word Normalization: • Index text and query terms have same form e.g. match U.S.A and USA • Usually lower cased • Stop word Removal: An optional step where a predefined list of stop words are removed. More important for small corpuses • Stemming - Reduce terms to their stems • Language dependent - in English, every word has 2 parts, the stem and the affix • automate(s), automatic, automation => automat, plural forms like cats => cat • The “stem” may not be an actual word for e.g. consolidating => consolid 5http://snowball.tartarus.org/algorithms/english/stemmer.html
  • 6. 6The inverted index part of the image taken from http://butchiso.com/assets/posts/mysql-full-text-search-p3/inverted_index.png
  • 7. Search Example • For any given term in the query: • Term Frequency (TF) - The number of times a term occurs in a document. Normalize this by the total number of terms in a document. • Document Frequency (DF) - The number of documents that the term occurs in • Inverse Document Frequency (IDF) - Inverse of above. So, it will be high for less frequent terms and low for more frequent terms. • Simple ranking of documents for a query • For all the terms in the query, sum up the product of TF and IDF. This can be used to rank the results with the documents with the highest tf-idf on top. • Example: • Document 1 = “The rose is red” • Document 2 = “Red shoe” • Query 1 = “Red” => both Document 1 and Document 2 because both documents have same number of terms after removing stop words 7
  • 8. Evaluating Search Results • Search results can be evaluated by 2 metrics that encourage two kinds of algorithm behavior: • High Precision - Very few false positives. Critical for systems that cannot make a wrong recommendation. • High Recall - Very few misses. Critical for systems where every missed opportunity needs to be minimized but there is a low cost associated with a false positive. • FMeasure - The harmonic mean of precision and recall. It tries to balance out the explorative nature of search with the preciseness of the results. 8
  • 9. • precision = a/(a + c) • recall = a/(a + b) • fMeasure = 2 * precision * recall/(precision + recall) • Example: • retrieved documents = 5, relevant documents = 10 • relevant documents within the 5 retrieved results = 4 • precision = 4/5 = 0.8, recall = 4/10 = 0.4, FMeasure = 0.53 9Table from https://www.coursera.org/course/textretrieval
  • 10. Section Summary • In this section, we applied NLP techniques across an entire corpus. This is where frameworks like map reduce play an important role. • The NLP techniques by themselves were shallow but were able to implicitly handle compound words and stop words. • Introduced a simple formula for ranking and retrieving search results. The real world involve more complex probabilistic models like BM25 that follow the same principles. • Reviewed some techniques for evaluating search algorithms. These simple approaches can also be used for other NLP and machine learning problems. 10http://en.wikipedia.org/wiki/Okapi_BM25
  • 11. 11 Big Data is for Losers. I’m into Small Data now.
  • 12. Extracting Concepts From Text • We apply various NLP techniques to analyze the contents of a document. Some example are: • Mentions of people, places, locations etc. • Central Themes or concepts in the document • This is different from search • Search follows a pull model where the users take initiative in querying the system for relevant documents. • In concept extraction, we can infer abstract concepts from text and push it to interested users. We may also be able to infer the concepts a user is interested in based on the content they consume. 12
  • 13. Concept Extraction - Motivation 13
  • 14. Sentence Segmentation • Periods are ambiguous - Abbreviations, decimals etc. • !, ? - Less ambiguous • Classifier - rules (using case, punctuation rules etc.), ML etc. • StanfordNLP sentence detection and tokenizer • Trained on Penn Bank dataset and is hence suited towards more formal english. • OpenNLP has a sentence detection and tokenizer as well. • Both these libraries perform pretty well for English and there is not much to choose between them. They can also be retrained. 14 http://nlp.stanford.edu/software/tokenizer.shtml https://opennlp.apache.org https://github.com/dpdearing/nlp
  • 15. Part of Speech Tagging using StanfordNLP • StanfordNLP is quite accurate (~90%) and has been trained using a maximum entropy tagger. 15 TAG POS TAG POS DT Determiner PRP Pronoun JJ,JJR,JJS Adjective VB Verbs NN,NNS Noun IN Preposition NNP,NNPS Proper Noun CC Conjunction https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 16. Named Entity Recognition • Named Entity Recognition is the NLP task of recognizing proper nouns in a document. • Named Entity Recognition consists of three steps: • Spotting: Statistical model pre-trained on well known corpus data help us “spot” entities in the text. • Disambiguation: Once spots are found, we may need to disambiguate them (for e.g. there are multiple entities with the same name and the correct url needs to be retrieved) • Filtering: Remove named entities whose types we are not interested in or entities that have very few links pointing to them. • At the end of NER, we get back a set of url of resources that were referenced in the text. 16
  • 17. Spotting is the process of identifying and assigning classes to named entities. 17 STANFORDNLP OPENNLP I go to school at <ORGANIZATION>Stanford University</ORGANIZATION>, which is located in <LOCATION>California</LOCATION>. I go to school at <ORGANIZATION>Stanford University</ ORGANIZATION> which is located in <LOCATION>California</ LOCATION> Schooled in the <LOCATION>Philippines</LOCATION> Schooled in the <LOCATION>Philippines</LOCATION> Where does <ORGANIZATION>Toyota</ ORGANIZATION> have its factories? Where does <ORGANIZATION>Toyota</ORGANIZATION> have its factories? What does <ORGANIZATION>GM</ORGANIZATION> produce? What does <ORGANIZATION>GM</ORGANIZATION> produce? Is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>. is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to <LOCATION>Atlanta</LOCATION>. I work at <ORGANIZATION>Chevy</ ORGANIZATION>. I work at Chevy. I work at <ORGANIZATION>chevy</ ORGANIZATION>. I work at chevy. I am fixing a <ORGANIZATION>General Motors</ ORGANIZATION> car I am fixing a <ORGANIZATION>General Motors</ ORGANIZATION> car You told me I was like the <LOCATION>Dead Sea</ LOCATION> You told me I was like the <LOCATION>Dead Sea</LOCATION>
  • 18. Dbpedia Spotlight • Dbpedia Spotlight is an API that can be used to perform all 3 steps of NER • Spots - It identifies spots using a statistical backed model. • Spots are disambiguated based on other references in the document • Uri’s are retrieved for each of the identified named entities. These are usually dbpedia urls with references to freebase and other ontologies. • Provides API to perform the steps of NER separately as well • Spotting - Identifies only the spots • Disambiguate - Performs disambiguation based on different options provided • Annotate - Performs all 3 steps of NER and provides results • Candidates - Provides a ranked list of candidates for each spot 18https://github.com/dbpedia-spotlight/dbpedia-spotlight
  • 19. Dbpedia Spotlight Results 19 ID SONG EXPECTED ACTUAL PRECISION RECALL FMEASURE 1 Here We Stand (Talking Heads) http://dbpedia.org/resource/Pizza_Hut http://dbpedia.org/resource/7-Eleven http://dbpedia.org/resource/ Dairy_Queen http://dbpedia.org/resource/7- Eleven 1.0 0.33 0.5 2 Kodachrome (Paul Simon) http://dbpedia.org/resource/ Kodachrome http://dbpedia.org/resource/Nikon http://dbpedia.org/resource/ Nikon 1.0 0.5 0.66 3 Brand New Cadillac (The Crash) http://dbpedia.org/resource/Cadillac http://dbpedia.org/resource/ Cadillac 1.0 1.0 1.0 4 A Certain Romance (Arctic Monkeys) http://dbpedia.org/resource/Reebok http://dbpedia.org/resource/ Converse_(shoe_company) http://dbpedia.org/resource/ Reebok http://dbpedia.org/resource/ Converse_(shoe_company) 1.0 1.0 1.0 5 My Humps (Black Eyed Peas) http://dbpedia.org/resource/Prada http://dbpedia.org/resource/Gucci http://dbpedia.org/resource/Fendi http://dbpedia.org/resource/ Dolce_&_Gabbana http://dbpedia.org/resource/ True_Religion http://dbpedia.org/resource/ Prada http://dbpedia.org/resource/ Gucci 1.0 0.33 0.5 Mean 1.0 0.63 0.73
  • 20. Querying the Semantic Web • SPARQL is a query language to interact with the semantic web. • SPARQL is the equivalent of SQL for RDF stores. • Ontologies provide knowledge about different entities usually in the form of a subject-predicate-object triple. • English version of dbpedia contains 4.58 million things with 584 million facts. 20 SELECT ?industry WHERE {<http://dbpedia.org/ resource/Fendi? dbprop:industry ?industry> http://dbpedia.org/sparql http://wiki.dbpedia.org/Datasets#h434-9
  • 21. Named Entity Recognition Demo • http://dbpedia-spotlight.github.io/demo/ 21
  • 22. Extracting Concepts using Word Senses 22http://www.picgifs.com/clip-art/activities/sweating/clip-art-sweating-328953.jpg
  • 23. Word Sense Disambiguation • For many words, multiple senses of the word exists based on the context. For e.g. there are multiple senses for the word “bank” (even within the same part of speech). • Extremely difficult for Computers. A combination of context and common sense information make this quite easy for humans. • Word Sense Disambiguation can be useful for • Machine translation between languages (surface form loses value during translation because the only thing that matters is the sense of the word) • Information Retrieval - Correct interpretation of the query. However this can be overcome by providing enough terms to only retrieve relevant documents. • Automatic annotation of text • Measuring semantic relatedness between documents. 23 http://babelnet.org/ https://code.google.com/p/dkpro-wsd/wiki/LSRs
  • 24. • Solving the Word Sense Disambiguation Problem • Need an inventory of knowledge that can be used to disambiguate words. Usually a graph structure. Some examples are: • WordNet • Wikipedia • Yago • Freebase • ConceptNet • Algorithms to traverse the inventory to retrieve most likely disambiguation of a word. These are usually graph algorithms that work on a measure of centrality like degree centrality etc. • Assumptions: • The document has enough context to disambiguate the word correct. If not, we would default to the most frequent sense of a word. • Single sense per discourse 24
  • 25. WordNet • WordNet is a hierarchically organized lexical database widely used in NLP applications. Started at Princeton in 1985. • Contains nouns, verbs adjectives and adverbs • Words are separated into senses and are represented as synsets. • The noun “bank” can have multiple senses based on the context (for e.g. bank of a river, financial institution etc.) • Synsets are connected by well defined semantic relationships • Majority of WordNet relations connect words from same part of speech. • Can be accessed in Java using the extJWNL library 25 PART OF SPEECH UNIQUE STRINGSNoun 117,798 Verb 11,529 Adjective 22,479 Adverb 4,481 http://extjwnl.sourceforge.net/
  • 26. WordNet Synsets 26 Synset format => baseform#pos#index bank#n#1 -> river bank bank#n#2 -> Financial institution bank#v#3 -> bank with a financial institution http://wordnetweb.princeton.edu/perl/webwn
  • 27. WordNet Relationships • Hypernym - Defines a superordinate relationship. • Motor vehicle is a hypernym of car • Hyponym - Subordinate relationship • Mango is a hyponym of fruit • The root node of nouns is “entity” • Other relationships: InstanceOf, Synonyms/Antonyms, Meronym (PartOf) etc. 27
  • 29. Accessing WordNet using extJWNL • Download WordNet 3.0 dataset • Use the properties file to point to the location of WordNet • on the file system or database • Lemmatization - Needed to get the base form of a word (different from stemming) using the WordNet dictionary. • cat and cats have same lemma 29 val dictionary = Dictionary.getInstance(new FileInputStream(“data/file_properties.xml")) def getBaseForm(pos: POS, word: String): String = {
 dictionary.getMorphologicalProcessor.lookupBaseForm(pos, word.toLowerCase) } http://extjwnl.sourceforge.net/
  • 30. WSD using WordNet • Example 1 - “I am going to the bank” • “bank” by itself usually just defaults to bank#n#1 • Example 2 - “What is the difference between a bank and a credit union?” • Credit Union only has one sense - credit_union#n#1 • Because credit union is present, “bank” is disambiguated to “bank#n#2” 30https://code.google.com/p/dkpro-wsd/wiki/LSRs
  • 31. Concept Graph • WordNet does not capture any common sense information. For e.g. bank (financial institution) and money do not have a close relationship in WordNet. • It is possible to use other resource like ConceptNet that map common sense knowledge to WordNet (and ontologies like dbpedia). For e.g. we can download mappings for concepts like Money, Love, Sports, Family etc. • Another option is to deploy a custom concept graph: • Deploy WordNet onto a Graph database. That forms the base graph. • Deploy custom concept mapping to the WordNet synsets. • Add mappings for relevant wikipedia (dbpedia) categories 31http://conceptnet5.media.mit.edu/data/5.3/c/en/family?limit=1000
  • 33. Concept Analysis of over 500K songs 33
  • 34. Concept Polarity • SentiWordNet is a lexical resource for opinion mining and sentiment analysis • SentiWordNet provides sentiment values for the different WordNet sysnsets. For each synset in WordNet, SentiWordNet assigns it scores on 3 dimensions - positivity, negativity and objectivity. • Once the central concepts are found, we can extract the polarity of the concepts. • Example: • “They are really happy to be here” => happy#a#1 has a very positive polarity. 34http://sentiwordnet.isti.cnr.it/
  • 35. Section Summary • Went beyond surface forms and analyzed the concepts contained in documents. • The approach was still mostly bag of words meaning that the structure of the individual sentences did not matter. • The approaches in tandem with common sense knowledge sources help in extracting concepts from documents. • It also allows documents to be compared based on semantic similarity measures. 35
  • 37. Document Summarization • Objective - Reduce the document in order to create a summary that retains the most important points of the original document. • Two Approaches: • Extractive: Extract the sentences that are most representative of the content of the document. • Generative: Generate a summary of the text using words that may not be part of the original text. This is a difficult task and is often not attempted. • Evaluating summarization techniques: • Somewhat subjective because humans sometimes cannot agree on the best summary • Extractive Approaches • Based on term frequency • Based on sentence similarity 37
  • 38. 38http://en.wikipedia.org/wiki/Apache_Cassandra ID SENTENCE EXPECTED SCORE 1 Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. High 2 Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. High 3 Cassandra also places a high value on performance. Low 4 In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments. Low 5 Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although "this comes at the price of high write and read latencies." High 6 Cassandra's data model is a partitioned row store with tunable consistency. Medium 7 Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Medium 8 Other columns may be indexed separately from the primary key. Low 9 Tables may be created, dropped, and altered at runtime without blocking updates and queries. Low 10 Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Medium 11 Rather, Cassandra emphasizes denormalization through features like collections. Medium
  • 39. TextRank • A graph approach where each vertex is a sentence and each edge has a weight corresponding to the similarity between the two sentences. Every vertex is connected to every other vertex. • For every sentence: • Calculate its similarity to every other sentence. The similarity measure can be simple for e.g. normalized value of the number of common terms between the 2 sentences • Sum the similarity of the sentence to every other row (sum up each of the rows). That is the score of the sentence. • Sort the vertices based on the sum of the weights of their edges and return the top k sentences. 39http://lit.csci.unt.edu/index.php/Graph-based_NLP
  • 40. 40 TOP SENTENCES SCORE Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. 1.6 Cassandra also places a high value on performance. 1.125 Other columns may be indexed separately from the primary key. 0.999 • Can the similarity metric be improved?
  • 41. Dependency Analysis in Sentences • StanfordNLP can be used to analyze the grammatical structure of sentences and provide a dependency graph between the different elements of the sentence. • LexicalizedParser can provide a graph where the vertices are the words and the edges are the grammatical relationships in a sentence. 41http://nlp.stanford.edu/software/lex-parser.shtml
  • 42. 42 TAG MEANING TAG MEANING advmod Adverbial Modifier dobj Direct Object (she,gave) neg Negation Modifier iobj Indirect Object (gave,me) nsubj Nominal Subject amod Adjective Modifier nsubjpass Passive Nominal Subject prep Preposition
  • 44. Dependency Analysis • Works well for short sentences. It loses accuracy when the scope is increased to a document. • May aid in text simplification by using the relationships between the entities. • By analyzing the subject and the object, we can clearly establish a point of view (for e.g. direct address vs first person vs. second person etc.). • Could potentially help in story extrapolation but does not generalize well. So this is a topic of research. 44
  • 45. Sentiment Analysis • StanfordNLP has a deep learning model for sentiment analysis. • Takes a deep parsing approach to sentiment analysis - the structure of the sentence is constructed prior to the analysis. • Was trained on movie reviews data and obtained an accuracy of 5% more than the closest model. • Uses an annotated dataset called the Stanford Sentiment Treebank. Users are encouraged to add labels to improve the model further. 45
  • 46. Sentiment Analysis Examples • Taxonomy • Very Negative • Negative • Neutral • Positive • Very Positive 46
  • 47. Sentiment Analysis Demo • http://nlp.stanford.edu:8080/sentiment/ rntnDemo.html 47
  • 48. StanfordNLP Sentiment Analysis • Provides relatively good results for short sentences. • Sentences that are similar to the training data (movie reviews) perform much better than other sentences. • No good way to aggregate sentiments across a document. A future work would probably involve document level dependency parsing and sentiment analysis. • Only provides overall sentiment. Does not provide an indication of the object of the sentiment. 48
  • 49. Final Thoughts • Shallow NLP is employed in text retrieval and search and provide good results for general search use cases. • Deeper NLP involves semantic parsing, common sense interpolation (both local and global knowledge bases) and tends to be harder. • Deeper NLP is more practical after picking a specific domain for e.g. medical records, legal documents etc. • 2 cents on Intelligence - Memory based systems • http://watson-um-demo.mybluemix.net/ 49http://en.wikipedia.org/wiki/On_Intelligence
  • 50. Resources • StanfordNLP Github: https://github.com/stanfordnlp/CoreNLP • Own repository: https://github.com/sangv/swsd • Dbpedia Spotlight: https://github.com/dbpedia-spotlight/ dbpedia-spotlight • Opennlp repo: https://github.com/apache/opennlp • ConceptNet conceptnet5.media.mit.edu • On Intelligence book: http://en.wikipedia.org/wiki/ On_Intelligence 50