SlideShare uma empresa Scribd logo
1 de 28
Knowledgent Big Data-palooza:
Aspects of Semantic Processing
Na’im R. Tyson, PhD
February 6, 2014
Discussion Topics

• Semantic Processing
– What is Semantics?
– What is Pragmatics?
• Lexical Semantics
– Computing Semantic Similarity
∗ WordNet
∗ Vector Space Modeling
• Ontology Basics
• Text Mining: Basics
1
Semantic Processing

• What is Semantics?
– Study of literal meanings of words and sentences
∗ Lexical Semantics - word meanings & word relations
– Sometimes stated formally using some logical form
∗ Example: ∀x∃yloves(x, y)
• What is Pragmatics?
– Study of language use and its situational contexts (discourse, deixis,
presupposition, etc.)

2
Lexical Semantics
WordNet: Description
• Word relation database
• Created by George Miller & Christiane Fellbaum (Miller, 1995; Fellbaum, 1998)
@ Princeton University
• Types of Relationships
Synonymy - word pair similarity
Antonymy - word pair dissimilarity
Meronymy - part-of relation
– Example: ’engine’ and ’car’
Hyponymy - subordinate relation between words (i.e., a type-of relation)
– Example: ’red’ is a hyponym of ’color’ (’red’ is a type of color)
Hypernymy - superordinate relation between words
3
– Example: ’color’ is a hypernym of ’red’
Question: What’s the relationship between a hyponym and a hypernym?
• 150K words w/ 115k synsets and approx. 200k word-sense pairs

4
Lexical Semantics

• Adapted from Python Text Processing with NLTK 2.0 Cookbook (Perkins,
2010)
>>> from nltk.corpus import wordnet as wn
>>> word_synset = wn.synsets(’cookbook’)[0]
>>> word_synset.name
’cookbook.n.01’
>>> word_synset.definition
’a book of receipes and cooking directions’

5
Lexical Semantics

• Antonymy:
>>> ga1 = wn.synset(’good.a.01’)
>>> ga1.definition
’having desirable or positive qualities especially those suitable
for a thing specified’
>>> bad = ga1.lemmas[0].antonyms()[0]
>>> bad.name
’bad’
>>> bad.synset.definition
’having undesirable or negative qualities’

6
Lexical Semantics

• Hyponymy & Hypernymy:
>>> word_synset.hyponyms()
>>> word_synset.hypernyms()

7
Computing Similarity by WordNet

• Similarity by Path Length (see Perkins, 2010, p. 19)
>>> from nltk.corpus import wordnet as wn
>>> cb = wn.synset(’cookbook.n.01’)
>>> ib = wn.synset(’instruction_book.n.01’)
>>> cb.wup_similarity(ib) # Wu-Palmer Similarity
0.91666666666666663
• For path similarity explanations, see Jaganadhg (2010)

8
Advantages & Disadvantages

• Advantages
Quality: developed and maintained by researchers
Practice: applications can use WordNet
Software: SenseRelate (Perl) - http://senserelate.sourceforge.net
• Disadvantages
Coverage: technical terms may be missing
Irregularity: path lengths can be irregular across hierarchies
Relatedness: related terms may not be in the same hierarchies
Example: Tennis Problem
– ’player’, ’racquet’, ’ball’ and ’net’

9
Computing Word Similarity by Vector Space Modeling

• Computing Similarity from a Document Corpus
Goal: determine distributional properties of a word
Steps: In general...
– Create vector of size n for each word of interest
– Think of them as points in some n-dimensional space
– Use a similarity metric to compute distance
Algorithm: Brown et al. (1992)
– C(x) - vector with properties of x (context of ’x’)
– C(w) = #(w1), #(w2), ..., #(wk ) , where #(wi) is the number of times
wi followed w in a corpus

10
11
Similarity Measure: Cosine
Cosine cos(⃗ , ⃗ ) =
x y

⃗ ∗⃗
x y
|⃗ ||⃗|
x y

n

=

i=1
n
i=1

xi yi
n

x2

i=1

y2

cosmonaut

astronaut

moon

car

truck

Soviet

1

0

0

1

1

American

0

1

0

1

1

spacewalking

1

1

0

0

0

red

0

0

0

1

1

full

0

0

1

0

0

old

0

0

0

1

1

, xn )

cos(cosm, astr) =

1∗0+0∗1+1∗1+0∗0+0∗0+0∗0
12 +02 +12 +02 +02 +02

02 +12 +12 +02 +02 +02

Figure 1: Cosine Similarity Comparison from Collins (2007)

Outline

12
13
Similarity Measure: Euclidean
n
i=1 (xi

Euclidean |⃗ , ⃗ | = |⃗ − ⃗ | =
x y
x y

− yi )2

cosmonaut

astronaut

moon

car

truck

Soviet

1

0

0

1

1

American

0

1

0

1

1

spacewalking

1

1

0

0

0

red

0

0

0

1

1

full

0

0

1

0

0

old

0

0

0

1

1

•

•
•

euclidian(cosm, astr) =
(1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2

Figure 2: Euclidean Similarity Comparison from Collins (2007)

14
Cosine & Euclidean Similarity in Python

>>> import numpy as np
>>> from scipy.spatial import distance as dist
>>> cosm = np.array([1,0,1,0,0,0])
>>> astr = np.array([0,1,1,0,0,0])
>>> dist.cosine(cosm, astr)
1.0
>>> dist.euclidean(cosm, astr)
2.4494897427831779

15
Computing Word Similarity by Vector Space Modeling

• Advantages & Disadvantages
– Requires no database lookups
– Semantic similarity doesn’t imply synonymy, antonymy, meronymy, hyponymy,
hypernymy, etc.

16
Ontology Basics

• Semantic Web Technologies
–
–
–
–

Data Models
Ontology Language
Distributed Query Language
Applications
∗ Large knowledge bases
∗ Business Intelligence

17
Ontology Basics

Figure 3: Cambridge Semantics’ simplified view of Semantic Web solutions.

18
Ontology Basics
• W3C Semantic Web
– RDF - Resource Description Framework
∗ Data model w/ identifiers and named relations b/t resource pairs
∗ Represented as directed graphs b/t resources and literal values
· Done w/ collections of triples
· triple: subject, predicate and object
1. Na’im Tyson born in 197x
2. Na’im Tyson works for Knowledgent
3. Knowledgent headquartered Warren
– SPARQL - SPARQL Protocol And RDF Query Language
∗ Query language of Semantic Web
∗ Queries RDF stores over HTTP
∗ Very similar to SQL
– Capturing Relationships
RDF Schema: Vocabulary (term definitions), Schema (class definitions) and
Taxonomies (defining hierarchies)
19
OWL: Expressive relation definitions (symmetry, transitivity, etc.)
RIF: Rules Interchange Form - representation for exchanging sets of logical
and business rules

20
Text Mining Basics

• What people think Text Mining is?
– Automated discovery of new previously unknown information, by
automatically extracting information from a usually amount of different
unstructured textual resources (Wasilewska, 2014)

21
Text Mining Basics
• What text mining really is?

Data Mining

Information Retrieval

Text Mining
Statistics

Web Mining

Computational Linguistics &
Natural Language Processing

Figure 4: Venn Diagram of Text Mining (Wasilewska, 2014).

22
Text Mining Basics
• A General Approach — ignore Process
Text Mining the cloud!

• Document Clustering
• Text Characteristics

Interpretation /
Evaluation
Data Mining /
Pattern Discovery

Attribute Selection

Text Transformation
(Attribute Generation)
Text Preprocessing
Text

Figure 5: General Approaches to Text Mining Process (Wasilewska, 2014).

23
Text Mining Basics

• Application - Document Clustering
Goal: Group large amounts of textual data
Techniques: High Level
– k-means - top down
∗ cluster documents into k groups using vectors and distance metric
– agglomerative hierarchical clustering - bottom up
∗ Start with each document being a single cluster
∗ Eventually all documents belong to the same cluster
∗ Documents represented as a hierarchy (dendogram)
Reference: Taming Text (see Ingersoll et al., 2013, chap. 6)
• Final Remarks
24
THANK YOU!!

25
References
Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and
Jenifer C. Lai. Class-based n-gram models of natural language. Computational
Linguistics, 18:467–479, 1992.
Michael
Collins.
Lexical
Semantics:
Similarity
Measures
and
Clustering,
November
2007.
URL
http://www.cs.columbia.edu/∼mcollins/6864/slides/wordsim.4up.pdf.
Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming Text: How
to Find, Organize, and Manipulate It. Manning Publications Co., January 2013.
Jaganadhg. Wordnet sense similarity with nltk: some basics, October 2010. URL
http://jaganadhg.freeflux.net/blog/archive/tag/WSD/.
26
George A. Miller. Wordnet: A lexical database for english. Communications of the
ACM, 38(11):39–41, 1995.
Jason Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing, 2010.
Anita Wasilewska. CSE 634 - Data Mining: Text Mining, January 2014. URL
http://www.cs.sunysb.edu/ cse634/presentations/TextMining.pdf.

27

Mais conteúdo relacionado

Mais procurados

Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 

Mais procurados (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 

Semelhante a Big Data Palooza Talk: Aspects of Semantic Processing

Domain Modeling for Personalized Learning
Domain Modeling for Personalized LearningDomain Modeling for Personalized Learning
Domain Modeling for Personalized LearningPeter Brusilovsky
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...ssuser4b1f48
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
The Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for HindiThe Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for Hindisinghg77
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD Aldo Gangemi
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
MS-Word.doc
MS-Word.docMS-Word.doc
MS-Word.docbutest
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...paper_reader
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataJian Wu
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

Semelhante a Big Data Palooza Talk: Aspects of Semantic Processing (20)

Domain Modeling for Personalized Learning
Domain Modeling for Personalized LearningDomain Modeling for Personalized Learning
Domain Modeling for Personalized Learning
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 
E43022023
E43022023E43022023
E43022023
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
The Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for HindiThe Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for Hindi
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
MS-Word.doc
MS-Word.docMS-Word.doc
MS-Word.doc
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
G04124041046
G04124041046G04124041046
G04124041046
 

Último

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Big Data Palooza Talk: Aspects of Semantic Processing

  • 1. Knowledgent Big Data-palooza: Aspects of Semantic Processing Na’im R. Tyson, PhD February 6, 2014
  • 2. Discussion Topics • Semantic Processing – What is Semantics? – What is Pragmatics? • Lexical Semantics – Computing Semantic Similarity ∗ WordNet ∗ Vector Space Modeling • Ontology Basics • Text Mining: Basics 1
  • 3. Semantic Processing • What is Semantics? – Study of literal meanings of words and sentences ∗ Lexical Semantics - word meanings & word relations – Sometimes stated formally using some logical form ∗ Example: ∀x∃yloves(x, y) • What is Pragmatics? – Study of language use and its situational contexts (discourse, deixis, presupposition, etc.) 2
  • 4. Lexical Semantics WordNet: Description • Word relation database • Created by George Miller & Christiane Fellbaum (Miller, 1995; Fellbaum, 1998) @ Princeton University • Types of Relationships Synonymy - word pair similarity Antonymy - word pair dissimilarity Meronymy - part-of relation – Example: ’engine’ and ’car’ Hyponymy - subordinate relation between words (i.e., a type-of relation) – Example: ’red’ is a hyponym of ’color’ (’red’ is a type of color) Hypernymy - superordinate relation between words 3
  • 5. – Example: ’color’ is a hypernym of ’red’ Question: What’s the relationship between a hyponym and a hypernym? • 150K words w/ 115k synsets and approx. 200k word-sense pairs 4
  • 6. Lexical Semantics • Adapted from Python Text Processing with NLTK 2.0 Cookbook (Perkins, 2010) >>> from nltk.corpus import wordnet as wn >>> word_synset = wn.synsets(’cookbook’)[0] >>> word_synset.name ’cookbook.n.01’ >>> word_synset.definition ’a book of receipes and cooking directions’ 5
  • 7. Lexical Semantics • Antonymy: >>> ga1 = wn.synset(’good.a.01’) >>> ga1.definition ’having desirable or positive qualities especially those suitable for a thing specified’ >>> bad = ga1.lemmas[0].antonyms()[0] >>> bad.name ’bad’ >>> bad.synset.definition ’having undesirable or negative qualities’ 6
  • 8. Lexical Semantics • Hyponymy & Hypernymy: >>> word_synset.hyponyms() >>> word_synset.hypernyms() 7
  • 9. Computing Similarity by WordNet • Similarity by Path Length (see Perkins, 2010, p. 19) >>> from nltk.corpus import wordnet as wn >>> cb = wn.synset(’cookbook.n.01’) >>> ib = wn.synset(’instruction_book.n.01’) >>> cb.wup_similarity(ib) # Wu-Palmer Similarity 0.91666666666666663 • For path similarity explanations, see Jaganadhg (2010) 8
  • 10. Advantages & Disadvantages • Advantages Quality: developed and maintained by researchers Practice: applications can use WordNet Software: SenseRelate (Perl) - http://senserelate.sourceforge.net • Disadvantages Coverage: technical terms may be missing Irregularity: path lengths can be irregular across hierarchies Relatedness: related terms may not be in the same hierarchies Example: Tennis Problem – ’player’, ’racquet’, ’ball’ and ’net’ 9
  • 11. Computing Word Similarity by Vector Space Modeling • Computing Similarity from a Document Corpus Goal: determine distributional properties of a word Steps: In general... – Create vector of size n for each word of interest – Think of them as points in some n-dimensional space – Use a similarity metric to compute distance Algorithm: Brown et al. (1992) – C(x) - vector with properties of x (context of ’x’) – C(w) = #(w1), #(w2), ..., #(wk ) , where #(wi) is the number of times wi followed w in a corpus 10
  • 12. 11
  • 13. Similarity Measure: Cosine Cosine cos(⃗ , ⃗ ) = x y ⃗ ∗⃗ x y |⃗ ||⃗| x y n = i=1 n i=1 xi yi n x2 i=1 y2 cosmonaut astronaut moon car truck Soviet 1 0 0 1 1 American 0 1 0 1 1 spacewalking 1 1 0 0 0 red 0 0 0 1 1 full 0 0 1 0 0 old 0 0 0 1 1 , xn ) cos(cosm, astr) = 1∗0+0∗1+1∗1+0∗0+0∗0+0∗0 12 +02 +12 +02 +02 +02 02 +12 +12 +02 +02 +02 Figure 1: Cosine Similarity Comparison from Collins (2007) Outline 12
  • 14. 13
  • 15. Similarity Measure: Euclidean n i=1 (xi Euclidean |⃗ , ⃗ | = |⃗ − ⃗ | = x y x y − yi )2 cosmonaut astronaut moon car truck Soviet 1 0 0 1 1 American 0 1 0 1 1 spacewalking 1 1 0 0 0 red 0 0 0 1 1 full 0 0 1 0 0 old 0 0 0 1 1 • • • euclidian(cosm, astr) = (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 Figure 2: Euclidean Similarity Comparison from Collins (2007) 14
  • 16. Cosine & Euclidean Similarity in Python >>> import numpy as np >>> from scipy.spatial import distance as dist >>> cosm = np.array([1,0,1,0,0,0]) >>> astr = np.array([0,1,1,0,0,0]) >>> dist.cosine(cosm, astr) 1.0 >>> dist.euclidean(cosm, astr) 2.4494897427831779 15
  • 17. Computing Word Similarity by Vector Space Modeling • Advantages & Disadvantages – Requires no database lookups – Semantic similarity doesn’t imply synonymy, antonymy, meronymy, hyponymy, hypernymy, etc. 16
  • 18. Ontology Basics • Semantic Web Technologies – – – – Data Models Ontology Language Distributed Query Language Applications ∗ Large knowledge bases ∗ Business Intelligence 17
  • 19. Ontology Basics Figure 3: Cambridge Semantics’ simplified view of Semantic Web solutions. 18
  • 20. Ontology Basics • W3C Semantic Web – RDF - Resource Description Framework ∗ Data model w/ identifiers and named relations b/t resource pairs ∗ Represented as directed graphs b/t resources and literal values · Done w/ collections of triples · triple: subject, predicate and object 1. Na’im Tyson born in 197x 2. Na’im Tyson works for Knowledgent 3. Knowledgent headquartered Warren – SPARQL - SPARQL Protocol And RDF Query Language ∗ Query language of Semantic Web ∗ Queries RDF stores over HTTP ∗ Very similar to SQL – Capturing Relationships RDF Schema: Vocabulary (term definitions), Schema (class definitions) and Taxonomies (defining hierarchies) 19
  • 21. OWL: Expressive relation definitions (symmetry, transitivity, etc.) RIF: Rules Interchange Form - representation for exchanging sets of logical and business rules 20
  • 22. Text Mining Basics • What people think Text Mining is? – Automated discovery of new previously unknown information, by automatically extracting information from a usually amount of different unstructured textual resources (Wasilewska, 2014) 21
  • 23. Text Mining Basics • What text mining really is? Data Mining Information Retrieval Text Mining Statistics Web Mining Computational Linguistics & Natural Language Processing Figure 4: Venn Diagram of Text Mining (Wasilewska, 2014). 22
  • 24. Text Mining Basics • A General Approach — ignore Process Text Mining the cloud! • Document Clustering • Text Characteristics Interpretation / Evaluation Data Mining / Pattern Discovery Attribute Selection Text Transformation (Attribute Generation) Text Preprocessing Text Figure 5: General Approaches to Text Mining Process (Wasilewska, 2014). 23
  • 25. Text Mining Basics • Application - Document Clustering Goal: Group large amounts of textual data Techniques: High Level – k-means - top down ∗ cluster documents into k groups using vectors and distance metric – agglomerative hierarchical clustering - bottom up ∗ Start with each document being a single cluster ∗ Eventually all documents belong to the same cluster ∗ Documents represented as a hierarchy (dendogram) Reference: Taming Text (see Ingersoll et al., 2013, chap. 6) • Final Remarks 24
  • 27. References Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18:467–479, 1992. Michael Collins. Lexical Semantics: Similarity Measures and Clustering, November 2007. URL http://www.cs.columbia.edu/∼mcollins/6864/slides/wordsim.4up.pdf. Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming Text: How to Find, Organize, and Manipulate It. Manning Publications Co., January 2013. Jaganadhg. Wordnet sense similarity with nltk: some basics, October 2010. URL http://jaganadhg.freeflux.net/blog/archive/tag/WSD/. 26
  • 28. George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995. Jason Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010. Anita Wasilewska. CSE 634 - Data Mining: Text Mining, January 2014. URL http://www.cs.sunysb.edu/ cse634/presentations/TextMining.pdf. 27