Deep misconceptions and the myth of data driven NLU

Deep Misconceptions and the Myth
of Data-Driven Language Understanding
On Putting Logical Semantics Back to Work

IMMANUEL KANT
Every thing in nature, in the inanimate
as well as in the animate world, happens
according to some rules, though we do
not always know them
I reject the contention that an important
theoretical difference exists between formal
and natural languages
RICHARD MONTAGUE
One can assume a theory of the world
that is isomorphic to the way we talk
about it… in this case, semantics
becomes very nearly trivial
JERRY HOBBS

Early efforts to find theoretically elegant formal models for various linguistic phenomena did not
result in any noticeable progress, despite nearly three decades of intensive research (late 1950‟s
through the late 1980‟s ). As the various formal (and in most cases mere symbol manipulation)
systems seemed to reach a deadlock, disillusionment in the brittle logical approach to language
processing grew larger, and a number of researchers and practitioners in natural language
processing (NLP) started to abandon theoretical elegance in favor of attaining some quick results
using empirical (data-driven) approaches.
All seemed natural and expected. In the absence of theoretically elegant models that can explain
a number of NL phenomena, it was quite reasonable to find researchers shifting their efforts to
finding practical solutions for urgent problems using empirical methods. By the mid 1990‟s, a
data-driven statistical revolution that was already brewing over took the field of NLP by a storm,
putting aside all efforts that were rooted in over 200 years of work in logic, metaphysics,
grammars and formal semantics.
We believe, however, that this trend has overstepped the noble cause of using empirical
methods to find reasonably working solutions for practical problems. In fact, the data-driven
approach to NLP is now believed by many to be a plausible approach to building systems that
can truly understand ordinary spoken language. This is not only a misguided trend, but is a very
damaging development that will hinder significant progress in the field. In this regard, we hope
this study will help start a sane, and an overdue, semantic (counter) revolution.
Copyright © 2017 WALID S. SABA
a spectre is haunting NLP
February 7, 2017

about the resurgence
of and the currently dominant
paradigm in ‘AI’ …

the availability of huge amounts data, coupled with
advances in computer hardware and distributed
computing resulted in some advances in certain types
of (data-centric) problems (image, speech, fraud
detection, text categorization, etc.)

But …
many problems in AI require
understanding that is beyond
discovering patterns in data

Identifying an adult female in an image is a data-centric
problem that might be suitable for data-driven image
recognition systems
However, inferring which of the two is a photo of a
teacher and which is a mother requires information that
is not (always!) in the data

which picture would a
data-driven image
recognition pick-out
for a query like
‘musical band’?

data-driven image
for a query like
‘musical band’?
data-driven image
for a query like
‘musical band’?
Musician?
a person who plays a
musical instrument?

So what
is at issue
here?
The issue here is that, ontologically, there
are no musicians, teachers, lawyers, or
even mothers! What exists, ontologically
(metaphysically), are humans, and a
concept such as ‘musician’ is a logical
concept that might be true of a certain
human
Quantitative/Data-driven approaches can
only reason with (detect, infer, recognize)
objects that are of an ontological type, but
they can not detect logical concepts, that
form the majority of objects of (human)
thought

human
lawyer
dancer
ONTOLOGIACL
CONCEPTS
LOGIACL
CONCEPTS
teacher
mother
...

failure to distinguish between logical and
ontological concepts is not only a flaw in
data-driven approaches
logical/formal semantics also failed to
provide adequate models for natural
language and for exactly the same reason

Notwithstanding achievements in data-centric tasks (e.g., image and speech recognition, or
numerically specifiable and finite-space problems, such as the game Go), statistical and other
data-driven models (e.g., neural networks) cannot model human language comprehension
because these models cannot explain, model or account for very important phenomena in
ordinary spoken language, such as:
• Non-Observable (thus Non-Learnable) Information
• Intensionality and Compositionally
• Inferential Capacity

Criticisms of the statistical data-driven approach to language understanding are very often
automatically associated with the Chomskyan school of linguistics. At best, this is a misinformed
judgement (although in many cases, it is ill informed). There is a long history of work in logical
semantics (a tradition that forms the background to the proposals we will make here) that has
very little to do (if anything at all) with Chomskyan linguistics.
Notwithstanding Chomsky‟s (in our opinion valid) Poverty of the Stimulus (POS) argument – an
argument that clearly supports the claim of some kind of innate linguistic abilities, we believe
that Chomskyans put too much emphasis on syntax and grammar (which ironically made their
theory vulnerable to criticism from the statistical and data-driven school). Instead, we think that
syntax and grammar are just the external artifacts used to express internal, logically coherent,
semantic, and compositionally and productively (i.e., recursively) constructed thoughts,
something that is perhaps analogous to Jerry Fodor‟s Language of Thought (LOT).
Here we should also mention that we agree somewhat with M. C. Corballis („The Recursive
Mind‟) that it is thought that brought about the external tool we call language, and not the other
way around.
what this study is not about

Another association that criticism of the statistical and data-driven approaches to NLU often
conjures up is that of building large knowledge bases with brittle rule-based inference engines.
This is perhaps the biggest misunderstanding, held not only by many in the statistical and data-
driven camp, but also by previously over enthused knowledge engineers that mistakenly
believed at one point that all that is required to crack the NLU problem was to keep adding
more knowledge and more rules. We also do not subscribe to such theories.
In fact, regarding the above, we agree with an observation once made by the late John McCarthy
(at IJCAI 1995) that building ad-hoc systems by simply adding more knowledge and more rules
will result in building systems that we don‟t even understand. Ockham's Razor, as well as
observing linguistic skills of 5-year olds, should both tell us that the conceptual structures that
might be needed in language understanding should not, in principal, require all that
complexity.
As will become apparent later in this study, the conceptual structures that speakers of ordinary
spoken language have access to are not as massive and overwhelming as is commonly believed.
Instead, it will be shown that the key is in the nature of that conceptual structure and the
computational processes involved.

FINALLY, our concern here is in introducing a plausible model for natural language understanding
(NLU). If your concern is natural language processing (NLP), as it is used, for example, in
applications such as these
words-sense disambiguation (WSD);
entity extraction/named-entity recognition (NER);
spam filtering, categorization, classification;
semantic/topic-based search;
word co-occurrence/concept clustering;
sentiment analysis;
topic identification;
automated tagging;
document clustering;
summarization;
etc.
then it is best if we part ways at this point, since this is not at all our concern here. There are many
NLP and text processing systems that already do a reasonable job on such data-level tasks. In fact, I
am part of a team that developed a semantic technology that does an excellent job on almost all of
the above, but that system (and similar systems) are light years away from doing anything remotely
related to what can be called natural language understanding (NLU), which is our concern here.

1
2
3
WE WILL ARGUE THAT purely data-driven extensional models that ignore
intensionality, compositionality and inferential capacities in natural language are
inappropriate, even when the relevant data is available, since higher-level reasoning (the
kind that‟s needed in NLU) requires intensional reasoning beyond simple data values.
WE WILL ARGUE THAT many language phenomena are not learnable from data because
(i) in most situations what is to be learned is not even observable in the data (or is not
explicitly stated but is implicitly assumed as „shared knowledge‟ by a language
community); or (ii) in many situations there‟s no statistical significance in the data as the
relevant probabilities are all equal
WE WILL ARGUE THAT the most plausible explanation for a number of
phenomena in natural language is rooted in logical semantics, ontology, and the
computational notions of polymorphism, type unification, and type casting; and we
will do this by proposing solutions to a number of challenging and well-known
problems in language understanding.
what this study is about

We will propose a plausible model rooted in logical semantics, ontology, and the computational notions
of polymorphism, type casting and type unification. Our proposal provides a plausible framework for
modelling various phenomena in natural language; and specifically phenomena that requires reasoning
beyond the surface structure (external data). To give a hint of the kind of reasoning we have in mind,
consider the following sentences:
(1) a. Jon enjoyed the movie
b. Jon enjoyed watching the movie
(2) a. A small leather suitcase was found unattended
b. A leather small suitcase was found unattended
(3) a. The ham sandwich wants another beer
b. The person eating the ham sandwich wants another beer
(4) a. Dr. Spok told Jon he should soon be done with writing the thesis
b. Dr. Spok told Jon he should soon be done with reading the thesis
Our model will explain why (1a) is understood by all speakers
of ordinary language as (1b); why speakers in multiple languages find (2a) more natural to say than (2b);
why we all understand (3a) as (3b); and why we effortlessly resolve „he‟ in (4a) with Jon and „he‟ in (4b)
with Dr. Spok. Before we do so, however, we will discuss some serious flaws in proposing a statistical
and a data-driven approach to NLU.
more specifically ...

is not even in the data?
what if the relevant information
understanding language by
analyzing data?

Challenges in the computational comprehension of ordinary text is often due to quite a bit of missing text –
text which is not explicitly stated but is often assumed as shared knowledge among a community of
language users. Consider for example the sentences in (1):
(1) a. Don’t worry, Simon is a rock.
b. The truck in front of us is annoying me.
c. Carlos likes to play bridge.
d. Mary enjoyed the apple pie.
e. Jon owns a house on every street in the village.
Clearly, speakers of ordinary English understand the above as
(2) a. Don’t worry, Simon is [as solid as] a rock.
b. The [person driving the] truck in front of us is annoying me.
c. Carlos likes to play [the game] bridge.
d. Mary enjoyed [eating] the apple pie.
e. Jon owns a [different] house on every street in the village.
Since such sentences are quite common and are not at all exotic, farfetched, or contrived, any model for
NLU must clearly somehow „uncover‟ this [missing text] for a proper understanding of what is being said.
What is certain here is that data-driven approaches are helpless in this regard, since a crucial part of
understanding NL text is not only interpreting the data, but „discovering‟ what is missing from the data.
analyzing missing text?
it's not even in the data

Again, let us consider the sentences below, where there is some [missing text] that is not explicitly
stated in every day discourse, but is often implicitly assumed:
a. Don’t worry, Simon is [as solid as] a rock.
b. The [person driving the] truck in front of us is annoying me.
c. Carlos likes to play [the game] bridge.
d. Mary enjoyed [eating] the apple pie.
e. Jon owns a [different] house on every street in the village.
Although the above seem to have a common denominator, namely some missing text that is often
implicitly assumed, it is somewhat surprising that in looking at the literature one finds that the
missing text phenomenon have been studied quite independently and under different labels such
as
metaphor (a),
metonymy (b),
lexical ambiguity (c),
ellipses (d),
quantifier scope ambiguity (e)
analyzing missing text?

In ordinary spoken
language there’s more
than missing (and
implicitly assumed)
text …
When surface data
probabilities are all equally
likely, we often resort to our
shared (commonsense)
knowledge in resolving
certain types of ambiguities
(e.g., in reference resolution)

One of the most obvious challenges to statistical and data-driven NLU are situations where there does
not seem to be any statistical significance in the observed data that can help in making the right
inferences. As an example, consider the sentences in (1) and (2).
(1) The trophy did not fit in the brown suitcase because it was too
a. big
b. small
(2) Dr. Spok told Jon that he should soon be done
a. writing his thesis
b. reading his thesis
For a speaker of ordinary language, the decision as to what „it‟ in (1) and „he‟ in (2) refer to are
immediately obvious, even for a 5-year old. On the other hand, a statistically data-driven approach
would be helpless in making such decisions since the only difference between the sentence-pairs in (1)
and (2) are words that co-occur with equal probabilities (this is so because antonyms or opposites, such
as big/small, night/day, hot/cold, read/write, open/close, etc. have been shown to co-occur in text with
equal frequency). Clearly, then, references such those in (1) and (2) must be resolved using information
that is not (directly) in the data.
probabilities are all equal

In the absence of any statistical significance in the data, we have suggested above that references such
as those in sentences (1) and (2) are resolved by relying on other information that is not (directly) in
the data.
It might still be suggested, however, that a learning algorithm can create statistical significance
between (1a) and (1b), for example, if probabilities of some composites in the sentence (as opposed to
the atomic units) are considered. What this would essentially require is creating a composite feature
for every possible relation. In (1), we would need at least the following:
trophy-fit-in-suitcase-small
trophy-fit-in-suitcase-big
trophy-not-fit-in-suitcase-small
trophy-not-fit-in-suitcase-big
Note here that since data-driven approaches also do not admit the existence of a type-hierarchy (or any
knowledge structure, for that matter) –i.e., there‟s nothing that says that a Trophy and a Radio are both
subtypes of an Artifact, and that Purse and Suitcase are both subtypes of some Container, where
the „fit‟ relation applies similarly to both, other features (e.g., radio-fit-in-purse-small) would also be
needed to learn how to resolve the reference „it‟ in (1).

Again, in the absence of a type-hierarchy (or some other source of information) statistical
significance can only be salvaged if composite features are constructed for every possible relation in
a meaningful sentence. Such a story leads us to something like this:
trophy-fit-in-suitcase-small
trophy-fit-in-suitcase-big
trophy-not-fit-in-suitcase-small
trophy-not-fit-in-suitcase-big
radio-fit-in-purse-small
radio-fit-in-purse-big
radio-not-fit-in-purse-small
radio-not-fit-in-purse-big
etc.
Although the point can be made with the above, the story in reality is much worse, as there are
more „nodes‟ that must be combined in these features to capture statistical significance. For
example, if „because‟ was changed to „although‟ in (1b) then „it‟ would suddenly refer to the trophy.
Nevertheless, the question now is how many such features would eventually be needed, if every
meaningful sentence requires a handful of composite features to capture all statistical correlations?
Fodor and Pylyshyn (1988) hint that that number is in the magnitude of the number of seconds in
the history of the universe, citing an experiment conducted by the psycholinguist George Miller.

Incidentally, in the absence of any external knowledge structures, the combinatorially implausible
explosion in the number of features needed by a statistical data-driven (i.e., bottom-up) learner
would also be needed by a top-down learner, one that learns by being told (or by instruction).
Specifically, a top-down learner would ask for n number of clarifications in every sentence, requiring
therefore a total of nm clarifications for a paragraph with m sentences. The reader can now easily
work out how many clarifications would be required for a top-down learner to understand just a
small paragraph1.
The point here is that whether the learner tries to discover what is missing bottom-up (from the
data), or top-down (by being told), it would seem therefore that the infinity lurking
in language (due to the recursive productivity of thoughts) makes learning various language
phenomena just from data alone a computationally implausible theory.
a top-down explanation
1The reason a top-down learner would need (n x n), as opposed to (n + n), clarifications for two consecutive
sentences where each requires n is that the preferences of one sentence are subject to revision in the context of
the previous and/or the following sentence. This is so because, linguistically, it is paragraphs, not sentences,
that are the smallest linguistic units that can be fully interpreted on their own and should not (in theory) require
any additional text to be fully understood. See The Semantics of Paragraphs (Zadrozny & Jenssen, 19xx) for an
excellent treatment of the subject

Our argument against statistical data-driven approaches in NLU are not meant to dismiss the role of
statistical/probabilistic reasoning in language understanding. That would, of course, be unwise. Our
argument is about which probabilities are relevant in language understanding. Consider, for example,
the following:
(1) The town councilors refused to give the demonstrators a
permit because they advocated violence and anarchy.
(2) A young teenager fired several shots at a policeman.
Eyewitnesses say he immediately fled away.
While the most likely reading for (1) has they referring to the demonstrators, one can imagine a
scenario where a group of anarchist town councilors refused to give the demonstrators a permit
specifically to incite violence and anarchy. Similarly, while the most likely reading for (2) is the one
where „he‟ refers to the young teenager, one can imagine a scenario where a slightly wounded
policeman fled away to escape further injuries.
Obviously such occurrences are rare, and thus, in the absence of other information, the pragmatic
probability of the usual reading wins over with speakers of ordinary language. What is important to
note here is that the likelihoods we are speaking of are a function of pragmatics and have nothing to
do with anything observed in the data.
pragmatic probabilities

To summarize this argument, consider the table below. At the data level references can be resolved
during syntactic analysis using simple NUMBER or GENDER data. At the information level,
the resolution would require semantic (type) information, for example that corporations, and not
lawsuits, settle a case out of court. Note also that at this level the possibilities are not all available, once
the type constraints are applied. It is exactly at the pragmatic level where probabilistic/statistical
reasoning factors in, since at this level the referents are all possible, yet some are more probable than
others (e.g., it is more likely that the one who fell down is the one who was shot, etc.)
pragmatic probabilities
REFERNCES RESOLVED BY SYNTAX
John informed Mary that he passed the exam.
John told Steve and Diane that they were invited to the party.
information
data
intentional
REFERNCES RESOLVED BY SEMANTICS
There are a number of lawsuits between Apple and Samsung, and
a. both say they are more about values than patents and money.
b. both say they are ready to settle out of court.
REFERNCES RESOLVED BY PRAGMATICS
A young teenager fired several shots at a policeman.
Eyewitnesses say he immediately fled away.
REFERNCES CANNOT BE RESOLVED (intention not clear)
John told Bill that he has been nominated to head the committee.
information level
data level
knowledge level
intentional level

Perhaps chief among the “it‟s not even in the data” phenomena is that of Adjective-Ordering Restrictions
(AORs), a phenomenon that can be explained by the examples below:
(1) a. Carlos is a polite young man
b. #Carlos is a young polite man
(2) a. A small brown suitcase was found unattended
b. #A brown small suitcase was found unattended
The readings in (1a) and (2a) are clearly preferred by speakers of ordinary spoken language over the
readings in (1b) and (2b), although there are no rules that speakers of ordinary language seem to be
following. What makes the AORs phenomenon even more intriguing is the fact that these preferences
are also consistently made across multiple languages.
First of all this phenomena presents a paradigmatic challenge to the statistical and data-driven story
about language learning, as it does not seem that speakers come to have these preferences by observing
and analyzing data. Furthermore, there does not seem to be a pattern in the observed data suggesting
what adjectives should precede or follow other adjectives. For example, while it
is preferred that „small‟ precede „brown‟ in (2), in (3) „small‟ is
not anymore preferred to be the first adjective:
(3) A beautiful small suitcase was found unattended
innate preferences?

The most crucial challenge to data-driven NLU as it relates to adjective-ordering restrictions is to
explain how beautiful in (4a) could be describing Olga‟s dancing as well as Olga as a person, while this
reading is not available in (4b):
(4) a. Olga is a tall beautiful dancer
b. Olga is a beautiful tall dancer
We will see later why beautiful in (4b) cannot anymore modify Olga‟s dancing (an abstract entity of
type Activity) after it was polymorphically cast into describing a physical object. For now we want to
note however that while various investigations on large corpora have not yielded any plausible
explanation as to what seems to govern these adjective-ordering restrictions, we argue that even if
some patterns were to be discovered, the more important question is „what is behind this phenomena –
i.e., what is it that makes us have these ordering preferences, and across multiple languages‟?
In our opinion, what is behind this phenomenon must be much deeper than the outside (observable)
data of any language. In fact, we believe that a plausible account for this phenomena must shed some
light on the conceptual structures and the processes that are operating in language. As stated above, a
plausible explanation for this puzzle, one that is rooted in ontology, polymorphism, type unification
and type casting, will be suggested later in this study.
innate preferences?

We have thus far argued that in the absence of some process or other source of information, a number of
phenomena in natural language understanding cannot be observed, captured, or learned by simply
analyzing the external linguistic data alone. Whether it‟s adjective-ordering restrictions, which seem to be
not only data-independent, but even language-independent, to the missing (not explicitly stated) text that
must somehow be discovered and interpreted, to situations where probabilities in the data are statistically
insignificant, it is clear that data-driven approaches to NLU are inappropriate.
Before we get into our proposals, however, we will next have a small discussion about intensions and
how data alone, even if available, is not enough in high-level reasoning, the kind that is needed in NLU.

data is (in the end) just data
no matter how big,
extensions and
intensions

Clearly, as objects (e.g., as logical gates) the expressions in (1) are not the same. For example, a logical
circuit corresponding to the expression on the left-hand side has only two gates, while a gate for the
other expression would have three, as shown below.
What do we mean when we write an equality like this?
(A ^ (B _ C)) = (A ^ B) _ (A ^ C)
It would seem then that at some level, equality in data only is not enough and saying two objects are the
same is different from saying they are equal (in their data value). In some contexts, as will be seen
shortly, these differences are crucial. What is crucial to our discussion here is that data-driven approaches
deal with data only, that is, equality in that paradigm is equality of one attribute, namely the final value.
Thus, if it does turn out that equality of data alone is not enough in high-level reasoning (e.g., in NLU),
then data-driven approaches to NLU, would also (or, again) clearly be inappropriate.
Let us therefore take a closer look at the equality most of us know, and the related notions of intensions
and extensions, notions that some of the most penetrating minds in mathematical logic have studied for
nearly two centuries.
(1)
data and intensions

Our grade school teachers once
told us that
256 = 16
Can we always equate and replace the data value
16 by the data value 256? Let‟s see …
data and intensions

Mary taught
her little brother
that 7 + 9 = 16
Now if we blindly follow what our grade school teachers told us, namely that 256 = 16, we
should be able to replace 16 by 256 without any problem. But if we do that we would then be
able to alter reality and come up with
What happened? Were we taught the wrong thing when we were told that 256 = 16? Not
exactly, but we were also not told the whole story. I guess our grade school teachers did not
know we will end up working in AI and NLU. If they did, they would have told us that
extensional (data only) equality is not sufficient in high-level reasoning, and if equated with
sameness at that level it can easily lead to false conclusions.
here‘s a snapshot of
some reality
Mary taught her little brother that 7 + 9 = 𝟐𝟓𝟔
data and intensions

The four objects below are in fact equal, including 256 and 16, but in regard to one attribute only,
namely their data value. As objects, however, they are not the same, as they differ in many other
attributes, for example in the number of operators and the number of operands. Note however that
the attributes value, no-of-operators, and no-of-operands are still not enough to establish true intensional
equality between these objects, as demonstrated by the objects (a) and (b). At a minimum, true
(intensional) equality between these objects would require the equality of at least four attributes:
value, no-of-operators, no-of-operands, syntaxtree.
equality and sameness
(a) (b)
In many domains where the only relevant attribute is the data (value), working with extensional
(data equality) only might be enough. In tasks that require high-level reasoning, such as NLU,
however, this will lead to contradictions and false conclusions, as the example of Mary and her
little brother clearly demonstrate
data and intensions

The four objects below are in fact equal, including 256 and 16, but only in regard to one attribute
only, namely their data value. As objects, however, they are not the same, as they differ in many
other attributes, for example in the number of operators and the number of operands. Note
however that the attributes value, no-of-operators, and no-of-operands are still not enough to establish
true intensional equality between these objects, as demonstrated by the objects (a) and (b). At a
minimum, true (intensional) equality between these objects would require the equality of at least
four attributes: value, no-of-operators, no-of-operands, syntaxtree.
equality and sameness
(a) (b)
In many domains where the only relevant attribute is the data (value), working with extensional
(data equality) only might be enough. In tasks that require high-level reasoning, such as NLU,
however, this will lead to contradictions and false conclusions, as the example of Mary and her
little brother clearly demonstrate
As an aside …
Reducing equality of objects to equality of one extensional attribute, namely the
data value, is what
is behind the so-called adversarial examples in deep neural works, where small
perturbations in the image (the kind of which will not affect the human eye from
making a different classification) will cause the network to classify the image in a
completely different category. The same is true in the converse case, where a
completely meaningless image (a blob of pixels) is classified with high certainty as
a real-life object. That
is, behind both of these phenomena is something similar to the fact that 256 is
not always (and in all contexts) equal to 9 + 7, although certain calculations
involving these data values might produce the same output
value (bottom line: extensional data-only equality is
not enough in high level reasoning)
data and intensions

Beyond grade school, we were told in high school that two functions, f and g, are equal (are the same) if
for every input they produce the same output. In notation, this was expressed as
But this is not entirely true – or, our high school teachers did not also tell us the whole truth: if two
functions are equal whenever they agree on their input-output pairings, then MergeSort and
InsertionSort would be the same objects, since for any sequence
But computer scientists know that although their external values are always the same (that is, they are
extensionally equal), MergeSort and InsertSort are not the same objects as they differ in many other (and
very important) attributes – for example in their space and time complexity.
yet another example
MergeSort(sequence) = InsertionSort(sequence)
data and intensions

data and reasoning
Here we consider an example where working with extensions (data values) only and ignoring
intensions can easily lead to absurd conclusions. Consider the facts shown in the table below.
Now according to the above, the teacher of Alexander the Great = Aristotle. Notice now that if
we simply replace „the teacher of Alexander the Great‟ with a value that is only extensionally equal to
it, we can get an absurdity from a very meaningful sentence, as shown below
data and intensions

Let us now consider examples illustrating why intensionality cannot be ignored in natural language
understanding. Suppose we have a question-answering system that was to return the names of:
(1) all the tall presidents of the United States ?
(2) all the former presidents of the United States ?
A simple method for answering (1) would be to get two sets, the set of names of all tall people, and
the set of names of all presidents of the United States, and simply return the intersection as the
result.
What about the query in (2), however? Clearly we cannot do the same, because we cannot, like in
the case of tall, represent former by a set (an extension) of all former things. If we did, than Ronald
Reagan, for example, would have been a „former president‟ even while serving his term as
president, because he would have been in both sets: the set of presidents, and the set of „former
things‟ as he was also a former actor.
The point here is unlike tall, which is an extensional adjective that can semantically be represented
by a set (the set of all tall things), former is an intensional adjective that logically operates on a
concept returning a subset of that concept as a result.
data and intensions
data, intensions and reasoning

Let us elaborate on this subject some more. The following is a plausible meaning for (1) and (2) above:
(1) tall presidents of the United States ) f x j is-president-of-the-us(x) ^ is-tall(x)g
(2) former presidents of the United States ) f x j is-president-of-the-us (x) ^ F(x, president)g
What the above says is: (1) „tall presidents of the United States‟ refers to any x that is in the set of
presidents and also in the set of tall things; and (2) „former presidents of the United States‟ refers to any
x that is in the set of presidents and also some F is also true of x. Cleary what F does with an x is
something to effect of making sure that x was, at some point in time, but is not now, a president. The
point here is that unlike is-tall(x), F is not a set, and has no extensional value, but is a logical expression
that takes a concept and applies some condition returning a subset of the original concept.
All of this is not available in data-driven NLU, where both „tall‟ and „former‟ are adjectives that equally
modify nouns, which, as we have seen, can result in contradictions when executed on real data.
data and intensions

One misguided attempt at salvaging the data-only solution would be to maintain a set of for the
compound former presidents.
This escape attempt is doomed, however, since composite sets for previous president, former
senator, former governor, previous governor, etc. would also then have to be added and maintained.
In fact, insisting on a data-only solution for intensional adjectives would essentially mean
maintaining a set for every construction of the form [Adj1 Adj2 Noun], [Adj1 Adj2 Noun1 Noun2], …
where any adjective Adji was an intensional adjective.
This is exactly the same the situation we encountered previously (pages 12-15), where composite
features for every possible relation were needed to resolve references in a data-driven model. In
both cases, such alternatives are neither computationally, nor psychologically plausible.
data and intensions

Another major problem with data-driven/statistical approaches to NLU is their complete denial of
compositionality in computing the meaning of larger linguistic units as a function of the meaning of
their constituents. To illustrate, consider the sentences below.
(1) Jon bought an old copy of Das Kapital.
(2) Jon read an old copy of Das Kapital.
Although (1) and (2) refer to the same object, namely to a book entitled „Das Kpital‟, the reference in (1)
is to a physical object that can be bought (and thus sold, and burned, etc), while in (2) the reference is to
the content and ideas in that book. Thus, „Das Kapital‟ may refer to different features or properties of
the book, depending on the context, where the context could extend over several sentences. For
example, consider (3):
(3) Jon read Das Kapital. He then burned it because he did not
agree with anything it espouses.
In (3), we are (at the same time) using „Das Kapital‟ to refer to an abstract object (namely the content of
Das Kapital) when Jon read it and then disagreed with it‟s content, and a physical object, that can be
burned. We will see later on how a strongly-typed system will discover the existence of all the potential
types of objects that „Das Kapital‟ can refer to (physical object that can be burned, abstract object that
can be read and disagreed with, etc.)
data and intensions
compostionality

In natural language we can speak of anything we can conceive or imagine, existent or non-existent.
We can thus speak of and refer to an event that did not exist, as in
(1) John cancelled the trip. It was planned for next Saturday.
In (1), we are speaking about and referring to an event (a trip), that did not actually happen, thus a trip
that never existed. We can also refer to or speak of objects that do not exist, as in
(2) John painted a yellow bear.
In (2) what is „yellow‟ is not an actual bear, but a depiction of some object, namely a bear. Reference to
abstract and nonexistent objects can be quite involved, especially in mixed contexts where the initial
reference is to an object that does not necessarily exist, but is an object that subsequent context implies
its existence. For example, consider the following:
(3) John’s book proposal was not well received.
But it later became a bestseller when it was published.
In (3), the reference was initially to a book proposal, which does not imply the existence of the book,
although subsequent context implies the concrete existence of a book. Such inferences cannot
be made with a simple analysis of the external data.
data and intensions
yellow bears?

Data-driven approaches typically ignore functional words (prepositions, quantifiers, etc.), and for a
good reason: the probabilities of these words are equal in all contexts! But such words cannot be
ignored as these words are what logically glues the various components of a sentence into a coherent
whole. Consider for example the determiner „a‟, the smallest word in English, in the following
sentences:
(1) A paper on genetics was published by every student of Dr. Miller
(2) A paper on genetics was referenced by every student of Dr. Miller
While „a paper on genetics‟ may refer to a single and specific paper in (2), this not likely in (1), where
„a‟ is most likely under the scope of „every‟. That is, the most likely meaning of (1) the one implied by
(3) Every student of Dr. Miller published a paper on genetics
Resolving such quantifier scope ambiguities are clearly beyond data-driven approaches and are a
function of pragmatic world knowledge (e.g., while it is possible for several students to refer
to a single paper, it is not likely that all of Dr. Miller‟s students published the same paper…)
We shall later on see how a strongly-typed ontology of commonsense concepts can be used to make
such inferences.
data and intensions
functional words

We (hopefully) have demonstrated that purely quantitative (statistical data-driven) approaches are
not plausible models for natural language understanding, and for two main reasons:
1. The relevant information is often not even present in the
data, or in many cases there is no statistical significance in
the data to make the proper inferences. Attempts to remedy this lead to a combinatorial
explosion in the size of features that would have to assumed, which renders these attempts
computationally implausible.
2. It was shown that even when the data is available, reasoning with data only and ignoring
intensions and logical definitions can easily lead to absurdities and contradictions.
While statistical and data-driven models may not be appropriate for high-level reasoning tasks in
language understanding, we believe that these models have a lot to offer in some linguistic and data-
centric tasks. Chief among these are part-of-speech (POS) tagging, statistical parsing, and collecting
and analyzing corpus linguistic data to „enable‟ and automate some of the tasks needed in building
a system that can truly understand ordinary spoken languages.
We are now in a position to start describing our proposal.
data-driven NLU?
so where are we now?

ontological vs. logical concepts

We will start with our proposal by first introducing the general framework, and we will do so
gradually. The material presented form hereon assumes some exposure to logic, although we will try
to simplify our presentation as much as can possibly be done.
One of the major features in our framework is the crucial idea of distinguishing between what can be
called ontological concepts, or first-intension concepts, as Cocchiarella (19xx) calls them, and logical
concepts (or, second intension concepts). The difference between these two types of concepts can be
illustrated by the following examples:
(1) R2 : heavy(x :: physical)
R3 : hungry(x :: animal)
R4 : articulate(x :: human)
R5 : make(x :: human, y :: artifact)
R6 : imminent(x :: event)
R7 : beautiful(x :: entity)
What the above says is : heavy is a property that can be said of any object x that is of type physical;
that we say hungry of objects that are of type animal; that articulate applies to objects that are of
type human; that we can speak of the make relation between an object of type human and an object of
type artifact; that we can say imminent of objects that are of type event; and, finally, that we can say
beautiful of any entity.
the framework

the framework
It is also assumed that the types associated with predicates in
(1), e.g. artifact, event, human, entity, etc. exist in a
subsumption hierarchy as shown in the fragment hierarchy
below, and where the dotted arrows indicate the existence of
intermediate types.
The fact that an object of type human is ultimately an object of
type entity is expressed as human v entity. Furthermore, a
property such as heavy can be said of objects of type human
and objects of type artifact since human v physical,
artifact v physical and heavy(x :: physical).

As mentioned earlier, a strongly-typed ontology is assumed throughout this study. Usually this
conjures up thoughts of massive amount of knowledge that has to be hand coded and engineered by
experts. This is not at all what we are assuming here. In fact, the ontological structure we are
assuming (and we will discuss later on) is not massive at all since most everyday concepts are
actually just instances of the basic ontological types.
For example, there‟s nothing meaningful (i.e., sensible, regardless of whether it is true or false), in
language that we can say about a „racing car‟ that we cannot say about a car). Thus, as far as
language understanding, the ontological type car belongs to the ontology, and „racing car‟ is just an
instance concept. With such an analysis, most of everyday concepts are just instances of basic
ontological types. This issue is related to a comment that J. Fodor once made, something to the effect
that “to be a concept, is to be locked to a word in the language”. This is also inline with Fred
Sommers‟ idea of applicability in his proposal about The Tree of Language. Gottleb Frege‟s idea of how
a word gets its meaning, namely from all the different ways it can be used in language, is also
consistent with the ontological structure we assume, which was discovered by reverse engineering
language itself. That is, what we can say about concepts, tells us what structure lies behind.
We will discuss the details of the ontology later on, for now, we will simply assume that this
ontological structure exists.
about the ontological structure
the framework

According to the above, in our framework we assume a Platonic universe that includes everything
we can talk about in ordinary discourse, including abstract objects such as events, states, properties,
etc. These ontological concepts exist as types in a strongly-typed ontology, and the logical concepts
are all the properties of, or the relations that can hold between, these ontological concepts. In addition
to logical and ontological concepts there are proper nouns, which are the names of objects; objects
that can be of any type. We use the notation
(91Sheba :: thing)
to state that there is a unique object named Sheba, an object that is of type thing. With this basic
machinery, let‟s consider the interpretation of the simple sentence „Sheba is a thief‟, where 〚s〛
stands for 'the meaning of s', is used to mean 'is interpreted as', and thief(x :: human) states that
the property thief applies to objects that must be of type human:
(2) 〚Sheba is a thief〛
) (91Sheba :: thing)(thief(Sheba :: human))
Thus „Sheba is a thief‟ is interpreted as follows: there is some unique object named Sheba, an object
that is initially assumed to be a thing; such that the property thief is true of Sheba.
)
the framework

Note that in our interpretation (repeated below) Sheba is now associated with more than one type in a
single scope.
(2) 〚Sheba is a thief〛 ) (91Sheba :: thing)(thief(Sheba :: human))
Initially unknown, and thus assumed to be an object of type thing, Sheba was later assigned the type
human, when described by the property (or when in the context of being a) thief. In these situations a
type unification must occur, and this is done as follows,
(Sheba :: (thing ² human)) ! (Sheba :: human)
where (s ² t) denotes a type unification between the types s and t, and where ! stands for „unifies to‟.
Note that the unification of thing and human resulted in human since human v thing; that is, since an
object that is of type human is ultimately an object of type thing. The final interpretation of „Sheba is a
thief‟ is now the following:
(2) 〚Sheba is a thief〛 ) (91Sheba :: human)(thief(Sheba))
In the final analysis „Sheba is a thief‟ is simply interpreted as: there is a unique object named Sheba, an
object that (we now know) must be of type human, and that object is a thief.
type unification – the basics
the framework

Although we have interpreted a very simple sentence, we have already seen the power of embedding
ontological types (that exist in some strongly-typed hierarchy) into the powerful machinery
of logical semantics. Specifically, it was the type constraint on the property thief(x :: human), namely
that it applies to objects that must be of type human, that allowed us to discover the fact that Sheba must
be a human. Admittedly, this a very trivial „discovery‟ and in a very simple context. However, the
power of type unification and the hidden information it will uncover will be more appreciated as we
move on to more involved contexts.
Suppose black(x :: physical) and own(x :: human,y :: entity). That is, we are assuming that black
can be said of all objects of type physical, and that objects of type human can own any object of type
entity. With that, let us consider now the following:
(3) 〚 Sara owns a black cat 〛
) (91Sara :: thing)(9c :: cat)(black(c :: physical)
^ own(Sara :: human, c :: entity))
Thus „Sara owns a black cat‟ is interpreted as follows: there is a unique thing named Sara, and some
object c of type cat, such that c is black (and thus here it must be of type physical), and Sara owns c,
where in this context Sara must be an object of type human and c an object of type entity.
the framework

Our interpretation for „Sara owns a black cat‟ is repeated below.
(3) 〚 Sara owns a black cat 〛 ) (91Sara :: thing)(9c :: cat)(black(c :: physical)
^ own(Sara :: human, c :: entity))
Note now that, depending on the context they are mentioned in, Sara is assigned two types, and the
object c is assigned three types. The type unifications that must occur in this situation are the following:
(Sara :: (thing ² human)) ! (Sara :: human)
(c :: ((physical ² entity) ² cat))
! (c :: (physical ² cat))
! (c :: cat)
Note that the type unification (physical ² entity) ² cat) is associative, so the order in which the two
type unifications are done does not matter. The final interpretation of „Sara owns a
black cat‟ is therefore given by:
(3) 〚 Sara owns a black cat 〛 ) (91Sara :: human)(9c :: cat)(black(c) ^ own(Sara, c))
That is, there is unique object named Sara, which is of type human, and some cat c, and Sara owns c.
the framework

type
unification:
the basics
the framework

As mentioned in our introduction, in our framework ontological concepts include abstract objects such
as states, processes, events, properties, etc. Let us now consider one of these categories, namely
activities. In our framework a concept such as dancer(x) is true of some x according to the following:
(8x :: human)(dancer(x)
´ (9d :: activity)(dancing(d) ^ agent(d, x))
That is, any object x of type human is a dancer iff there is some object d of type activity such that d is a
dancing activity, and x is the agent of d. Note that according to the above, there are at least two objects
that are part of the meaning of „dancer‟, and in particular, some object x of type human, and some
dancing activity, d. Thus, in saying „beautiful dancer‟, for example, one could be using „beautiful‟ to
describe the dancer, or the dancing activity itself. Consider now the interpretation below, assuming that
beautiful(x :: entity); that is, assuming beautiful is a property that can be said of any entity:
(4) 〚 Sara is a beautiful dancer 〛
) (91Sara :: thing)(9a :: activity)
(dancing(a) ^ agent(a :: activity,Sara :: human)
^ (beautiful(a :: entity) _ beautiful(Sara :: entity)))
abstract objects
the framework

〚 Sara is a beautiful dancer 〛
(dancing(a) ^ agent(a :: activity,Sara :: human)
^ (beautiful(a :: entity) _ beautiful(Sara :: entity)))
Thus „Sara is a beautiful dancer‟ is interpreted as follows: there‟s a unique object named Sara, some
activity a, such that a is a dancing activity, and Sara is the agent of a (and as such must be an object of
type human), and either the dancing is beautiful, or Sara (or, of course, both). Note now that there are a
number of type unifications that must occur:
(Sara :: ((thing ² human) ² entity)) ! (Sara :: (human ² entity)) ! (Sara :: human)
(a :: (activity ² entity)) ! (a :: activity)
After all is said and done, the interpretation of (4) is the following:
(4) 〚 Sara is a beautiful dancer 〛
) (91Sara :: human)(9a :: activity)
(dancing(a) ^ agent(a, Sara)
^ (beautiful(a)_beautiful(Sara)))
Note that the ambiguity of what beautiful is describing is still represented in our final interpretation.
abstract objects
the framework

Thus far our type unifications have always succeeded. In some cases, however, a type unification
between two types s and t could fail, and we write this as
(s ² t) ! ?
Let us see where this might occur and what would this result in. Consider the interpretation of „Sara is a
blonde dancer‟ where we assume blonde(x :: human), that is, we are assuming that blonde
is a property that applies to objects that must be of type human.
(5) 〚 Sara is a blonde dancer 〛
(dancing(a) ^ agent(a :: activity, Sara :: human)
^ (blonde(a :: human)_blonde(Sara :: human)))
The type unifications needed for Sara are quite simple:
(Sara :: ((thing ² human) ² human)) ! (Sara :: (human ² human)) ! (Sara :: human)
The type unification needed for the activity a, however, is not as straightforward. Before we continue, let
us plug in the type unification of Sara to see where we‟re at.
failed type unifications
the framework

a brief detour

Before we continue with our proposal, we would like to illustrate the utility of separating concepts into
logical and ontological concepts. We will do this here by proposing a solution to the so-called Paradox
of the Ravens. Introduced in the 1940‟s by the logician (and once an assistant of Rudolph Carnap) Carl
Gustav Hempel, the Paradox of the Ravens (or Hempel‟s Paradox, or the Paradox of Confirmation) has
continued to occupy logicians, statisticians, and philosophers of science to this day. The paradox arises
when one considers what constitutes as an evidence for a statement (or hypothesis). To illustrate what
the Paradox of the Ravens is consider the following:
(H1) All ravens are black
(H2) All non-black things are not ravens
That is, we have the hypothesis H1 that „All ravens are black‟. This hypothesis, however, is logically
equivalent to the hypothesis H2 that „All non-black things are not ravens‟, as shown below:
(1) and (2) are logically equivalent, thus any evidence/observation that confirms H1 must also confirm
H2 and vice versa. While it sounds reasonable that observing black ravens should confirm H1, observing
a white ball, or a red sofa, that do confirm H2, also confirm the logically equivalent hypothesis that all
ravens are black, which does not sound plausible.
what paradox of the ravens?
a temporary diversion

Observing non-black objects that are not ravens as in (b), however, confirms hypothesis
H2 (that all non-black things are not ravens). But H2 is logically equivalent to H1, leaving
us with the unpleasant conclusion that observing red apples, blue suede shoes, or brown
briefcases, confirms the hypothesis that „All ravens are black‟.
Observing black ravens confirms hypothesis H1, namely that „All ravens are black‟ -
the case in in (a).
(a) (b)

Many solutions have been proposed to the Paradox of the Ravens that range from accepting the
paradox (that observing red apples and other non-black non-ravens does confirm the hypothesis „All
ravens are black‟) to proposals in the Bayesian tradition that try to measure the „degree‟ of confirmation.
The Bayesian proposals essentially amount to proposing that observing a red apple does confirm the
hypothesis „All ravens are black‟ but it does so very minimally, and certainly much less than the
observation of a black raven confirms „All ravens are black‟. Clearly, this is not a satisfactory solution
since observing a red flower should not contribute at all to the confirmation of „All ravens are black‟.
Worse, in the Bayesian analysis, the observation of black but non-raven objects actually negatively
confirms (or disconfirms) the hypothesis that „All ravens are black‟.
One logician that stands out in suggesting an explanation for the Paradox of the Ravens is W. V. Quine,
who suggested (in „Natural Kinds‟) that there is no paradox in the first place, since universal statements
of the form All Fs are Gs can only be confirmed on what he called natural kinds, and that „nonblack
things‟ and „non ravens‟ are not natural kinds. Basically, for Quine, members of a natural kind must
share most of their properties, and there‟s hardly anything similar between all „non-black things‟, or all
non-ravens. While statistical/Bayesian and other logical proposal still have not suggested a reasonable
explanation for the Ravens Paradox, we believe that the line of thought Quine was perusing is the most
appropriate. However, Quine‟s natural kinds were not well-defined. In fact, what Quine was alluding
to, probably, was that there is a difference between what we have called here logical concepts and that
of ontological concepts

The so-called Paradox of the Ravens exists simply because of mistakenly representing both ontological
and logical concepts by predicates, although, ontologically, these two types of concepts are quite
different. First, let us discuss some predicates and how we usually represent them in first-order logic.
Consider the following:
Suppose now that we would like to add types to our variables. That is, we would like our logical
expressions to be, in computer programming terminology, strongly-typed. Suppose, further, that we
also like our predicates to be polymorphic; that is, they apply to objects of a certain type and all of their
subtypes. That is, if a predicate applies to objects of type vehicle, then it applies to all subtypes of
vehicle (e.g., car, truck, bus, …) Given this, what are the appropriate types that one might associate
with the variables of the predicates above? Here are some possible type assignments:

What the above suggests is that, ignoring metaphor for the moment, the predicate black applies to
objects that are of type physical. In other words, black is meaningless (or nonsensical) when applied
to (or said of) objects that are not of type physical. Similarly, the above says that imminent is said of
objects that are of type event (and, of course, all its subtypes, so we can say „an imminent trip‟, an
„imminent meeting‟, imminent election‟, etc.). In the same vein the above says that sympathetic is
said of objects that must be of type human, and that hungry applies to objects of type animal. But
how about the predicates in (5) and (6)? What are the most appropriate types that can be associated
with the variables in the predicates dog(x) and guitar(x), or of what types of objects can these
predicates be meaningful? The only plausible answer seems to be the following:
(5) and (6) are obvious tautologies, since, for example, the predicate dog applied to an object of type
dog is always true. Clearly, then, (5) and (6) are quite different from the predicates in (1) through (4)
: while the predicates in (1) through (4) are logical concepts, dog and guitar are not redicates/logical
concepts, but ontological concepts that correspond to types in a strongly-typed ontology. With this
background, let us now go back to the so-called Paradox of the Ravens.

what paradox
of the ravens?

the framework
failed
type
unifications

salient
properties/relations
the framework

ontological
semantics:
contents
the road ahead

the proposal
words-sense
disambiguation

Let us now look at situations where lexical ambiguities translate into ambiguities in both, logical and
ontological concepts. Consider the sentences in (12) and (13):
(10) Melinda ran for twenty minutes.
(11) The program ran for twenty minutes.
First of all, there is a clear ambiguity in the meaning of „program‟, as it could refer to a computer
program (i.e., a process), or to a program of some event, among other meanings. Second, it is clear that
the running of Melinda in (10) is different from the running of the program in (11). Let us consider the
simplest of these two cases, namely the ambiguity in (10), assuming that there are (at least) two kinds of
running activities, one who‟s agent is a (legged) animal, and one who‟s agent is a process:
What the above says is the following: there‟s a unique object named Melinda, some twenty minutes that
Melinda ran, and either a running activity of some human, or the running of some process.
the proposal
words-sense disambiguation

the proposal

fragment of
the ontology

words-sense
disambiguation
the proposal

the proposal
The corner table
wants another beer
Tables have ‘wants’, and they drink beer?!

To be continued ...

Deep misconceptions and the myth of data driven NLU

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Deep misconceptions and the myth of data driven NLU

Similar to Deep misconceptions and the myth of data driven NLU (20)

Recently uploaded

Recently uploaded (20)

Deep misconceptions and the myth of data driven NLU