DevoxxFR 2024 Reproducible Builds with Apache Maven
Michael Fuchs | How to compute semantic relationships between entities and facts out of natural texts
1. How to compute
semantic relationships
between entities
and facts out of
natural texts
Michael Fuchs
Technology Evangelist
ABBYY
fuchs@abbyy.com
2. Agenda
1. How machines read pixels
2. Documents, words, layout & semantics
3. Syntactic & semantic text parsing
4. Live demo
5. Q&A
2
3. How machines read pixels
3
Separate pixels to charactersPixel analysis Find text/image blocks
4. How machines read pixels
4
Build proper words as editable textRecognize individual characters
-> Linguistics: Alphabets & Morphology Dictionaries
-> Math, AI, Statistics, Experience, and…
Requirements to make a machine read text:
5. 5
What is needed to make
a machine understand the meaning
of words, sentences, texts?
6. Documents & Words
6
What is a document?
Statistics can give
basic insights
-> No real semantic
understanding
b) Words in order?
Layouts generate
visual pattern
-> Semantics can be
derived from layout
a) Bag of words?
7. Documents, Words and Layout
7
Document with layout
Text document with “simulated” layout Text with line breaks
Text only
-> Rules can extract data out of (semi-)structured texts and documents
-> Layout helps to identify the semantic meaning of data
8. Text and Structure
Is “plain” natural language text unstructured?
8
-> yes, at least for almost all IT systems
-> not for humans who can read and
speak the language
-> Facts and their relations can’t be reliably
detected with “simple” rules
9. Text, Structure & Translation
9
Is a word by word translation enough?
-> … well – not really…
-> Semantic understanding of the words and
their relationship in sentences is needed!
-> That is true for humans and machines
10. Text & Structure
10
Why is natural language text understanding difficult for machines?
-> Languages are not logical and context dependent
– different usage, e.g. as verb, noun, adjective
-> Different words – the same concept, e.g. to buy/sell something
– different meanings, e.g. run, plant, apple …
-> One word – different variants, e.g. go, went, gone
11. Basic Language Structure
11
-> Morphology = Rules how to use words
-> Semantics = meaning and the usage of words
-> Semantic Relations = reflect/organise the meaning and
relations of words and sentences.
-> Syntax = Rules are used to build correct sentences
How to get to the insides of a sentence?
12. Compreno System Architecture
13
Extraction rules
Interpretation
rules
Identification
rules
Morphological
analyzer
Syntactic and
semantic analysis
Anaphora
resolution
Disambiguation
Semantic
representation
of text
Parser Information
Extraction
Module
RDF Graph
19. Identifying Pronoun Referents (Anaphora)
21
Mary saw her students. They were wearing masks. She was surprised.
(Mary → her, Mary → she, students → they).
22. Summary: What is ABBYY Compreno?
● … NLP technology featuring a unique model-based approach that employs
universal language models and identifies language structures.
● …. combines both syntactic and semantic analysis, as well as machine learning
on untagged text corpora.
● … allows to create a semantic representation of text
● … able to resolve complex language phenomena:
− lexical ambiguity
− omitted words and links recovering ellipsis
− identifying pronoun referents anaphora
− coreference
− coordination and more
● … support of English, Russian, German in progress
24