Machine learning for document analysis and understanding
1. 1
Machine learning
for document analysis and
understanding
TC10/TC11 Summer School on Document Analysis:
Traditional Approaches and New Trends
@La Rochelle, France. 8:30-10:30, 4th July 2018
Seiichi Uchida, Kyushu University, Japan
2. 2
The Nearest Neighbor Method
The simplest ML for pattern recognition;
Everything starts from it!
3. 3
The nearest neighbor method:
Learning = memorizing
Input PorkBeef
Orange
Watermelon
Pineapple Fish
Which reference
pattern is the most
similar?
Reference
patterns
4. 4
Each pattern is represented
as a feature vector
Color feature
Texture
feature
Pork=(10, 2.5, 4.3)
*Those numbers are just a random example
Note: In the classical
nearest neighbor method,
those features are
designed by human
5. 5
A different pattern becomes a different
feature vector
Beef = (8, 2.6, 0.9)
*Those numbers are just a random example
Pork=(10, 2.5, 4.3)
*Those numbers are just a random example
Color feature
Texture
feature
7. 7
An input pattern
in the feature vector space
We want to
recognize this
input x
Color feature
Texture
feature
8. 8
input x
Nearest neighbor method
in the feature vector space
Nearest
neighbor
input = orange
Color feature
Texture
feature
9. 99
How do you define
“the nearest neighbor”?
Distance-based
The smallest distance gives
the nearest neighbor
Ex.
• Euclidean distance /
Similarity-based
The largest similarity gives
the nearest neighbor
Ex.
• Inner product
• Cosine similarity
𝐱 𝐲
𝐱 𝐲
x
?
10. 1010
Do you remember an important property
of “inner product”?
If and are in the similar direction, their
inner product becomes larger
The inner product evaluates
the similarity between and
11. 11
Well, two different types of features
(Note: important to understand deep learning)
Features defined by the
pattern itself
Orange pixels→ Many
Blue pixels → Rare
Roundness → High
Symmetry →High
Texture → Fine
…
Features defined by the
similarity to others
Similarity to ”car” → Low
Similarity to ”apple” → High
Similarity to “monkey”→Low
Similarity to “Kaki”
(persimmon) →Very high
…
12. 12
The nearest neighbor method with
similarity-based feature vectors
Similarity
to “Kaki”
Similarity to “car”
Important note:
Similarity is used for not
only feature extraction
but also classification
13. 13
A shallow explanation of
neural networks
Don’t think it is a black box.
If you know “inner-product”, it becomes
15. 15
From reality to computational model
https://commons.wikimedia.org/
input g xgx
1x
jx
dx
1w
jw
dw
f
xg
16. 16
The neuron by computer
Σ xg
1x
jx
dx
1
……
b
bf
bxwfg
T
d
j
jj
xw
x
1
)(
x 1w
jw
dw
f
f: non-linear func.
input
output
17. 17
The neuron by computer
Σ xg
1x
jx
dx
1
……
b
x 1w
jw
dw
f
f: non-linear func.
Let’s
forget
bf
bxwfg
T
d
j
jj
xw
x
1
)(
18. 18
The neuron by computer
Σ xg
1x
jx
dx
1
……
b
x 1w
jw
dwLet’s
forget
d
j
jj bxwg
1
)(x
19. 19
The neuron by computer
Σ
1x
jx
dx
……
xwT
just “inner product”
of two vectors
x
1w
jw
dw
w
xw
x
T
d
j
jj xwg
1
)(
20. 20
So, a neuron calculate…
xwT
Σ
1x
jx
dx
……
1w
jw
dw
xw andbetweensimilarityA
=0.9 if they
are similar
=0.02 if they
are dissimilar
21. 21
So, if we have K neurons, we have
a K-dimensional similarity-based feature vector
…
xw
xw
xw
T
K
T
T
2
11w
2w
Kwx
1x
jx
dx
0.9
0.05
0.75
x
23. 23
Another function of the inner product
Similarity-based classification!
(Yes, the nearest neighbor method!)
Σ
1x
jx
dx
……
x
reference
pattern
of class k
24. 24
Note: Multiple functions are realized by just combining neurons!
Just by layering the neuron elements, we
can have a complete recognition system!
…
Feature extraction
1w
Kw
1x
jx
dx
……
2w
Classification
AV
CV
BV
Similarity
to class A
Similarity
to class B
Similarity
to class C
Choose
max
25. 25
Now the time for deep neural networks
1x
jx
dx
feature extraction layers
……
…
f
f
f
…
classification
f
f
f
26. 26
An example: AlexNet
“Deep” neural network called AlexNet
A Krizhevsky, NIPS2012
feature extraction layers
classification
layers
27. 27
Now the time for deep neural networks
1x
jx
dx
feature extraction layers
……
…
f
f
f
…
Classification
f
f
f
Why do we need to repeat
feature extraction?
28. 28
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
A difficult
classification
task
29. 29
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w
30. 30
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w
F
A
B
C
D
E
Large similarity to 𝐰
Small similarity to 𝐰
similarity to
similarity
to𝐰
Note: The lower picture is not
very accurate (because it does not
use inner-product-based but
distance-based space
transformation. However I believe
that it does not seriously damage
the explanation here.
31. 31
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w
F
A
B
C
D
E
It becomes more
separable
but still not
very separable
similarity to
similarity
to𝐰
32. 32
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w
F
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰
33. 33
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w
F
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰
A
D
E
B
C
F
similarity to
similarity
to𝐰
35. 35
Why do we need to repeat feature
extraction?
A
D
C
B
E
F
1w2w Now two classes
become totally
separable by
2v
1v
A
D
E
B
C
F
similarity to
similarity
to𝐰
A
D
E
B
C
F
similarity to
similarity
to𝐰
F
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰
37. 37
The typical non-linear function:
Rectified linear function (ReLU)
Σ xg
1x
jx
dx
1
……
b
x 1w
jw
dw
f
Rectified linear function
38. 3838
How does ReLU affect
the similarity-based feature?
Minus elements in the feature vector are
forced to be zero
xwT
1
xwT
K
f
Unchanged
Unchanged
f
39. 39
How to train neural networks:
Super-superficial explanation
40. 40
In order to realize a DNN with
an expected “input-output” relation
…
1w
Kw
1x
jx
dx
……
2w
AV
CV
BV
Similarity to
class A
Similarity to
class B
Similarity to
class C
Those parameters should be tuned
1w AV2w
41. 41
Training DNN; the goal
Class B
Class A
DNN
Knobs
Perfect
classification
boundary
Note: Actual number of #knobs (=#parameters)
43. 4343
Advanced topic: Why (SGD-based) back-
propagation works?
Many theoretical researches have been done
[Choromanska+, PMLR2015] [Wu+, arXiv2017]
Under several assumptions,
local minima is close to the
global minimum.
flat basin of loss surface
44. 44
Knob = weight
= a pattern for similarity-based feature
Σ
1x
jx
dx
……
input weight
similarity
to
similarity to
This pattern is automatically
derived through training…
45. 45
Optimal feature is extracted automatically
through training (Representation learning)
Google’s cat
https://googleblog.blogspot.jp/2012/06/
similarity
to
similarity toDetermined
automatically
46. 46
DNN for image classification:
Convolutional neural networks
(CNN)
47. 47
kw
How to deal with images by DNN?
x
xwT
k
400million-dim vector
400million-dim vector
①Intractable
computations
②Enormous
parameters
48. 4848
kw
Convolution
= Repeating “local inner product” operations
= Linear filtering
x
ji
T
k ,xw
Low-dimensional
vector
ji,x
①Tractable
computations
②Trainable
#parameters
53. 53
Application to DAR:
Detecting a component in a character imageMulti-part
component
[Iwana+, ICDAR2017]
Q: Can CNN detect complex components accurately?
56. 56
CNN can be used as a feature extractor
1x
jx
dx
feature extraction layers
……
…
f
f
f
…
classification
(discarded)
f
f
f
1x
jx
dx
……
…
f
f
f
Another classifier
e.g., SVM and LSTM
Anomaly
detector
Clustering
great
57. 5757
The current CNN does not
“understand” characters yet
Adversarial examples
[Abe+, unpublished]
Motivated by [Nguyen+, CVPR2015]
Likelihood values for
classes “A” and “B”
58. 58
On the other hand, CNN can learn “math
operation” through images
input images output “image”
showing the sum
[Hoshen+, AAAI, 2016]
59. 5959
Visualization for deep learning:
DeCAF [Donahue+, arXiv 2013]
Visualizing the pattern distribution at each
layer
Near to the input layer Near to the output layer
60. 6060
Visualization for deep learning:
DeepDream and its relations
Finding an input image that excites a neuron
at a certain layer
https://distill.pub/2017/feature-visualization/
61. 6161
Visualization for deep learning:
Layer-wise Relevance Propagation (LRP)
Finding pixels which contribute the final
decision by a backward process
http://www.explain-ai.org/
62. 62
Visualization for deep learning:
Local sensitivity analysis by making a hole
Motivated by [Zeiler+, arXiv, 2013][Ide+, Unpublished]
Likelihood of class “0” degrades a lot by making a hole around the pixel
63. 6363
Visualization for deep learning:
Grad-CAM [Selvaraju+, arXiv2016]
Finding pixels which contribute the final
decision by a backward process
http://gradcam.cloudcv.org/
66. 6666
Auto encoder
(= Nonlinear principal component analysis)
Training the network to output the input
App: Denoising by convolutional auto-encoder
Compact
representation
of the input
wikipedia
https://blog.sicara.com/keras-tutorial-content-based-image-retrieval-convolutional-denoising-autoencoder-dc91450cc511
71. 7171
Note: Deep Image Prior
[Ulyanov+, CVPR2018]
Conv-Deonv structure has an inherent
characteristics which is suitable for image
completion and other “low-pass” operations
train a conv-deconv
net just to generate
the left image but it
results in the right
image
72. 7272
Generative Adversarial Networks
The battle of two neural networks
VS
Generate
“fake bill” Discriminate
fake or real bill
Generator Discriminator
Fake bill becomes more and more realistic
75. 75
Huge variety of GANs:
Just several examples…
StackGANCycleGAN
Standard GAN (DCGAN)
https://www.slideshare.net/YunjeyChoi/generative-adversarial-networks-75916964
condition
(class)
Conditional GAN
79. 79
SSD (Single Shot MultiBox Detector)
Fully-Conv Net that outputs bounding boxes
[Liu+, ECCV2016]
80. 80
Application to DAR:
EAST: An Efficient and Accurate Scene Text Detector
[Zhou+, “EAST: An Efficient and Accurate Scene Text Detector”, CVPR2017]
Evaluating bounding box shape
82. 82
LSTM (Long short-term memory):
A recurrent neural network
… …
… …
Recurrent
structure
Info from
all the past
Gate
structure
Active info
selection
input vector
output vector
Also very effective for solving
the vanishing gradient problem
in t-direction
[Graves+, TPAMI2009]
83. 83
LSTM NN
Recurrent NN
Recurrent
structure
Info from
all the past
LSTM NN
Gate
structure
Active info
selection input
output
input
output
input gate
forget gate
output gate
[Graves+, TPAMI2009]
87. 87
Application to DAR:
Convolutional Recurrent Neural Network (CRNN)
An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text
Recognition, IEEE TPAMI, 2017
99. 99
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
For us, the function value
should be more than 1
For us, the function value
should be more than 1
100. 100
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
NG OK NG
101. 101
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
nail
102. 102
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
the minimum slope
satisfying the constraints
103. 103
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
It also gives the maximum
margin classification!
104. 104
A A AB B
Support vectors
x
bT
xw
1
-1
SV
SV
Only those SVs contribute to
determine the discriminant function
109. 109
Mapping the feature vector space
to a higher-dimensional space
1x
2x
1
1
0
Not linearly-separable
1x
21xx
2x
Linearly-separable!
21
2
1
2
1
xx
x
x
x
x
:
111. 111
What happens in the original space
1y
2y
3y
dcybyay 321
A plane in 3D space
Rewrite
112. 112
What happens in the original space
1x
2x
21xx
dxcxbxax 2121
??? What is this?
Revert
113. 113
What happens in the original space
1x
2x
1
1
0
1
1
2
2121
cxb
axd
x
dxcxbxax
識別面:
Linear classification
in the higher-space
corresponds to a
non-linear classification
in the original space
Classification
boundary
115. 115
What happens in the original space
daxcx
cb
x
dxxcbxax
1
2
12
2
2
121
1
B
1x
2
2
1 xx
B
A
A
2x
A
A
B
B
1x
2x
116. 116116
Notes about -machine
Combination with SVM is popular
-function leads “kernel”
Choosing a good mapping is not trivial
In the past, the choice was done by try-and-error
Recently….
i
iji
ji
jiji
i
ij
T
i
ji
jiji
i
ij
T
i
ji
jiji
kyy
yy
yy
xx
xx
xx
,
,
,
,
117. 117117
Deep neural networks can find
a good mapping automatically
Feature extraction layer = a mapping
The mapping is specified by the weight
The weight (i.e, ) is optimized via training
It is so-called “representation learning”
…
…
121. 121
AdaBoost:
A set of complementary classifiers
1g 0.7
training
patterns
1.Train
ifsum>0thenA;elseB
2. Reliability
122. 122
AdaBoost:
A set of complementary classifiers
0.7
training
patterns
ifsum>0thenA;elseB
3. Give a large (small) weight to each
sample which is misrecognized
(correctly recognized) by
124. 124
training
patterns
ifsum>0thenA;elseB
AdaBoost:
A set of complementary classifiers
0.7
0.43
6. Give a large (small)
weight to each sample which
is misrecognized (correctly
recognized) by
Repeat until
convergence of
training accuracy
127. 127
Near-human performance has been
achieved by big data and neural networks
machine printed
handwritten
designed fonts
95.49%
99.79%
99.99%
[Uchida+, ICFHR2016]
[Zhou+, CVPR2017]
Scene text detection
Scene text recognition
CRNN [Shi+, TPAMI, 2017]
F value=0.8 on
ICDAR2015 Incidental scene text
89.6% word recog. rate
on ICDAR2013
129. 129
Beyond 100% = Computer can detect, read,
and collect all text information perfectly
Texts on notebook
Texts on object label
Texts on digital display
Texts on book page
Texts on signboard
Texts on poster / ad
So, what do want to do
with the perfect recognition results?
130. 130
Poor recognition results
In fact, our real goal should NOT be
perfect recognition results
Real goals
Ultimate application
by using perfect
recognition results
Scientific discovery
by analyzing perfect
recognition results
Perfect recognition results
Tentative goal
131. 131
What will you do
in the world beyond 100%?
Ultimate application
Education
“Total-recall” for perfect
information search
Welfare
Alarm, translation,
information complement
“Life log”-related apps
Summary, log compression,
captioning, question
answering, behavior
prediction, reminder
Scientific discovery
With social science
Interaction between scene
text and human
Text statistics
With design science
Font shape and impression
Discovering typographic
knowledge
With humanities
Historical knowledge
Semiology
132. 132132
Another direction:
Use characters to understand ML
Simple binary and stroke-structured pattern
Less background clutter
Small size (ex. 32x32)
Big data (ex. 80,000 samples / class)
Predefined classes (ex. 10 classes for digits)
ML has achieved near-human performance
Very good “testbed” for
not only evaluating but also understanding ML