A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Università degli studi di Bari “Aldo Moro”
Dipartimento di Informatica

A Run Length Smoothing-Based Algorithm
for non-Manhattan Document Segmentation
S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito
Via Orabona, 4 - 70126 Bari – Italy
{ferilli, esposito}@di.uniba.it
L.A.C.A.M. {fabio.leuzzi, fulvio.rotella}@uniba.it
http://lacam.di.uniba.it

Introduction
● Automatic document processing a hot topic
― Layout analysis a fundamental step

● Identification of frames (relevant components in the document)

● Performance can determine quality and feasibility of the whole process

● Two different…

● Kinds of sources: Digitized (scanned) vs. Natively digital documents

● Categories of layouts: Manhattan vs. Non-Manhattan

● Types of algorithms: Top-down vs. Bottom-up

● Run Length Smoothing Algorithm
● Manhattan Layout

● Other works exploit or try to improve the RLSA by setting its parameters

● Many works on Manhattan layout

― Top-down strategies

● Less works on non-Manhattan layout

― Bottom-up strategies

● The Manhattan assumption holds for many typeset documents, simplifies
document processing…BUT cannot be assumed in general

RLSO
Application to scanned images
RLSO (Run Length Smoothing with OR)
1) horizontal smoothing with threshold th, row by row

2) vertical smoothing with threshold tv, column by column
● logical OR of the images obtained in steps 1 and 2
th = 5
tv = 4
(AND)

RLSO

?
Application to scanned images

RLSO
Application to born-digital documents
● Set horizontal/vertical distance thresholds th/tv
● build a frame for each basic block
● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacent basic blocks
and dh is the horizontal distance between them}
●for all (dh,1, b’h,1, b’’h,1) ∈ H s.t. dh,1 ≤ th merge the frames to which b’h,1, b’’h,1
belong

● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacent basic blocks
and dv is the vertical distance between them}
● for all (dv,1, b’h,1, b’’h,1) ∈ V s.t. dv,1 ≤ tv merge the frames to which b’h,1, b’’h,1 belong

Reference block
Adjacent blocks
Non-adjacent blocks
Horizontal distance
Vertical distance

RLSO
Application to born-digital documents

RLSO
● Run Length Smoothing algorithms based on thresholds
― Hard to properly set manually (Not typical human activity)
― Heuristic approaches (Ad hoc)
― Tampers the idea of automatic processing
― Fixed thresholds not suitable to documents with several different
spacings

Automatic assessment of RLSO thresholds

RLSO
Automatic threshold assessment
● Study of Run Lengths behavior Figure 1.
a fragment of
― Histogram very irregular scientific paper
● Peaks = most frequent spacings

● Peak clusters = equally spaced

components
― Hard to exploit by automatic

techniques

― Cumulative histograms more regular
― Bar b = runs larger or equal than

b H’(i) = ∑ j≥ i H(j)
● Monotonically decreasing

― Flat zones = lengths for which no

runs are present
● Scaled down to 10%

― Reduces variability

RLSO
Automatic threshold assessment
● Select threshold on flat zones
― Derivative a good indicator

● Slope = 0

● Discrete approximation on bar

b:
― Tolerance possible Figure 1-a.

● Slope = – 30

― Skip starting and trailing flat

zones
● Starting zone = missing small
b
run lengths
● Trailing zone = merge whole

content Figure 1-b.

● Iteration of technique on
previously smoothed image
― Finds progressively more
(Figure 1-a/1-b) successive application of RLSO with
spaced components automatic threshold assessment on Figure 1.

Conclusions
● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the
document image and fill them with black pixels whenever they are shorter than a
given threshold
– Both Manhattan and Non-Manhattan Layout
– Version for natively digital documents
● Automatic thresholding effective on documents having
– single character size
– different spacings

● Good baseline towards more complex documents
– different character sizes
– graphics
● Current and future Work
– Stop criterion for iteration
– Clustering based on positioning and spacing

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Semelhante a A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation (17)

Último

Último (20)

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation