Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run Length Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.
DSPy a system for AI to Write Prompts and Do Fine Tuning
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
1. Università degli studi di Bari “Aldo Moro”
Dipartimento di Informatica
A Run Length Smoothing-Based Algorithm
for non-Manhattan Document Segmentation
S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito
Via Orabona, 4 - 70126 Bari – Italy
{ferilli, esposito}@di.uniba.it
L.A.C.A.M. {fabio.leuzzi, fulvio.rotella}@uniba.it
http://lacam.di.uniba.it
2. Introduction
● Automatic document processing a hot topic
― Layout analysis a fundamental step
● Identification of frames (relevant components in the document)
● Performance can determine quality and feasibility of the whole process
● Two different…
● Kinds of sources: Digitized (scanned) vs. Natively digital documents
● Categories of layouts: Manhattan vs. Non-Manhattan
● Types of algorithms: Top-down vs. Bottom-up
● Run Length Smoothing Algorithm
● Manhattan Layout
● Other works exploit or try to improve the RLSA by setting its parameters
● Many works on Manhattan layout
― Top-down strategies
● Less works on non-Manhattan layout
― Bottom-up strategies
● The Manhattan assumption holds for many typeset documents, simplifies
document processing…BUT cannot be assumed in general
3. RLSO
Application to scanned images
RLSO (Run Length Smoothing with OR)
1) horizontal smoothing with threshold th, row by row
2) vertical smoothing with threshold tv, column by column
● logical OR of the images obtained in steps 1 and 2
th = 5
tv = 4
(AND)
5. RLSO
Application to born-digital documents
● Set horizontal/vertical distance thresholds th/tv
● build a frame for each basic block
● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacent basic blocks
and dh is the horizontal distance between them}
●for all (dh,1, b’h,1, b’’h,1) ∈ H s.t. dh,1 ≤ th merge the frames to which b’h,1, b’’h,1
belong
● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacent basic blocks
and dv is the vertical distance between them}
● for all (dv,1, b’h,1, b’’h,1) ∈ V s.t. dv,1 ≤ tv merge the frames to which b’h,1, b’’h,1 belong
Reference block
Adjacent blocks
Non-adjacent blocks
Horizontal distance
Vertical distance
7. RLSO
● Run Length Smoothing algorithms based on thresholds
― Hard to properly set manually (Not typical human activity)
― Heuristic approaches (Ad hoc)
― Tampers the idea of automatic processing
― Fixed thresholds not suitable to documents with several different
spacings
Automatic assessment of RLSO thresholds
8. RLSO
Automatic threshold assessment
● Study of Run Lengths behavior Figure 1.
a fragment of
― Histogram very irregular scientific paper
● Peaks = most frequent spacings
● Peak clusters = equally spaced
components
― Hard to exploit by automatic
techniques
― Cumulative histograms more regular
― Bar b = runs larger or equal than
b H’(i) = ∑ j≥ i H(j)
● Monotonically decreasing
― Flat zones = lengths for which no
runs are present
● Scaled down to 10%
― Reduces variability
9. RLSO
Automatic threshold assessment
● Select threshold on flat zones
― Derivative a good indicator
● Slope = 0
● Discrete approximation on bar
b:
― Tolerance possible Figure 1-a.
● Slope = – 30
― Skip starting and trailing flat
zones
● Starting zone = missing small
b
run lengths
● Trailing zone = merge whole
content Figure 1-b.
● Iteration of technique on
previously smoothed image
― Finds progressively more
(Figure 1-a/1-b) successive application of RLSO with
spaced components automatic threshold assessment on Figure 1.
11. Conclusions
● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the
document image and fill them with black pixels whenever they are shorter than a
given threshold
– Both Manhattan and Non-Manhattan Layout
– Version for natively digital documents
● Automatic thresholding effective on documents having
– single character size
– different spacings
● Good baseline towards more complex documents
– different character sizes
– graphics
● Current and future Work
– Stop criterion for iteration
– Clustering based on positioning and spacing