SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
A RUN LENGTH SMOOTHING-BASED ALGORITHM
FOR NON-MANHATTAN DOCUMENT SEGMENTATION

Stefano Ferilli, Fabio Leuzzi, Fulvio Rotella and Floriana Esposito
Dipartimento di Informatica – Universit„ di Bari
                                       a
Via E. Orabona, 4 70126 Bari
{ferilli, esposito}@di.uniba.it
{fabio.leuzzi,fulvio.rotella}@uniba.it


Abstract        Layout analysis is a fundamental step in automatic document processing, be-
                cause its outcome affects all subsequent processing steps. Many different tech-
                niques have been proposed to perform this task. In this work, we propose a gen-
                eral bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan
                documents, and two specializations of it to handle both bitmap and PS/PDF
                sources. A famous approach proposed in the literature for layout analysis was
                the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run-
                Lengh Smoothing with OR”), that exploits the OR logical operator instead of
                the AND and is particularly indicated for the identification of frames in non-
                Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on
                different criteria than those that work in RLSA. Since setting such thresholds is
                a hard and unnatural task for (even expert) users, and no single threshold can fit
                all documents, we developed a technique to automatically define such thresholds
                for each specific document, based on the distribution of spacing therein. Appli-
                cation on selected sample documents, that cover a significant landscape of real
                cases, revealed that the approach is satisfactory for documents characterized by
                the use of a uniform text font size.




1.         Introduction
   Automatic document processing is a hot topic in the current computer sci-
ence landscape [Nagy, 2000], since the huge and ever-increasing amount of
available documents cannot be tackled by manual expert work. Document
layout analysis is a fundamental step in the document processing workflow,
that aims at identifying the relevant components in the document (often called
frames, because they can be identified with rectangular regions in the page),
that deserve further and specialized processing. Thus, the quality of the layout
analysis task outcome can determine the quality and even the feasibility of the
whole document processing activity. Document analysis is carried out on (a
representation of) the source document, that can be provided either in the form
of a digitized document, i.e. as a bitmap, or in the form of a born-digital docu-
ment (such as a word processor file, a PostScript or PDF file, etc.), made up of
blocks. In the former case, the basic components are just pixels or connected
components in a raster image, while in the latter they are higher level items
such as words or images. The differences between the two cases are so neat
that specific techniques are typically needed for each of them.
   The literature usually distinguishes two kinds of arrangements of the docu-
ment components: Manhattan (also called Taxi-Cab) and non-Manhattan lay-
outs. Most documents (e.g., almost all typeset documents) fall in the for-
mer category, in which significant content blocks correspond to rectangles
surrounded by perpendicular background areas. Conversely, components in
non-Manhattan layouts have irregular shapes and are placed in the page in
‘fancier’ positions. Many algorithms assume that the document under pro-
cessing has a Manhattan layout, which significantly simplifies the process-
ing activity. Unfortunately, this is not true in general. Previous work has
agreed on identifying in bottom-up techniques the best candidates for han-
dling the non-Manhattan case, since basic components grouping can ignore
high level regularities in identifying significant aggregates. This work pro-
poses a general bottom-up strategy to tackle the layout analysis of (possibly)
non-Manhattan documents, and two specializations of it to handle the different
cases of bitmaps and PS/PDF sources. We propose a variant of the Run Length
Smoothing Algorithm, called RLSO, that exploits the OR logical operator in-
stead of the AND [Ferilli et al., 2009]. Advantages range from efficiency, to
the ability of successfully handling non-Manhattan layouts, to the possibility
of working on both scanned and born-digital documents with minor changes.
As for the classical technique, however, it requires the user to set thresholds for
the run length smoothing, which is a quite delicate issue. Indeed, the quality
of the overall result depends on setting proper thresholds, but this is not a typi-
cal activity for humans, that are for this reason often unable to provide correct
values. Additionally, there is no single threshold that can fit all documents, but
each case may require a specific value depending on its peculiar layout fea-
tures. Thus, we also propose a technique for automatically assessing the run
length thresholds to be applied in the RLSO algorithm.
   The next section introduces the proposed the Run Length Smoothing tech-
nique for non-Manhattan document segmentation, relating it to relevant past
work in the literature, and the automatic assessment of document-specific thresh-
olds for it. Then, Section 3 shows the effect of applying such a technique on
real-world documents having layouts at different levels of complexity. Lastly,
Section 4 concludes the paper and outlines current and future work on the sub-
ject.
a                            b                              c

Figure 1. A non-Manhattan fragment of a scanned document (a), the corresponding RLSA
output (b), and the corresponding RLSO output (c)


2.        Run Length Smoothing-based Algorithms
   In this section we propose a novel approach to the segmentation of both
scanned and born-digital documents together with the automatic assessment of
the needed thresholds.

2.1        Application to scanned images
  The proposed RLSO approach includes some ideas taken from the RLSA,
CLiDE and the DOCSTRUM plus other novelties. In the case of digitized
documents, it works as follows:
      1 horizontal smoothing, carried out row-wise on the image with threshold
        th ;
      2 vertical smoothing, carried out column-wise on the image with threshold
        tv ;
      3 logical OR of the images obtained in steps 1 and 2 (the outcome has a
        black pixel wherever at least one of the input images has one, or a white
        pixel otherwise).
    While the RLSA returns rectangular frames (see Figure 1-b), the RLSO al-
gorithm identifies irregular connected components (see Figure 1-c), each of
which is to be considered a frame. The actual content of a frame can be ex-
tracted by applying a logical AND of the connected component with the origi-
nal image. Figure 1 refers to a non-Manhattan fragment of document layout.
    RLSO improves flexibility and efficiency of the results with respect to clas-
sical RLSA. A straightforward advantage of using the OR operator is that it
preserves all black pixels from the original smoothed images, and hence no sec-
ond horizontal smoothing is required to restore the layout fragments dropped
by the AND operation. Another improvement of efficiency is due to the fact
that, since the aim is grouping small original connected components (such as
characters) into larger ones (such as frames), shorter thresholds are sufficient
(to fill inter-character, inter-word or inter-line spaces), and hence the algorithm
is faster because it has to fill with black pixels less runs. A further improvement
may consist of applying vertical smoothing of step 2 directly on the image re-
sulting from step 1, which allows to completely avoid step 3. This latter option
should not introduce further connections among components, and could even
improve the quality of the result (e.g., by handling cases in which adjacent
rows have inter-word spacings vertically aligned, and would not be captured
by the pure vertical smoothing).

2.2        Application to born-digital documents
   In the case of scanned images processed by the RLSO algorithm, black pix-
els play the role of basic blocks, and two blocks are merged together whenever
the horizontal or vertical distance in white pixels (i.e., the white run length) be-
tween them is below a given threshold. Thus, RLSO can be cast as a bottom-up
and distance-based technique. The same approach can be, with minor changes,
extended to the case of digital documents, where the basic blocks are rectangu-
lar areas of the page and the distance among adjacent blocks can be defined as
the Euclidean distance between their closest edges, according to the projection-
based approach exploited in CLiDE [Simon et al., 1997]. Thus, assuming
initially a frame for each basic block, more complex frames can be grown
bottom-up by progressively merging two frames whose distance (computed
as the shortest distance of the blocks they contain) is below a given thresh-
old. Such a technique resembles a single-link clustering procedure [Berkhin,
2002], where however the number of desired clusters is not fixed in advance,
but automatically derives from the number of components whose distance falls
below the given thresholds.
   A problem to be solved in such an extension is how to define adjacency
between blocks. Indeed, in scanned images the pixels are lined up in a grid
such that each element has at most four adjacent elements in the horizontal
and vertical directions. Conversely, rectangular blocks of variable size may
have many adjacent blocks on the same side, according to their projections
(as explained in [Simon et al., 1997]). A similar issue is considered in the
DOCSTRUM, where the basic blocks correspond to text characters, and is
tackled by imposing a limit on the number of neighbors to be considered rather
than on the merging distance. This solution would not fit our case, that is
more general, where a block can include a variable number of characters, and
hence the number of Nearest Neighbors to be considered could not be easily
determined as in the DOCSTRUM. The resulting algorithm works as follows,
where steps 2-3 correspond to horizontal smoothing and steps 4-5 correspond
to vertical smoothing:


      1 build a frame for each basic block, such that the corresponding basic
        block is the only element of the frame;
(a)                        (b)                        (c)
Figure 2.   (a) A non-Manhattan fragment of a (scanned/natively digital) document and the
corresponding RLSO output on (b) scanned document and (c) natively digital document.


    2 compute the set H of all possible triples (dh , b1, b2) where b1 and b2
      are horizontally adjacent basic blocks and dh is the horizontal distance
      between them;
    3 while H is not empty and the element having lowest dh value, (dh , b1 , b2 ),
      is such that dh < th
         (a) merge in a single frame the frames to which b1 and b2 belong
         (b) remove that element from H
    4 compute the list V of all possible triples (dv , b1, b2) where b1 and b2 are
      vertically adjacent basic blocks and dv is the vertical distance between
      them;
    5 while V is not empty and the element having lowest dv value, (dv , b1 , b2 ),
      is such that dv < tv
         (a) merge in a single frame the frames to which b1 and b2 belong
         (b) remove that element from V
After obtaining the final frames, each of them can be handled separately by
reconstructing the proper top-down left-to-right ordering of the basic blocks it
includes. Figure 2-c shows the output, along with the discovered frames high-
lighted, of this version of the algorithm on the born-digital version of the doc-
ument in Figure 2-a. Efficiency of the procedure can be improved by consider-
ing the H and V sets as priority queues based on the distance attribute and by
preliminarily organizing and storing the basic blocks top-down in ‘rows’ and
left-to-right inside these ‘rows’, so that considering in turn each of them from
top to bottom and from left to right is sufficient for identifying all pairs of adja-
cent blocks. This method resembles that proposed in [Simon et al., 1997], with
the difference that no graph is built in this case, and more efficiency is obtained
by considering adjacent components only. The “list-of-rows” structure allows
to efficiently find adjacent blocks by limiting the number of useless compar-
isons. The “nearest first” grouping is mid-way between the minimum spanning
tree technique of [Simon et al., 1997] and a single-link clustering technique.
Compared to the ideas in [O’Gorman, 1993], we exploit the distance between
component borders rather than between component centers. Indeed, the lat-
ter option, adopted in DOCSTRUM, cannot be exploited in our case since the
starting basic components can range from single characters to (fragments of)
words, and this would cause a lack in regularity that would affect the proper
nearest neighbour identification.

2.3       Automatic assessment of RLSO thresholds
   A very important issue for the effectiveness of Run Length Smoothing-
based algorithms is the choice of proper horizontal/vertical thresholds. Indeed,
too high thresholds could erroneously merge different content blocks in the
document, while too low ones could return an excessively fragmented layout.
Thus, a specific study of each single document would be in principle needed
by the human expert to obtain effective results, which would tamper at the ba-
sis the idea of a fully automatic document processing system. The threshold
assessment issue becomes even more complicated when documents including
different spacings among components are considered. This typically happens
in document pages that include several different kinds of components, that are
independent of each other and hence exploit different spacing standards (al-
though the spacings are homogeneous within each component). For instance,
a title, usually written in larger font characters, will use an inter-character,
inter-word and inter-line spacing wider than that of normal text. Obviously,
the attempt to merge related wider components would also merge smaller un-
related ones as an undesirable side effect. Thus, a single threshold that is suit-
able for all components cannot be found in these cases, but rather the layout
analysis process should be able to identify homogeneous areas of the page and
to selectively define and apply different thresholds for each of them. Hence,
a strong motivation for the development of methods that can automatically as-
sess proper thresholds, possibly based on the specific document at hand. This
is why some works in the literature have been devoted to trying and finding
automatically values for the th , tv and ta parameters of the RLSA in order to
improve its performance (e.g., [Gatos et al., 2000]). Studies on RLSA sug-
gest setting large thresholds, in order to keep the number of black pixels in the
smoothed images sufficiently large to prevent the AND operator from dropping
again most of them (indeed, this is the reason why an additional horizontal
smoothing is provided for).
   The use of the OR operator in RLSO needs small values, that brings several
advantages in terms of efficiency, but the threshold assessment must be care-
fully evaluated, because as soon as two components are merged there is no step
that can break them again, and this can be a problem in cases in which logi-
cally different components are very close to each other and might be merged
together. We carried out a study of the behavior of run lengths in several docu-
ments to search for regularities that can help in developing an automatic thresh-
olding technique for RLSO, that is able to define proper thresholds for each
o                                a             b




                                                                        c            d

Figure 3. A sample study on run length distribution on a fragment of scientific paper. (o) The
original scanned document, histograms of the (a) horizontal and (b) vertical run lengths and the
corresponding (c-d) cumulative histograms.


document at hand based on their specific distribution of spacings. Specifically,
we focus on the scanned document case, leaving an analysis of possible ap-
plication to natively digital documents for future work. Consider the sample
document depicted in Figure 3-o, corresponding to a fragment of the first page
in reference [Cesarini et al., 1999], and representing the authors and abstract
of a scientific paper. Figures 3-a and 3-b show the histograms of the white run
lengths distribution, reporting the amount of runs (respectively horizontally
and vertically) for each possible length (notice that the runs from the borders
to the first black pixels have been ignored). Several irregular peaks are im-
mediately apparent, that are in some sense clustered into groups separated by
an absence of runs for several adjacent lengths. Indeed, the peaks correspond
to the most frequent spacings in the document, and the most prominent peak
clusters can be supposed to represent the homogeneous spacings correspond-
ing to different layout components. More precisely, a significantly high and
wide peak is present for very short runs, corresponding to the most frequent
spaces in the page, i.e. intra-letters, inter-letters and inter-words ones in the
horizontal case, and intra-letter and inter-lines ones in the vertical case. After
such an initial peak, the behavior is quite different for the horizontal (Figure 3-
a) and vertical (Figure 3-b) cases. In the former the histogram shows another
noticeable peak immediately after the first one, corresponding to the distance
between the first two and the last two author names, and then becomes quite
flat, with a few small peaks in the middle that, looking at the original image,
should correspond to the runs between the outlines of some letters. Conversely,
the vertical histogram has several peaks at various lengths, that are due to the
different inter-line spacings used in the fragment. This suggests, as a first idea,
that each peak corresponds to a group of homogeneous spacings that are likely
to separate blocks in a same frame, and hence should be somehow related to
the various thresholds of interest. However, the histograms are too irregular
to be usefully exploited by some technique. For this reason, a more regular
(and interpretable) shape would be obtained by considering the cumulative his-
tograms of run lengths (see Figures 3-c and 3-d), where each bar corresponds
to a possible run length and represents the amount of runs in the given image
having length larger or equal than that. To introduce some tolerance, the actual
histogram bars are scaled down to 10% of the original value. Now, the shape
of the cumulative graphic is clearly monotonically decreasing, and flat zones
identify lengths for which no runs are present in the document (or, equivalently,
if several adjacent run lengths are missing, the slope is 0). Mathematically, the
derivative of the shape is always less or equal than 0, and should indicate vari-
ations in the amount of runs introduced by a given length: 0 means no runs for
that length, while the larger the absolute value, the more the difference in the
number of runs between adjacent lengths in that point. In our implementation,
the slope in a given bar b of a histogram H(·) is computed according to the
difference between the previous bar b − 1 and the next bar b + 1 as
                    H(b + 1) − H(b − 1)   H(b + 1) − H(b − 1)
                                        =                                                               (1)
                     (b + 1) − (b − 1)             2
for inner bars; as
                          H(b + 1) − H(b)
                                          = H(b + 1) − H(b)                                             (2)
                           (b + 1) − (b)
for the first bar, and as
                          H(b) − H(b − 1)
                                          = H(b) − H(b − 1)                                             (3)
                           (b) − (b − 1)
for the last bar. A different definition could take into account only equation (2)
for all bars (except the last one). 1
   Adopting as RLSO thresholds the first horizontal and vertical length where
the slope becomes 0 yields the result in Figure 4-a1. The result successfully
separates the abstract and the authors in different connected components (i.e.,
frames), as desired. Actually, one can note that two components are found for
the authors, each containing a pair of author names. This happens because in
the author section a different spacing is exploited, so that the space between
the second and third authors is bigger than that between the first-second and

1 Note that very short white runs, made up of a few pixels, are often unlikely, which would yield an initial
flat region in the histogram, having slope 0. Of course this should be skipped for our purposes.
a                                 b              c




                                   a1                                 b1            c1

                                   a2                                 b2            c2

Figure 4. (a, a1, a2) Successive applications of RLSO with automatic threshold assessment
on a fragment of scientific paper and the corresponding (b, b1, b2) horizontal and (c, c1, c2)
vertical white run histograms used for automatic threshold assessment on the document reported
in (a)


third-fourth pairs. Although it is clear that the final target layout should in-
clude all authors in a single frame, the fact that the names were not equally
spaced suggests that the Authors of the paper wanted to mark some difference
between the first and second pair of names, which is satisfactorily caught by
the above result. Now, in case further block merge is desired, one might take
for granted the connected components retrieved by the previous step, and won-
der how to go on and find further mergings on this new image. By applying
the same reasoning as before, the new cumulative histograms are computed,
as shown in Figure 4-b1 and 4-c1. An initial peak is still present, due to the
small white rectangles inside the connected components. Applying again the
RLSO with thresholds equal to the first lengths where the slope becomes 0 in
this new histogram yields the result in Figure 4-a2, that is just what we were
looking for. Notice that, while the former round produced connected compo-
nents with many inner white regions, the second round completely eliminated
those small spaces, and hence computing the new cumulative histogram, re-
ported in Figure 4-b2 and 4-c2, reveals a neat cut (wide region with slope 0)
on the first bars. Thus, horizontally there is no more slope 0 region except
the trailing one, while vertically there would be one more flat zone that would
merge the authors and abstract frames together. The above example suggests a
progressive technique that merges connected components at higher level iden-
tified according to a significant group of runs having similar length. In other
words, it finds and applies first a threshold that merges smaller components,
and then progressively larger thresholds for merging wider components. As to
Figure 5. Progressive application of RLSO with automatic thresholding on a non-Manhattan
layout portion of the document in Figure 1.




            Figure 6.   Application of RLSO to a layout portion of a newspaper.


the possible noise in documents, it is worth noting that born-digital documents
are noise free and only scanned documents could be affected by some noise.
However, even in this case, the method can be successfully applied to a scale
reduction of the image with the aim of obtaining a denoised image and, after
the process execution, taking back the resulting output to the original scale.

3.       Sample Applications to Complex Layouts
   The proposed technique has been applied to several scanned images of doc-
uments showing complex layout. In the following, a selection of fragments
belonging to such images is reported, that represent a landscape of features
and problems that may expected to be found in a general set of documents.
   Figure 5 shows the progressive application of the RLSO algorithm with au-
tomatic thresholding on a scanned portion of the document in Figure 1. Here,
the problems are the non-Manhattan nature of the document layout, and the
use of different font sizes (and spacings) in the various frames: larger in the
central quotation, smaller in the surrounding running text. On the left is the
original fragment, and on the right the results of two successive applications
of the RLSO. It is possible to note how the thresholding technique succeeds
in progressively finding different levels of significant spacings and, as a con-
sequence, different levels of layout components. The first step already merges
correctly the paragraphs in the running text, but just the words in the quotation.
As expected, the merging of components written in larger size/spacing is de-
layed with respect to those using smaller size/spacing, and indeed it takes place
in the second step. Actually, the second step confirms the same surrounding
frames as the first one, and additionally merges all quotation rows, except the
first one, into a single frame. No better result can be found, because a further
iteration would start merging different frames and spoil the result. Figure 6
shows on the left two scanned portions of newspapers, characterized by many
components very different for type of content (text, lines and one case even
pictures), font size and spacings. In this case the first application of the auto-
matic thresholded RLSO, shown on the right, already returns a satisfactory set
of components. No different components are merged, and text paragraphs are
already retrieved. As to components having larger font size, those of medium
size are successfully retrieved as well, while those of significantly larger size
(the titles) are merged at least at the level of whole words. However, it would
be safer to preliminarily identify and remove straight lines, not to affect the
histograms on which the automatic threshold computation is based.

4.      Conclusion
   Document layout analysis is a fundamental step in the document processing
workflow. The layout analysis algorithms presented in literature are generally
divided into two main categories, bottom-up and top-down, according to their
approach to the problem, and have been mainly developed to deal with digi-
tized documents. However, document analysis has to be carried out on both
digitized documents and born-digital documents. Furthermore, most of them
are able to process Manhattan layout documents only. Thus, existing algo-
rithms show a limitation in dealing with some situations. Among the several
layout analysis algorithms presented in literature, a very successful approach
has been the application of Run Length Smoothing. We presented a variant
of the classical RLSA approach, called RLSO because it is based on the ap-
plication of the OR operator. RLSO is a general bottom-up strategy to tackle
the layout analysis of non-Manhattan documents. Like RLSA it is based on
the definition of horizontal and vertical thresholds for the smoothing opera-
tor, but requires different values than those that are useful in RLSA, due to
the different approach. We also proposed a technique to automatically de-
fine RLSO thresholds for each single document, based on the distribution of
spacing therein. Application on selected samples of documents, that aimed
at covering a significant landscape of real cases, revealed that the approach is
satisfactory for documents characterized by the use of a uniform text font size.
More complex cases, in which text, graphics and pictures are present, and sev-
eral different font sizes and spacings are exploited for different components
(such as title, authors, body of an article), obviously require more complex
approaches. However, the proposed technique candidates itself as an interest-
ing starting point, since by iterating its application the basic elements of even
complex layouts can be successfully found.
   Current and future work will be centered on a deeper study of the behavior
of the automatic threshold technique on one side, and of the features of such
complex documents on the other, in order to reach a full technique that is appli-
cable to any kind of layout. First of all, a way to determine when the iterations
are to be stopped, in order to obtain a good starting point without spoiling the
result, must be found. Then, how to exploit this technique as a component of a
larger algorithm that deals with complex documents must be defined. We are
currently investigating the possibility of analyzing the distribution of size and
spacings of basic components to obtain a clustering of components based on
features such as their position and spacing. After identifying clusters charac-
terized by uniform parameters, the technique can be applied separately to each
such zone by selectively computing and applying a proper threshold for each.

References
[Berkhin, 2002] Berkhin, Pavel (2002). Survey of clustering data mining techniques. Technical
   report, Accrue Software, San Jose, CA.
[Cesarini et al., 1999] Cesarini, F., Marinai, S., Soda, G., and Gori, M. (1999). Structured docu-
   ment segmentation and representation by the modified x-y tree. In ICDAR ’99: Proceedings
   of the Fifth International Conference on Document Analysis and Recognition, page 563,
   Washington, DC, USA. IEEE Computer Society.
[Ferilli et al., 2009] Ferilli, Stefano, Biba, Marenglen, Esposito, Floriana, and Basile,
   Teresa M.A. (2009). A distance-based technique for non-manhattan layout analysis. In
   Proceedings of the 10th International Conference on Document Analysis and Recognition
   (ICDAR-2009), volume I, pages 231–235. IEEE Computer Society.
[Gatos et al., 2000] Gatos, B., Mantzaris, S. L., Perantonis, S. J., and Tsigris, A. (2000). Au-
   tomatic page analysis for the creation of a digital library from newspaper archives. Interna-
   tional Journal on Digital Libraries (IJODL, 3:77–84.
[Nagy, 2000] Nagy, George (2000). Twenty years of document image analysis in pami. IEEE
   Trans. Pattern Anal. Mach. Intell., 22(1):38–62.
[O’Gorman, 1993] O’Gorman, L. (1993). The document spectrum for page layout analysis.
   IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1162–1173.
[Simon et al., 1997] Simon, Anik«, Pret, Jean-Christophe, and Johnson, A. Peter (1997). A fast
                                o
   algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis
   and Machine Intelligence, 19(3):273–277.

Mais conteúdo relacionado

Mais procurados

Lbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionLbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionIGEEKS TECHNOLOGIES
 
IRJET- Devnagari Text Detection
IRJET- Devnagari Text DetectionIRJET- Devnagari Text Detection
IRJET- Devnagari Text DetectionIRJET Journal
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from imagesGarby Baby
 
License Plate Recognition
License Plate RecognitionLicense Plate Recognition
License Plate RecognitionGilbert
 
Image to text Converter
Image to text ConverterImage to text Converter
Image to text ConverterDhiraj Raj
 
Self-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingSelf-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingPriyanka Wagh
 
Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...csandit
 
Detecting text from natural images with Stroke Width Transform
Detecting text from natural images with Stroke Width TransformDetecting text from natural images with Stroke Width Transform
Detecting text from natural images with Stroke Width TransformPooja G N
 
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSCLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSsipij
 
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSCLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSsipij
 
E041122335
E041122335E041122335
E041122335IOSR-JEN
 
Another Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line ClippingAnother Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line Clippingijcga
 
Automatic digital terrain modelling
Automatic digital terrain modellingAutomatic digital terrain modelling
Automatic digital terrain modellingSumant Diwakar
 
Text Detection Strategies
Text Detection StrategiesText Detection Strategies
Text Detection StrategiesAnyline
 
Another simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clippingAnother simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clippingijcga
 
AVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeAVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeDR.P.S.JAGADEESH KUMAR
 
Numerical Modeling & FLAC3D Introduction.pptx
Numerical Modeling & FLAC3D Introduction.pptxNumerical Modeling & FLAC3D Introduction.pptx
Numerical Modeling & FLAC3D Introduction.pptxKothakondaMaharshi1
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognitionaavi241
 

Mais procurados (20)

Lbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginitionLbp based edge-texture features for object recoginition
Lbp based edge-texture features for object recoginition
 
IRJET- Devnagari Text Detection
IRJET- Devnagari Text DetectionIRJET- Devnagari Text Detection
IRJET- Devnagari Text Detection
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from images
 
License Plate Recognition
License Plate RecognitionLicense Plate Recognition
License Plate Recognition
 
F045053236
F045053236F045053236
F045053236
 
Image to text Converter
Image to text ConverterImage to text Converter
Image to text Converter
 
Self-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingSelf-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with Smoothing
 
Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...
 
Detecting text from natural images with Stroke Width Transform
Detecting text from natural images with Stroke Width TransformDetecting text from natural images with Stroke Width Transform
Detecting text from natural images with Stroke Width Transform
 
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSCLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
 
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMSCLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
 
E041122335
E041122335E041122335
E041122335
 
Another Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line ClippingAnother Simple but Faster Method for 2D Line Clipping
Another Simple but Faster Method for 2D Line Clipping
 
Automatic digital terrain modelling
Automatic digital terrain modellingAutomatic digital terrain modelling
Automatic digital terrain modelling
 
Text Detection Strategies
Text Detection StrategiesText Detection Strategies
Text Detection Strategies
 
Another simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clippingAnother simple but faster method for 2 d line clipping
Another simple but faster method for 2 d line clipping
 
C04741319
C04741319C04741319
C04741319
 
AVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeAVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification Scheme
 
Numerical Modeling & FLAC3D Introduction.pptx
Numerical Modeling & FLAC3D Introduction.pptxNumerical Modeling & FLAC3D Introduction.pptx
Numerical Modeling & FLAC3D Introduction.pptx
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 

Semelhante a A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Enhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical RecordsEnhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical Recordscsandit
 
Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Alexander Decker
 
Text-Image Separation in Document Images using Boundary/Perimeter Detection
Text-Image Separation in Document Images using Boundary/Perimeter DetectionText-Image Separation in Document Images using Boundary/Perimeter Detection
Text-Image Separation in Document Images using Boundary/Perimeter DetectionIDES Editor
 
Binarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsBinarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsIRJET Journal
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
Persian arabic document segmentation based on hybrid approach
Persian arabic document segmentation based on hybrid approachPersian arabic document segmentation based on hybrid approach
Persian arabic document segmentation based on hybrid approachijcsa
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
Text-Image Separation in Document Images Using Boundary/Perimeter DetectionText-Image Separation in Document Images Using Boundary/Perimeter Detection
Text-Image Separation in Document Images Using Boundary/Perimeter DetectionIDES Editor
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff DistanceIRJET Journal
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff DistanceIRJET Journal
 
Layout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document ImagesLayout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document ImagesIOSR Journals
 
A Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasA Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasCSCJournals
 
3 d cad cam and rapid prototypingv1.1 2
3 d cad cam and rapid prototypingv1.1 23 d cad cam and rapid prototypingv1.1 2
3 d cad cam and rapid prototypingv1.1 2navannandha
 
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Subhajit Sahu
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...ijdpsjournal
 
Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...journalBEEI
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsVijay Karan
 

Semelhante a A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation (20)

Enhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical RecordsEnhancement and Segmentation of Historical Records
Enhancement and Segmentation of Historical Records
 
Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...Finding similarities between structured documents as a crucial stage for gene...
Finding similarities between structured documents as a crucial stage for gene...
 
Text-Image Separation in Document Images using Boundary/Perimeter Detection
Text-Image Separation in Document Images using Boundary/Perimeter DetectionText-Image Separation in Document Images using Boundary/Perimeter Detection
Text-Image Separation in Document Images using Boundary/Perimeter Detection
 
Binarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf ManuscriptsBinarization of Degraded Text documents and Palm Leaf Manuscripts
Binarization of Degraded Text documents and Palm Leaf Manuscripts
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Persian arabic document segmentation based on hybrid approach
Persian arabic document segmentation based on hybrid approachPersian arabic document segmentation based on hybrid approach
Persian arabic document segmentation based on hybrid approach
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
Text-Image Separation in Document Images Using Boundary/Perimeter DetectionText-Image Separation in Document Images Using Boundary/Perimeter Detection
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
 
Topology in GIS
Topology in GISTopology in GIS
Topology in GIS
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff Distance
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
 
isprsarchives-XL-3-381-2014
isprsarchives-XL-3-381-2014isprsarchives-XL-3-381-2014
isprsarchives-XL-3-381-2014
 
Layout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document ImagesLayout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document Images
 
A Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by camerasA Novel Method for De-warping in Persian document images captured by cameras
A Novel Method for De-warping in Persian document images captured by cameras
 
3 d cad cam and rapid prototypingv1.1 2
3 d cad cam and rapid prototypingv1.1 23 d cad cam and rapid prototypingv1.1 2
3 d cad cam and rapid prototypingv1.1 2
 
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
 
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
 
Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...Improved wolf algorithm on document images detection using optimum mean techn...
Improved wolf algorithm on document images detection using optimum mean techn...
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 

Mais de University of Bari (Italy)

ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
 
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text University of Bari (Italy)
 
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsRecognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsUniversity of Bari (Italy)
 
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsRecognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsUniversity of Bari (Italy)
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 

Mais de University of Bari (Italy) (10)

ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
 
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
 
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
 
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsRecognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
 
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsRecognising the Social Attitude in Natural Interaction with Pedagogical Agents
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
Technical report jada
Technical report jadaTechnical report jada
Technical report jada
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

  • 1. A RUN LENGTH SMOOTHING-BASED ALGORITHM FOR NON-MANHATTAN DOCUMENT SEGMENTATION Stefano Ferilli, Fabio Leuzzi, Fulvio Rotella and Floriana Esposito Dipartimento di Informatica – Universit„ di Bari a Via E. Orabona, 4 70126 Bari {ferilli, esposito}@di.uniba.it {fabio.leuzzi,fulvio.rotella}@uniba.it Abstract Layout analysis is a fundamental step in automatic document processing, be- cause its outcome affects all subsequent processing steps. Many different tech- niques have been proposed to perform this task. In this work, we propose a gen- eral bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run- Lengh Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non- Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Appli- cation on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. 1. Introduction Automatic document processing is a hot topic in the current computer sci- ence landscape [Nagy, 2000], since the huge and ever-increasing amount of available documents cannot be tackled by manual expert work. Document layout analysis is a fundamental step in the document processing workflow, that aims at identifying the relevant components in the document (often called frames, because they can be identified with rectangular regions in the page), that deserve further and specialized processing. Thus, the quality of the layout analysis task outcome can determine the quality and even the feasibility of the
  • 2. whole document processing activity. Document analysis is carried out on (a representation of) the source document, that can be provided either in the form of a digitized document, i.e. as a bitmap, or in the form of a born-digital docu- ment (such as a word processor file, a PostScript or PDF file, etc.), made up of blocks. In the former case, the basic components are just pixels or connected components in a raster image, while in the latter they are higher level items such as words or images. The differences between the two cases are so neat that specific techniques are typically needed for each of them. The literature usually distinguishes two kinds of arrangements of the docu- ment components: Manhattan (also called Taxi-Cab) and non-Manhattan lay- outs. Most documents (e.g., almost all typeset documents) fall in the for- mer category, in which significant content blocks correspond to rectangles surrounded by perpendicular background areas. Conversely, components in non-Manhattan layouts have irregular shapes and are placed in the page in ‘fancier’ positions. Many algorithms assume that the document under pro- cessing has a Manhattan layout, which significantly simplifies the process- ing activity. Unfortunately, this is not true in general. Previous work has agreed on identifying in bottom-up techniques the best candidates for han- dling the non-Manhattan case, since basic components grouping can ignore high level regularities in identifying significant aggregates. This work pro- poses a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle the different cases of bitmaps and PS/PDF sources. We propose a variant of the Run Length Smoothing Algorithm, called RLSO, that exploits the OR logical operator in- stead of the AND [Ferilli et al., 2009]. Advantages range from efficiency, to the ability of successfully handling non-Manhattan layouts, to the possibility of working on both scanned and born-digital documents with minor changes. As for the classical technique, however, it requires the user to set thresholds for the run length smoothing, which is a quite delicate issue. Indeed, the quality of the overall result depends on setting proper thresholds, but this is not a typi- cal activity for humans, that are for this reason often unable to provide correct values. Additionally, there is no single threshold that can fit all documents, but each case may require a specific value depending on its peculiar layout fea- tures. Thus, we also propose a technique for automatically assessing the run length thresholds to be applied in the RLSO algorithm. The next section introduces the proposed the Run Length Smoothing tech- nique for non-Manhattan document segmentation, relating it to relevant past work in the literature, and the automatic assessment of document-specific thresh- olds for it. Then, Section 3 shows the effect of applying such a technique on real-world documents having layouts at different levels of complexity. Lastly, Section 4 concludes the paper and outlines current and future work on the sub- ject.
  • 3. a b c Figure 1. A non-Manhattan fragment of a scanned document (a), the corresponding RLSA output (b), and the corresponding RLSO output (c) 2. Run Length Smoothing-based Algorithms In this section we propose a novel approach to the segmentation of both scanned and born-digital documents together with the automatic assessment of the needed thresholds. 2.1 Application to scanned images The proposed RLSO approach includes some ideas taken from the RLSA, CLiDE and the DOCSTRUM plus other novelties. In the case of digitized documents, it works as follows: 1 horizontal smoothing, carried out row-wise on the image with threshold th ; 2 vertical smoothing, carried out column-wise on the image with threshold tv ; 3 logical OR of the images obtained in steps 1 and 2 (the outcome has a black pixel wherever at least one of the input images has one, or a white pixel otherwise). While the RLSA returns rectangular frames (see Figure 1-b), the RLSO al- gorithm identifies irregular connected components (see Figure 1-c), each of which is to be considered a frame. The actual content of a frame can be ex- tracted by applying a logical AND of the connected component with the origi- nal image. Figure 1 refers to a non-Manhattan fragment of document layout. RLSO improves flexibility and efficiency of the results with respect to clas- sical RLSA. A straightforward advantage of using the OR operator is that it preserves all black pixels from the original smoothed images, and hence no sec- ond horizontal smoothing is required to restore the layout fragments dropped by the AND operation. Another improvement of efficiency is due to the fact that, since the aim is grouping small original connected components (such as characters) into larger ones (such as frames), shorter thresholds are sufficient (to fill inter-character, inter-word or inter-line spaces), and hence the algorithm is faster because it has to fill with black pixels less runs. A further improvement
  • 4. may consist of applying vertical smoothing of step 2 directly on the image re- sulting from step 1, which allows to completely avoid step 3. This latter option should not introduce further connections among components, and could even improve the quality of the result (e.g., by handling cases in which adjacent rows have inter-word spacings vertically aligned, and would not be captured by the pure vertical smoothing). 2.2 Application to born-digital documents In the case of scanned images processed by the RLSO algorithm, black pix- els play the role of basic blocks, and two blocks are merged together whenever the horizontal or vertical distance in white pixels (i.e., the white run length) be- tween them is below a given threshold. Thus, RLSO can be cast as a bottom-up and distance-based technique. The same approach can be, with minor changes, extended to the case of digital documents, where the basic blocks are rectangu- lar areas of the page and the distance among adjacent blocks can be defined as the Euclidean distance between their closest edges, according to the projection- based approach exploited in CLiDE [Simon et al., 1997]. Thus, assuming initially a frame for each basic block, more complex frames can be grown bottom-up by progressively merging two frames whose distance (computed as the shortest distance of the blocks they contain) is below a given thresh- old. Such a technique resembles a single-link clustering procedure [Berkhin, 2002], where however the number of desired clusters is not fixed in advance, but automatically derives from the number of components whose distance falls below the given thresholds. A problem to be solved in such an extension is how to define adjacency between blocks. Indeed, in scanned images the pixels are lined up in a grid such that each element has at most four adjacent elements in the horizontal and vertical directions. Conversely, rectangular blocks of variable size may have many adjacent blocks on the same side, according to their projections (as explained in [Simon et al., 1997]). A similar issue is considered in the DOCSTRUM, where the basic blocks correspond to text characters, and is tackled by imposing a limit on the number of neighbors to be considered rather than on the merging distance. This solution would not fit our case, that is more general, where a block can include a variable number of characters, and hence the number of Nearest Neighbors to be considered could not be easily determined as in the DOCSTRUM. The resulting algorithm works as follows, where steps 2-3 correspond to horizontal smoothing and steps 4-5 correspond to vertical smoothing: 1 build a frame for each basic block, such that the corresponding basic block is the only element of the frame;
  • 5. (a) (b) (c) Figure 2. (a) A non-Manhattan fragment of a (scanned/natively digital) document and the corresponding RLSO output on (b) scanned document and (c) natively digital document. 2 compute the set H of all possible triples (dh , b1, b2) where b1 and b2 are horizontally adjacent basic blocks and dh is the horizontal distance between them; 3 while H is not empty and the element having lowest dh value, (dh , b1 , b2 ), is such that dh < th (a) merge in a single frame the frames to which b1 and b2 belong (b) remove that element from H 4 compute the list V of all possible triples (dv , b1, b2) where b1 and b2 are vertically adjacent basic blocks and dv is the vertical distance between them; 5 while V is not empty and the element having lowest dv value, (dv , b1 , b2 ), is such that dv < tv (a) merge in a single frame the frames to which b1 and b2 belong (b) remove that element from V After obtaining the final frames, each of them can be handled separately by reconstructing the proper top-down left-to-right ordering of the basic blocks it includes. Figure 2-c shows the output, along with the discovered frames high- lighted, of this version of the algorithm on the born-digital version of the doc- ument in Figure 2-a. Efficiency of the procedure can be improved by consider- ing the H and V sets as priority queues based on the distance attribute and by preliminarily organizing and storing the basic blocks top-down in ‘rows’ and left-to-right inside these ‘rows’, so that considering in turn each of them from top to bottom and from left to right is sufficient for identifying all pairs of adja- cent blocks. This method resembles that proposed in [Simon et al., 1997], with the difference that no graph is built in this case, and more efficiency is obtained by considering adjacent components only. The “list-of-rows” structure allows to efficiently find adjacent blocks by limiting the number of useless compar- isons. The “nearest first” grouping is mid-way between the minimum spanning tree technique of [Simon et al., 1997] and a single-link clustering technique. Compared to the ideas in [O’Gorman, 1993], we exploit the distance between component borders rather than between component centers. Indeed, the lat- ter option, adopted in DOCSTRUM, cannot be exploited in our case since the
  • 6. starting basic components can range from single characters to (fragments of) words, and this would cause a lack in regularity that would affect the proper nearest neighbour identification. 2.3 Automatic assessment of RLSO thresholds A very important issue for the effectiveness of Run Length Smoothing- based algorithms is the choice of proper horizontal/vertical thresholds. Indeed, too high thresholds could erroneously merge different content blocks in the document, while too low ones could return an excessively fragmented layout. Thus, a specific study of each single document would be in principle needed by the human expert to obtain effective results, which would tamper at the ba- sis the idea of a fully automatic document processing system. The threshold assessment issue becomes even more complicated when documents including different spacings among components are considered. This typically happens in document pages that include several different kinds of components, that are independent of each other and hence exploit different spacing standards (al- though the spacings are homogeneous within each component). For instance, a title, usually written in larger font characters, will use an inter-character, inter-word and inter-line spacing wider than that of normal text. Obviously, the attempt to merge related wider components would also merge smaller un- related ones as an undesirable side effect. Thus, a single threshold that is suit- able for all components cannot be found in these cases, but rather the layout analysis process should be able to identify homogeneous areas of the page and to selectively define and apply different thresholds for each of them. Hence, a strong motivation for the development of methods that can automatically as- sess proper thresholds, possibly based on the specific document at hand. This is why some works in the literature have been devoted to trying and finding automatically values for the th , tv and ta parameters of the RLSA in order to improve its performance (e.g., [Gatos et al., 2000]). Studies on RLSA sug- gest setting large thresholds, in order to keep the number of black pixels in the smoothed images sufficiently large to prevent the AND operator from dropping again most of them (indeed, this is the reason why an additional horizontal smoothing is provided for). The use of the OR operator in RLSO needs small values, that brings several advantages in terms of efficiency, but the threshold assessment must be care- fully evaluated, because as soon as two components are merged there is no step that can break them again, and this can be a problem in cases in which logi- cally different components are very close to each other and might be merged together. We carried out a study of the behavior of run lengths in several docu- ments to search for regularities that can help in developing an automatic thresh- olding technique for RLSO, that is able to define proper thresholds for each
  • 7. o a b c d Figure 3. A sample study on run length distribution on a fragment of scientific paper. (o) The original scanned document, histograms of the (a) horizontal and (b) vertical run lengths and the corresponding (c-d) cumulative histograms. document at hand based on their specific distribution of spacings. Specifically, we focus on the scanned document case, leaving an analysis of possible ap- plication to natively digital documents for future work. Consider the sample document depicted in Figure 3-o, corresponding to a fragment of the first page in reference [Cesarini et al., 1999], and representing the authors and abstract of a scientific paper. Figures 3-a and 3-b show the histograms of the white run lengths distribution, reporting the amount of runs (respectively horizontally and vertically) for each possible length (notice that the runs from the borders to the first black pixels have been ignored). Several irregular peaks are im- mediately apparent, that are in some sense clustered into groups separated by an absence of runs for several adjacent lengths. Indeed, the peaks correspond to the most frequent spacings in the document, and the most prominent peak clusters can be supposed to represent the homogeneous spacings correspond- ing to different layout components. More precisely, a significantly high and wide peak is present for very short runs, corresponding to the most frequent spaces in the page, i.e. intra-letters, inter-letters and inter-words ones in the horizontal case, and intra-letter and inter-lines ones in the vertical case. After such an initial peak, the behavior is quite different for the horizontal (Figure 3- a) and vertical (Figure 3-b) cases. In the former the histogram shows another noticeable peak immediately after the first one, corresponding to the distance between the first two and the last two author names, and then becomes quite flat, with a few small peaks in the middle that, looking at the original image, should correspond to the runs between the outlines of some letters. Conversely, the vertical histogram has several peaks at various lengths, that are due to the different inter-line spacings used in the fragment. This suggests, as a first idea,
  • 8. that each peak corresponds to a group of homogeneous spacings that are likely to separate blocks in a same frame, and hence should be somehow related to the various thresholds of interest. However, the histograms are too irregular to be usefully exploited by some technique. For this reason, a more regular (and interpretable) shape would be obtained by considering the cumulative his- tograms of run lengths (see Figures 3-c and 3-d), where each bar corresponds to a possible run length and represents the amount of runs in the given image having length larger or equal than that. To introduce some tolerance, the actual histogram bars are scaled down to 10% of the original value. Now, the shape of the cumulative graphic is clearly monotonically decreasing, and flat zones identify lengths for which no runs are present in the document (or, equivalently, if several adjacent run lengths are missing, the slope is 0). Mathematically, the derivative of the shape is always less or equal than 0, and should indicate vari- ations in the amount of runs introduced by a given length: 0 means no runs for that length, while the larger the absolute value, the more the difference in the number of runs between adjacent lengths in that point. In our implementation, the slope in a given bar b of a histogram H(·) is computed according to the difference between the previous bar b − 1 and the next bar b + 1 as H(b + 1) − H(b − 1) H(b + 1) − H(b − 1) = (1) (b + 1) − (b − 1) 2 for inner bars; as H(b + 1) − H(b) = H(b + 1) − H(b) (2) (b + 1) − (b) for the first bar, and as H(b) − H(b − 1) = H(b) − H(b − 1) (3) (b) − (b − 1) for the last bar. A different definition could take into account only equation (2) for all bars (except the last one). 1 Adopting as RLSO thresholds the first horizontal and vertical length where the slope becomes 0 yields the result in Figure 4-a1. The result successfully separates the abstract and the authors in different connected components (i.e., frames), as desired. Actually, one can note that two components are found for the authors, each containing a pair of author names. This happens because in the author section a different spacing is exploited, so that the space between the second and third authors is bigger than that between the first-second and 1 Note that very short white runs, made up of a few pixels, are often unlikely, which would yield an initial flat region in the histogram, having slope 0. Of course this should be skipped for our purposes.
  • 9. a b c a1 b1 c1 a2 b2 c2 Figure 4. (a, a1, a2) Successive applications of RLSO with automatic threshold assessment on a fragment of scientific paper and the corresponding (b, b1, b2) horizontal and (c, c1, c2) vertical white run histograms used for automatic threshold assessment on the document reported in (a) third-fourth pairs. Although it is clear that the final target layout should in- clude all authors in a single frame, the fact that the names were not equally spaced suggests that the Authors of the paper wanted to mark some difference between the first and second pair of names, which is satisfactorily caught by the above result. Now, in case further block merge is desired, one might take for granted the connected components retrieved by the previous step, and won- der how to go on and find further mergings on this new image. By applying the same reasoning as before, the new cumulative histograms are computed, as shown in Figure 4-b1 and 4-c1. An initial peak is still present, due to the small white rectangles inside the connected components. Applying again the RLSO with thresholds equal to the first lengths where the slope becomes 0 in this new histogram yields the result in Figure 4-a2, that is just what we were looking for. Notice that, while the former round produced connected compo- nents with many inner white regions, the second round completely eliminated those small spaces, and hence computing the new cumulative histogram, re- ported in Figure 4-b2 and 4-c2, reveals a neat cut (wide region with slope 0) on the first bars. Thus, horizontally there is no more slope 0 region except the trailing one, while vertically there would be one more flat zone that would merge the authors and abstract frames together. The above example suggests a progressive technique that merges connected components at higher level iden- tified according to a significant group of runs having similar length. In other words, it finds and applies first a threshold that merges smaller components, and then progressively larger thresholds for merging wider components. As to
  • 10. Figure 5. Progressive application of RLSO with automatic thresholding on a non-Manhattan layout portion of the document in Figure 1. Figure 6. Application of RLSO to a layout portion of a newspaper. the possible noise in documents, it is worth noting that born-digital documents are noise free and only scanned documents could be affected by some noise. However, even in this case, the method can be successfully applied to a scale reduction of the image with the aim of obtaining a denoised image and, after the process execution, taking back the resulting output to the original scale. 3. Sample Applications to Complex Layouts The proposed technique has been applied to several scanned images of doc- uments showing complex layout. In the following, a selection of fragments belonging to such images is reported, that represent a landscape of features and problems that may expected to be found in a general set of documents. Figure 5 shows the progressive application of the RLSO algorithm with au- tomatic thresholding on a scanned portion of the document in Figure 1. Here, the problems are the non-Manhattan nature of the document layout, and the use of different font sizes (and spacings) in the various frames: larger in the central quotation, smaller in the surrounding running text. On the left is the original fragment, and on the right the results of two successive applications of the RLSO. It is possible to note how the thresholding technique succeeds in progressively finding different levels of significant spacings and, as a con- sequence, different levels of layout components. The first step already merges correctly the paragraphs in the running text, but just the words in the quotation. As expected, the merging of components written in larger size/spacing is de-
  • 11. layed with respect to those using smaller size/spacing, and indeed it takes place in the second step. Actually, the second step confirms the same surrounding frames as the first one, and additionally merges all quotation rows, except the first one, into a single frame. No better result can be found, because a further iteration would start merging different frames and spoil the result. Figure 6 shows on the left two scanned portions of newspapers, characterized by many components very different for type of content (text, lines and one case even pictures), font size and spacings. In this case the first application of the auto- matic thresholded RLSO, shown on the right, already returns a satisfactory set of components. No different components are merged, and text paragraphs are already retrieved. As to components having larger font size, those of medium size are successfully retrieved as well, while those of significantly larger size (the titles) are merged at least at the level of whole words. However, it would be safer to preliminarily identify and remove straight lines, not to affect the histograms on which the automatic threshold computation is based. 4. Conclusion Document layout analysis is a fundamental step in the document processing workflow. The layout analysis algorithms presented in literature are generally divided into two main categories, bottom-up and top-down, according to their approach to the problem, and have been mainly developed to deal with digi- tized documents. However, document analysis has to be carried out on both digitized documents and born-digital documents. Furthermore, most of them are able to process Manhattan layout documents only. Thus, existing algo- rithms show a limitation in dealing with some situations. Among the several layout analysis algorithms presented in literature, a very successful approach has been the application of Run Length Smoothing. We presented a variant of the classical RLSA approach, called RLSO because it is based on the ap- plication of the OR operator. RLSO is a general bottom-up strategy to tackle the layout analysis of non-Manhattan documents. Like RLSA it is based on the definition of horizontal and vertical thresholds for the smoothing opera- tor, but requires different values than those that are useful in RLSA, due to the different approach. We also proposed a technique to automatically de- fine RLSO thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. More complex cases, in which text, graphics and pictures are present, and sev- eral different font sizes and spacings are exploited for different components (such as title, authors, body of an article), obviously require more complex approaches. However, the proposed technique candidates itself as an interest-
  • 12. ing starting point, since by iterating its application the basic elements of even complex layouts can be successfully found. Current and future work will be centered on a deeper study of the behavior of the automatic threshold technique on one side, and of the features of such complex documents on the other, in order to reach a full technique that is appli- cable to any kind of layout. First of all, a way to determine when the iterations are to be stopped, in order to obtain a good starting point without spoiling the result, must be found. Then, how to exploit this technique as a component of a larger algorithm that deals with complex documents must be defined. We are currently investigating the possibility of analyzing the distribution of size and spacings of basic components to obtain a clustering of components based on features such as their position and spacing. After identifying clusters charac- terized by uniform parameters, the technique can be applied separately to each such zone by selectively computing and applying a proper threshold for each. References [Berkhin, 2002] Berkhin, Pavel (2002). Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA. [Cesarini et al., 1999] Cesarini, F., Marinai, S., Soda, G., and Gori, M. (1999). Structured docu- ment segmentation and representation by the modified x-y tree. In ICDAR ’99: Proceedings of the Fifth International Conference on Document Analysis and Recognition, page 563, Washington, DC, USA. IEEE Computer Society. [Ferilli et al., 2009] Ferilli, Stefano, Biba, Marenglen, Esposito, Floriana, and Basile, Teresa M.A. (2009). A distance-based technique for non-manhattan layout analysis. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR-2009), volume I, pages 231–235. IEEE Computer Society. [Gatos et al., 2000] Gatos, B., Mantzaris, S. L., Perantonis, S. J., and Tsigris, A. (2000). Au- tomatic page analysis for the creation of a digital library from newspaper archives. Interna- tional Journal on Digital Libraries (IJODL, 3:77–84. [Nagy, 2000] Nagy, George (2000). Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):38–62. [O’Gorman, 1993] O’Gorman, L. (1993). The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1162–1173. [Simon et al., 1997] Simon, Anik«, Pret, Jean-Christophe, and Johnson, A. Peter (1997). A fast o algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3):273–277.