1. Contributions for building a
Corpora-Flow system
Andr´ Santos
e
andrefs@cpan.org
Informatics Engineering MSc
University of Minho
December 2011
2. Concepts
Aligned parallel corpus: Set of parallel texts in
which correspondences have been marked
between blocks (paragraphs, sentences,
words, . . . ) from each text.
Corpora-flow: Adaptation of the concept of
workflow to the several tasks, decisions
and sequences of steps involved in the
process of building a corpus.
1 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
3. Concepts
Aligned parallel corpus: Set of parallel texts in
which correspondences have been marked
between blocks (paragraphs, sentences,
words, . . . ) from each text.
Corpora-flow: Adaptation of the concept of
workflow to the several tasks, decisions
and sequences of steps involved in the
process of building a corpus.
This presentation and the underlying master thesis
describe the implementation of several tools to be
used in typical corpus building activities.
1 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
4. Context
The work developed in the context of this master
thesis was motivated and supported by
Project Per-fide, an undergoing project in
University of Minho which aims to build large
parallel corpora between Portuguese and other six
languages.
2 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
5. Corpora building challenges
file format and format conversion
finding duplicated files
text encoding format
structural residues
section delimiters
unpaired sections (parallel corpora)
...
3 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
6. Corpora building challenges
Severe problems which often lead to bad results
Many (most?) of them are hard/impossible to
solve completely
Find the problem and report it when it is not
solvable automatically
Provide intelligent ways of describing what was
found and done
4 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
7. 5 key issues
Book cleaning
Duplicates and candidate pairs detection
Book synchronization
Alignment evaluation
Corpora-flow system
5 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
8. Book processing problems – Motivation
(...) d <92>’ entr´e, donnant acc`s dans la salle commune.
e e
Une l´g`re v´randa, qui en prot´-
e e e e
M
<96>- 86 <96>-
^L geait la partie ant´rieure contre l <92>’ action
e
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
6 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
9. Book processing problems – Motivation
(...) d <92>’ entr´e, donnant acc`s dans la salle commune.
e e
Une l´g`re v´randa, qui en prot´-
e e e e
M
<96>- 86 <96>-
^L geait la partie ant´rieure contre l <92>’ action
e
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
<92>’ : right single quot. mark (CP1252)
<96>- : en dash (CP1252)
^L : page break (0xC)
prot´-(...)geait : transpagination
e
6 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
10. Book processing problems – Motivation
(...) d <92>’ entr´e, donnant acc`s dans la salle commune.
e e
Une l´g`re v´randa, qui en prot´-
e e e e
M
<96>- 86 <96>-
^L geait la partie ant´rieure contre l <92>’ action
e
des rayons solaires, reposait sur de sveltes bambous. (...)
La Jangada, Jules Verne
(...) d ’ entr´e, donnant acc`s dans la salle commune.
e e
Une l´g`re v´randa, qui en prot´geait _pb1_
e e e e
la partie ant´rieure contre l ’ action
e
des rayons solaires, reposait sur de sveltes bambous. (...)
6 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
11. Book cleaning
Subdivided in several steps:
7 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
12. Sections ontology
chap
PT cap´tulo,
ı
contains common section types cap, capitulo
FR chapitre, chap
used to automatically generate EN chapter, chap
the code to recognize section NT sec
delimiters end
PT fim
allows discussion/cooperation FR fin
EN the_end
with people with no BT _alone
programming knowledge scene
code becomes more simple and PT cena
FR sc`ne
e
clean EN scene
RU глава
BT act
8 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
13. Duplicates and pairs detection
Motivation
Duplicates can result in a biased corpus
Finding candidate pairs for alignment
Language independent elements (LIEs)
terms which are usually kept untranslated
year references – “1973”
proper names – “Hamlet”
Measuring similarity Thresholds
< 0.2: unrelated
|ALIEs ∩ BLIEs | > 0.4: pair
similarity (A, B) =
|ALIEs ∪ BLIEs | > 0.9: duplicates
9 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
14. Book synchronization
Definition
Structural alignment at section level, based on
previously added section delimiting marks.
Motivation
Some aligners cannot handle large documents
Section delimiters can act as anchor points
Unpaired sections can be discarded
Implementation
match similar section delimiters
synchronization points
10 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
15. Output
pair of files with
synchronization
marks
pair of files divided
into smaller pairs
of chunks
text report
synchronization
matrix
11 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
16. Output
pair of files with
synchronization
marks
pair of files divided
into smaller pairs
of chunks
text report
synchronization
matrix
11 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
17. Alignment evaluation
Motivation
compare alignments of the same documents
(performed by different tools, with different options, . . . )
determine if an alignment was successful
12 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
18. Alignment evaluation
Motivation
compare alignments of the same documents
(performed by different tools, with different options, . . . )
determine if an alignment was successful
Comparing alignments
parse TMX files and output the total number
correspondences of each type
0:1/1:0, 1:1, 2:1/1:2 and 2:2
evaluate the other tools developed
compare the performance of the available
alignment tools
12 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
19. Alignment evaluation
Determine if an alignment was successful
Summarize a TMX by sampling. Sampling can
be performed based on:
number of samples desired
explicit sampling points
translation units which match a given regular
expression
Output is a (much?) smaller TMX file
13 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
20. Alignment evaluation
The Name of the Rose, Umberto Eco
14 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
21. Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
22. Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
23. Alignment evaluation
AdsonDE = АдсоRU
The Name of the Rose, Umberto Eco
14 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
24. Distribution
All the tools implemented as Perl modules:
Text::Perfide::BookCleaner
Text::Perfide::BookPairs
Text::Perfide::BookSync
Text::Perfide::TMX::Utils
publicly available on CPAN
including tests and documentation
additional effort required to make code
installable and usable by other people
15 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
25. Corpora-flow
Motivation
building a corpus is a complex task
linear pipeline is not powerful enough
Workflow Makefiles
states file-oriented
actions timestamps and
conditions dependencies
context fail-fast and resumable
execution
parallelization
16 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
26. Corpora-flow
workflow + Makefiles = corpora-flow
DSL (→ Slay::Makefile)
workflow: rule*
rule: pre-condition* action post-condition*
action: targets dependencies function
condition: filename function
target: pattern*
dependencies: pattern*
function: Perl code
17 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
27. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
28. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
29. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
30. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
31. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
32. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
33. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
34. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
35. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
36. Conclusions
Evaluation of the tools has shown that they do
help to solve problems
Most of the methods devised can be applied in
other contexts
Working within a larger project:
provides requirements and resources
specific needs and priorities
making code available to other people:
requires additional effort
gives meaning to the work
external contributions
Higher level objects help to organize and
discuss
18 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
37. Future work
Document cleaners
other types of documents (e.g. scientific
articles)
algorithm for finding section delimiters with
notion of hierarchy
create ebooks/bilingual books
Duplicates and pair detection
list of correspondences (e.g. Adson → Адсо,
London → Londres)
calculate best threshold values in real time
19 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
38. Future work
Document synchronization
interactive mode
improvements on synchronization matrix and
metrics
hierarchical sections
other section alignment algorithms
Corpora-flow
finish specification and implementation
implement a corpora-flow for Project Per-fide
20 Andr´ Santos, andrefs@cpan.org
e Contributions for building a Corpora-Flow system
39. Contributions for building a
Corpora-Flow system
Andr´ Santos
e
andrefs@cpan.org
Informatics Engineering MSc
University of Minho
December 2011