Zanettin

104 Inte rcultural F aubline s

Stubbs, Michael (1995a) 'Collocations and Semantic Profiles: On the Cause of the Trouble
with Quantitative Studies', Functions of Language 2(l): 23-55 .
8 Parallel Corpora in Thanslation Studies: Issues
------ (1995b) 'Collocations and Cultural Connotations of Common Words', Linguistics in Corpus Design and Analysis
and Education 7: 379-390.
Toury, Gideon (1995) Desciptive Translation Studies - and Beyond, Amsterdam and FEDERICO ZANETTIN
Philadelphia: John Benjamins.
Vanderauwera, fua (1985) Dutch Novels Translated into English: The Transformation Abstract: This chapter deals with the design and analysis of'translation-
of a "Minority" Literature, Amsterdam: Rodopi. driven' corpora, i.e. principled collections of electronic texts compiled with
Wodin, Natascha (1983) Die gliiserne Stadt,Leipzig: Reclam Verlag. the aim of studying translation products and processes, with special reference
Woolls, David ( 1997) Mubiconcord" Version 1.5, Birmingham: CFL Software Development. to parallel corpora.'The comparison and contrast ofpaired translation units
Ztirn, Unica ( 1917) Der Mann im Jasmin,Frankturt am Main and Berlin: Verlag Ullstein; is, of course, not new to
translation research, but the possibility of retrieving
translated by Malcolm Green as The House of lllnesses, 1993, and The Man of Jas' on a computer screen hundreds of similar contexts and their translations,
mine & Other Texts,1994, London: Atlas Press. and the relative ease of combining this with statistical analysis and data
manipulation, allow hypotheses to be tested on a larger scale as well as
tentative generalilations to be made. Corpus linxuistics is seen as a
.._.*^lrslbqip!.ost-wbLeh_ssrybê-sw,ti.ed-roûstq;;f a;;;jt;;;;,fi i"";t"dir;
*^Fa-idILet wayâs it is app-lied to the sutdy,ottexruoniariilnrrs"i;nc
in the same
cf,rpora, as well as orher ,rr".irt"ri"iirir7;;ffio, ,o,
complement other types of investigation of printed texts in translation
studies. It is suggested in this chapter that issues of corpus design (which
types of tefis are included, what languages are involved, which citeria are
used for sampling, what the research aims and applications of the projects
are), and corpus encoding (how the 'translation' from printed texts to
electronic corpus comes about) deserve careful consideration insofar as
these issues are likely to affect findings based on corpora. It is also argued
that, in order to enhance and maximize the advantage which can be deived
from research based on electronic texts in translation stud.ies, there is a
need for greater standardization and interchange of corpus resources.

1. Introduction
In the past ten yea.rs or so there has been a growing interest in the application of
computer-assisted methods of investigation to the study of translation and trans-
lated texts. In narti istics has seryed as amethodological framework
for the creation and use of what I would like tb call .t
The corpora developed and used within this approach have quite airreienTitraiii:.
teristics, depending on the aims and research interests upon which the various research
projects are based, but they generally involve a comparison between two different
corpus components.
A first type of 'translation-driven corpus' is the monolingual comparable cor-
pus, consisting of a set of translations and a comparable set of texts spontaneously
created in the same language and selected according to similar design criteria. cor-
pora belonging to this first type include the English Comparable Corpus (ECC),

Inte rc ultu ral F aultline s Zanettin: Issues in Corpus Design ttntl Arutlysis
106 107

cerns which surface in the design of general-language corpora. These issues are
developed at UMIST, Manchester under the direction of Mona Baker, and the
aspects of representativeness, diversification and encoding.
Finnish Comparable Corpus at the University of Joensuu (see Jantunen 2000;
Kemppanen 2000; Tirkkonen-Condit 2000). These monolingual comparable cor-
',p-o-{a
ar9 'tran.s!ati_93*{riy9l1.91-.t 11rjl4ion depeadent' in that the "non-tr_anslationaf- 2. Representativeness in bi-directional parallel corpora
component is modelled on the composition of the translatigl{ 1et" (Laviosa 1997:
293). The motivations behind the construction of a corpus with these characteristics one of the major issues in corpus design is that of representativeness; what distin_
are mostly theoretical; the comparison between the two components of the corpus guishes a corpus from a collection of electronic texts (or a text archive) is that a
should enable investigation of the linguistic features of translated texts as opposed corpus is put together in a principled way so as to be representative of a larger
to those found in spontaneous text production. Since translation, is_'ig-tory-"-"*-- textual population, in order to make it possible to generalize findings concerning
tive event which is shaped by its own goals, pressures and context of produclion" that population. Thus, the most appropriate design for a corpus depends on what it
@aker 1996: r5), tirxti pioauceo ii i ieiritt of tfiieciftiff'snbnldshow a distinc- is meant to represent (Biber 1993; Halverson l99g; Kennedy 199g; Biber et al.
ti","_1ilgu.i.F,gake:up, riihiah ian $ dt:{:gliTe:9'_gC.lir1gggg!+,"9!dg1:1 1998). This should be remembered before making any general statements about
-omparaule.o6g
A diicona type oT'transfation-Oriu"n is q+l:lunguar language, texts, or translations based on corpus analysis; what is found in a corpus
' " either of the t ws components i ncludes-tran
"olputt
ated t. cumptred-trc-feiis
N sl
"i$Vf,ari " will only apply to what that particular corpus represents. The more general, large
spontaneously produced in two languages under similar circumstances and within and varied the textual population to be represented, the more variables must be
the same domains. Corpora of this kind have been created and used mainly with taken into account in the selection of the texts to be included in the corpus. To
-=. ,. 3gpJ-99!_".!9f"s
t, mi4,4, _either for the extraction of biliagual terminology (Laffling ensure representativeness of a certain genre, decisions must be taken as to sam_
1992) or for the training of translators (Zanettin 1994, 1998, forthcoming; Gavioli pling criteria:
and Zanettin 1997; Peters and Picchi 1998). A bilingual comparable corpus is trans-
lation-driven in that the ultimate aim of its creation is to develop a tool and a resource ' should translations be chosen at random from within the total population to
for trainees and practitioners in the translation profession, and its composition is be represented, or should a motivated choice be made based on criteria such
dictated by the provision of translating texts belonging to a specifrc genre. This type as text status and reception (e.g. prestige, readership)? Even when making a
of corpus, when in printed rather than in electronic format, is also referred to in random selection, we still need to dehne the boundaries and internal catego-
translation studies as 'parallel texts' (e.g. Piotrowska 1997; Scheffner 1998). ries of the total population, so all selections are bound to be subjective to a
. A third type of'translation-driven corpus', and the one on which this chapter certain extent.
will focus, is the parallel corpus, comprising a set of translations in one language ' Should the corpus be composed of complete texts or samples, and if the lat-
and their respective source texts in another language. Parallel corpora can be used ' ter, what size and type should the samples be? There seems to be general
for studying translated texts with two different goals in mind; they can be used by agreement among translation scholars that the basic unit in translation is a
researchers to describe what translators actually do with texts and how they trans- ful1 text, but pracrical limitations may still lead corpus designers to opt for
form them in the process of translation, and they can help practitioners to make samples.
informed choices based on translation traditions and norms while translating or ' In any case, a compromise has to be reached between what is desirable and
learning to translate. A parallel corpus can also be created to contain both directions what is feasible on practical grounds; constraints include availability of texts,
of translations, thereby forming a bi-directional parallel (Aston 1999) or reciprocal copyright restrictions and project fundin!.
(Teubert 1996) corpus, appearing to encompass all the different varieties described
above. The design of such a corpus is not devoid of theoretical problems, as will be one of the best known parallel corpora is the English Norwegian parallel cor-
argued in the following section. However, an examination of these problems may pus (ENPC), developed at the University of oslo and documented ir a number of
also provide a means of improving the design of the different kinds of translation- publications (e.g. Johansson et al. 1996; Ebeling 199g). The ENpc has been taken
driven corpora. In this chapter I will briefly exemplify some of the issues which as a model in a number ofprojects for other language pairs (Johansson 199g: 9-10)
arise in the design of bi-directional parallel corpora with reference to English and and as a starting point for the English Italian Translational corpus (cEXI) project
Italian, focusing on some of those corpus design issues which are related to the at the school for Translators and Interpreters (SSLMIT) of the university of Bolo-

specific translation-driven nature of a corpus and which may be different from con- gna at Forli, which is currently at the stage ofcorpus design.

Inte rc uhural F aultline s Zanettin: Issues in Corpus Design and Analysis 109
108

Figure 1, adapted from Johansson and Hofland (1994:26), shows the composi- tity, but also one of quality; that is to say, if we compare translated narrative fiction
tion of a bi-directional parallel corpus based on the design ofthe ENPC. Each corpus in the two languages, we find that, while most genres are translated into Italian, and
component (CC) has a number from 1 to 4. within popular fiction (detective, romance and science-hction novels) translations
may even constitute the majority of all published books, the vast majority of prose
Language A Language B
fiction translated from Ttalian into English is what may be called high-quality, ,liter-
_-l
Translations frc2 ary' fiction, by which I mean authors like Eco, Calvino or Tabucchi.
I Translations in I This means that in a bi-directional parallel corpus of English and Italian narra-
I L-euug" e I
tive fiction, if the two translational components of t{e corpus are representative of
the respective populations of translated books, strictly speaking they will not be
CC3 CC4 comparable with the non-translational components, i.e. the source texts for the trans-
Source Texts Source Texts for Source Texts for lations in the two languages respectively. In short, for Ita1ian, mostly translated
Translations in Translations in 'popular' fiction would be compared with original 'literary' fiction, while the re-
verse would be true for English, in which translated 'literary' fiction would be
compared with mostly original 'popular fiction'. Since the flow of translations and
Figure 1: A bi-directional parallel corpus the policies regarding them differ in different cultures and for different languages,
it is difficult to envisage the same design for parallel corpora in different directions
In principle, such a composition should not only allow comparison of tfanslations of translation and in different language pairs.
and source texts (CC 1 :CC 4, CC2:CC3) but also of translations and non-translations
in the same language (CCl:CC3, CCZ:CC4), as one would do using a monolingual
comparable corpus. Furthermore, it would also seem possible to compare non-
3. Diversification in translation-driven corpora
translated texts in the two different languages (CC3:CC4), as one would using a
bilingual comparable corpus. I All three types of translation-driven corpora described Another key issue in the design of translation-driven corpora concerns the need to
in Section 1 are then ideally represented in a bi-directional parallel corpus and, in use different types of corpora according to researchers' aims and objectives. Full
addition, for two different languages. A closer examination, however, reveals that reciprocity in a bi-directional paral.lel corpus would, in fact, appear impossible, as
the different components of a bi-directional parallel corpus are hardly comparable the concern with representativeness clashes with the concern with comparability.
as they stand. Is there a way out of this apparent dead end? I believe this can only be found through
Like the ENPC, the CEXI is a bi-directional paraliel co{pus, comprisirg both diversification, i.e. by using the same translated texts in corpora of different types
English translations and Italian source texts and Italian translations and English created ad hoc according to specific design criteria. The comparison of translated
source texts. Like most translation-driven corpora, the Italian-English translational texts with their source texts should be complemented with the comparison be-
corpus only aims at representing wrilten, published translations, and within these tween these translated texts and a reference corpus in the same language. The ideal
only the universe of translated books (to the exclusion of journals or magazines), translational corpus is then not a pre-formed set of texts but an open-ended corpus
rather than 'translation' in its entirefy. Translated books may represent only a small comprising different components which allow different types of comparison to
percentage of all translated texts, but they constitute data which are more readily be made.
available. In Italy, one in four published books is a translation, and half of these are As an example, let us consider a simple quantitative analysis involving type/
translated from English (Vigini 1999:87). On the other hand, only about 27o - or token ratio for a quite straightforward single source author parallel corpus, com-
one in fifty - of the books published in English-speaking countries such as the UK prising five novels and a short story by salman Rushdie and their Italian published
and the USA are translations (Venuti 7995: l2), of which only about 47o Ne from translations. Type/token ratio is taken to be an indicator of lexical variety in texts
Itab.an (Index Translationum 1998). The difference is not only one of relative quan- (Baker 1995 and 1996; Laviosa 1998a); the more word forms are used with respect
to the total number of words in a text, the wider the range of vocabulary used in the
I Yet another possibility would be to compare translated texts in two different languages text, and more effort may be required on the part of the reader to process it than
(CC1:CC2), i.e. acomparable bilingual corpus oftranslations (Johansson 1998: 8). This type of texts with less varied vocabulary. Type/token ratio is also a function of the total
comparison has not yet attracted the attention of scholars however. length of a text, so that to have comparable figures for texts of different length we

110 I nte rcultural Faultline s Zanettin: Issue.s in Corpus Design arul Analysis
11t

need a standard measure. Wordsmith Tools (Scott 1996) allows the standardized The data, however, are not directly comparable, as different values could be due
fype/token ratio to be calculated, and the base length chosen for this investigation to structural differences between the two languages rather than to the process of
was the suggested 1000 words. Table 1 shows the standardized rype/token ratio for translation (Munday 1998:545). Therefore the typeftoken ratio for the texts in the
the Rushdie parallel corpus: two languages has to be related to reference corpora for English and Italian.

Corpus Tokens Types T/T ratio
Types T/T ratio T/T ratio T/T ratio
Tokens
(std 1M0J
Texts (running (word (whole (std 10O0) Kq$Fi9,,cqs.'jri.:., . -.'f1
words) texts) {ll:dfanifurs d6mrj r.'1 :j;lt iiit+iff!,,#t
.forms) 1

Italian Reference comrrs 856,00r 58.864 6.88 49.99

I figli della mezzanotte 224,398 23.999 10.69 52.69
Rrlshd4{drpus7""".
(&slisli:ibffice
ir;
ffih]
r.s1
I

Enslish Reference Comus 843.629 27.46& 3.26 M44
jMgij#,,EltiU iiirS;:l:4 ,49
Table 2: Type/token ratio: Rushdie parallel co{pus vs. comparable reference corpora
I versi satanici 796,&6 23,536 11.97 53.41

, t:gz* The reference corpus for English was made up of 20 novels randomly selected from
ial.;F]q
the written imaginative component of the British National corpus (1995). Six of
Lliltimo sospiro del Moro 167.630 21.843 13.03 53.63
full texts, while the remaining fourteen are extracts of about 40,000
these novels are
words each. The reference corpus for Italian comprises seventeen full text novels
iel;ffi ij by contemporary Italian writers downloaded from the Internet. Thus, the two refer-
La vergogna to7.946 15,884 t4.11 s3.57 ence corpora used in this experiment are only loosely comparable, as the main
criterion for the inclusion of texts in the corpus was their availability in electronic
,{
1,1i9-5
format. They should however provide useful indicative reference values.
Harun e il mar delle stoie 44.624 7,487 16.78 49.86 The overall type,/token ratio for the Italian reference corpus (49.g9) is 5.53 points
higher than that of the English reference corpus (44.44). we can also see thar the
i!*M ::5,'5,.f.9
l?.6s,; .l ratio for the two components of the Rushdie corpus (translations and source texts)
55.54
is higher than the average for both Italian and English. However, the difference
Chekov e Zulu s,o25 1,946 38.73
between the source texts by Rushdie and the reference corpus for Engiish is higher
rrr:::.i,i,::i= l- :rl.i,:,:
tEl:;654 ;15Oi45 (5.2) than that between the translations and the reference corpus for Italian (3.1), as
',?"qiPq
can be seen in Figure 2.
Italian trarslations (total) 146,269 41 ,162 6.40 53.09
A further check is necessary at this point to make sure that the results are statis-
::::r::'.!l| i:i 't|i;# l;!1-;;::49ffi tically significant. what must be computed is the standard deviation for the two
?l::s|,9 ::.;..ia ; :
reference cofpora, that is to say, the range outside which variation between a reter-
ence corpus and the texts in the corresponding component of the Rushdie parallel
Table I : Type/token ratio for Rushdie parallel corpus corpus cannot be ascribed to chance. The standard rleviation is calculatecl through a
formula which correlates the average standardized type/token ratio for a sample to
the size of that sample.2 Table 3 shows the standardized type/token ratio for each
The higher the figure obtained by dividing types by tokens, the higher the lexical
text in the parallel corpus (column 2), followed by the average standardized typel
variety. As can be seen from the last column, the standardized type/token ratio for
the translations is higher that that of the source texts, both as a whole and with
regard to individual text pairs. This may appear to indicate that the vocabulary used
in the translations is more varied than that of the source texts, and it runs counter to 2
The formula used to calculate the standard cleviarion was rhe following:
the suggestion that translations are lexically less varied than originals as a result of E O _ rr.
the translation process (Baker 1996; Kohn 1996).
o h,,-

I
lt2 Int e rcr.t ltural F aultline s Zanettin: Issues. in Corpus Design and Analysis
u3

token ratio for the reference corpora (Column 3:44.44 forEnglish a1'd49.99 fot high type/token ratio, while the last one, Haroun and the sea of stories, is,
at least
Italian) and the standard deviation (Column 4: +- 0.62 for English, + 0.88 for Ital- on one level of reading, a story for children, with a rather low type/token ratio,
and
ian). The fifth column shows the gap between each text and the aYetage, which is, the difference from the average in the Italian translation is not statistically signifi-
in all but one case, significantly trigher than the standard deviation for the reference cant, i.e. the gap is within the range of the standard deviation. This could also
be
corpora. taken to suggest that above and below a certain threshold, there is no significant
variation in fype/token ratio between translations and source texts.
f-*'.-
i
summing up, we can say that, for at least four novels by Rushdie, while both
I
translations and source texts have a higher tvpe/token ratio than the average for the
160
I

I
respective languages, the translations are much closer to this average than source
i
texts, leading to the tentative conclusion that they are lexically less varied than the
lso
I
source texts as a result of the process of translation- It is Aue that type/token ratio
I
is
iem not by itself conclusive evidence of lexical variety. It should be considered along-
side other indexes such as lexical density, i.e. the ratio of lexical to grammatical
Io lt-l
words, or the ratio of hapaxes (i.e. those words occurring only once in a text) to the
It total number of words (Baker 1996; Laviosa l99gb). However, this example has
l*
lF
?n
been used for methodological purposes to demonstrate that parallel corpora need
I
I
to
a
be used in conjunction with other corpus resources. In this case, the use ofa refer_
i 10
ence corpus enabled us to go beyond a provisional finding based on the figures for
a paraliel corpus alone.
English Italian As much as analyses based on parallet corpora can profit from the comparison
with data taken from monolingual reference corpora, other types oftranslation-driven
Figure 2: Type/token ratio (std. 1,000): Rushdie parallel corpus vs. comparable corpora could benefit from diversification of analysis. For instance, if we were to
reference corpora compare the corpus of the Italian translations of Rushdie's novels with the Italian
reference corpus alone, we would also have to hypothesize that (these) translations
are more lexically varied than (the reference corpus of) texts spontaneously pro_
duced in the same language.
The reference corpora used in this experiment were composed of non-translated
texts only. However, other kinds ofreference corpora are also possible, depending
on the purpose of the investigation. To be representative of a particular genre and
thus reflect its norms and conventions, a reference corpus will have to be created by
looking at the actual textual population for that genre, regardless of the stafus of the
texts with respect to translation. Il for instance, what one wants to see is the expect
ancy norrn for the language ofpopular narrative fiction in Italian, a reference corpus
might well include translations rather than containing only texts spontaneously pro_
duced in Italian, since expectations for this kind of texts are created partly by
Harun e il mr delle translations themselves (Chesterman 1997 : 67).

Table 3: Lexical variation in a parallel corpus
4. Corpus encoding
As we can see, the difference between the English source texts and the average for
the English reference corpus is almost double that between the Italian translations After texts have been selected for inclusion in a corpus, they need to be acquired in
and the average for the ltalian reference corpus, except for the flrst and last parallel electronic format, and the process of'translating' texts from paper into digital for-
texts. Significantly, the first text, Chekov and Zttltt, is a short story with a rather mat, as in all kinds of intersemiotic transiation, may involve losses as weli as gains.

I nt e rc u I t u r al F ault line s Zanettin: Issues in Corpus Design arul Arutlysi.s

What may be lost is useful or even essential information conceming the context of web; corpora encoded in this format could be accessed through the Internet,
effec_
reception of the printed texts, such as paratextual and visual information. This is tively constituting a large textual database. For instance, the same translated
text
why the investigations of electronic corpora should, if possible, be coupled with the could be part of a comparable monolingual corpus or of one or more
bilingual par_
analysis of the printed source texts, for which the electronic versions are not a sub- allel corpora, with different overall designs. The benefits to be derived from
the
stitute but a complement. On the other hand, an electronic text can be 'enriched' encoding of texts, and specifically of translations, in accordance
with international
with linguistic and extralinguistic information, providing a means of carrying out standards, making possible the exchange of primary data (the
electronic texts) as
analyses which would not be possible without using texts in electronic format- well as of secondary data (hndings based on such data) among researchers,
may
At the linguistic level, corpora can be annotated, adding to each running word well compensate for the time and effort required to create ao.pu, ."aouaces
in ac_
part-of-speech tagging as well as syntactic and semantic information. The parsing cordance with international standards.
and tagging of the corpus can be automated to a large degree (Wilson and McEnery Parallel corpora allow not only quantitative analyses of the kind
illustrated
1996; Kennedy 1998), and computer applications have been developed to facilitate through the Rushdie example to be carried out. They also facilitate qualitative
analy-
the inserlion of annotation ofdiscourse features, such as referring expressions (Biber ses, based on the examination of parallel concordances. A search
in an aligned
et al. 1998). A linguisticaliy annotated corpus makes possible more refined analy- parallel corpus could go in both directions, from source ro target
texts or vice versa.
ses; for instance, type/token ratio statistics carried out on a lemmatized corpus could In the first case, a concordance of a source language item or pattern would
high_
probably provide a more accurate picture of lexical variety than figures based on light recurrent translation choices made by the translators; in the second
case, a
mere word-form counts. A corpus tagged for parts of speech couid be searched for concordance of target language items or patterns would yield as a
result the range
just one use of a word form, i.e. to obtain a concordance of set as a noun rather than of source language features from which those items or patterns originated.
as an adjective or a verb. In any case, in order to carry out paraller concordances, parallel corpora
need to
be aligned. Alignment procedures can, to a rarge extent, be automated,
As regards extralinguistic information, all the variables considered in the phase and may be
performed on the basis of statistical elaboration, taking into account
of corpus design can be encoded in each electronic text so that they can be retrieved the number of
sentences, words or even characters in the pairs of texts to be aligned (church
and used as selection criteria for inclusion in a different corpus. In a corpus of and
translated books, for example, coding could include bibliographical information
Gale 1991), or using bilingual lexicons (peters and picchi l99g) or .anchor
lists,
(Hofland and Johansson 1998). However, while advancements
about the printed text, information about the translator and the process of transla- in automatic corpus
alignment techniques will certainly help enhance not only machine transration
tion, and about the source text for the translation. We may, for example, want to re-
search but also descriptive studies, a certain degree of interaction
select translations not only according to when they were published, but also accord- between machines
and humans during the alignment process will not only continue to
ing to the amount of time that has elapsed between their publication and that of the be necessary but
may also provide a way of examining in some detail how transrations
source texts.3 map onto
source texts. Manual proofreading of aligned texts may help the researcher
In order to be able to exchange and re-use corpus resources and in order to rep- to ob_
serve any regularities which may characteri ze a certain body of translations
licate and compare findings based on different corpora, the adoption of a common as regards
the segmentation, completeness and distribution of translated segments,
standard for the encoding of translated texts seems advisable. The Text Encoding in order to
investigate what Toury (1995: 58-59) calls .,matricial norms',.
Initiative (Sperberg-McQueen and Burnard 1994) provides useful guidelines for
The output of the alignment procedure can be either bilingual texts with
the encoding of electronic texts in accordance with the international SGML stand- alter_
nating source text segments and their translations, or monolingual texts
ard (ISO 8879: 1986).4 A Corpus Encoding Standard (CES) compliant with these in which
segments which correspond are numbered, pairgd, and retrieved on
guidelines has been developed and is already available in XML format (Ide and the basis of an
index. In order to ailow different types of comparison to be made, it
Bonhomme 2000), enabling the encoding of corpora in one or more languages, and should be
possible to take individual electronic texts from translation-driven
corpora, includ_
also of parallel corpora. XML is an abbreviated version of SGML for use on the ing parallel co{pora, and re-use them in different projects. The Corpus Encoding
Standard mentioned above makes possible the creation of parallel
corpora in which
translations and dource texts are encoded as individual entities and from
which the
r The Translational English Corpus (TEC) header (Baker 1999) includes many of these retrieval of paraliel concordances is carried out on the basis of an external index
specifications. generated during the alignment process. The same source text could
ilnternational Organization for Standardization, /,lO 8879: Information processing Text and thus be used in
- a monolingual comparable colpus, aligned with translations in
different languages
office systems Standard Generalized Markup l"angtnge (SGML), ([Geneva]: ISO, 1986). or with different translations in the same language.

I
Inte rcuhural F aultline s Zanettin: Issues in Corpus Design and Analysis 117
116

5. Conclusion
Gavioli, Laura and Federico zanetin (1997) 'comparable corpora and rranslation: A
Pedagogic Perspective', http://www.sslmit.unibo.itlcultpaps/
(last checked on 15 sep-
tember 2000).
Computerized parallel corpora together with other types of translation-driven cor- Halverson, Sandra (1998) 'Translations Studies and Representative Corpora: Establish-
pora can be an invaluable resource in both descriptive and applied translation studies, ing Links between Translation corpora, Theoretical/Descriptive Categories and a
in that they allow the investigation of linguistic and extralinguistic features of trans- Conception of the Object of Study' , Meta 43(4): 494-514.
lated texts on a much larger scale than can be achieved by manual analysis ofprinted Hofland, Knut and Stig Johansson (1998) 'The Translation corpus Aligner: A program
texts. However, as more and more translation-driven corpora are created, there is a for Automatic Alignment of Parallel rexts', in Stig Johansson and Signe oksefjell
need for the criteria adopted in the design of these corpora to be carefully consid- (eds.) corpora and cross-Linguistic Research: Theory, Method, an^d case studies,
ered and made transparent, so tiat research can be replicated and findings can be Amsterdam and Atlanta: Rodopi, 87-100.
compared and evaluated. Compiling corpora can involve a lot of time and effort, Ide, Nancy and Patrice Bonhomme (2000) xML corpus Encoding standard Document
xcES 0.2, http://www-cs.vassar.edurXCES/ (lasr checked on 15 September 2000).
and in order to avoid ending up with strands of isolated experiments, we should
Index Translationum, 5d edition, UNESCO 1998.
make sure that the maximum advantage can be derived from ttre enterprise. The
Jantunen, Jarmo (2000) 'what can corpora Tell us about Translated Language: A com-
criteria used in selecting the texts to be included in a corpus, and those adopted in parable Corpus of Finnish in use for Making Hypotheses', paper presented at the
preparing them for use with corpus soffware, e.g. the procedures for alignment of a uMlsrrucl- Research Models in Translation Studies conference, Manchester, 2g-
parallel corpus, have implications for the findings based on the corpora themselves. 30 April 2000.
I would like to suggest that corpus design and encoding should not be seen as merely Johansson, Stig and Knut Hofland (1994) 'Towards an English-Norwegian paraliel Cor-
preliminary to the actual analysis of a corpus, but as important moments in them- pus", in udo Fries, Gunnel rottie and Peter Schneider (eds.) creating and {Jsing
selves in the study of translation and translated texts. English Language Corpora, Amsterdam: Rodopi. 25-37.
----, Jarle Ebeling and Knur Hofland (1996) 'coding and Aligning the English-Norwe-
gian Parailel Corpus', in Karin Aijmer, Bengt Altenberg and Mats Johansson (eds.)
Languages in Contrast, Lund: Lund University Press, 87-112.
References ------ (1998) 'on the Role of Corpora in cross-Linguistic Research', in stig Johansson
and signe oksefiell (eds.) corpora and cross-Linguistic Research: Theory, Method.,
Aston, Guy (1999) 'Corpus Use and Learning to Translate', TextusXll(2),289-314. and Case Studies, Amsterdam and Atlanta: Rodopi, 3-24.
Baker, Mona (1995) 'Corpora in Translation Studies: An Overview and Some Sugges- Kemppanen, Hannu (2000) 'Looking for Evaluative Keywords in Authentic and rrans-
tions for Future Research', T ar g e t 7 (2) : 223 -243. lated Finnish: Corpus Research on Finnish History Texts', paper presented at the
------ (1996) 'Corpus-Based Translation Studies: The Challenges that Lie Ahead', in UMISTTCL Research Models in Translation studies conference, Manchester, 2g-
Harold Somers (ed.) Terminology, LSP, and Translation: Studies in Language En-
30 April 2000.
gineeing inHonour of Juan C. Sager, Amsterdam and Philadelphia: John Benjamins,
Kennedy, Graeme (1998) An Introduction to Corpus Lingr jstlcs, London and New york:
175-186.
Longman.
------ (1999) 'The Role of Corpora in Investigating the Linguistic Behaviour of Profes-
Kohn, Jiinos (1996) 'what Can (Corpus) Linguistics Do for Translation?', in Kinga
sional Translators' , International Journal of Corpus Linguistics 4(2): 1-18.
KJaudy, Josd Lambert and Anik6 Sohdr (eds.) Translation studies in Hungary,Bl-
Biber, Douglas (1993) 'Representativeness in Corpus Design', Literary and Linguistic
dapest: Scholastica, 39 -52.
Computing 8(4): 243-257 .
Laffling, John (1992) 'on consrructing a Transfer Dictionary for Man and Machine',
------, Susan Conrad and Randi Reppen (1998) Corpus Linguistics: Investigating Lan-
guage Structure and Use, Cambridge: Cambridge University Press.
Target 4(1): 11-31.
Laviosa, Sara (1997) 'How Comparable Can 'Comparable Corpora' Be?',Target 9(2)
BriishNational Corpus,version 1.0, 1995, Oxford: OxfordUniversity Computing Services.
289-319.
Chesterman, Andrew (7997) Memes of Translation: The Spread of Ideas in Translation
Theory, Amsterdam and Philadelphia: John Benjamins. ----- (1998a) 'The English Comparable Corpus: a Resource and a Methodology',, in
Ebeling, Jarte (1998) 'Contrastive Linguistics, Translation, and Parallel Corpora' , Meta Lynrre Bowker, Michael Cronin, Dorothy Kenny and Jennifer pearson (ed,s.) Unity
43(4):602-615. in Diversity: current Trends in Translation studies, Manchester: St. Jerome pub-
Church, Kenneth W. and William A. Gale (1991) 'Concordances for Parallel Texts', in lishing, l0l-112.
Using Corpora: Proceedings of the 7'h Annual Conference of the UW for the New ------ (1998b) 'Core Patterns of Lexical Use in a Comparable Corpus of English Narra-
OED andText Research, Oxford, Oxford University Press, 40-62. tive Prose', Meta 43(4): 551-570.

Zanettin

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Zanettin

Semelhante a Zanettin (20)

Último

Último (20)

Zanettin