8. There are rules to follow
When all rules are abided by, the XML is well-formed
9. XML well-formedness rules
(not exhaustive)
•
•
•
•
•
•
•
•
•
There must be a root element
Elements must follow naming rules
All elements must be closed
Element names are case sensitive
Elements must be properly nested
Attributes must be quoted
Attributes can only appear once in same start tag
Some characters cannot be used as such
Entities must be declared
11. Elements must follow naming rules
Names can only start with
• A letter (in any language, including accented letters)
• A colon
• An underscore
筆者
筆者
12. Elements must follow naming rules
Names cannot contain
• White spaces
• Most punctuation characters except colon, underscore,
hyphen, dot, middle dot
• Symbol characters
筆 者
筆 者
17. Attention to those darn quotes
If double quotes are used you cannot use double quotes inside
the attribute value . The same applies for single quotes.
19. Some characters cannot be used
• < and & need to escaped into entities:
and
• Most control characters
(characters to indicate carriage return, tab or backspace)
20. A word about entities
Entities are used to represent characters or a sequence of
characters that needs to be repeated throughout a document
Syntax:
Ampersand
Semicolon
21. Predefined XML entities
5 predefined character entities, only 2 are obligatory
<
<
less than
>
>
greater than
&
&
ampersand
'
'
apostrophe
"
"
quotation mark
22. Entities must be declared
Except for predefined entities all entities must be declared in
the Document Type Definition
DTD
Entity declaration
Entity
23. Other constructs
• XML declaration
• Stylesheet declaration
• Document Type declaration
• Comments
• CDATA
28. DTDs in the localization world
Don't be scared, but XML really is everywhere
•
•
•
•
•
•
•
•
TMX
TBX
XLIFF
TTX
SRX
QT Linguist TS
DITA
...
29. Encoding
All XML parsers must support at least UTF-8 and UTF-16.
Default encoding is UTF-8.
Always a good idea to specify the encoding
30. Byte Order Mark
A character to indicate the byte order of an XML document
In UTF-8 it's optional and not even recommended
In UTF-16 it's used to indicate endianness:
little-endian or big-endian
If you see these at the start of a file, something's wrong:
34. How to apply an XSLT
Declare the stylesheet in the XML file itself
Use an application like XMLSpy or xmlstarlet
35. XSLT localization examples
•
•
•
•
•
•
Convert a TTX to a two-column HTML or CSV
Convert a TMX to a TBX
Convert a TMX to a TXT (for spell-check in MS Word)
Convert multilingual XML to TMX/TBX
Generate HTML preview for XML in SDL Trados Studio
Prepare XML files for translation
36. XPath
It's a query language to select nodes from an XML document
It's used in XSLT
Will select all
elements that have an attribute called
and whose value is
And also in SDL Trados Studio file types
37. Is XML good for localization?
Yes, but not always
38. XML is great for localization
• Unicode supported by default
• Metadata gives more information about content
• Separates content from formatting (to some extent)
• Human readable
• Easily transformable using XSLT
• Excellent for single-sourcing
39. But bad XML is bad
• Translatable content in attributes
• No metadata to distinguish between content
e.g. mixed languages, translatable vs not translatable
• CDATA is just plain cheating
• Bad implementations of standards (XLIFF)
40. And also
• Multilingual XML can be challenging (XSLT can help)
東京
• Big files and one-liners can cause processing problems
(pretty-printing can help)
41. Tools, tools, tools
• Altova XMLSpy: all-round XML editor
• Altova DiffDog: compare XML files
• xmlstarlet: command line XML toolkit
• EditPad Pro for all encoding/BOM matters