SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Language Tags
and Locale Identifiers




                   A Status Report




                                     1
Presenter and Agenda

             Addison Phillips

             Internationalization Architect, Yahoo!
             Co-Editor, Language Tag Registry Update (LTRU)
             Working Group (RFC 3066bis, draft-matching)

             Language tags
             Locale identifiers




Addison Phillips is the co-editor to the recent Language Tag registry RFC and its
associated matching draft. This presentation details the history of language tags
and locale identifiers on the Internet, with a focus on the recent changes and
updates to RFC 3066 and efforts to create standardized locales and locale
identifiers for the Internet.




                                                                                    2
Languages? Locales?

           What’s a language tag?
           What the #@&%$ is a
            locale?
           Why do identifiers matter?




If the Internet is anything, it is a means of communication. While there are many
forms of communication, language and textual information in particular loom large
in computer systems.
The identification of human “natural language”, as a result, is important, since
users expect their computer systems to interact with textual data in useful ways
(be it searching, relaying, checking, formatting, or otherwise processing it).
Alas, defining what a language is and what constitutes the difference between
various forms of language is a complex problem. And, for computer systems,
there is another kind of beast: the “locale”, which is even more difficult to grasp.
What are these things? How do we identify them? Why do language and locale
identifiers matter?




                                                                                       3
Language Tags

                Enable presentation, selection, and
                negotiation of content
                Defined by BCP 47
                 –   Widely used! XML, HTML, RSS, MIME, SOAP,
                     SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP,
                     perl……….
                 –   Well understood (?)




Natural language and especially written (that is, textual) information are a key and fundamental
part of most computer systems. When computer systems were mostly isolated and not
interconnected, they mostly dealt with a single language at a time and could be tuned to deal with
the particular idiosyncrasies of that language. But the Internet (and other networking technologies)
have changed that. Now textual data may be stored, processed, or viewed in many different
contexts and many different languages simultaneously. And increasingly the boundaries between
“computer” and the world at large is becoming blurred: your “computer” today might equally be
your TV, your telephone, your game player, your music player, your PDA, or your automobile. The
digital content delivered to your “computer” is more important than the form factor the computer
itself takes. As text, speech, and other content associated with language become pervasive and
networked together, the selection, identification, and correct processing of the language become
critical.
Most people seem to believe that they have a relatively good grasp of languages and, thus, of
language identification. If you ask your mother-in-law what language the folks in Germany or
France speak, for example, she probably will have a ready answer. But the more one delves into
languages and language identification, the more complex the problem seems to become.
The standard for language identification on the Internet is something called “BCP 47”. It is widely
used: the list above is a small fraction of the formats and technologies that implement it. What,
never heard of “BCP 47”? BCP 47 is the official designation for the language tagging specification
of the IETF. BCP stands for “best current practice”. The most recent document to be BCP 47 is (or,
by the time you read this, “was”) RFC 3066, which was preceded by RFC 1766. You’re probably
more familiar with the RFC numbers than the BCP number.




                                                                                                       4
Locale Identifiers

              Different ideas:
               –   Accept-Locale vs. Accept-Language
               –   URIs/URNs, etc.
               –   CLDR/LDML
              And Requirements:
               –   Operating environments and harmonization
               –   App Servers
               –   Web Services
              New Solution? Cost of Adoption:
               –   UTF-8 to the browser: 8 long years




Locale identifiers, by contrast, are somewhat more difficult to grasp. Your mother-
in-law (unless she’s a software engineer) probably has no idea what a locale is.
One definition of a locale is:
       “a data structure or concept used by programmers to identify a particular 
       collection of cultural, regional, or linguistic preferences.”
Locales are tied to specific programming languages or operating environments.
What they do and how they are identified are unique and usually proprietary.
There is a relationship of sorts between language and locale: most locale
identifiers include a language identifier. So if locale identifiers need to be
exchanged on the Internet, as in Web services or between different application
servers, how would these identifiers be defined?
There are different ideas for how this might happen. One question is cost of
adoption: new headers, identifiers, or data structures might take a long time to
reach “critical mass” and become useful, while adaptation or cooption of existing
structures might introduce problems for existing applications.




                                                                                      5
In the Beginning

             Received Wisdom from the Dark Ages
              Locales:
                 –   japanese, french, german, C
                 –   ENU, FRA, JPN
                 –   ja_JP.PCK
                 –   AMERICAN_AMERICA.WE8ISO8859P1
               Languages…
                 … looked a lot like locales (and vice
                  versa)




In the beginning, there was very little difference between language and locale in computer
systems. Locale identifiers (some historical examples are shown above) usually included some
kind of language identification.
When the Internet became accessible to mere mortals in the early 1990’s, language identification
became an immediate concern. The Internet made content easy to exchange across boundaries
and borders in ways that closed networks like CompuServe never could master. Identifying
languages was necessary for applications such as email and http, so Harald Alvestrand worked to
create the first version of BCP 47, which was known as RFC 1766 to address the problem.
These language tags became widely adopted, as we’ve noted. Locale identifiers were not created
for the Internet, though, because of a lack of distributed applications.
“Now, hold on!” you might say. “I’ve used distributed applications for years now: I’ve got client-
server and I’ve bought books from Amazon or stocks from my broker or airline tickets on-line.
What do you mean ‘there’s a lack of distributed applications’?!?”
It is true that there are client-server architectures and Web applications are now quite
commonplace. However, these are not truly distributed applications. In a Web application, for
example, there is a host where all the logic is stored. This host and its associated programming
language or operating environment completely encapsulates the overall locale model. Client-
server architectures are similar: the client and server each have specific technology choices
associated with them and the business logic lives in one or the other (and usually in the server).
Truly distributed applications are the province of integration (EAI, B2B), Web services, and the
idea of Service Oriented Architectures (SOA). You only need a shared concept of locale when
your logic is being hosted in discrete chunks on multiple systems and when you cannot count on
the systems using the same technology!
Web apps are usually hosted in a single container or are written by people who have chosen a
particular technology. The locale model associated with that technology becomes the locale
model of the Web application. The whole point of Web services, by contrast, is to hide this
technology decision.




                                                                                                     6
Locales and Language Tags meet

             Conversations in
             Prague…
              – Language tags are being
                locale identifiers anyway…
              – Not going to need a big
                new thing…
              – Just a few things to fix…
              … we can do this really fast




In 2002, Mark Davis and I attended the Internationalization and Unicode
Conference in Prague (so you can see that it pays to attend these events!),
where I had a paper about locale identifiers. The basic problem was that
language tags were widely distributed, and, since they looked an awful lot like
POSIX locale identifiers, most Web application platforms were actually using
them as locale identifiers already by mapping language tags to their local
equivalent. Mark was working on the CLDR project and was concerned about
problems involving script identification (especially for compatibility with
Microsoft’s .NET Culture identifiers). It seemed that a few small fixes to BCP 47
(to allow some script subtags) and some documentation (“how to get a locale out
of a language tag”) might solve several problems all at once.




                                                                                    7
BCP 47 Basic Structure

             Alphanumeric (ASCII only) subtags
             Up to eight characters long
             Separated by hyphens
             Case not important (i.e. zh = ZH = zH = Zh)


                  1*8alphanum * [ “-” 1*8 alphanum ]




The basic structure of language tags has been remarkably stable. Language tags
are ASCII strings consisting of subtags separated by hyphens (and not
underscores). The subtags may consist of either (ASCII) letters or digits.
There exist suggested capitalization rules for some of the underlying standards
used by language tags, but these do not apply to language tags and have no
meaning in a language tag context. Language tags are case insensitive.
At the bottom of the slide is the original “ABNF” which describes the language tag
grammar.




                                                                                     8
RFC 1766


                 zh-TW
                   ISO 639-1 (alpha2)




                                        ISO 3166 (alpha2)


                                                            i-klingon
                                                                Registered value




RFC 1766 defined language tags in two distinct ways.
All language tags took the form of a sequence of subtags composed of the ASCII
letters and digits separated by the hyphen character. The subtags could be, at
most, eight characters long. RFC 1766 said that:
•If the first subtag consisted of two letters, it was a language code from the ISO
639-1 standard.
•If there is a second subtag (additional subtags are optional) and it consisted of
two letters, it was a region code from the ISO 3166 standard.
Otherwise, the interpretation of the tag (and its subtags) was defined by a registry
maintained by IANA. If users needed a specific language tag, they could write to
a mailing list (ietf-languages@iana.org) and request a registration be created.
Here is one such tag, for the Klingon language.




                                                                                       9
RFC 3066


                    sco-GBISO 639-2 (alpha 3 codes)


                                                      Bu
                                                           tu
                                                                se
                                                                   …




                                                                       eng-GB
                                                                         X
                                                                       alpha 2 codes when they exist




RFC 3066 expanded on RFC 1766, making a few minor additions and cleaning
up a few problems that arose.
The main change was the addition of ISO 639-2 codes for languages. The ISO
639-1 codes are two-letters long and there are, necessarily, a limited number of
these (about 650 total, given that some letters are reserved). Since there are at
least several thousand languages that exist in modern times, this isn’t sufficient
to encode the world’s languages. ISO 639-2 assigns three-letter codes, which
allows for many more potential codes. This allows all of the languages to be
represented by one code or another.
RFC 3066 also mandated that if an ISO 639-1 code exists for a language, then
that code must be used (and not the ISO 639-2 code). This prevents languages
from being encoded using different tags. Thus the tag “eng-UK” is not legal, even
though “eng” is a valid ISO 639-2 code: tags must use the “en” code for English.
The IANA language tag registry remained the same as during the RFC 1766 era:
a collection of isolated registrations.
(‘sco’ is the code for ‘Scots’)




                                                                                                       10
Problems

                 Script Variation:
                  –   zh-Hant/zh-Hans
                  –   (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)
                 Obsolence of registrations:
                  –   art-lojban (now jbo), i-klingon (now tlh)
                 Instability in underlying standards:
                  –   sr-CS (CS used to be Czechoslovakia…




A variety of problems were associated with language tags, despite their success. The one Mark
and I were primarily interested in was the problem of script variation. Most languages are
customarily written in a single script. They may be transcribed in another script, but most native
speakers and most content in that language use a single script.
A few languages are written equally—or at least “commonly”—in more than one script. Some of
the languages are undergoing transitions (Cyrillic script was imposed on several languages during
the Soviet era, for example), while others are just naturally written in more than one script. For
example, Serbian can be written in either Cyrillic or Latin script. Both traditions are historical to the
language, not artificially imposed.
The most notable example of script variation is in Chinese, where the traditional form of the script
is used in some Chinese speaking regions (Taiwan, Hong Kong) while the simplified form of the
script is used in others (the PRC, Singapore). These variations do not follow spoken variation in
the language (Hong Kong, for example, speaks Cantonese while Taiwan speaks Mandarin)…
which leads to vocabulary and other variations with the writing systems in question. And
identifying “Traditional Chinese” using a region has other cultural sensitivity problems…
Another problem was the relative ease of registration for language tags compared to the action of
the various ISO maintenance and registration bodies. Many of the registered tags were later
deprecated due to standards action.
A last problem I’ll mention here was instability in ISO 3166 (the region codes). Codes in ISO 3166
are changing all the time, which is not a surprise, given that countries are changing name,
boundaries, and organization with some regularity. Alas, ISO 3166 doesn’t just remove old codes:
they sometimes give them to a new country or region. So the language code today for “Serbian for
Serbia and Montenegro” would have been “Serbian for Czechoslovakia” a couple decades ago.




                                                                                                            11
And More Problems

            Lack of scripts
            Little support for registered values in software
            Reassignment of values by ISO 3166
            Lack of consistent tag formation (Chinese dialects?)
            Standards not readily available, bad references
            Bad implementation assumptions
             –   1*8 alphanum *[ “-” 1*8 alphanum]
             –   2*3 ALPHA [ “-” 2ALPHA ]
            Many registrations to cover small variations
             –   8 German registrations to cover two variations




There were a few other problems, which I’ve listed here…




                                                                   12
LTRU and “draft-registry”

             Defines a generative syntax
              –   machine readable
              –   future proof, extensible
             Defines a single source
              –   Stable subtags, no conflicts
              –   Machine readable
             Defines when to use subtags
              –   (sometimes)




So Mark and I started writing Internet-Drafts. Eventually, a Working Group was
formed at the IETF called the Language Tag Registry Update or LTRU working
group.
Out of this working group comes a new RFC, which is the new BCP 47. As I write
this the RFC has not yet been assigned a number, so it is called RFC 3066bis
informally. It changes language tags in a number of interesting ways, while
maintaining full compatibility with all existing tags.




                                                                                 13
14
                       sl-Latn-IT-rozaj-x-mine




                                                 Private Use and Extension
RFC 3066bis and LTRU




                                                                               Here is an illustration of a new-style language tag.
                                            Registered variants (any number)
                                            ISO 3166 (alpha2) or UN M49
                                            ISO 15924 script codes (alpha 4)
                                            ISO 639-1/2 (alpha2/3)
More Examples

                 es-419 (Spanish for Americas)
                 en-US (English for USA)
                 de-CH-1996 (Old tags are all valid)
                 sl-rozaj-nedis (Multiple variants)
                 zh-t-wadegile (Extensions)




Here are some more examples of language tags showing some of the interesting
variations.
es-419 makes use of the UN M.49 region codes to describe a language for a
larger area than a country.
de-CH-1996 was registered in the old IANA Language Tag Registry. It is still a
valid tag.
sl-rozaj-nedis is probably not a good tag choice, but illustrates that you can have
more than a single variant in a well-formed tag. In this case, both –rozaj and –
nedis are dialects of Slovenian (sl), but –nedis doesn’t include sl-rozaj in its
registered list of prefixes, so this tag is probably meaningless.
zh-t-wadegile is a hypothetical tag: if there were an extension for transliterations
and it if it were assigned the letter ‘t’, than one valid subtag might be ‘wadegile’.*




* Several well-informed people have cast doubt on the idea of a transliteration extension, not to mention the
“wadegile” example shown.




                                                                                                                15
Benefits

             Subtag registry in one place: one source.
             Subtags identified by length/content
             Extensible
             Compatible with RFC 3066 tags
             Stable: subtags are forever




There are several benefits to switching over to RFC 3066bis.
For the first time there is a single, authoritative source for subtags. It contains
date versioning information, as well as information on the formation of useful tags.
Instead of having to hunt through various versions of ISO 639, ISO 3166, ISO
15924, UN M.49 and the IANA registry, there is one source.
It is machine readable and the entries are dated. There is even a mechanism for
canonicalizing tags as they evolve.
Inside a language tag, the subtags can be identified by length and content.
Parsers do not have to have a copy of the registry to extract most of the
information in a tag.
There are several extension mechanisms. In particular, private use subtags can
be used in otherwise public tags.
The tags are all backwards compatible with RFC 3066. Any new tag would have
been valid to register under pervious versions of BCP 47. And all of the old tags
are forwards compatible (although a few are only compatible via fiat).
Finally: tags and subtags are stable. Forever.




                                                                                       16
Problems

             Matching
              –   Does “en-US” match “en-Latn-US”?
             Tag Choices
              –   Users have more to choose from.
             Implementations
              –   More to do, more to think about
              –   (easier to parse, process, support the good stuff)




The creation of the new format does create a few problems for users and
implementers, though.
In particular, there are now more choices for how to form the generative language
tags.
Matching of tags is a particular issue we’ll cover in more depth in a second.
Users have more choices available, so implementations and guidelines are going
to be necessary to help people decide what’s best for them.
Software implementations will have to do several things. Of course, they’ll have
to be modified to be either well-formed or validating processors. The good news
here is that the tag syntax is more deterministic and thus more amenable to
parsing. And there is a data source that can easily be incorporated into code. The
bad news is that some badly-written implementations are going to break and that
developers need to go back and evaluate their software.




                                                                                     17
Tag Matching

             Uses “Language Ranges” to select sets of
             content according to the language tag
             Four Schemes
              –   Basic Filtering
              –   Extended Filtering
              –   Scored Filtering
              –   Lookup




The remaining work of LTRU relates to matching and selecting content based
language tags. This has some impact on implementations, which need to guide
users in selection of the most appropriate tags.
Tag matching depends on language ranges, which are identifiers that people use
to specify what they are looking for or wish to match. Ranges select sets of tags.
The current version of the Internet-Draft on matching describes four types of
matching in two categories (filtering and lookup).




                                                                                     18
Filtering

              Ranges specify the least specific item
               –   “en” matches “en”, “en-US”, “en-Brai”, “en-boont”
              Basic matching uses plain prefixes
              Extended matching can match “inside bits”
               –   “en-*-US”




Filtering is one type of matching. In filtering, the range specifies the least specific
item that constitutes a match. For example, if I specify a range of “de-CH”, all
content in the matching set must include the language “de” (German) and the
region “CH” (Switzerland) in its tags.
•“Basic filtering” is strict prefix matching. That is range “de-CH” matches tags “de-
CH” and “de-CH-1996” but not “de-Brai-CH”, “de”, or “de-Latn-CH-1996”
•In “extended filtering”, ranges can match missing elements. Thus “de-*-CH”
would match all of the foregoing examples except “de”.




                                                                                          19
Scored Filtering

             Assigns a “weight” or “score” to each match
             Result set is ordered by match quality




             Postulated by John Cowan




Scored filtering, which was first postulated by John Cowan, assigns a weight or
score to each potential range-to-tag match. Unlike the other two forms of filtering,
scored filtering results in an ordered set of matching tags. This might be useful
with search results, for example.




                                                                                       20
Lookup

            Range specifies the most specific tag in a
            match.
             –   “en-US” matches “en” and “en-US” but not “en-
                 US-boont”
            Mirrors the locale fallback mechanism and
            many language negotiation schemes.




The other form of matching is called lookup. In lookup, the user specifies the
most specific tag that represents a match. The lookup algorithm is for use in
cases where the user wants exactly one item returned for each request. Software
resources are examples of language tag matching.
(Demo of all matching types)




                                                                                  21
What Do I Do (Content Author)?

 Not much.
  –   Existing tags are all still valid: tagging is mostly
      unchanged.
  –   Resist temptation to (ab)use the private use
      subtags.
 Unless your language has script variations:
  –   Tag content with the appropriate script subtag(s)
         Script subtags only apply to a small number of
         languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small
         number of others.




                                                                     22
What Do I Do (Programmer)?

 Check code for compliance with 3066bis
 –   Decide on well-formed or validating
 –   Implement suppress-script
 –   Change to using the registry
 –   Bother infrastructure folks (Java, MS, Mozilla, etc)
     to implement the standard




                                                            23
What Do I Do (End-User)?

 Check and update your language ranges.
 Tag content wisely.




                                          24
LTRU Milestone Dates

 (Done) RFC 3066bis
 –   Registry went live in December 2005
 Produce “Matching” RFC
 –   Draft-04 available
 (Anticipated) Produce RFC 3066ter
 –   This includes ISO 639-3 support, extended
     language subtags, and possibly ISO 639-6




                                                 25
Things to Read

 Registry Draft
 http://www.inter-locale.com
 http://www.ietf.org/internet-drafts/draft-ietf-ltru-
   registry-12.txt
 Matching Draft
 http://www.inter-locale.com
 LTRU Mailing List
 https://www1.ietf.org/mailman/listinfo/ltru




                                                        26
Things to Do (languages)

 Get involved in LTRU
 Get involved in W3C I18N Core WG!
 Write implementations
 Work on adoption of 3066bis: understand the
 impact

 Then get involved with Locale identifiers…




                                               27
Back to Locales…

             IUC 20 Round Table
             Suzanne Topping’s
             Multilingual Article
             Tex Texin and the Locales
             list…




So we’ve done a deep dive into Language Tags, whereas my point of entry was
locale identifiers. What’s going on with locales?
Back at IUC20 (see, it pays to go to these events!) there was a round-table in
which there was a discussion of problems confronting the Web. Language tags
and locale identifiers was one of the key topics discussed at this round table,
apparently. I say “apparently”, because I left the conference before the round
table. I read about the results on the W3C website and in an article by Suzanne
Topping in Multilingual magazine. What I read there surprised and dismayed me.
A few weeks later, I found that others in the community were working on locales
or, rather, on rubbishing locales. Tex Texin started a list (now defunct) for
discussing the problem.
I got involved in thinking about the problem.




                                                                                  28
Locale Identifiers and Web Services




Fundamentally, my interest stemmed from the fact that I was working on Web services. Web
services are supposed to define a platform-agnostic way to expose logic or functionality in a
distributed fashion. By using XML and HTTP, it was hoped that Web services could provide a
standards oriented way to accomplish what CORBA or EAI vendors had been providing in a
proprietary fashion previously.
The problem I was grappling with was: “how do you internationalize a Web service?”
Web services have all the same requirements any distributed system has: they have messages,
data, text, and potentially cultural, regional, or other issues in them. In our programming
environments we have a ready solution for addressing these problems. These often hinge on the
locale. And the locale hinges on the user’s preferences in the matter.
We have standard language identifiers. We don’t have standard locale anything. What to do?
There were (and are) three schools of thought.
On the one hand are the identifier folks (such as myself) who think that if we had a general locale-
and/or-international-preferences-ID-mechanism, each vendor would implement it in a manner
consistent with their existing language/platform and everything would work pretty well.
On the other hand are the locale definition folks (such as Mark Davis) who think that if we all
agreed to use the same locale data and locale data structures, then we could exchange identifiers
and get the same results because everything is the same.
On the left foot are the folks who think locales are just a bad idea and ought to be placed in the
nearest landfill or entombed in concrete, Chernobyl-style.




                                                                                                       29
W3C and Unicode

             W3C
              –   Identifiers and cross-over with language tags
              –   Web services
              –   XML, HTML
             Unicode Consortium
              –   LDML
              –   CLDR
              –   Standards for content




Two standards organizations that are working in the area of locales and locale
identifiers are the W3C (Internationalization Core Working Group) and the
Unicode Consortium (the Common Locale Data Repository project).
The W3C is, of course, directly concerned with the use and implementation of
language tags in document formats and technologies. In addition, the need for
locale identification for Web services is a specific work item for the I18n working
group.
The Unicode folks are working to build a standardized, comprehensive set of
locale data.




                                                                                      30
“Language Tags and Locale
          Identifiers” SPEC

             First Working Draft coming soon
              –   URIs?
              –   Simple tags?




The W3C is currently working on a pair of specifications (W3C-ese for “standards
track documents”). The first is called “Language Tags and Locale Identifiers”,
which, as its names says, has to do with actually creating locale identifiers, as
well as providing implementation guidelines for RFC 3066bis and draft-matching.
There are questions about how a locale identifier should be structured. Several
ideas are currently floating around. For example, URIs might be used. Or 3066bis
tags might be “extended” in some way.




                                                                                    31
WS-I18N SPEC

             First Working Draft now available:
              –   http://www.w3.org/TR/ws-i18n




The second spec that the W3C is working on is the WS-I18N spec, or “Web
Services Internationalization”. This spec relies on the preceding document for
locale identifiers and describes how to use locales with Web services
technologies. Previous work by the W3C I18N WG in this area include
requirements and usage scenarios.




                                                                                 32
Ideas?




         33

Mais conteúdo relacionado

Semelhante a 02 c a306-phillips_langtags

Key Features Of The Pseudo Code
Key Features Of The Pseudo CodeKey Features Of The Pseudo Code
Key Features Of The Pseudo CodeAngilina Jones
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
Java As A Programming Language
Java As A Programming LanguageJava As A Programming Language
Java As A Programming LanguageJennifer Wright
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpRikki Wright
 
Nt1330 Unit 3 Types Of Dngs
Nt1330 Unit 3 Types Of DngsNt1330 Unit 3 Types Of Dngs
Nt1330 Unit 3 Types Of DngsLaura Martin
 
Computer Languages And Programming Frameworks
Computer Languages And Programming FrameworksComputer Languages And Programming Frameworks
Computer Languages And Programming FrameworksGracie Segura
 
Competency Based Learning Materials
Competency Based Learning MaterialsCompetency Based Learning Materials
Competency Based Learning MaterialsLana Sorrels
 
groovy DSLs from beginner to expert
groovy DSLs from beginner to expertgroovy DSLs from beginner to expert
groovy DSLs from beginner to expertPaul King
 
Big Data And The Next Wave Of Infrastress
Big Data And The Next Wave Of InfrastressBig Data And The Next Wave Of Infrastress
Big Data And The Next Wave Of InfrastressCyndi Ruppel
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingMahmood Aijazi, MD
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyDr. Jayarama Reddy
 
Florida Is A State Composed Of Diverse Cultures And Languages
Florida Is A State Composed Of Diverse Cultures And LanguagesFlorida Is A State Composed Of Diverse Cultures And Languages
Florida Is A State Composed Of Diverse Cultures And LanguagesKatrina Banks
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...mtoppa
 
Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?Markus Voelter
 
CSCorganization of programming languages
CSCorganization of programming languagesCSCorganization of programming languages
CSCorganization of programming languagesOluwafolakeOjo
 
Webof words
Webof wordsWebof words
Webof wordssteddyss
 

Semelhante a 02 c a306-phillips_langtags (20)

Key Features Of The Pseudo Code
Key Features Of The Pseudo CodeKey Features Of The Pseudo Code
Key Features Of The Pseudo Code
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Java As A Programming Language
Java As A Programming LanguageJava As A Programming Language
Java As A Programming Language
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
Nt1330 Unit 3 Types Of Dngs
Nt1330 Unit 3 Types Of DngsNt1330 Unit 3 Types Of Dngs
Nt1330 Unit 3 Types Of Dngs
 
Computer Languages And Programming Frameworks
Computer Languages And Programming FrameworksComputer Languages And Programming Frameworks
Computer Languages And Programming Frameworks
 
Competency Based Learning Materials
Competency Based Learning MaterialsCompetency Based Learning Materials
Competency Based Learning Materials
 
groovy DSLs from beginner to expert
groovy DSLs from beginner to expertgroovy DSLs from beginner to expert
groovy DSLs from beginner to expert
 
Computer programminglanguages
Computer programminglanguagesComputer programminglanguages
Computer programminglanguages
 
Syntax And Syntax
Syntax And SyntaxSyntax And Syntax
Syntax And Syntax
 
Big Data And The Next Wave Of Infrastress
Big Data And The Next Wave Of InfrastressBig Data And The Next Wave Of Infrastress
Big Data And The Next Wave Of Infrastress
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language Processing
 
Unit ii oo design 9
Unit ii oo design 9Unit ii oo design 9
Unit ii oo design 9
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddy
 
Florida Is A State Composed Of Diverse Cultures And Languages
Florida Is A State Composed Of Diverse Cultures And LanguagesFlorida Is A State Composed Of Diverse Cultures And Languages
Florida Is A State Composed Of Diverse Cultures And Languages
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
 
Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?
 
CSCorganization of programming languages
CSCorganization of programming languagesCSCorganization of programming languages
CSCorganization of programming languages
 
Webof words
Webof wordsWebof words
Webof words
 

Último

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 

Último (20)

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 

02 c a306-phillips_langtags

  • 1. Language Tags and Locale Identifiers A Status Report 1
  • 2. Presenter and Agenda Addison Phillips Internationalization Architect, Yahoo! Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 3066bis, draft-matching) Language tags Locale identifiers Addison Phillips is the co-editor to the recent Language Tag registry RFC and its associated matching draft. This presentation details the history of language tags and locale identifiers on the Internet, with a focus on the recent changes and updates to RFC 3066 and efforts to create standardized locales and locale identifiers for the Internet. 2
  • 3. Languages? Locales? What’s a language tag? What the #@&%$ is a locale? Why do identifiers matter? If the Internet is anything, it is a means of communication. While there are many forms of communication, language and textual information in particular loom large in computer systems. The identification of human “natural language”, as a result, is important, since users expect their computer systems to interact with textual data in useful ways (be it searching, relaying, checking, formatting, or otherwise processing it). Alas, defining what a language is and what constitutes the difference between various forms of language is a complex problem. And, for computer systems, there is another kind of beast: the “locale”, which is even more difficult to grasp. What are these things? How do we identify them? Why do language and locale identifiers matter? 3
  • 4. Language Tags Enable presentation, selection, and negotiation of content Defined by BCP 47 – Widely used! XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl………. – Well understood (?) Natural language and especially written (that is, textual) information are a key and fundamental part of most computer systems. When computer systems were mostly isolated and not interconnected, they mostly dealt with a single language at a time and could be tuned to deal with the particular idiosyncrasies of that language. But the Internet (and other networking technologies) have changed that. Now textual data may be stored, processed, or viewed in many different contexts and many different languages simultaneously. And increasingly the boundaries between “computer” and the world at large is becoming blurred: your “computer” today might equally be your TV, your telephone, your game player, your music player, your PDA, or your automobile. The digital content delivered to your “computer” is more important than the form factor the computer itself takes. As text, speech, and other content associated with language become pervasive and networked together, the selection, identification, and correct processing of the language become critical. Most people seem to believe that they have a relatively good grasp of languages and, thus, of language identification. If you ask your mother-in-law what language the folks in Germany or France speak, for example, she probably will have a ready answer. But the more one delves into languages and language identification, the more complex the problem seems to become. The standard for language identification on the Internet is something called “BCP 47”. It is widely used: the list above is a small fraction of the formats and technologies that implement it. What, never heard of “BCP 47”? BCP 47 is the official designation for the language tagging specification of the IETF. BCP stands for “best current practice”. The most recent document to be BCP 47 is (or, by the time you read this, “was”) RFC 3066, which was preceded by RFC 1766. You’re probably more familiar with the RFC numbers than the BCP number. 4
  • 5. Locale Identifiers Different ideas: – Accept-Locale vs. Accept-Language – URIs/URNs, etc. – CLDR/LDML And Requirements: – Operating environments and harmonization – App Servers – Web Services New Solution? Cost of Adoption: – UTF-8 to the browser: 8 long years Locale identifiers, by contrast, are somewhat more difficult to grasp. Your mother- in-law (unless she’s a software engineer) probably has no idea what a locale is. One definition of a locale is: “a data structure or concept used by programmers to identify a particular  collection of cultural, regional, or linguistic preferences.” Locales are tied to specific programming languages or operating environments. What they do and how they are identified are unique and usually proprietary. There is a relationship of sorts between language and locale: most locale identifiers include a language identifier. So if locale identifiers need to be exchanged on the Internet, as in Web services or between different application servers, how would these identifiers be defined? There are different ideas for how this might happen. One question is cost of adoption: new headers, identifiers, or data structures might take a long time to reach “critical mass” and become useful, while adaptation or cooption of existing structures might introduce problems for existing applications. 5
  • 6. In the Beginning Received Wisdom from the Dark Ages Locales: – japanese, french, german, C – ENU, FRA, JPN – ja_JP.PCK – AMERICAN_AMERICA.WE8ISO8859P1 Languages… … looked a lot like locales (and vice versa) In the beginning, there was very little difference between language and locale in computer systems. Locale identifiers (some historical examples are shown above) usually included some kind of language identification. When the Internet became accessible to mere mortals in the early 1990’s, language identification became an immediate concern. The Internet made content easy to exchange across boundaries and borders in ways that closed networks like CompuServe never could master. Identifying languages was necessary for applications such as email and http, so Harald Alvestrand worked to create the first version of BCP 47, which was known as RFC 1766 to address the problem. These language tags became widely adopted, as we’ve noted. Locale identifiers were not created for the Internet, though, because of a lack of distributed applications. “Now, hold on!” you might say. “I’ve used distributed applications for years now: I’ve got client- server and I’ve bought books from Amazon or stocks from my broker or airline tickets on-line. What do you mean ‘there’s a lack of distributed applications’?!?” It is true that there are client-server architectures and Web applications are now quite commonplace. However, these are not truly distributed applications. In a Web application, for example, there is a host where all the logic is stored. This host and its associated programming language or operating environment completely encapsulates the overall locale model. Client- server architectures are similar: the client and server each have specific technology choices associated with them and the business logic lives in one or the other (and usually in the server). Truly distributed applications are the province of integration (EAI, B2B), Web services, and the idea of Service Oriented Architectures (SOA). You only need a shared concept of locale when your logic is being hosted in discrete chunks on multiple systems and when you cannot count on the systems using the same technology! Web apps are usually hosted in a single container or are written by people who have chosen a particular technology. The locale model associated with that technology becomes the locale model of the Web application. The whole point of Web services, by contrast, is to hide this technology decision. 6
  • 7. Locales and Language Tags meet Conversations in Prague… – Language tags are being locale identifiers anyway… – Not going to need a big new thing… – Just a few things to fix… … we can do this really fast In 2002, Mark Davis and I attended the Internationalization and Unicode Conference in Prague (so you can see that it pays to attend these events!), where I had a paper about locale identifiers. The basic problem was that language tags were widely distributed, and, since they looked an awful lot like POSIX locale identifiers, most Web application platforms were actually using them as locale identifiers already by mapping language tags to their local equivalent. Mark was working on the CLDR project and was concerned about problems involving script identification (especially for compatibility with Microsoft’s .NET Culture identifiers). It seemed that a few small fixes to BCP 47 (to allow some script subtags) and some documentation (“how to get a locale out of a language tag”) might solve several problems all at once. 7
  • 8. BCP 47 Basic Structure Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh) 1*8alphanum * [ “-” 1*8 alphanum ] The basic structure of language tags has been remarkably stable. Language tags are ASCII strings consisting of subtags separated by hyphens (and not underscores). The subtags may consist of either (ASCII) letters or digits. There exist suggested capitalization rules for some of the underlying standards used by language tags, but these do not apply to language tags and have no meaning in a language tag context. Language tags are case insensitive. At the bottom of the slide is the original “ABNF” which describes the language tag grammar. 8
  • 9. RFC 1766 zh-TW ISO 639-1 (alpha2) ISO 3166 (alpha2) i-klingon Registered value RFC 1766 defined language tags in two distinct ways. All language tags took the form of a sequence of subtags composed of the ASCII letters and digits separated by the hyphen character. The subtags could be, at most, eight characters long. RFC 1766 said that: •If the first subtag consisted of two letters, it was a language code from the ISO 639-1 standard. •If there is a second subtag (additional subtags are optional) and it consisted of two letters, it was a region code from the ISO 3166 standard. Otherwise, the interpretation of the tag (and its subtags) was defined by a registry maintained by IANA. If users needed a specific language tag, they could write to a mailing list (ietf-languages@iana.org) and request a registration be created. Here is one such tag, for the Klingon language. 9
  • 10. RFC 3066 sco-GBISO 639-2 (alpha 3 codes) Bu tu se … eng-GB X alpha 2 codes when they exist RFC 3066 expanded on RFC 1766, making a few minor additions and cleaning up a few problems that arose. The main change was the addition of ISO 639-2 codes for languages. The ISO 639-1 codes are two-letters long and there are, necessarily, a limited number of these (about 650 total, given that some letters are reserved). Since there are at least several thousand languages that exist in modern times, this isn’t sufficient to encode the world’s languages. ISO 639-2 assigns three-letter codes, which allows for many more potential codes. This allows all of the languages to be represented by one code or another. RFC 3066 also mandated that if an ISO 639-1 code exists for a language, then that code must be used (and not the ISO 639-2 code). This prevents languages from being encoded using different tags. Thus the tag “eng-UK” is not legal, even though “eng” is a valid ISO 639-2 code: tags must use the “en” code for English. The IANA language tag registry remained the same as during the RFC 1766 era: a collection of isolated registrations. (‘sco’ is the code for ‘Scots’) 10
  • 11. Problems Script Variation: – zh-Hant/zh-Hans – (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.) Obsolence of registrations: – art-lojban (now jbo), i-klingon (now tlh) Instability in underlying standards: – sr-CS (CS used to be Czechoslovakia… A variety of problems were associated with language tags, despite their success. The one Mark and I were primarily interested in was the problem of script variation. Most languages are customarily written in a single script. They may be transcribed in another script, but most native speakers and most content in that language use a single script. A few languages are written equally—or at least “commonly”—in more than one script. Some of the languages are undergoing transitions (Cyrillic script was imposed on several languages during the Soviet era, for example), while others are just naturally written in more than one script. For example, Serbian can be written in either Cyrillic or Latin script. Both traditions are historical to the language, not artificially imposed. The most notable example of script variation is in Chinese, where the traditional form of the script is used in some Chinese speaking regions (Taiwan, Hong Kong) while the simplified form of the script is used in others (the PRC, Singapore). These variations do not follow spoken variation in the language (Hong Kong, for example, speaks Cantonese while Taiwan speaks Mandarin)… which leads to vocabulary and other variations with the writing systems in question. And identifying “Traditional Chinese” using a region has other cultural sensitivity problems… Another problem was the relative ease of registration for language tags compared to the action of the various ISO maintenance and registration bodies. Many of the registered tags were later deprecated due to standards action. A last problem I’ll mention here was instability in ISO 3166 (the region codes). Codes in ISO 3166 are changing all the time, which is not a surprise, given that countries are changing name, boundaries, and organization with some regularity. Alas, ISO 3166 doesn’t just remove old codes: they sometimes give them to a new country or region. So the language code today for “Serbian for Serbia and Montenegro” would have been “Serbian for Czechoslovakia” a couple decades ago. 11
  • 12. And More Problems Lack of scripts Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions – 1*8 alphanum *[ “-” 1*8 alphanum] – 2*3 ALPHA [ “-” 2ALPHA ] Many registrations to cover small variations – 8 German registrations to cover two variations There were a few other problems, which I’ve listed here… 12
  • 13. LTRU and “draft-registry” Defines a generative syntax – machine readable – future proof, extensible Defines a single source – Stable subtags, no conflicts – Machine readable Defines when to use subtags – (sometimes) So Mark and I started writing Internet-Drafts. Eventually, a Working Group was formed at the IETF called the Language Tag Registry Update or LTRU working group. Out of this working group comes a new RFC, which is the new BCP 47. As I write this the RFC has not yet been assigned a number, so it is called RFC 3066bis informally. It changes language tags in a number of interesting ways, while maintaining full compatibility with all existing tags. 13
  • 14. 14 sl-Latn-IT-rozaj-x-mine Private Use and Extension RFC 3066bis and LTRU Here is an illustration of a new-style language tag. Registered variants (any number) ISO 3166 (alpha2) or UN M49 ISO 15924 script codes (alpha 4) ISO 639-1/2 (alpha2/3)
  • 15. More Examples es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-nedis (Multiple variants) zh-t-wadegile (Extensions) Here are some more examples of language tags showing some of the interesting variations. es-419 makes use of the UN M.49 region codes to describe a language for a larger area than a country. de-CH-1996 was registered in the old IANA Language Tag Registry. It is still a valid tag. sl-rozaj-nedis is probably not a good tag choice, but illustrates that you can have more than a single variant in a well-formed tag. In this case, both –rozaj and – nedis are dialects of Slovenian (sl), but –nedis doesn’t include sl-rozaj in its registered list of prefixes, so this tag is probably meaningless. zh-t-wadegile is a hypothetical tag: if there were an extension for transliterations and it if it were assigned the letter ‘t’, than one valid subtag might be ‘wadegile’.* * Several well-informed people have cast doubt on the idea of a transliteration extension, not to mention the “wadegile” example shown. 15
  • 16. Benefits Subtag registry in one place: one source. Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are forever There are several benefits to switching over to RFC 3066bis. For the first time there is a single, authoritative source for subtags. It contains date versioning information, as well as information on the formation of useful tags. Instead of having to hunt through various versions of ISO 639, ISO 3166, ISO 15924, UN M.49 and the IANA registry, there is one source. It is machine readable and the entries are dated. There is even a mechanism for canonicalizing tags as they evolve. Inside a language tag, the subtags can be identified by length and content. Parsers do not have to have a copy of the registry to extract most of the information in a tag. There are several extension mechanisms. In particular, private use subtags can be used in otherwise public tags. The tags are all backwards compatible with RFC 3066. Any new tag would have been valid to register under pervious versions of BCP 47. And all of the old tags are forwards compatible (although a few are only compatible via fiat). Finally: tags and subtags are stable. Forever. 16
  • 17. Problems Matching – Does “en-US” match “en-Latn-US”? Tag Choices – Users have more to choose from. Implementations – More to do, more to think about – (easier to parse, process, support the good stuff) The creation of the new format does create a few problems for users and implementers, though. In particular, there are now more choices for how to form the generative language tags. Matching of tags is a particular issue we’ll cover in more depth in a second. Users have more choices available, so implementations and guidelines are going to be necessary to help people decide what’s best for them. Software implementations will have to do several things. Of course, they’ll have to be modified to be either well-formed or validating processors. The good news here is that the tag syntax is more deterministic and thus more amenable to parsing. And there is a data source that can easily be incorporated into code. The bad news is that some badly-written implementations are going to break and that developers need to go back and evaluate their software. 17
  • 18. Tag Matching Uses “Language Ranges” to select sets of content according to the language tag Four Schemes – Basic Filtering – Extended Filtering – Scored Filtering – Lookup The remaining work of LTRU relates to matching and selecting content based language tags. This has some impact on implementations, which need to guide users in selection of the most appropriate tags. Tag matching depends on language ranges, which are identifiers that people use to specify what they are looking for or wish to match. Ranges select sets of tags. The current version of the Internet-Draft on matching describes four types of matching in two categories (filtering and lookup). 18
  • 19. Filtering Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont” Basic matching uses plain prefixes Extended matching can match “inside bits” – “en-*-US” Filtering is one type of matching. In filtering, the range specifies the least specific item that constitutes a match. For example, if I specify a range of “de-CH”, all content in the matching set must include the language “de” (German) and the region “CH” (Switzerland) in its tags. •“Basic filtering” is strict prefix matching. That is range “de-CH” matches tags “de- CH” and “de-CH-1996” but not “de-Brai-CH”, “de”, or “de-Latn-CH-1996” •In “extended filtering”, ranges can match missing elements. Thus “de-*-CH” would match all of the foregoing examples except “de”. 19
  • 20. Scored Filtering Assigns a “weight” or “score” to each match Result set is ordered by match quality Postulated by John Cowan Scored filtering, which was first postulated by John Cowan, assigns a weight or score to each potential range-to-tag match. Unlike the other two forms of filtering, scored filtering results in an ordered set of matching tags. This might be useful with search results, for example. 20
  • 21. Lookup Range specifies the most specific tag in a match. – “en-US” matches “en” and “en-US” but not “en- US-boont” Mirrors the locale fallback mechanism and many language negotiation schemes. The other form of matching is called lookup. In lookup, the user specifies the most specific tag that represents a match. The lookup algorithm is for use in cases where the user wants exactly one item returned for each request. Software resources are examples of language tag matching. (Demo of all matching types) 21
  • 22. What Do I Do (Content Author)? Not much. – Existing tags are all still valid: tagging is mostly unchanged. – Resist temptation to (ab)use the private use subtags. Unless your language has script variations: – Tag content with the appropriate script subtag(s) Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others. 22
  • 23. What Do I Do (Programmer)? Check code for compliance with 3066bis – Decide on well-formed or validating – Implement suppress-script – Change to using the registry – Bother infrastructure folks (Java, MS, Mozilla, etc) to implement the standard 23
  • 24. What Do I Do (End-User)? Check and update your language ranges. Tag content wisely. 24
  • 25. LTRU Milestone Dates (Done) RFC 3066bis – Registry went live in December 2005 Produce “Matching” RFC – Draft-04 available (Anticipated) Produce RFC 3066ter – This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6 25
  • 26. Things to Read Registry Draft http://www.inter-locale.com http://www.ietf.org/internet-drafts/draft-ietf-ltru- registry-12.txt Matching Draft http://www.inter-locale.com LTRU Mailing List https://www1.ietf.org/mailman/listinfo/ltru 26
  • 27. Things to Do (languages) Get involved in LTRU Get involved in W3C I18N Core WG! Write implementations Work on adoption of 3066bis: understand the impact Then get involved with Locale identifiers… 27
  • 28. Back to Locales… IUC 20 Round Table Suzanne Topping’s Multilingual Article Tex Texin and the Locales list… So we’ve done a deep dive into Language Tags, whereas my point of entry was locale identifiers. What’s going on with locales? Back at IUC20 (see, it pays to go to these events!) there was a round-table in which there was a discussion of problems confronting the Web. Language tags and locale identifiers was one of the key topics discussed at this round table, apparently. I say “apparently”, because I left the conference before the round table. I read about the results on the W3C website and in an article by Suzanne Topping in Multilingual magazine. What I read there surprised and dismayed me. A few weeks later, I found that others in the community were working on locales or, rather, on rubbishing locales. Tex Texin started a list (now defunct) for discussing the problem. I got involved in thinking about the problem. 28
  • 29. Locale Identifiers and Web Services Fundamentally, my interest stemmed from the fact that I was working on Web services. Web services are supposed to define a platform-agnostic way to expose logic or functionality in a distributed fashion. By using XML and HTTP, it was hoped that Web services could provide a standards oriented way to accomplish what CORBA or EAI vendors had been providing in a proprietary fashion previously. The problem I was grappling with was: “how do you internationalize a Web service?” Web services have all the same requirements any distributed system has: they have messages, data, text, and potentially cultural, regional, or other issues in them. In our programming environments we have a ready solution for addressing these problems. These often hinge on the locale. And the locale hinges on the user’s preferences in the matter. We have standard language identifiers. We don’t have standard locale anything. What to do? There were (and are) three schools of thought. On the one hand are the identifier folks (such as myself) who think that if we had a general locale- and/or-international-preferences-ID-mechanism, each vendor would implement it in a manner consistent with their existing language/platform and everything would work pretty well. On the other hand are the locale definition folks (such as Mark Davis) who think that if we all agreed to use the same locale data and locale data structures, then we could exchange identifiers and get the same results because everything is the same. On the left foot are the folks who think locales are just a bad idea and ought to be placed in the nearest landfill or entombed in concrete, Chernobyl-style. 29
  • 30. W3C and Unicode W3C – Identifiers and cross-over with language tags – Web services – XML, HTML Unicode Consortium – LDML – CLDR – Standards for content Two standards organizations that are working in the area of locales and locale identifiers are the W3C (Internationalization Core Working Group) and the Unicode Consortium (the Common Locale Data Repository project). The W3C is, of course, directly concerned with the use and implementation of language tags in document formats and technologies. In addition, the need for locale identification for Web services is a specific work item for the I18n working group. The Unicode folks are working to build a standardized, comprehensive set of locale data. 30
  • 31. “Language Tags and Locale Identifiers” SPEC First Working Draft coming soon – URIs? – Simple tags? The W3C is currently working on a pair of specifications (W3C-ese for “standards track documents”). The first is called “Language Tags and Locale Identifiers”, which, as its names says, has to do with actually creating locale identifiers, as well as providing implementation guidelines for RFC 3066bis and draft-matching. There are questions about how a locale identifier should be structured. Several ideas are currently floating around. For example, URIs might be used. Or 3066bis tags might be “extended” in some way. 31
  • 32. WS-I18N SPEC First Working Draft now available: – http://www.w3.org/TR/ws-i18n The second spec that the W3C is working on is the WS-I18N spec, or “Web Services Internationalization”. This spec relies on the preceding document for locale identifiers and describes how to use locales with Web services technologies. Previous work by the W3C I18N WG in this area include requirements and usage scenarios. 32
  • 33. Ideas? 33