SlideShare a Scribd company logo
1 of 21
Download to read offline
Building Blocks for Accessing
Multilingual Data: CLDR
Steven R. Loomis, IBM GFTT 1
Access available handouts at ala.15.ala.org/sessions/handouts.
About Me
• Senior Software Engineer, 

IBM Global Foundations Technology Team
• IBM’s technical lead for the ICU4C/C++
software library, and primary voting
representative to Unicode
• Member of CLDR-TC, lead of ULI-TC
2
Access available handouts at ala.15.ala.org/sessions/handouts.
Agenda
• About CLDR
• Focus Areas:
• Language Identification
• Transliteration
• Searching and Sorting
• Keyboards/Entry
• Q&A
3
Access available handouts at ala.15.ala.org/sessions/handouts.
What is CLDR?
• Common Locale Data Repository
• Language and region-specific data
• Covers hundreds of language/region pairs
• Open data (like Unicode itself), XML/JSON
format
• Community input, carefully curated
4
Access available handouts at ala.15.ala.org/sessions/handouts.
Who is CLDR?
• CLDR’s Technical Committee,

the CLDR-TC, is part of the Unicode
Consortium
• Active participation by industry, academic,
open source projects, national standards
bodies, individuals
5
Access available handouts at ala.15.ala.org/sessions/handouts.
Who uses CLDR?
• Apple, Google, IBM, Microsoft…
• Wikimedia foundation, jQuery, …
• Java, node.js, php, …
• Many users via ICU C/C++/Java library
6
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Data
• Data required for respecting the
linguistic, cultural, geopolitical
requirements of specific users
• Example: "What day is it?"
7
Access available handouts at ala.15.ala.org/sessions/handouts.
XML / JSON
• XML: “es-US”
• <month type="6">Junio</month>
• JSON: “es-US”
• { …

"6": "Junio", …

}
8
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR Coverage
• Coverage vs. number of languages
9
Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR site and SurveyTool (DEMO)
• DEMO:
• http://unicode.org/cldr
• http://st.unicode.org/cldr-apps
10
Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Identifiers — BCP47
• Example: sr-Latn-RS
• sr : ISO-639 "Serbian"
• Latn : ISO-15924 "Latin Script"

(vs Cyrillic)
• RS : ISO 3166 / UN M.49 "Serbia"
Latn
Latnsr
Latn
LatnLatn
Latn
LatnRS
11
Access available handouts at ala.15.ala.org/sessions/handouts.
Language/Territory/Script info
Facts:
• “The Cyrillic Script can be used to write
Mongolian, Russian, Serbian…”
• “Italian is spoken in Italy, San Marino,
Switzerland…”
12
Access available handouts at ala.15.ala.org/sessions/handouts.
Language Identification: Exemplars
English
(Latin)
a b c d e f g h i j k l m 

n o p q r s t u v w x y z
Serbian
(Latin)
a b c ć č d đ dž e f g h i j k l lj m 

n nj o p r s š t u v z ž
Serbian
(Cyrillic)
а б в г д ђ е ж з и ј к л љ м н њ о п р 

с т ћ у ф х ц ч џ ш
Russian
(Cyrillic)
а б в г д е ё ж з и й к л м н о п р 

с т у ф х ц ч ш щ ъ ы ь э ю я
13
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration
• Existing data for rule sets.
• ALA-LC format could be included.
• Rule based engine.
14
Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration Rule Example: Greek
• <tRule>Σ ↔ S ;</tRule>
• <tRule>τ ↔ t ;</tRule>
• <tRule>Τ ↔ T ;</tRule>
15
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: ICU transliterator demo
• http://demo.icu-project.org/icu-bin/
translit
16
Access available handouts at ala.15.ala.org/sessions/handouts.
Searching and Sorting
• Unicode (UCA) provides base
• CLDR “tailors”: 

English vs. Danish vs. French
• German: Mueller = Müller = MUELLER
• Multiple stages and options:
• blackbird vs black-bird vs BlackBird
17
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: Collator
• http://demo.icu-project.org/icu-bin/
collation.html
18
Access available handouts at ala.15.ala.org/sessions/handouts.
Keyboards / Entry
• Standardized
identifier for
keyboard tables
• Allows comparison
between keyboard
providers
19
Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: MARC processor
CLDR
data
Script: Armn (Armenian)
Exemplar text matches hy
“Armenian”
Transliterate to latin: 

“Hayastaneayc‘ ekeġec‘i”
Regions where spoken: 

Armenia, Russia, Georgia,
Syria, Lebanon, Iran,
Turkey, Cyprus
20
uses: CLDR, ICU4J, MARC4J
Access available handouts at ala.15.ala.org/sessions/handouts.
Thank You / Q&A
• srloomis@us.ibm.com
• @srl295 ( Twitter, GitHub, Freenode )
• ibm.biz/srloomis
21

More Related Content

Viewers also liked (7)

3 s glbal presentation on unicode development
3 s glbal presentation on unicode development3 s glbal presentation on unicode development
3 s glbal presentation on unicode development
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked Data
 
Vedic Sanskrit-on the way of Digitization
Vedic Sanskrit-on the way of DigitizationVedic Sanskrit-on the way of Digitization
Vedic Sanskrit-on the way of Digitization
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best Practices
 
Internationalisation with PHP and Intl
Internationalisation with PHP and IntlInternationalisation with PHP and Intl
Internationalisation with PHP and Intl
 
Principles of icu ventilators
Principles of icu ventilatorsPrinciples of icu ventilators
Principles of icu ventilators
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 

Similar to Building Blocks for Accessing Multilingual Data: CLDR

Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
Lionel Briand
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
Richard Littauer
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
alaa223
 

Similar to Building Blocks for Accessing Multilingual Data: CLDR (20)

Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
 
Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
Known XML Vulnerabilities Are Still a Threat to Popular Parsers ! & Open Sour...
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for Translations
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
20180324 leveraging unix tools
20180324 leveraging unix tools20180324 leveraging unix tools
20180324 leveraging unix tools
 
Free and Open Source Software technology: General Overview
Free and Open Source Software technology: General OverviewFree and Open Source Software technology: General Overview
Free and Open Source Software technology: General Overview
 
Free and Open Source Software technology: General Overview
Free and Open Source Software technology: General OverviewFree and Open Source Software technology: General Overview
Free and Open Source Software technology: General Overview
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
groovy DSLs from beginner to expert
groovy DSLs from beginner to expertgroovy DSLs from beginner to expert
groovy DSLs from beginner to expert
 
Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
2020 oct zowe quarterly webinar series
2020 oct zowe quarterly webinar series2020 oct zowe quarterly webinar series
2020 oct zowe quarterly webinar series
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Chw00t: How to break out from various chroot solutions
Chw00t: How to break out from various chroot solutionsChw00t: How to break out from various chroot solutions
Chw00t: How to break out from various chroot solutions
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
R meet up slides.pptx
R meet up slides.pptxR meet up slides.pptx
R meet up slides.pptx
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Building Blocks for Accessing Multilingual Data: CLDR

  • 1. Building Blocks for Accessing Multilingual Data: CLDR Steven R. Loomis, IBM GFTT 1
  • 2. Access available handouts at ala.15.ala.org/sessions/handouts. About Me • Senior Software Engineer, 
 IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode • Member of CLDR-TC, lead of ULI-TC 2
  • 3. Access available handouts at ala.15.ala.org/sessions/handouts. Agenda • About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry • Q&A 3
  • 4. Access available handouts at ala.15.ala.org/sessions/handouts. What is CLDR? • Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format • Community input, carefully curated 4
  • 5. Access available handouts at ala.15.ala.org/sessions/handouts. Who is CLDR? • CLDR’s Technical Committee,
 the CLDR-TC, is part of the Unicode Consortium • Active participation by industry, academic, open source projects, national standards bodies, individuals 5
  • 6. Access available handouts at ala.15.ala.org/sessions/handouts. Who uses CLDR? • Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library 6
  • 7. Access available handouts at ala.15.ala.org/sessions/handouts. Locale Data • Data required for respecting the linguistic, cultural, geopolitical requirements of specific users • Example: "What day is it?" 7
  • 8. Access available handouts at ala.15.ala.org/sessions/handouts. XML / JSON • XML: “es-US” • <month type="6">Junio</month> • JSON: “es-US” • { …
 "6": "Junio", …
 } 8
  • 9. Access available handouts at ala.15.ala.org/sessions/handouts. CLDR Coverage • Coverage vs. number of languages 9
  • 10. Access available handouts at ala.15.ala.org/sessions/handouts. CLDR site and SurveyTool (DEMO) • DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps 10
  • 11. Access available handouts at ala.15.ala.org/sessions/handouts. Locale Identifiers — BCP47 • Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script"
 (vs Cyrillic) • RS : ISO 3166 / UN M.49 "Serbia" Latn Latnsr Latn LatnLatn Latn LatnRS 11
  • 12. Access available handouts at ala.15.ala.org/sessions/handouts. Language/Territory/Script info Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…” • “Italian is spoken in Italy, San Marino, Switzerland…” 12
  • 13. Access available handouts at ala.15.ala.org/sessions/handouts. Language Identification: Exemplars English (Latin) a b c d e f g h i j k l m 
 n o p q r s t u v w x y z Serbian (Latin) a b c ć č d đ dž e f g h i j k l lj m 
 n nj o p r s š t u v z ž Serbian (Cyrillic) а б в г д ђ е ж з и ј к л љ м н њ о п р 
 с т ћ у ф х ц ч џ ш Russian (Cyrillic) а б в г д е ё ж з и й к л м н о п р 
 с т у ф х ц ч ш щ ъ ы ь э ю я 13
  • 14. Access available handouts at ala.15.ala.org/sessions/handouts. Transliteration • Existing data for rule sets. • ALA-LC format could be included. • Rule based engine. 14
  • 15. Access available handouts at ala.15.ala.org/sessions/handouts. Transliteration Rule Example: Greek • <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule> 15
  • 16. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: ICU transliterator demo • http://demo.icu-project.org/icu-bin/ translit 16
  • 17. Access available handouts at ala.15.ala.org/sessions/handouts. Searching and Sorting • Unicode (UCA) provides base • CLDR “tailors”: 
 English vs. Danish vs. French • German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird 17
  • 18. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: Collator • http://demo.icu-project.org/icu-bin/ collation.html 18
  • 19. Access available handouts at ala.15.ala.org/sessions/handouts. Keyboards / Entry • Standardized identifier for keyboard tables • Allows comparison between keyboard providers 19
  • 20. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: MARC processor CLDR data Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: 
 “Hayastaneayc‘ ekeġec‘i” Regions where spoken: 
 Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus 20 uses: CLDR, ICU4J, MARC4J
  • 21. Access available handouts at ala.15.ala.org/sessions/handouts. Thank You / Q&A • srloomis@us.ibm.com • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis 21