Processing multi-lingual business data

Multi-lingual data processing
The CIS and Georgia
Olga Rink, director general

3
Content
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• Business environment
• Main stages of processing multi-lingual business data
o Naming convention
o Transliteration
o Matching
• Seeding and verifying objects in a media coverage

4
Official languages, population (mn) and Russian as a
second language (est.)

5
Multi-lingual environment
Country Official language (group)
Population,
mn Alphabet Second language
Russian, % of
population, est.
Russia Russian 150Cyrillic
35+* official and over 100
used  100%
Armenia
Armenian (Indo-European
language) 3Own script Russian, English 100%
Azerbaijan Azeri Turkish 9,8
Latin in Azerbaijan, Cyrillic in Russia
(Dagestan) 90%
Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100%
Georgia Georgian (Kartvelian language) 3,7Georgian script
Russian, English, Azeri,
Armenian 100%
Kazakhstan
Kazakh (Turkic language),
Russian 17,7
Kazakh alphabets (Cyrillic, Latin,
Perso-Arabic, Kazakh Braille)
Russian
 100%
Kyrgyzstan
Kyrgyz (Turkic language),
Russian 6Cyrillic Kyrgyz  100%
Moldova Romanian 3,6Latin Russian is widely used  90%
Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90%
Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100%
Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic
Russian is widely used along
with a number of other
languages  100%
Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100%
• The Constitution of Dagestan defines "Russian and the languages of
the peoples of Dagestan" as the state languages
•  a bulk of newly-registered business is available in Cyrillic or Latin

6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• For Slavic languages we use ISO
9:1995 standard with one exception:
put a combination of Latin characters
instead of Latin diacritic characters.
Example: Ch (without diacritic) instead of
Ч – Č (with diacritic)
• ISO9985 is used for Armenian
• ISO 9984 – for Georgian
• ООО «Ъ» (Trade style: OOO TVERDY
ZNAK; OOO “” is a transliterated
name – no way to find by the
original name)
• Minor changes in transliteration like
3DNYUS, OOO >3DNEWS, LLC are
accepted and now filtered while
being updated
• Matching rules are defined in our
“Naming Convention”: i.e. the
transliterated «normalized» Charter
brief company name is used as
primary: an indication to a legal form
in the name (required by law) is put at
the end via comma.
• Second one is the transliterated full
legal name.
• Trade style contains official name in
English/Latin or trade marks
• We use rule-based and machine
learning approaches, including areas
of collecting data, identifying
objects, developing credit scorings,
digesting media coverage

7
Natural Language Processing and Machine Learning
The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products
Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,
and deploy credit and reputation risk models with minimal effort
• Tagging documents and
• Classifying by a text type (media-release,
forecast, feature etc)
Detecting and Disambiguating Named Entities
Support Vector Machine (SVM) or Bayes are used,
depending on configuration
• SVM represents a text as a vector to compare with a pattern
(prototype); The closeness defines the type
• Bayes rule is applicable when you rely on pre-determined
assumptions (a range of known “symptoms”) while calculating
probabilities
Rule-based fact extraction and sentiment analysis
At an initial phase for seeding named persons
• Rule-based approach mostly
• Context analysis and statistics for entity disambiguation
Clarification of Named Entity Detection with learning semi-
automatically labelled corpus
• Support Vector Machine (SVM)
• A neural network on the basis of the existing rule-based
structure is considered for future

8
An intellectual WOW-effect or what can only SCAN
do – forward to “verifying” media coverage
Out of 3 mn companies automatically
generated by the Scan linguistic kernel for
the recent year 22 thousand have been
verified, 0.5 mn are identified with Spark
2 mn persons were generated (seeded);
out of them 75 thousand verified
300 thousand of geographic locations: all
Russian ones identified by OKATO classifier
and many global locations got by parsing
Wikipedia
13 thousand trade marks (“Trade style”)
24 thousand sources in
Russian

ThankYou
Interfax – Dun & Bradstreet
www.dnb.ru

Processing multi-lingual business data

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Processing multi-lingual business data

Semelhante a Processing multi-lingual business data (20)

Último

Último (20)

Processing multi-lingual business data