SlideShare uma empresa Scribd logo
1 de 9
Baixar para ler offline
SALES RELAUNCH F&Q SESSION
Multi-lingual data processing
The CIS and Georgia
Olga Rink, director general
3
Content
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• Business environment
• Main stages of processing multi-lingual business data
o Naming convention
o Transliteration
o Matching
• Seeding and verifying objects in a media coverage
4
Official languages, population (mn) and Russian as a
second language (est.)
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
5
Multi-lingual environment
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Country Official language (group)
Population,
mn Alphabet Second language
Russian, % of
population, est.
Russia Russian 150Cyrillic
35+* official and over 100
used  100%
Armenia
Armenian (Indo-European
language) 3Own script Russian, English 100%
Azerbaijan Azeri Turkish 9,8
Latin in Azerbaijan, Cyrillic in Russia
(Dagestan) 90%
Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100%
Georgia Georgian (Kartvelian language) 3,7Georgian script
Russian, English, Azeri,
Armenian 100%
Kazakhstan
Kazakh (Turkic language),
Russian 17,7
Kazakh alphabets (Cyrillic, Latin,
Perso-Arabic, Kazakh Braille)
Russian
 100%
Kyrgyzstan
Kyrgyz (Turkic language),
Russian 6Cyrillic Kyrgyz  100%
Moldova Romanian 3,6Latin Russian is widely used  90%
Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90%
Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100%
Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic
Russian is widely used along
with a number of other
languages  100%
Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100%
• The Constitution of Dagestan defines "Russian and the languages of
the peoples of Dagestan" as the state languages
•  a bulk of newly-registered business is available in Cyrillic or Latin
6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• For Slavic languages we use ISO
9:1995 standard with one exception:
put a combination of Latin characters
instead of Latin diacritic characters.
Example: Ch (without diacritic) instead of
Ч – Č (with diacritic)
• ISO9985 is used for Armenian
• ISO 9984 – for Georgian
• ООО «Ъ» (Trade style: OOO TVERDY
ZNAK; OOO “” is a transliterated
name – no way to find by the
original name)
• Minor changes in transliteration like
3DNYUS, OOO >3DNEWS, LLC are
accepted and now filtered while
being updated
• Matching rules are defined in our
“Naming Convention”: i.e. the
transliterated «normalized» Charter
brief company name is used as
primary: an indication to a legal form
in the name (required by law) is put at
the end via comma.
• Second one is the transliterated full
legal name.
• Trade style contains official name in
English/Latin or trade marks
• We use rule-based and machine
learning approaches, including areas
of collecting data, identifying
objects, developing credit scorings,
digesting media coverage
7
Natural Language Processing and Machine Learning
The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,
and deploy credit and reputation risk models with minimal effort
• Tagging documents and
• Classifying by a text type (media-release,
forecast, feature etc)
Detecting and Disambiguating Named Entities
Support Vector Machine (SVM) or Bayes are used,
depending on configuration
• SVM represents a text as a vector to compare with a pattern
(prototype); The closeness defines the type
• Bayes rule is applicable when you rely on pre-determined
assumptions (a range of known “symptoms”) while calculating
probabilities
Rule-based fact extraction and sentiment analysis
At an initial phase for seeding named persons
• Rule-based approach mostly
• Context analysis and statistics for entity disambiguation
Clarification of Named Entity Detection with learning semi-
automatically labelled corpus
• Support Vector Machine (SVM)
• A neural network on the basis of the existing rule-based
structure is considered for future
8
An intellectual WOW-effect or what can only SCAN
do – forward to “verifying” media coverage
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Out of 3 mn companies automatically
generated by the Scan linguistic kernel for
the recent year 22 thousand have been
verified, 0.5 mn are identified with Spark
2 mn persons were generated (seeded);
out of them 75 thousand verified
300 thousand of geographic locations: all
Russian ones identified by OKATO classifier
and many global locations got by parsing
Wikipedia
13 thousand trade marks (“Trade style”)
24 thousand sources in
Russian
ThankYou
Interfax – Dun & Bradstreet
www.dnb.ru

Mais conteúdo relacionado

Destaque

Presentacion Teledetección
Presentacion TeledetecciónPresentacion Teledetección
Presentacion Teledetecciónmanuelmch
 
Web 2.0 tatys
Web 2.0 tatysWeb 2.0 tatys
Web 2.0 tatystaty24edu
 
Answer HW Alternatives
Answer HW AlternativesAnswer HW Alternatives
Answer HW AlternativesTippery
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson planSabariChandran
 
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Consejo de Rectores de Panamá
 
Gabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesGabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesemefguerreiro
 
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...Natália Macário
 
Imaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinImaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinTippery
 
Top 10 tv dramas
Top 10 tv dramasTop 10 tv dramas
Top 10 tv dramasGeorgeSilke
 
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์Udomchai Boonrod
 
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองTum Meng
 
Maranhão - Império
Maranhão - ImpérioMaranhão - Império
Maranhão - ImpérioLyssa Martins
 
Фабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоФабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоAkiwa
 
การจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาการจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาUdomchai Boonrod
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson planrsjulie436
 
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoPara obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoEdwin Ambulodegui
 
Projecte Niger Francés
Projecte Niger FrancésProjecte Niger Francés
Projecte Niger Francésrpiquerasm
 
4รายงานนวีตกรรม
4รายงานนวีตกรรม 4รายงานนวีตกรรม
4รายงานนวีตกรรม krupornpana55
 
History of bastard sword
History of bastard swordHistory of bastard sword
History of bastard swordDixievaldez
 

Destaque (20)

Presentacion Teledetección
Presentacion TeledetecciónPresentacion Teledetección
Presentacion Teledetección
 
Web 2.0 tatys
Web 2.0 tatysWeb 2.0 tatys
Web 2.0 tatys
 
Answer HW Alternatives
Answer HW AlternativesAnswer HW Alternatives
Answer HW Alternatives
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson plan
 
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
 
Gabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesGabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomes
 
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
 
Imaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinImaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skin
 
Top 10 tv dramas
Top 10 tv dramasTop 10 tv dramas
Top 10 tv dramas
 
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
 
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
 
Maranhão - Império
Maranhão - ImpérioMaranhão - Império
Maranhão - Império
 
Фабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоФабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качество
 
การจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาการจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษา
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson plan
 
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoPara obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
 
Projecte Niger Francés
Projecte Niger FrancésProjecte Niger Francés
Projecte Niger Francés
 
4รายงานนวีตกรรม
4รายงานนวีตกรรม 4รายงานนวีตกรรม
4รายงานนวีตกรรม
 
El Virus De La Gripe
El Virus De La GripeEl Virus De La Gripe
El Virus De La Gripe
 
History of bastard sword
History of bastard swordHistory of bastard sword
History of bastard sword
 

Semelhante a Processing multi-lingual business data

SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesIJECEIAES
 
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerTulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerJason Townsend, MBA
 
Machine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersMachine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersVIA
 
Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Tal Lavian Ph.D.
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf3Play Media
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval ShujaatZaheer3
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningFindwise
 
Calais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupCalais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupKrista Thomas
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Amazon Web Services
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsAlessa
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Olga Melnikova
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Searchandrew_paulsen
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Andrej Muhic
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD RIILP
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010Markus Voelter
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessLeon Teale
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 

Semelhante a Processing multi-lingual business data (20)

cldr_overview
cldr_overviewcldr_overview
cldr_overview
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performances
 
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerTulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
 
Machine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersMachine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border matters
 
Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
Calais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupCalais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web Meetup
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Search
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 

Último

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Último (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Processing multi-lingual business data

  • 2. Multi-lingual data processing The CIS and Georgia Olga Rink, director general
  • 3. 3 Content Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • Business environment • Main stages of processing multi-lingual business data o Naming convention o Transliteration o Matching • Seeding and verifying objects in a media coverage
  • 4. 4 Official languages, population (mn) and Russian as a second language (est.) Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
  • 5. 5 Multi-lingual environment Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Country Official language (group) Population, mn Alphabet Second language Russian, % of population, est. Russia Russian 150Cyrillic 35+* official and over 100 used  100% Armenia Armenian (Indo-European language) 3Own script Russian, English 100% Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia (Dagestan) 90% Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100% Georgia Georgian (Kartvelian language) 3,7Georgian script Russian, English, Azeri, Armenian 100% Kazakhstan Kazakh (Turkic language), Russian 17,7 Kazakh alphabets (Cyrillic, Latin, Perso-Arabic, Kazakh Braille) Russian  100% Kyrgyzstan Kyrgyz (Turkic language), Russian 6Cyrillic Kyrgyz  100% Moldova Romanian 3,6Latin Russian is widely used  90% Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90% Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100% Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic Russian is widely used along with a number of other languages  100% Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100% • The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages •  a bulk of newly-registered business is available in Cyrillic or Latin
  • 6. 6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters. Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian • ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name) • Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated • Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma. • Second one is the transliterated full legal name. • Trade style contains official name in English/Latin or trade marks • We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage
  • 7. 7 Natural Language Processing and Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train, and deploy credit and reputation risk models with minimal effort • Tagging documents and • Classifying by a text type (media-release, forecast, feature etc) Detecting and Disambiguating Named Entities Support Vector Machine (SVM) or Bayes are used, depending on configuration • SVM represents a text as a vector to compare with a pattern (prototype); The closeness defines the type • Bayes rule is applicable when you rely on pre-determined assumptions (a range of known “symptoms”) while calculating probabilities Rule-based fact extraction and sentiment analysis At an initial phase for seeding named persons • Rule-based approach mostly • Context analysis and statistics for entity disambiguation Clarification of Named Entity Detection with learning semi- automatically labelled corpus • Support Vector Machine (SVM) • A neural network on the basis of the existing rule-based structure is considered for future
  • 8. 8 An intellectual WOW-effect or what can only SCAN do – forward to “verifying” media coverage Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Out of 3 mn companies automatically generated by the Scan linguistic kernel for the recent year 22 thousand have been verified, 0.5 mn are identified with Spark 2 mn persons were generated (seeded); out of them 75 thousand verified 300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing Wikipedia 13 thousand trade marks (“Trade style”) 24 thousand sources in Russian
  • 9. ThankYou Interfax – Dun & Bradstreet www.dnb.ru