SlideShare uma empresa Scribd logo
1 de 3
Baixar para ler offline
Linguistic component: Lemmatizer for
the Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29
www.semanticanalyzer.info
This document describes technical details of lemmatizer for the Russian language.
It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer
component (see the corresponding Technical Description).
Demo package sent upon request contains the following:
 Java library of tokenizer in a form of a binary
 run_lemmatizer.sh script for swift checking the functionality of the module
 messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization
using the run_lemmatizer.sh script
Algorithm is based on combination of the following:
 dictionary search
 algorithm calculating morphological properties of unknown words
 compound word analyzer
 analyzer of numbers
 rule-based analyzer
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz
Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server
5037 characters/ms
880 tokens/ms
Tests were conducted in a single thread.
Format of the messages_to_lemmatize.txt file
This file describes input data for the tokenizer module for demo purposes.Формат:
Format:
TexttText type
Text contains textual data in Russian for lemmatization
t – tab symbol
Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of lemmatization
The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out.
For the following input file messages_to_tokenize.txt:
Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение(
TWITTER
This output gets generated:
Прекрасный, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом,
ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн]
вечер, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом,
у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер]
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
прогулка, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою,
у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк]
по, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573
564,stem=по]
Набережной, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми,
ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн]
-, type: PUNCT
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=-
,pos=NUMERAL,weight=0,stem=-]
самое, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою,
ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам]
то, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s
tem=то]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т
о]
MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом,
ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т]
MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом,
ому],lemma=то,pos=NOUN,weight=0,stem=т]
emopostkn, type: ALPHANUM
только, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s
tem=только]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st
em=только]
маккафе, type: ALPHANUM
подпортило, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим,
вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли,
ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти]
настроение, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю,
я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями,
ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
emonegtkn, type: ALPHANUM
Examples of using the library from the Java code
MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault();
System.out.println(morphAnalyzer.analyzeBest("русского"));
output:
MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою,
ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]

Mais conteúdo relacionado

Mais procurados

Batch file programming
Batch file programmingBatch file programming
Batch file programmingswapnil kapate
 
Linux commd
Linux commdLinux commd
Linux commdragav03
 
Linux commd
Linux commdLinux commd
Linux commdragav03
 
In just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaIn just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaJason Brown
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...Jim Birch
 
Automating with ansible (Part c)
Automating with ansible (Part c) Automating with ansible (Part c)
Automating with ansible (Part c) iman darabi
 

Mais procurados (11)

Batch file programming
Batch file programmingBatch file programming
Batch file programming
 
Linux commd
Linux commdLinux commd
Linux commd
 
Linux commd
Linux commdLinux commd
Linux commd
 
Sahul
SahulSahul
Sahul
 
In just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninjaIn just one hour i will make you a power shell ninja
In just one hour i will make you a power shell ninja
 
Command line for the beginner - Using the command line in developing for the...
Command line for the beginner -  Using the command line in developing for the...Command line for the beginner -  Using the command line in developing for the...
Command line for the beginner - Using the command line in developing for the...
 
Linux introduction Class 03
Linux introduction Class 03Linux introduction Class 03
Linux introduction Class 03
 
Linux
LinuxLinux
Linux
 
Automating with ansible (Part c)
Automating with ansible (Part c) Automating with ansible (Part c)
Automating with ansible (Part c)
 
Apache
ApacheApache
Apache
 
Intro_Unix_Ppt
Intro_Unix_PptIntro_Unix_Ppt
Intro_Unix_Ppt
 

Destaque

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for EnglishDmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupDmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesDmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source stateDmitry Kan
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting StartedCraig Trim
 

Destaque (19)

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational Dictionary
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting Started
 

Semelhante a Linguistic component Lemmatizer for the Russian language

Chapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsChapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsMeenalJabde
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingRahul P
 
55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines55 best linux tips, tricks and command lines
55 best linux tips, tricks and command linesArif Wahyudi
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Webfanqstefan
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Introduction to Compilers
Introduction to CompilersIntroduction to Compilers
Introduction to Compilersvijaya603274
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
 
Ruby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionRuby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionPrabu D
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functionsmatlab Content
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in MatlabDataminingTools Inc
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonTristan Penman
 
phases of compiler-analysis phase
phases of compiler-analysis phasephases of compiler-analysis phase
phases of compiler-analysis phaseSuyash Srivastava
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysismengistu23
 

Semelhante a Linguistic component Lemmatizer for the Russian language (20)

Chapter 3 Using Unix Commands
Chapter 3 Using Unix CommandsChapter 3 Using Unix Commands
Chapter 3 Using Unix Commands
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language Programming
 
55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines55 best linux tips, tricks and command lines
55 best linux tips, tricks and command lines
 
Developing web apps using Erlang-Web
Developing web apps using Erlang-WebDeveloping web apps using Erlang-Web
Developing web apps using Erlang-Web
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Introduction to Compilers
Introduction to CompilersIntroduction to Compilers
Introduction to Compilers
 
Parsing
ParsingParsing
Parsing
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
 
Ruby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic IntroductionRuby 1.9.3 Basic Introduction
Ruby 1.9.3 Basic Introduction
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functions
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in Matlab
 
COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis: COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis:
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
 
phases of compiler-analysis phase
phases of compiler-analysis phasephases of compiler-analysis phase
phases of compiler-analysis phase
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Linux commands
Linux commandsLinux commands
Linux commands
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Mais de Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesDmitry Kan
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine TranslationDmitry Kan
 

Mais de Dmitry Kan (6)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Linguistic component Lemmatizer for the Russian language

  • 1. Linguistic component: Lemmatizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of lemmatizer for the Russian language. It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer component (see the corresponding Technical Description). Demo package sent upon request contains the following:  Java library of tokenizer in a form of a binary  run_lemmatizer.sh script for swift checking the functionality of the module  messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization using the run_lemmatizer.sh script Algorithm is based on combination of the following:  dictionary search  algorithm calculating morphological properties of unknown words  compound word analyzer  analyzer of numbers  rule-based analyzer Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 5037 characters/ms 880 tokens/ms Tests were conducted in a single thread. Format of the messages_to_lemmatize.txt file This file describes input data for the tokenizer module for demo purposes.Формат: Format: TexttText type Text contains textual data in Russian for lemmatization t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of lemmatization The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out. For the following input file messages_to_tokenize.txt:
  • 2. Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение( TWITTER This output gets generated: Прекрасный, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом, ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн] вечер, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом, у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер] emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM прогулка, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою, у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк] по, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573 564,stem=по] Набережной, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми, ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн] -, type: PUNCT MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=- ,pos=NUMERAL,weight=0,stem=-] самое, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою, ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам] то, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s tem=то] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т о] MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом, ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т] MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом, ому],lemma=то,pos=NOUN,weight=0,stem=т] emopostkn, type: ALPHANUM только, type: ALPHANUM
  • 3. MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s tem=только] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st em=только] маккафе, type: ALPHANUM подпортило, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим, вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли, ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти] настроение, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю, я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями, ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] emonegtkn, type: ALPHANUM Examples of using the library from the Java code MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault(); System.out.println(morphAnalyzer.analyzeBest("русского")); output: MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою, ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]