SlideShare uma empresa Scribd logo
1 de 99
Baixar para ler offline
Japanese linguistics
in Apache Lucene™ and Apache Solr™

             May 9th, 2012

             Christian Moen
          christian@atilika.com
About me
•   MSc. in computer science, University of Oslo, Norway
•   Worked with search at FAST (now Microsoft) for 10 years
     •   5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
     •   5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
•   Founded アティリカ株式会社 in 2009
     •   We help companies innovate using search technologies and good ideas
     •   We know information retrieval, natural language processing and big data
     •   We are based in Tokyo, but we have clients everywhere
•   Newbie Lucene & Solr Committer
     •   Mostly been working on Japanese language support (Kuromoji) so far
•   Please write me on christian@atilika.com or cm@apache.org
Today’s topics
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Japanese 101
ビールください
 bi-ru kudasai
ビールください
 bi-ru kudasai

A beer, please
ありがとうございます!
 arigatō gozaimasu!
ありがとうございます!
 arigatō gozaimasu!

Thank you very much!
乾杯!
kanpai!
乾杯!
kanpai!

Cheers!
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

  Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.



 JR新宿駅の近くにビールを飲みに行こうか?
Katakana - カタカナ
          ・Phonetic script (~50)
          ・Typically used for loan words



JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?


          Hiragana - ひらがな
          ・Phonetic script (~50)
          ・Used for inflections & particles
Romaji - ローマ字                   Katakana - カタカナ
・Latin characters (26+)         ・Phonetic script (~50)
・Used for proper nouns, etc.    ・Typically used for loan words



 JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字                      Hiragana - ひらがな
・Chinese characters (50,000+)   ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
! Words are implicit in Japanese - there
  is no white space that separates them
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
! We need to segment text into tokens first
! Two major approaches for segmentation

          1. n-gramming
          2. morphological analysis
            (statistical approach)
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR               Shall we go for a beer near JR Shinjuku station?
 n=2




JR
n-gramming (n=2)
J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                Shall we go for a beer near JR Shinjuku station?
 n=2
       R新




JR R新
n-gramming (n=2)
J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                     Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿




JR R新 新宿
n-gramming (n=2)
J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                      Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅




JR R新 新宿 宿駅
n-gramming (n=2)
J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                        Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の




JR R新 新宿 宿駅 駅の
n-gramming (n=2)
J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                             Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近




JR R新 新宿 宿駅 駅の の近
n-gramming (n=2)
J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                                  Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近


                                近く




JR R新 新宿 宿駅 駅の の近 近く
Problems with n-gramming
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×  ●
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
•   Still sometimes appropriate for certain search applications
     •   Compliance, e-commerce with special product names, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   CRFs decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, extract readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
How does this actually work?
Demo
Japanese support in
  Lucene and Solr
Japanese in Lucene/Solr
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics

! Customisable
How do we use it?
How do we use it?

      ! Use JapaneseAnalyzer
How do we use it?

      ! Use JapaneseAnalyzer



      ! Use field type “text_ja”
        in example schema.xml
Demo
Feature summary / text_ja analyzer chain
                       Segments Japanese text into tokens with very high accuracy
   JapaneseTokenizer   •   Token attributes for part-of-speech, base form, readings, etc.
                       •   Compound segmentation with compound synonyms
                       •   Segmentation is customisable using user dictionaries
Feature summary / text_ja analyzer chain
                         Segments Japanese text into tokens with very high accuracy
     JapaneseTokenizer    •   Token attributes for part-of-speech, base form, readings, etc.
                          •   Compound segmentation with compound synonyms
                          •   Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations

               LowerCaseFilter Lowercases
Feature details
Compound nouns
? How do we deal with compound nouns?
Compound nouns
? How do we deal with compound nouns?
      Japanese                English
    関西国際空港           Kansai International Airport
シニアソフトウェアエンジニア        Senior Software Engineer
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too
Compound segmentation

    関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
 Senior Software Engineer




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西
Kansai International Airport   Kansai
シニアソフトウェアエンジニナ                 シニア
 Senior Software Engineer      Senior




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際
Kansai International Airport   Kansai   International
シニアソフトウェアエンジニナ                 シニア      ソフトウェア
 Senior Software Engineer      Senior    Software




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際            空港
Kansai International Airport   Kansai   International   Airport
シニアソフトウェアエンジニナ                 シニア      ソフトウェア          エンジニナ
 Senior Software Engineer      Senior    Software       Engineer




 ! We are using a heuristic to implement this
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its part
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Character width normalisation
? How do we deal with character widths?
         Half-width・半角   Full-width・全角
            Lucene        Lucene
             カタカナ          カタカナ
             123           123
Character width normalisation
? How do we deal with character widths?
              Half-width・半角              Full-width・全角
                   Lucene                 Lucene
                    カタカナ                   カタカナ
                    123                    123


! Use CJKWidthFilter to normalise them
  (Unicode NFKC subset)



             Input text Lucene             カタカナ        123

        CJKWidthFilter      Lucene        カタカナ          123

                            half-width    full-width   half-width
Katakana end-vowel stemming
? A common spelling variation in
  katakana is a end long-vowel sound
   English   Japanese spelling variations
  manager    マネージャー            マネージャ        マネジャー
Katakana end-vowel stemming
  ? A common spelling variation in
    katakana is a end long-vowel sound
       English     Japanese spelling variations
       manager     マネージャー            マネージャ         マネジャー



   ! We JapaneseKatakanaStemFilter to
     normalise/stem end-vowel for long terms

                 Input text コピー     マネージャー        マネージャ      マネジャー
JapaneseKatakanaStemFilter コピー       マネージャ        マネージャ      マネジャ
                            copy       manager     manager   “manager”
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form


        買う
       kau
      to buy
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form   Inflected forms (not exhaustive)
                       買いなさい       買いませんでしたら   買える        買わせられる


        買う             買いなさるな
                       買いましたら
                                   買いませんでしたり
                                   買いませんなら
                                               買おう
                                               買った
                                                          買わせる
                                                          買わない
                       買いましたり      買うだろう       買ったら       買わないだろう


       kau             買いまして
                       買いましょう
                                   買うでしょう
                                   買うな
                                               買ったり
                                               買って
                                                          買わないで
                                                          買わないでしょう
                                               買わせない

      to buy
                       買います        買うまい                   買わなかった
                       買いますまい      買え          買わせます      買わなかったら
                       買いませば       買えない        買わせません     買わなかったり
                       買いません       買えば         買わせられない    買わなければ
                       買いませんで      買えます        買わせられます    買われない
                       買いませんでした    買えません       買わせられません   買われます
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form      Inflected forms (not exhaustive)
                           買いなさい      買いませんでしたら   買える        買わせられる


        買う                 買いなさるな
                           買いましたら
                                      買いませんでしたり
                                      買いませんなら
                                                  買おう
                                                  買った
                                                             買わせる
                                                             買わない
                           買いましたり     買うだろう       買ったら       買わないだろう


       kau                 買いまして
                           買いましょう
                                      買うでしょう
                                      買うな
                                                  買ったり
                                                  買って
                                                             買わないで
                                                             買わないでしょう
                                                  買わせない

      to buy
                           買います       買うまい                   買わなかった
                           買いますまい     買え          買わせます      買わなかったら
                           買いませば      買えない        買わせません     買わなかったり
                           買いません      買えば         買わせられない    買わなければ
                           買いませんで     買えます        買わせられます    買われない
                           買いませんでした   買えません       買わせられません   買われます




 ! Use JapaneseBaseformFilter to normalise
   inflected adjectives and verbs to dictionary form
   (lemmatisation by reduction)
User dictionaries
•   Own dictionaries can be used for ad hoc
    segmentation, i.e. to override default model
•   File format is simple and there’s no need to
    assign weights, etc. before using them
•   Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
Japanese focus in 4.0
•   Improvements in JapaneseTokenizer
     •   Improved search mode for katakana compounds
     •   Improved unknown word segmentation
     •   Some performance improvements
•   CharFilters for various character normalisations
     •   Dates and numbers
     •   Repetition marks (odoriji)
•   Japanese spell-checker
     •   Robert and Koji almost got this into 3.6, but it got
         postponed because of API changes being necessary
Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues
Q&A
ありがとうございました!
 arigatō gozaimashita!

Thank you very much!

Mais conteúdo relacionado

Mais procurados

グラフ構造のデータモデルをPower BIで可視化してみた
グラフ構造のデータモデルをPower BIで可視化してみたグラフ構造のデータモデルをPower BIで可視化してみた
グラフ構造のデータモデルをPower BIで可視化してみたCData Software Japan
 
今どきの若手育成にひそむ3つの思いこみ
今どきの若手育成にひそむ3つの思いこみ今どきの若手育成にひそむ3つの思いこみ
今どきの若手育成にひそむ3つの思いこみMariko Hayashi
 
グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?Insight Technology, Inc.
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataGregg Kellogg
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理KageShiron
 
基礎線形代数講座
基礎線形代数講座基礎線形代数講座
基礎線形代数講座SEGADevTech
 
Google アナリティクスで SharePointの利用状況を確認する
Google アナリティクスでSharePointの利用状況を確認するGoogle アナリティクスでSharePointの利用状況を確認する
Google アナリティクスで SharePointの利用状況を確認するAkihiro Ehara
 
Shiny-Serverあれこれ
Shiny-ServerあれこれShiny-Serverあれこれ
Shiny-ServerあれこれKazuya Wada
 
[DL輪読会]Unsupervised Neural Machine Translation
[DL輪読会]Unsupervised Neural Machine Translation [DL輪読会]Unsupervised Neural Machine Translation
[DL輪読会]Unsupervised Neural Machine Translation Deep Learning JP
 
メディアコンテンツ向け記事検索DBとして使うElasticsearch
メディアコンテンツ向け記事検索DBとして使うElasticsearchメディアコンテンツ向け記事検索DBとして使うElasticsearch
メディアコンテンツ向け記事検索DBとして使うElasticsearchYasuhiro Murata
 
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43Preferred Networks
 
レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法Takeshi Mikami
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Tetsuya Sodo
 
実践!Elasticsearch + Sudachi を用いた全文検索エンジン
実践!Elasticsearch + Sudachi を用いた全文検索エンジン実践!Elasticsearch + Sudachi を用いた全文検索エンジン
実践!Elasticsearch + Sudachi を用いた全文検索エンジンS. T.
 
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。Narichika Kajihara
 
ベンチャー企業で言葉を扱うロボットの研究開発をする
ベンチャー企業で言葉を扱うロボットの研究開発をするベンチャー企業で言葉を扱うロボットの研究開発をする
ベンチャー企業で言葉を扱うロボットの研究開発をするYuya Unno
 
まじめに!できる!LT
まじめに!できる!LT まじめに!できる!LT
まじめに!できる!LT Akabane Hiroyuki
 

Mais procurados (20)

グラフ構造のデータモデルをPower BIで可視化してみた
グラフ構造のデータモデルをPower BIで可視化してみたグラフ構造のデータモデルをPower BIで可視化してみた
グラフ構造のデータモデルをPower BIで可視化してみた
 
今どきの若手育成にひそむ3つの思いこみ
今どきの若手育成にひそむ3つの思いこみ今どきの若手育成にひそむ3つの思いこみ
今どきの若手育成にひそむ3つの思いこみ
 
グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?グラフデータベースは如何に自然言語を理解するか?
グラフデータベースは如何に自然言語を理解するか?
 
RDF Semantic Graph「RDF 超入門」
RDF Semantic Graph「RDF 超入門」RDF Semantic Graph「RDF 超入門」
RDF Semantic Graph「RDF 超入門」
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理
 
基礎線形代数講座
基礎線形代数講座基礎線形代数講座
基礎線形代数講座
 
Google アナリティクスで SharePointの利用状況を確認する
Google アナリティクスでSharePointの利用状況を確認するGoogle アナリティクスでSharePointの利用状況を確認する
Google アナリティクスで SharePointの利用状況を確認する
 
Shiny-Serverあれこれ
Shiny-ServerあれこれShiny-Serverあれこれ
Shiny-Serverあれこれ
 
ゼロから始める転移学習
ゼロから始める転移学習ゼロから始める転移学習
ゼロから始める転移学習
 
[DL輪読会]Unsupervised Neural Machine Translation
[DL輪読会]Unsupervised Neural Machine Translation [DL輪読会]Unsupervised Neural Machine Translation
[DL輪読会]Unsupervised Neural Machine Translation
 
メディアコンテンツ向け記事検索DBとして使うElasticsearch
メディアコンテンツ向け記事検索DBとして使うElasticsearchメディアコンテンツ向け記事検索DBとして使うElasticsearch
メディアコンテンツ向け記事検索DBとして使うElasticsearch
 
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43
 
レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624
 
実践!Elasticsearch + Sudachi を用いた全文検索エンジン
実践!Elasticsearch + Sudachi を用いた全文検索エンジン実践!Elasticsearch + Sudachi を用いた全文検索エンジン
実践!Elasticsearch + Sudachi を用いた全文検索エンジン
 
Dll Injection
Dll InjectionDll Injection
Dll Injection
 
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。
情報共有は、なぜGoogle Docsじゃなく、 Confluenceなのか。
 
ベンチャー企業で言葉を扱うロボットの研究開発をする
ベンチャー企業で言葉を扱うロボットの研究開発をするベンチャー企業で言葉を扱うロボットの研究開発をする
ベンチャー企業で言葉を扱うロボットの研究開発をする
 
まじめに!できる!LT
まじめに!できる!LT まじめに!できる!LT
まじめに!できる!LT
 

Destaque

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介Toshinori Sato
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4Masato Nakai
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemlucenerevolution
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Koki Shibata
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 

Destaque (7)

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 

Mais de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Japanese Linguistics in Lucene and Solr

  • 1. Japanese linguistics in Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  • 2. About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on christian@atilika.com or cm@apache.org
  • 4. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 5. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 6. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 15. JR新宿駅の近くにビールを飲みに行こうか? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  • 17. Romaji - ローマ字 ・Latin characters (26+) ・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  • 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?
  • 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  • 21. Romaji - ローマ字 Katakana - カタカナ ・Latin characters (26+) ・Phonetic script (~50) ・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか? Kanji - 漢字 Hiragana - ひらがな ・Chinese characters (50,000+) ・Phonetic script (~50) ・Used for stems & proper nouns ・Used for inflections & particles
  • 24. JR新宿駅の近くにビールを飲みに行こうか? ? What are the words in this sentence? ! Words are implicit in Japanese - there is no white space that separates them
  • 26. JR新宿駅の近くにビールを飲みに行こうか? ? How do we index this for search, then? ! We need to segment text into tokens first
  • 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  • 28. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 29. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 JR
  • 30. n-gramming (n=2) J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 JR R新
  • 31. n-gramming (n=2) J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 JR R新 新宿
  • 32. n-gramming (n=2) J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 JR R新 新宿 宿駅
  • 33. n-gramming (n=2) J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の JR R新 新宿 宿駅 駅の
  • 34. n-gramming (n=2) J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 JR R新 新宿 宿駅 駅の の近
  • 35. n-gramming (n=2) J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近く JR R新 新宿 宿駅 駅の の近 近く
  • 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  • 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  • 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  • 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  • 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  • 48. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 49. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  • 50. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 51. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 52. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 53. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 54. How does this actually work?
  • 55. Demo
  • 56. Japanese support in Lucene and Solr
  • 58. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6
  • 59. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box
  • 60. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults
  • 61. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics
  • 62. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Customisable
  • 63. How do we use it?
  • 64. How do we use it? ! Use JapaneseAnalyzer
  • 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  • 66. Demo
  • 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  • 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  • 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  • 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  • 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  • 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  • 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  • 75. Compound nouns ? How do we deal with compound nouns?
  • 76. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer
  • 77. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  • 78. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match ! We need to segment the compounds, too
  • 79. Compound segmentation 関西国際空港 Kansai International Airport シニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  • 80. Compound segmentation 関西国際空港 関西 Kansai International Airport Kansai シニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  • 81. Compound segmentation 関西国際空港 関西 国際 Kansai International Airport Kansai International シニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  • 82. Compound segmentation 関西国際空港 関西 国際 空港 Kansai International Airport Kansai International Airport シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  • 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 87. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  • 88. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123 ! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  • 89. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  • 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャー JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  • 91. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  • 92. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  • 93. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  • 94. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  • 95. User dictionaries • Own dictionaries can be used for ad hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 # Custom reading and POS former sumo wrestler Asashoryu 朝青龍,朝青龍,アサショウリュウ,カスタム人名
  • 96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  • 97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds! Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  • 98. Q&A