SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
©2013 CMScom info@cmscom.jp
Fuzzy Search on Plone and Search for East Asian Language
CMS communications Inc,
Manabu TERADA terada@cmscom.jp
http://www.cmscom.jp
4 / Oct / 2013
Plone Conference 2013 in Brasilia
Who I am? (お前だれよ?)
©2013 CMScom info@cmscom.jp
•Manabu TERADA (寺田 学) @terapyon
•Advisory Board Member of Plone Foundation
•Chair of PyCon APAC 2013 in Japan
•Owner of CMS communications Inc.
•Member of Plone Users Group Japan
•Authors
1
Contents
©2013 CMScom info@cmscom.jp
•About Japanese Language and other Languages
•Fuzzy Search on Plone
•About the product
•Basic technology
•Dependencies
•Domo
•Structure of the product
•The plan of future
2
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう Thank you Obrigado
Gracias 谢谢 감사 합니다
ขอบคุณ Спасибо ‫$#"ا‬
Language Questions
©2013 CMScom info@cmscom.jp
3
ありがとう
日本語
Thank you
English
Obrigado
Portuguese
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Double bytes
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•Left to Right (LTR) or Right to Left (RTL)
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space?
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Language Questions
©2013 CMScom info@cmscom.jp
3
•No white space
ありがとう
日本語
Thank you
English
Gracias
Spanish
谢谢
Chinese
감사 합니다
Korean
ขอบคุณ
Thai
Спасибо
Russian
‫$#"ا‬
Arabic
Obrigado
Portuguese
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read this Japanese?
•私は寺田学です。日本の東京から来ました。ブラジルに
来たのは初めてです。
•I am Manabu TERADA. I came from Tokyo, Japan. I
have come to Brazil for the first time.
•私 は 寺田 学 です。日本 の 東京 から 来ました。ブラ
ジル に 来た のは 初めて です。
Japanese
©2013 CMScom info@cmscom.jp
4
•Japanese doesn t have white space for splitting
words.
•Japanese has 3 different characters,
•Hiragana, Katakana, Kanji
•Hiragana and Katakana are each 50 characters
•Kanji is over 2000 characters
•Japanese has same homonym by different
characters, and has different homonym by same
character.
Japanese
©2013 CMScom info@cmscom.jp
4
•They are the same meaning.
•Kyoto ← Roma-ji
•京都 ← Kanji
•きょうと ← Hiragana
•キョウト ← Katakana
Japanese
©2013 CMScom info@cmscom.jp
4
•Can you read?
•橋 → ハシ → Hashi
•端 → ハシ → Hashi
•箸 → ハシ → Hashi
•They are different meaning.
•We can understand those by context.
Japanese and other Languages
©2013 CMScom info@cmscom.jp
4
•We have a lot of languages.
•We have a lot of rules.
•We have a lot of issues.
•I want to have any solutions in Plone.
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
Fuzzy Search
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•Name: c2.search.fuzzy
•1.0a5 (alpha release)
https://pypi.python.org/pypi/c2.search.fuzzy
https://bitbucket.org/cmscom/c2.search.fuzzy
5 About
©2012 CMScom info@cmscom.jp
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•We want to get suggestions the same as Google.
•In the Intranet, we can NOT use Google.
Fuzzy Search on Plone
©2013 CMScom info@cmscom.jp
5
•NOT use Solr. I know Solr is working well,
•But, it's difficult to install/configure/implement.
•And, I want to build own system.
Basic technology
©2013 CMScom info@cmscom.jp
6
•This system is not difficult.
•Keywords
•Levenshtein Distance
•Sorted list
•Automata system
Basic technology
©2013 CMScom info@cmscom.jp
6
the Levenshtein distance is a string metric for
measuring the difference between two sequences.
Informally, the Levenshtein distance between two
words is the minimum number of single-character
edits (insertion, deletion, substitution) required to
change one word into the other. The phrase edit
distance is often used to refer specifically to
Levenshtein distance. It is named after Vladimir
Levenshtein, who considered this distance in 1965.[1]
It is closely related to pairwise string alignments.
From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Distance
•base word: plone
•Zero Distance
•PLONE, Plone, pLone
•One Distance
•Phone, plene, plne, lone, ploneg, .....
•Two Distance
•one, plo, polne, ......
Basic technology
©2013 CMScom info@cmscom.jp
6
Sorted list
•Ordered container (List) or 
•Can get Order of words
Sorted Order from Unicode (by alphabet)
['Argentina', 'Australia', 'Brazil', 'Canada', 'China',
'European Union', 'France', 'Germany', 'India', 'Indonesia',
'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia',
'South Africa', 'South Korea', 'Turkey', 'United Kingdom',
'United States']
for example (G20 s countries)
Basic technology
©2013 CMScom info@cmscom.jp
6
From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata
Basic technology
©2013 CMScom info@cmscom.jp
6
Levenshtein Automata
•I found a good blog entry:
• Damn Cool Algorithms: Levenshtein Automata
•http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata
•https://gist.github.com/Arachnid/491973
•It s only using Python!!
Basic technology
©2013 CMScom info@cmscom.jp
6
Index
•It create original index, like a Sorted List, when Plone
content is being created or modified.
Search
•Searching from original index when we input into
search-box.
•Correct spelling will be shown in original index in less
distance.
•Because, It can be shown inside Plone content.
Basic technology
©2013 CMScom info@cmscom.jp
6
•For example,
•We want to show by one distance (it s default).
•From the G20 countries list.
•Brezil → Brazil
•Japon → Japan
•And, it use Automata system for increased speed.
Dependencies
©2013 CMScom info@cmscom.jp
7
We need only Python.
Dependencies
©2013 CMScom info@cmscom.jp
7
•We use MeCab for Japanese support.
•Japanese don t has white space for splitting word.
•(same as Chinese and Koran)
Dependencies
©2013 CMScom info@cmscom.jp
7
•Support language
•English and other European languages
•MAYBE: Arabic
•Chinese and Korean
•It s need to work splitting system
•I don t know it.
Domo
©2013 CMScom info@cmscom.jp
8
•View the video on YouTube
http://youtu.be/e5DHsF7Gi70
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Index data will be stored in ZODB, it's List object.
•When it being created or modified, will update the
List by sorted.
•List is into Dict, Dict key is phonetic (or lower case in
English), value is original word.
[{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']},
{'australia': ['Australia']},
{'brazil' : ['Brazil]},
{'きょうと' : ['京都', 'キョウト']}]
Example Index data
Structure of the product
©2013 CMScom info@cmscom.jp
9
•Search
•Checking the List from input word for less distance
by automata system.
•It's shown the original word from list in Dict values
under the search-box by JavaScript.
Structure of the product
©2013 CMScom info@cmscom.jp
9
for Japanese
•I'm using MeCab for splitting and getting phonetic.
•It's stored phonetic and original word.
•Because Japanese has same homonym by different
characters
The plan of future
©2013 CMScom info@cmscom.jp
10
•Now, I'm using ZODB for index storing.
•I want to have a option, Storing to RDBMS. I'm trying
to develop it.
•I want to support more language.
•Please help me for more support languages.
Thanks
©2013 CMScom info@cmscom.jp
11
•Japanese & East Asian languages
•We have any problems yet in Plone.
•I think Plone is working well in multi languages.
•I wish Plone will be continuous working well.
•All developers, you never forget other languages.
•Fuzzy search
•I want to get the bug report.
•Please try to use the product.
12 Special thanks
©2012 CMScom info@cmscom.jp
• Supported by
• ike @rokujyouhitoma
• @hiratara
• Referred web site
• http://blog.notdot.net/2010/07/Damn-Cool-
Algorithms-Levenshtein-Automata
13 Contact me
©2012 CMScom info@cmscom.jp
• Twitter: @terapyon
• Facebook: https://www.facebook.com/terapyon

Mais conteúdo relacionado

Semelhante a Fuzzy search on plone & search for east asian language

Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
CineSoft
 
Welcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology InitiativeWelcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology Initiative
Basil Bibi
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 
How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....
Mike Harris
 

Semelhante a Fuzzy search on plone & search for east asian language (20)

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018Prototyping Accessibility - WordCamp Europe 2018
Prototyping Accessibility - WordCamp Europe 2018
 
Graduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in SoftwareGraduates Gone Mad: Innovations in Software
Graduates Gone Mad: Innovations in Software
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote Worker
 
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)LocJAM Japan Presentation - Kyoto Study Group (December 2016)
LocJAM Japan Presentation - Kyoto Study Group (December 2016)
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
Subtle Encipherment Hall
Subtle Encipherment HallSubtle Encipherment Hall
Subtle Encipherment Hall
 
FEC2017-Introduction-to-programming
FEC2017-Introduction-to-programmingFEC2017-Introduction-to-programming
FEC2017-Introduction-to-programming
 
Welcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology InitiativeWelcome to the Brixton Library Technology Initiative
Welcome to the Brixton Library Technology Initiative
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019Prototyping Accessibility: Booster 2019
Prototyping Accessibility: Booster 2019
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
 
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
ICIC 2014 High volume, High Quality Patent Translation across Multiple Domain...
 
2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable2019 Fall SourceCon Sourcing Tools Roundtable
2019 Fall SourceCon Sourcing Tools Roundtable
 
How to Implement Domain Driven Design in Real Life SDLC
How to Implement Domain Driven Design  in Real Life SDLCHow to Implement Domain Driven Design  in Real Life SDLC
How to Implement Domain Driven Design in Real Life SDLC
 
python classes in thane
python classes in thanepython classes in thane
python classes in thane
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....How I Learned to Stop Worrying and Love Legacy Code.....
How I Learned to Stop Worrying and Love Legacy Code.....
 

Mais de Manabu Terada

SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えよう
Manabu Terada
 
Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015
Manabu Terada
 
Ja sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADAJa sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADA
Manabu Terada
 
Reporting of PyCon APAC at ploneconf / PyCon BR
Reporting of  PyCon APAC at ploneconf / PyCon BRReporting of  PyCon APAC at ploneconf / PyCon BR
Reporting of PyCon APAC at ploneconf / PyCon BR
Manabu Terada
 
PyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeatPyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeat
Manabu Terada
 
PyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbalPyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbal
Manabu Terada
 
PyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session teradaPyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session terada
Manabu Terada
 
Plone talk 201308_terada
Plone talk 201308_teradaPlone talk 201308_terada
Plone talk 201308_terada
Manabu Terada
 

Mais de Manabu Terada (20)

SI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えようSI業界の営業の役割と存在意義を一緒に考えよう
SI業界の営業の役割と存在意義を一緒に考えよう
 
私とコミュニティとPython
私とコミュニティとPython私とコミュニティとPython
私とコミュニティとPython
 
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fallPlone 5 & アクセシビリティ at OSC 2015 Tokyo fall
Plone 5 & アクセシビリティ at OSC 2015 Tokyo fall
 
Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015Plone + AWS at Plone Symposium tokyo 2015
Plone + AWS at Plone Symposium tokyo 2015
 
Osc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by teradaOsc2015 Tokyo Spring Plone by terada
Osc2015 Tokyo Spring Plone by terada
 
Plone conf 2014report by terada
Plone conf 2014report by teradaPlone conf 2014report by terada
Plone conf 2014report by terada
 
PloneConf 2014 CDN terada
PloneConf 2014 CDN teradaPloneConf 2014 CDN terada
PloneConf 2014 CDN terada
 
Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015Planning plone Symposium Tokyo 2015
Planning plone Symposium Tokyo 2015
 
OSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_teradaOSC 2014 Tokyo fall plone_terada
OSC 2014 Tokyo fall plone_terada
 
PyCon JP 2014 plone terada
PyCon JP 2014 plone teradaPyCon JP 2014 plone terada
PyCon JP 2014 plone terada
 
WPD tokyo opening
WPD tokyo openingWPD tokyo opening
WPD tokyo opening
 
Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)Varnish 4 Release Party in Tokyo (terada)
Varnish 4 Release Party in Tokyo (terada)
 
Ja sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADAJa sakai conf 2014 edx by Manabu TERADA
Ja sakai conf 2014 edx by Manabu TERADA
 
Reporting of PyCon APAC at ploneconf / PyCon BR
Reporting of  PyCon APAC at ploneconf / PyCon BRReporting of  PyCon APAC at ploneconf / PyCon BR
Reporting of PyCon APAC at ploneconf / PyCon BR
 
PyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeatPyCon asiapacific 2013 bengkeat
PyCon asiapacific 2013 bengkeat
 
PyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbalPyCon APAC session Frontpage for iqbal
PyCon APAC session Frontpage for iqbal
 
Pyconapac2014taiwan
Pyconapac2014taiwanPyconapac2014taiwan
Pyconapac2014taiwan
 
PyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session teradaPyCon APAC 2013 Apac session terada
PyCon APAC 2013 Apac session terada
 
グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)グリーンコンサート視察報告 (寺田)
グリーンコンサート視察報告 (寺田)
 
Plone talk 201308_terada
Plone talk 201308_teradaPlone talk 201308_terada
Plone talk 201308_terada
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Fuzzy search on plone & search for east asian language

  • 1. ©2013 CMScom info@cmscom.jp Fuzzy Search on Plone and Search for East Asian Language CMS communications Inc, Manabu TERADA terada@cmscom.jp http://www.cmscom.jp 4 / Oct / 2013 Plone Conference 2013 in Brasilia
  • 2. Who I am? (お前だれよ?) ©2013 CMScom info@cmscom.jp •Manabu TERADA (寺田 学) @terapyon •Advisory Board Member of Plone Foundation •Chair of PyCon APAC 2013 in Japan •Owner of CMS communications Inc. •Member of Plone Users Group Japan •Authors 1
  • 3. Contents ©2013 CMScom info@cmscom.jp •About Japanese Language and other Languages •Fuzzy Search on Plone •About the product •Basic technology •Dependencies •Domo •Structure of the product •The plan of future 2
  • 4. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう Thank you Obrigado Gracias 谢谢 감사 합니다 ขอบคุณ Спасибо ‫$#"ا‬
  • 5. Language Questions ©2013 CMScom info@cmscom.jp 3 ありがとう 日本語 Thank you English Obrigado Portuguese Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic
  • 6. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 7. Language Questions ©2013 CMScom info@cmscom.jp 3 •Double bytes ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 8. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 9. Language Questions ©2013 CMScom info@cmscom.jp 3 •Left to Right (LTR) or Right to Left (RTL) ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 10. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space? ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 11. Language Questions ©2013 CMScom info@cmscom.jp 3 •No white space ありがとう 日本語 Thank you English Gracias Spanish 谢谢 Chinese 감사 합니다 Korean ขอบคุณ Thai Спасибо Russian ‫$#"ا‬ Arabic Obrigado Portuguese
  • 12. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read this Japanese? •私は寺田学です。日本の東京から来ました。ブラジルに 来たのは初めてです。 •I am Manabu TERADA. I came from Tokyo, Japan. I have come to Brazil for the first time. •私 は 寺田 学 です。日本 の 東京 から 来ました。ブラ ジル に 来た のは 初めて です。
  • 13. Japanese ©2013 CMScom info@cmscom.jp 4 •Japanese doesn t have white space for splitting words. •Japanese has 3 different characters, •Hiragana, Katakana, Kanji •Hiragana and Katakana are each 50 characters •Kanji is over 2000 characters •Japanese has same homonym by different characters, and has different homonym by same character.
  • 14. Japanese ©2013 CMScom info@cmscom.jp 4 •They are the same meaning. •Kyoto ← Roma-ji •京都 ← Kanji •きょうと ← Hiragana •キョウト ← Katakana
  • 15. Japanese ©2013 CMScom info@cmscom.jp 4 •Can you read? •橋 → ハシ → Hashi •端 → ハシ → Hashi •箸 → ハシ → Hashi •They are different meaning. •We can understand those by context.
  • 16. Japanese and other Languages ©2013 CMScom info@cmscom.jp 4 •We have a lot of languages. •We have a lot of rules. •We have a lot of issues. •I want to have any solutions in Plone.
  • 17. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 Fuzzy Search
  • 18. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •Name: c2.search.fuzzy •1.0a5 (alpha release) https://pypi.python.org/pypi/c2.search.fuzzy https://bitbucket.org/cmscom/c2.search.fuzzy
  • 19. 5 About ©2012 CMScom info@cmscom.jp
  • 20. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •We want to get suggestions the same as Google. •In the Intranet, we can NOT use Google.
  • 21. Fuzzy Search on Plone ©2013 CMScom info@cmscom.jp 5 •NOT use Solr. I know Solr is working well, •But, it's difficult to install/configure/implement. •And, I want to build own system.
  • 22. Basic technology ©2013 CMScom info@cmscom.jp 6 •This system is not difficult. •Keywords •Levenshtein Distance •Sorted list •Automata system
  • 23. Basic technology ©2013 CMScom info@cmscom.jp 6 the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is closely related to pairwise string alignments. From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance
  • 24. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Distance •base word: plone •Zero Distance •PLONE, Plone, pLone •One Distance •Phone, plene, plne, lone, ploneg, ..... •Two Distance •one, plo, polne, ......
  • 25. Basic technology ©2013 CMScom info@cmscom.jp 6 Sorted list •Ordered container (List) or •Can get Order of words Sorted Order from Unicode (by alphabet) ['Argentina', 'Australia', 'Brazil', 'Canada', 'China', 'European Union', 'France', 'Germany', 'India', 'Indonesia', 'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia', 'South Africa', 'South Korea', 'Turkey', 'United Kingdom', 'United States'] for example (G20 s countries)
  • 26. Basic technology ©2013 CMScom info@cmscom.jp 6 From @hiratara s slide:http://www.slideshare.net/hiratara/levenshtein-automata
  • 27. Basic technology ©2013 CMScom info@cmscom.jp 6 Levenshtein Automata •I found a good blog entry: • Damn Cool Algorithms: Levenshtein Automata •http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata •https://gist.github.com/Arachnid/491973 •It s only using Python!!
  • 28. Basic technology ©2013 CMScom info@cmscom.jp 6 Index •It create original index, like a Sorted List, when Plone content is being created or modified. Search •Searching from original index when we input into search-box. •Correct spelling will be shown in original index in less distance. •Because, It can be shown inside Plone content.
  • 29. Basic technology ©2013 CMScom info@cmscom.jp 6 •For example, •We want to show by one distance (it s default). •From the G20 countries list. •Brezil → Brazil •Japon → Japan •And, it use Automata system for increased speed.
  • 31. Dependencies ©2013 CMScom info@cmscom.jp 7 •We use MeCab for Japanese support. •Japanese don t has white space for splitting word. •(same as Chinese and Koran)
  • 32. Dependencies ©2013 CMScom info@cmscom.jp 7 •Support language •English and other European languages •MAYBE: Arabic •Chinese and Korean •It s need to work splitting system •I don t know it.
  • 33. Domo ©2013 CMScom info@cmscom.jp 8 •View the video on YouTube http://youtu.be/e5DHsF7Gi70
  • 34. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Index data will be stored in ZODB, it's List object. •When it being created or modified, will update the List by sorted. •List is into Dict, Dict key is phonetic (or lower case in English), value is original word. [{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']}, {'australia': ['Australia']}, {'brazil' : ['Brazil]}, {'きょうと' : ['京都', 'キョウト']}] Example Index data
  • 35. Structure of the product ©2013 CMScom info@cmscom.jp 9 •Search •Checking the List from input word for less distance by automata system. •It's shown the original word from list in Dict values under the search-box by JavaScript.
  • 36. Structure of the product ©2013 CMScom info@cmscom.jp 9 for Japanese •I'm using MeCab for splitting and getting phonetic. •It's stored phonetic and original word. •Because Japanese has same homonym by different characters
  • 37. The plan of future ©2013 CMScom info@cmscom.jp 10 •Now, I'm using ZODB for index storing. •I want to have a option, Storing to RDBMS. I'm trying to develop it. •I want to support more language. •Please help me for more support languages.
  • 38. Thanks ©2013 CMScom info@cmscom.jp 11 •Japanese & East Asian languages •We have any problems yet in Plone. •I think Plone is working well in multi languages. •I wish Plone will be continuous working well. •All developers, you never forget other languages. •Fuzzy search •I want to get the bug report. •Please try to use the product.
  • 39. 12 Special thanks ©2012 CMScom info@cmscom.jp • Supported by • ike @rokujyouhitoma • @hiratara • Referred web site • http://blog.notdot.net/2010/07/Damn-Cool- Algorithms-Levenshtein-Automata
  • 40. 13 Contact me ©2012 CMScom info@cmscom.jp • Twitter: @terapyon • Facebook: https://www.facebook.com/terapyon