SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Build a Scalable Search Engine With
Amazon CloudSearch
Agenda
•  Introduction to Search
•  Amazon CloudSearch
•  Building with CloudSearch
Introduction to Search
Search Engines Connect Us To Data
Documents
Representation of a Document
Field Value
id tt0371746
title Iron Man
description When wealthy industrialist Tony Stark is forced to build
an armored suit after a life-threatening incident, he
ultimately decides to use its technology to fight against
evil.
director John Favreau
actors Robert Downey Jr., Gwyneth Paltrow, Terrence
Howard ...
rating 7.9
release_date 2008-05-02T00:00:00Z
Data Types
Doubles
Dates
Signed Integers
Text
Literal
Geo
•  Latlon data type
•  Region search
•  Distance sort
•  Supports mobile
Text Processing (Normalization)
•  Tokenization
(parsing)
•  Downcasing
•  Stemming
•  Stopword removal
•  Synonym Addition
When wealthy industrialist Tony Stark is forced to
build an armored suit after a life-threatening
incident, he ultimately decides to use its
technology to fight against evil.
when wealth industrial tony stark force build
armor suit after life threaten incident ultimate
decide use technology fight against evil
Indexing
Term Documents (Posting List)
Iron The Man in the Iron Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
...
Man Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
...
Matching
The Man in the Iron
Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
Iron Man 2
Iron Man
Ranking and Relevance
•  The meat of the search engine
•  TF-IDF – uniqueness and presence
•  Additional Criteria
–  Measures of document value (e.g. rating)
–  Observed user behavior
–  Freshness
Summary
•  Search makes data accessible
•  Search documents gather information about one search target
•  Reverse indices provide the basis of text-text matching
•  Relevance brings the best matches
Amazon CloudSearch
Building a Search service
•  Build your own
–  Extend datastores and build custom relevance engine
•  Open Source
–  Apache Solr, ElasticSearch
•  Enterprise Search
–  FAST, Autonomy, Endeca
Challenges with building a Search service
•  COMPLEX: Requires extensive search expertise
•  COSTLY: High upfront expenditure
•  SLOW: Long time to market. Slows innovation
•  UNDIFFERENTIATED: Operational overhead that doesn’t add value to
core product
Where CloudSearch fits in the picture
Amazon CloudSearch is a fully managed search service in the cloud that
makes it easy to setup, operate, and scale a search solution for your
website or application
Similar benefits as other AWS Managed Services
•  Easy to setup and operate (Console, SDK, CLT)
•  Pay as you go
•  No need to guess capacity
•  Experiment fast with low risk
•  Go Global in minutes
Reference Architecture
Automatic Scaling
SEARCH INSTANCE
Index Partition n
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 2
SEARCH INSTANCE
Index Partition n
Copy 2
SEARCH INSTANCE
Index Partition 2
Copy n
SEARCH INSTANCE
DATA Document Quantity and Size
TRAFFIC
Search
Request
Volume and
Complexity
Index Partition n
Copy n
SEARCH INSTANCE
Index Partition 1
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 1
SEARCH INSTANCE
Index Partition 1
Copy 2
SEARCH INSTANCE
Index Partition 1
Copy n
Building With CloudSearch
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Create a Domain
Upload Data
2014年3月 CloudSearch Launch
Arabic,
Armenian,
Basque,
Bulgarian,
Catalan,
Simplified Chinese,
Traditional Chinese,
Czech,
Danish,
Dutch,
English,
Finnish,
French,
Galician,
German,
Greek,
Hindi,
Hungarian,
Indonesian,
Irish,
Italian,
Japanese,
Korean,
Latvian,
Norwegian,
Persian,
Portuguese,
Romanian,
Russian,
Spanish,
Swedish,
Thai,
Turkish
•  Support  for  33  languages
CloudSearchへのデータ投入(コンソールCSV)
生成したSDFフォーマットのファイルを
ダウンロードすることも出来る	
  
1	
  
2	
  
3	
  
Japanese Text Processing
•  形態素解析(Morphological Analysis)
–  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業
(http://ja.wikipedia.org/wiki/形態素解析)
•  英語のようにスペースで区切られている言語と異なり、
•  日本語は日本語用の構文解析が必要
–  例) 彼はエンジニアだ
•  彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞)
•  “エンジニア”を抽出してインデックスを作ることにより、
•  ”エンジニア”で検索された際に、高速なレスポンスの実現が可能
Japanese Text Processing
•  正規化(Normalize)
–  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちら
の場合もヒットして欲しい
–  CloudSearchでサポートされている機能
–  更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま
しい場合もある
•  NFD(Canonical Decomposition): 正規化形式D
•  NFC(Canonical Composition): 正規化形式C
•  NFKD(Compatibility Decomposition): 正規化形式KD
•  NFKC(Compatibility Composition): 正規化形式KC
Japanese Text Processing
•  Stemming
–  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む
–  ステミング辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing
•  Stopword Removal
–  「の」、「は」、「か」といった意味の無い言葉を除く
–  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing
•  Synonym Addition
–  Synonym = 同義語
•  「ベニス」「ベネチア」「ヴェネチア」
•  「昨年」「去年」
–  同じ意味なので検索された場合にヒットさせる
–  Stopwords, Stemming同様に追加可能
Japanese Text Processing
•  Synonym Addition
–  シノニム辞書への追加 (API/SDKでも追加可能)
•  Alias
–  pupilで検索してstudentのドキュメントがヒット
–  studentで検索してpupilのドキュメントはヒットしない
•  Group
–  1st, first, oneどれで検索しても
–  1st, first, oneの全てのドキュメントがヒット
Document Upload
http(s)://< document service endpoint >/2013-01-01/documents/
batch!
!
Accept: application/json !
Content-Length: 1176 !
Content-Type: application/json !
Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com !
!
{ : , : "tt0371746", : { "directors" : [ "Jon Favreau" ],
"release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action",
"Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/
MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When
wealthy industrialist Tony Stark is forced to build an armored suit after a life-
threatening incident, he ultimately decides to use its technology to fight against
evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" :
[ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},!
{ , : "tt0434409"} ]!
Simple Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Simple Queries
http(s)/<search endpoint>/2013-01-01/search?q=iron+man!
{"status": {"rid": "oei6zt8oAgq5QOc=",!
"time-ms": 4},!
"hits": {"found": 9, "start": 0,!
"hit": [!
{"id": "tt1228705"},!
{"id": "tt0120744"},!
{"id": "tt0371746"},!
{"id": "tt1866249"},!
{"id": "tt0119558"},!
{"id": "tt0402894"},!
{"id": "tt1258972"},!
{"id": "tt1300854"},!
{"id": "tt0462465"} ] } }!
Complex Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Faceting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Drilldown
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Adjustable Ranking
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Highlighting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Availability Options
Scaling Options
IAM Integration
Configuration API Only
{!
"Version":"2012-10-17",!
"Statement": [!
{ "Effect": "Allow",

"Action": ["cloudsearch:*"],

"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },!
{ "Effect": "Deny",!
"Action": ["cloudsearch:DeleteDomain"],!
"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }!
]!
}!
Closing Thoughts
•  Content Discovery goes hand in hand with Content. Search is
everywhere!
•  Amazon CloudSearch is a fully managed, easy to use, cost effective
search service – easy to build, easy to scale
•  Get the powerful search features found in open source engines
(Apache Solr) combined with value add AWS features (easy setup, on
demand pricing, auto scaling, Multi-AZ, global availability)
Questions?
Jon Handler (handler@amazon.com)
Pravin Muthukumar (pravinm@amazon.com)

Mais conteúdo relacionado

Semelhante a Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Amazon Cloudsearch Session With Elsevier: re:Invent 2013
Amazon Cloudsearch Session With Elsevier: re:Invent 2013 Amazon Cloudsearch Session With Elsevier: re:Invent 2013
Amazon Cloudsearch Session With Elsevier: re:Invent 2013 Michael Bohlig
 
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...Amazon Web Services
 
Gute Nachrichten, Schlechte Nachrichten
Gute Nachrichten, Schlechte NachrichtenGute Nachrichten, Schlechte Nachrichten
Gute Nachrichten, Schlechte NachrichtenChristoph Engelbert
 
kumogata-template の紹介
kumogata-template の紹介kumogata-template の紹介
kumogata-template の紹介Naoya Nakazawa
 
Defcon through the_eyes_of_the_attacker_2018_slides
Defcon through the_eyes_of_the_attacker_2018_slidesDefcon through the_eyes_of_the_attacker_2018_slides
Defcon through the_eyes_of_the_attacker_2018_slidesMarina Krotofil
 
Cross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesCross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesBob Sims
 
What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)Krist Wongsuphasawat
 
Going from A to C, a Practical Approach to Semantic Search
Going from A to C, a Practical Approach to Semantic SearchGoing from A to C, a Practical Approach to Semantic Search
Going from A to C, a Practical Approach to Semantic SearchPawel Kowaluk
 
Japanese Startup Use-Cases and Tech Deep Dive
Japanese Startup Use-Cases and Tech Deep DiveJapanese Startup Use-Cases and Tech Deep Dive
Japanese Startup Use-Cases and Tech Deep DiveEiji Shinohara
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Alex Cachia
 
Microsoft Io TechCamp Frankfurt am Main 2015
Microsoft Io TechCamp Frankfurt am Main 2015Microsoft Io TechCamp Frankfurt am Main 2015
Microsoft Io TechCamp Frankfurt am Main 2015Damir Dobric
 
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & DriftctlFOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & DriftctlStephane Jourdan
 
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...Amazon Web Services
 
PostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) WantPostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) WantChristoph Engelbert
 
Initiation & hands-on Moovweb 5's new feature
Initiation & hands-on Moovweb 5's new featureInitiation & hands-on Moovweb 5's new feature
Initiation & hands-on Moovweb 5's new featureBeMyApp
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas44CON
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)packetloop
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling SoftwareJoshua Long
 
Dzone Webinar: Search Patterns with Amazon CloudSearch
Dzone Webinar: Search Patterns with Amazon CloudSearchDzone Webinar: Search Patterns with Amazon CloudSearch
Dzone Webinar: Search Patterns with Amazon CloudSearchMichael Bohlig
 

Semelhante a Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler (20)

Amazon Cloudsearch Session With Elsevier: re:Invent 2013
Amazon Cloudsearch Session With Elsevier: re:Invent 2013 Amazon Cloudsearch Session With Elsevier: re:Invent 2013
Amazon Cloudsearch Session With Elsevier: re:Invent 2013
 
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...
Enrich Search User Experience Using Amazon CloudSearch (SVC302) | AWS re:Inve...
 
Gute Nachrichten, Schlechte Nachrichten
Gute Nachrichten, Schlechte NachrichtenGute Nachrichten, Schlechte Nachrichten
Gute Nachrichten, Schlechte Nachrichten
 
kumogata-template の紹介
kumogata-template の紹介kumogata-template の紹介
kumogata-template の紹介
 
Defcon through the_eyes_of_the_attacker_2018_slides
Defcon through the_eyes_of_the_attacker_2018_slidesDefcon through the_eyes_of_the_attacker_2018_slides
Defcon through the_eyes_of_the_attacker_2018_slides
 
Cross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web ServicesCross-Platform Mobile Apps & Drupal Web Services
Cross-Platform Mobile Apps & Drupal Web Services
 
What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)
 
Going from A to C, a Practical Approach to Semantic Search
Going from A to C, a Practical Approach to Semantic SearchGoing from A to C, a Practical Approach to Semantic Search
Going from A to C, a Practical Approach to Semantic Search
 
Japanese Startup Use-Cases and Tech Deep Dive
Japanese Startup Use-Cases and Tech Deep DiveJapanese Startup Use-Cases and Tech Deep Dive
Japanese Startup Use-Cases and Tech Deep Dive
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...Chaos Engineering – why we should all practice breaking things on purpose by ...
Chaos Engineering – why we should all practice breaking things on purpose by ...
 
Microsoft Io TechCamp Frankfurt am Main 2015
Microsoft Io TechCamp Frankfurt am Main 2015Microsoft Io TechCamp Frankfurt am Main 2015
Microsoft Io TechCamp Frankfurt am Main 2015
 
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & DriftctlFOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
 
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...
ABD338_MirrorWeb - Powering Large-scale, Full-text Search for the UK Governme...
 
PostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) WantPostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) Want
 
Initiation & hands-on Moovweb 5's new feature
Initiation & hands-on Moovweb 5's new featureInitiation & hands-on Moovweb 5's new feature
Initiation & hands-on Moovweb 5's new feature
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas
44CON 2014 - I gave a talk about robots and hardware!, Josh Thomas
 
Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)Auscert Finding needles in haystacks (the size of countries)
Auscert Finding needles in haystacks (the size of countries)
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
 
Dzone Webinar: Search Patterns with Amazon CloudSearch
Dzone Webinar: Search Patterns with Amazon CloudSearchDzone Webinar: Search Patterns with Amazon CloudSearch
Dzone Webinar: Search Patterns with Amazon CloudSearch
 

Mais de Eiji Shinohara

Indexing with Algolia Ruby API Client
Indexing with Algolia Ruby API ClientIndexing with Algolia Ruby API Client
Indexing with Algolia Ruby API ClientEiji Shinohara
 
Getting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.jsGetting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.jsEiji Shinohara
 
Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019Eiji Shinohara
 
Scalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWSScalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWSEiji Shinohara
 
Accelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in JapanAccelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in JapanEiji Shinohara
 
AWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote RecapAWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote RecapEiji Shinohara
 
#CTONight powered by AWS
#CTONight powered by AWS#CTONight powered by AWS
#CTONight powered by AWSEiji Shinohara
 
SolrCloud on Amazon ECS
SolrCloud on Amazon ECSSolrCloud on Amazon ECS
SolrCloud on Amazon ECSEiji Shinohara
 
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介Eiji Shinohara
 
Search Solutions on AWS
Search Solutions on AWSSearch Solutions on AWS
Search Solutions on AWSEiji Shinohara
 
Global AWS AdTech use-cases
Global AWS AdTech use-casesGlobal AWS AdTech use-cases
Global AWS AdTech use-casesEiji Shinohara
 
IVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 WinterIVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 WinterEiji Shinohara
 
Tips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISHTips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISHEiji Shinohara
 
検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上Eiji Shinohara
 
エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座Eiji Shinohara
 
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECSAWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECSEiji Shinohara
 
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみるEiji Shinohara
 
Accelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJPAccelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJPEiji Shinohara
 
IVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 SpringIVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 SpringEiji Shinohara
 

Mais de Eiji Shinohara (20)

Indexing with Algolia Ruby API Client
Indexing with Algolia Ruby API ClientIndexing with Algolia Ruby API Client
Indexing with Algolia Ruby API Client
 
Getting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.jsGetting Started Algolia with InstantSearch.js
Getting Started Algolia with InstantSearch.js
 
Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019Algolia introduction in Kanazawa - July 2019
Algolia introduction in Kanazawa - July 2019
 
Scalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWSScalable and Cost Effective Systems Architecture on AWS
Scalable and Cost Effective Systems Architecture on AWS
 
#AWSAdTechJP
#AWSAdTechJP#AWSAdTechJP
#AWSAdTechJP
 
Accelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in JapanAccelerating AdTech on AWS in Japan
Accelerating AdTech on AWS in Japan
 
AWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote RecapAWS Summit New York 2017 Keynote Recap
AWS Summit New York 2017 Keynote Recap
 
#CTONight powered by AWS
#CTONight powered by AWS#CTONight powered by AWS
#CTONight powered by AWS
 
SolrCloud on Amazon ECS
SolrCloud on Amazon ECSSolrCloud on Amazon ECS
SolrCloud on Amazon ECS
 
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
AWS Summit San Francisco 2017 Werner Vogelsによる基調講演を徹底紹介
 
Search Solutions on AWS
Search Solutions on AWSSearch Solutions on AWS
Search Solutions on AWS
 
Global AWS AdTech use-cases
Global AWS AdTech use-casesGlobal AWS AdTech use-cases
Global AWS AdTech use-cases
 
IVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 WinterIVS CTO Night and Day Recap - #CTONight 2016 Winter
IVS CTO Night and Day Recap - #CTONight 2016 Winter
 
Tips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISHTips for getting the most out of AWS re:Invent IN ENGLISH
Tips for getting the most out of AWS re:Invent IN ENGLISH
 
検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上検索技術の活用による広告配信Relevance向上
検索技術の活用による広告配信Relevance向上
 
エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座エンジニアの為のAWS実践講座
エンジニアの為のAWS実践講座
 
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECSAWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
AWS Summit New York 2016 Recap : AWS Application Load Balancer and Amazon ECS
 
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
個人的にAmazon EMR5.0.0でSpark 2.0を使ってZeppelinでSQL集計してみる
 
Accelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJPAccelerating AdTech on AWS #AWSAdTechJP
Accelerating AdTech on AWS #AWSAdTechJP
 
IVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 SpringIVS CTO Night and Day Recap - #CTONight 2016 Spring
IVS CTO Night and Day Recap - #CTONight 2016 Spring
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

  • 1. Build a Scalable Search Engine With Amazon CloudSearch
  • 2. Agenda •  Introduction to Search •  Amazon CloudSearch •  Building with CloudSearch
  • 6. Representation of a Document Field Value id tt0371746 title Iron Man description When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. director John Favreau actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard ... rating 7.9 release_date 2008-05-02T00:00:00Z
  • 8. Geo •  Latlon data type •  Region search •  Distance sort •  Supports mobile
  • 9. Text Processing (Normalization) •  Tokenization (parsing) •  Downcasing •  Stemming •  Stopword removal •  Synonym Addition When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. when wealth industrial tony stark force build armor suit after life threaten incident ultimate decide use technology fight against evil
  • 10. Indexing Term Documents (Posting List) Iron The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady ... Man Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man ...
  • 11. Matching The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man Iron Man 2 Iron Man
  • 12. Ranking and Relevance •  The meat of the search engine •  TF-IDF – uniqueness and presence •  Additional Criteria –  Measures of document value (e.g. rating) –  Observed user behavior –  Freshness
  • 13. Summary •  Search makes data accessible •  Search documents gather information about one search target •  Reverse indices provide the basis of text-text matching •  Relevance brings the best matches
  • 15. Building a Search service •  Build your own –  Extend datastores and build custom relevance engine •  Open Source –  Apache Solr, ElasticSearch •  Enterprise Search –  FAST, Autonomy, Endeca
  • 16. Challenges with building a Search service •  COMPLEX: Requires extensive search expertise •  COSTLY: High upfront expenditure •  SLOW: Long time to market. Slows innovation •  UNDIFFERENTIATED: Operational overhead that doesn’t add value to core product
  • 17. Where CloudSearch fits in the picture Amazon CloudSearch is a fully managed search service in the cloud that makes it easy to setup, operate, and scale a search solution for your website or application Similar benefits as other AWS Managed Services •  Easy to setup and operate (Console, SDK, CLT) •  Pay as you go •  No need to guess capacity •  Experiment fast with low risk •  Go Global in minutes
  • 19. Automatic Scaling SEARCH INSTANCE Index Partition n Copy 1 SEARCH INSTANCE Index Partition 2 Copy 2 SEARCH INSTANCE Index Partition n Copy 2 SEARCH INSTANCE Index Partition 2 Copy n SEARCH INSTANCE DATA Document Quantity and Size TRAFFIC Search Request Volume and Complexity Index Partition n Copy n SEARCH INSTANCE Index Partition 1 Copy 1 SEARCH INSTANCE Index Partition 2 Copy 1 SEARCH INSTANCE Index Partition 1 Copy 2 SEARCH INSTANCE Index Partition 1 Copy n
  • 21. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 24. 2014年3月 CloudSearch Launch Arabic, Armenian, Basque, Bulgarian, Catalan, Simplified Chinese, Traditional Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish •  Support  for  33  languages
  • 26. Japanese Text Processing •  形態素解析(Morphological Analysis) –  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業 (http://ja.wikipedia.org/wiki/形態素解析) •  英語のようにスペースで区切られている言語と異なり、 •  日本語は日本語用の構文解析が必要 –  例) 彼はエンジニアだ •  彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞) •  “エンジニア”を抽出してインデックスを作ることにより、 •  ”エンジニア”で検索された際に、高速なレスポンスの実現が可能
  • 27. Japanese Text Processing •  正規化(Normalize) –  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちら の場合もヒットして欲しい –  CloudSearchでサポートされている機能 –  更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま しい場合もある •  NFD(Canonical Decomposition): 正規化形式D •  NFC(Canonical Composition): 正規化形式C •  NFKD(Compatibility Decomposition): 正規化形式KD •  NFKC(Compatibility Composition): 正規化形式KC
  • 28. Japanese Text Processing •  Stemming –  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む –  ステミング辞書への追加 (API/SDKでも追加可能)
  • 29. Japanese Text Processing •  Stopword Removal –  「の」、「は」、「か」といった意味の無い言葉を除く –  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)
  • 30. Japanese Text Processing •  Synonym Addition –  Synonym = 同義語 •  「ベニス」「ベネチア」「ヴェネチア」 •  「昨年」「去年」 –  同じ意味なので検索された場合にヒットさせる –  Stopwords, Stemming同様に追加可能
  • 31. Japanese Text Processing •  Synonym Addition –  シノニム辞書への追加 (API/SDKでも追加可能) •  Alias –  pupilで検索してstudentのドキュメントがヒット –  studentで検索してpupilのドキュメントはヒットしない •  Group –  1st, first, oneどれで検索しても –  1st, first, oneの全てのドキュメントがヒット
  • 32. Document Upload http(s)://< document service endpoint >/2013-01-01/documents/ batch! ! Accept: application/json ! Content-Length: 1176 ! Content-Type: application/json ! Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com ! ! { : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/ MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life- threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},! { , : "tt0434409"} ]!
  • 33. Simple Queries Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 34. Simple Queries http(s)/<search endpoint>/2013-01-01/search?q=iron+man! {"status": {"rid": "oei6zt8oAgq5QOc=",! "time-ms": 4},! "hits": {"found": 9, "start": 0,! "hit": [! {"id": "tt1228705"},! {"id": "tt0120744"},! {"id": "tt0371746"},! {"id": "tt1866249"},! {"id": "tt0119558"},! {"id": "tt0402894"},! {"id": "tt1258972"},! {"id": "tt1300854"},! {"id": "tt0462465"} ] } }!
  • 35. Complex Queries Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 36. Faceting Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 37. Drilldown Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 38. Adjustable Ranking Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 39. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron" Highlighting
  • 40. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  • 43. IAM Integration Configuration API Only {! "Version":"2012-10-17",! "Statement": [! { "Effect": "Allow",
 "Action": ["cloudsearch:*"],
 "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },! { "Effect": "Deny",! "Action": ["cloudsearch:DeleteDomain"],! "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }! ]! }!
  • 44. Closing Thoughts •  Content Discovery goes hand in hand with Content. Search is everywhere! •  Amazon CloudSearch is a fully managed, easy to use, cost effective search service – easy to build, easy to scale •  Get the powerful search features found in open source engines (Apache Solr) combined with value add AWS features (easy setup, on demand pricing, auto scaling, Multi-AZ, global availability)
  • 45. Questions? Jon Handler (handler@amazon.com) Pravin Muthukumar (pravinm@amazon.com)