SlideShare uma empresa Scribd logo
1 de 15
1
Searching Wikipedia with Amazon CloudSearch
2
Agenda
• Project Background
• High-level Architecture
• Summary & Observations
3
Project Background
• Amazon contracted with Search Technologies
to help with beta-testing, prior to the launch of
Amazon CloudSearch
• Decision to use Wikipedia as a convenient data
set for testing purposes
3
4
High-level Architecture
4
5
Indexing
• Wikipedia provides content in a series of large xml files
• Amazon CloudSearch ingests xml in a specified form
• Various content processing tasks to perform
• Splitting into individual documents
• Date normalization
• Metadata extraction & mapping
• Cleanup, etc.
• We used Aspire for these tasks
5
6
Aspire in Brief
• Based on Apache Felix / OSGi
• Thread-safe, multi-threaded, distributable
• Any number of pipelines, conditional branching
• Plug-in components individually testable & upgradable
• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.
• Tested with Elasticsearch and SP 2013
6
7
XML Input
7
8
Indexing
• Streaming Wikipedia Dump Files directly into
CloudSearch
• 500 docs/second achieved without much effort
• Using 4 x XL instances of CloudSearch
• 1 x XL EC2 instance for Aspire
8
9
Searching
• Amazon CloudSearch provides a RESTful/XML
interface for search purposes
• For the Wikipedia project, we needed a UI
• Chose to use Twigkit
• Wrote a Java API for CloudSearch
• The Java API is freely downloadable (with source) at
http://www.searchtechnologies.com/java-api-amazon-
cloudsearch.html
9
10
Searching
• Supports navigators and
relevancy customization
• E.g. a “PageRank” style link
analysis was performed
• Limits set high: E.g.
retrieve 500,000 results in a
single list, delivered in just a
few seconds
• Very useful for analysis
applications
• So, what does it look like?
10
11wikipedia.searchtechnologies.com 11
12wikipedia.searchtechnologies.com 12
13
Summary & Observations
• A capable and scalable “raw” engine
• xml in, RESTful/xml out
• Easy to set up – much the same as an EC2
instance
• Elastic scalability
13
14
Summary & Observations
• Cost effective
• From $75 per month, including management /
maintenance
• Extremely convenient
• Switch on / off at leisure
• Promotes experimentation & agility
14
15

Mais conteúdo relacionado

Mais procurados

Greetings from AWS User Group Taiwan
Greetings from AWS User Group TaiwanGreetings from AWS User Group Taiwan
Greetings from AWS User Group TaiwanCliff Chao-kuan Lu
 
Kubernetes as Orchestrator for A10 Lightning Controller
Kubernetes as Orchestrator for A10 Lightning ControllerKubernetes as Orchestrator for A10 Lightning Controller
Kubernetes as Orchestrator for A10 Lightning ControllerAkshay Mathur
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutionsClaudio Pontili
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018Rachit Arora
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSCloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSAWS Vietnam Community
 
Getting started with Laravel & Elasticsearch
Getting started with Laravel & ElasticsearchGetting started with Laravel & Elasticsearch
Getting started with Laravel & ElasticsearchPeter Steenbergen
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
 
Easy Object Storage Import/Export Using the S3 Connector on Jetstream
Easy Object Storage Import/Export Using the S3 Connector on JetstreamEasy Object Storage Import/Export Using the S3 Connector on Jetstream
Easy Object Storage Import/Export Using the S3 Connector on JetstreamGlobus
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applicationsevilmike
 
Apache CloudStack 4.2: A First Look
Apache CloudStack 4.2: A First LookApache CloudStack 4.2: A First Look
Apache CloudStack 4.2: A First LookShanker Balan
 
OpenStack in the Enterprise
OpenStack in the EnterpriseOpenStack in the Enterprise
OpenStack in the EnterpriseTesora
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
 

Mais procurados (20)

104 meets cloud
104 meets cloud104 meets cloud
104 meets cloud
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
 
AWS Cloudformation Session 01
AWS Cloudformation Session 01AWS Cloudformation Session 01
AWS Cloudformation Session 01
 
Greetings from AWS User Group Taiwan
Greetings from AWS User Group TaiwanGreetings from AWS User Group Taiwan
Greetings from AWS User Group Taiwan
 
Kubernetes as Orchestrator for A10 Lightning Controller
Kubernetes as Orchestrator for A10 Lightning ControllerKubernetes as Orchestrator for A10 Lightning Controller
Kubernetes as Orchestrator for A10 Lightning Controller
 
Intro to Serverless
Intro to ServerlessIntro to Serverless
Intro to Serverless
 
Ansible
AnsibleAnsible
Ansible
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutions
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSCloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
 
Getting started with Laravel & Elasticsearch
Getting started with Laravel & ElasticsearchGetting started with Laravel & Elasticsearch
Getting started with Laravel & Elasticsearch
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Easy Object Storage Import/Export Using the S3 Connector on Jetstream
Easy Object Storage Import/Export Using the S3 Connector on JetstreamEasy Object Storage Import/Export Using the S3 Connector on Jetstream
Easy Object Storage Import/Export Using the S3 Connector on Jetstream
 
Kubernetes on OpenStack @eBay
Kubernetes on OpenStack @eBayKubernetes on OpenStack @eBay
Kubernetes on OpenStack @eBay
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
Laravel and SOLR
Laravel and SOLRLaravel and SOLR
Laravel and SOLR
 
Apache CloudStack 4.2: A First Look
Apache CloudStack 4.2: A First LookApache CloudStack 4.2: A First Look
Apache CloudStack 4.2: A First Look
 
OpenStack in the Enterprise
OpenStack in the EnterpriseOpenStack in the Enterprise
OpenStack in the Enterprise
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 

Destaque

Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudRightScale
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellensteinlucenerevolution
 
Semantic search in the cloud
Semantic search in the cloudSemantic search in the cloud
Semantic search in the cloudlucenerevolution
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Harish Ganesan
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr CloudCominvent AS
 

Destaque (8)

Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the Cloud
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellenstein
 
Semantic search in the cloud
Semantic search in the cloudSemantic search in the cloud
Semantic search in the cloud
 
Amazon cloud search comparison report
Amazon cloud search comparison reportAmazon cloud search comparison report
Amazon cloud search comparison report
 
Cloud powered search
Cloud powered searchCloud powered search
Cloud powered search
 
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
 

Semelhante a Wikipedia Cloud Search Webinar

Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape NETWAYS
 
Oracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best PractisesOracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best PractisesMichel Schildmeijer
 
Eclipse Enterprise Content Repository (ECR)
Eclipse Enterprise Content Repository (ECR)Eclipse Enterprise Content Repository (ECR)
Eclipse Enterprise Content Repository (ECR)Florent Guillaume
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101NetApp
 
Apereo OAE - Architectural overview
Apereo OAE - Architectural overviewApereo OAE - Architectural overview
Apereo OAE - Architectural overviewNicolaas Matthijs
 
Eclipse Apricot
Eclipse ApricotEclipse Apricot
Eclipse ApricotNuxeo
 
WCM-5 WCM Solutions with Drupal and Alfresco
WCM-5 WCM Solutions with Drupal and AlfrescoWCM-5 WCM Solutions with Drupal and Alfresco
WCM-5 WCM Solutions with Drupal and AlfrescoAlfresco Software
 
Introducing Apricot, The Eclipse Content Management Platform
Introducing Apricot, The Eclipse Content Management PlatformIntroducing Apricot, The Eclipse Content Management Platform
Introducing Apricot, The Eclipse Content Management PlatformNuxeo
 
Real World Rails Deployment
Real World Rails DeploymentReal World Rails Deployment
Real World Rails DeploymentAlan Hecht
 
Melbourne User Group OAK and MongoDB
Melbourne User Group OAK and MongoDBMelbourne User Group OAK and MongoDB
Melbourne User Group OAK and MongoDBYuval Ararat
 
Webinar Alpakka 2018-08-16
Webinar Alpakka 2018-08-16Webinar Alpakka 2018-08-16
Webinar Alpakka 2018-08-16Enno Runne
 
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudPakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudLightbend
 
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...IndicThreads
 
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...Nuxeo
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearchErhwen Kuo
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in CloudHoward Marks
 

Semelhante a Wikipedia Cloud Search Webinar (20)

Apereo OAE - Bootcamp
Apereo OAE - BootcampApereo OAE - Bootcamp
Apereo OAE - Bootcamp
 
Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape
 
Oracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best PractisesOracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best Practises
 
Eclipse Enterprise Content Repository (ECR)
Eclipse Enterprise Content Repository (ECR)Eclipse Enterprise Content Repository (ECR)
Eclipse Enterprise Content Repository (ECR)
 
OpenStack Block Storage 101
OpenStack Block Storage 101OpenStack Block Storage 101
OpenStack Block Storage 101
 
Apereo OAE - Architectural overview
Apereo OAE - Architectural overviewApereo OAE - Architectural overview
Apereo OAE - Architectural overview
 
Eclipse Apricot
Eclipse ApricotEclipse Apricot
Eclipse Apricot
 
WCM-5 WCM Solutions with Drupal and Alfresco
WCM-5 WCM Solutions with Drupal and AlfrescoWCM-5 WCM Solutions with Drupal and Alfresco
WCM-5 WCM Solutions with Drupal and Alfresco
 
Introducing Apricot, The Eclipse Content Management Platform
Introducing Apricot, The Eclipse Content Management PlatformIntroducing Apricot, The Eclipse Content Management Platform
Introducing Apricot, The Eclipse Content Management Platform
 
Real World Rails Deployment
Real World Rails DeploymentReal World Rails Deployment
Real World Rails Deployment
 
Melbourne User Group OAK and MongoDB
Melbourne User Group OAK and MongoDBMelbourne User Group OAK and MongoDB
Melbourne User Group OAK and MongoDB
 
OpenStack Swift
OpenStack SwiftOpenStack Swift
OpenStack Swift
 
Webinar Alpakka 2018-08-16
Webinar Alpakka 2018-08-16Webinar Alpakka 2018-08-16
Webinar Alpakka 2018-08-16
 
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudPakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
 
Bitnami Bootcamp. OpenStack
Bitnami Bootcamp. OpenStackBitnami Bootcamp. OpenStack
Bitnami Bootcamp. OpenStack
 
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
 
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearch
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 

Mais de Search Technologies

The Evolution of Search and Big Data
The Evolution of Search and Big DataThe Evolution of Search and Big Data
The Evolution of Search and Big DataSearch Technologies
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchSearch Technologies
 
Advanced Query Parsing Techniques
Advanced Query Parsing TechniquesAdvanced Query Parsing Techniques
Advanced Query Parsing TechniquesSearch Technologies
 
The things you need to know about SharePoint 2013 Search
The things you need to know about SharePoint 2013 SearchThe things you need to know about SharePoint 2013 Search
The things you need to know about SharePoint 2013 SearchSearch Technologies
 
Enterprise Search Best Practices Webinar 4.2013
Enterprise Search Best Practices Webinar 4.2013Enterprise Search Best Practices Webinar 4.2013
Enterprise Search Best Practices Webinar 4.2013Search Technologies
 

Mais de Search Technologies (6)

The Evolution of Search and Big Data
The Evolution of Search and Big DataThe Evolution of Search and Big Data
The Evolution of Search and Big Data
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
Advanced Query Parsing Techniques
Advanced Query Parsing TechniquesAdvanced Query Parsing Techniques
Advanced Query Parsing Techniques
 
The things you need to know about SharePoint 2013 Search
The things you need to know about SharePoint 2013 SearchThe things you need to know about SharePoint 2013 Search
The things you need to know about SharePoint 2013 Search
 
Enterprise Search Best Practices Webinar 4.2013
Enterprise Search Best Practices Webinar 4.2013Enterprise Search Best Practices Webinar 4.2013
Enterprise Search Best Practices Webinar 4.2013
 
Advanced Relevancy Ranking
Advanced Relevancy RankingAdvanced Relevancy Ranking
Advanced Relevancy Ranking
 

Último

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Wikipedia Cloud Search Webinar

  • 1. 1 Searching Wikipedia with Amazon CloudSearch
  • 2. 2 Agenda • Project Background • High-level Architecture • Summary & Observations
  • 3. 3 Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes 3
  • 5. 5 Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks 5
  • 6. 6 Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013 6
  • 8. 8 Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire 8
  • 9. 9 Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon- cloudsearch.html 9
  • 10. 10 Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Very useful for analysis applications • So, what does it look like? 10
  • 13. 13 Summary & Observations • A capable and scalable “raw” engine • xml in, RESTful/xml out • Easy to set up – much the same as an EC2 instance • Elastic scalability 13
  • 14. 14 Summary & Observations • Cost effective • From $75 per month, including management / maintenance • Extremely convenient • Switch on / off at leisure • Promotes experimentation & agility 14
  • 15. 15

Notas do Editor

  1. For further information about Aspire, see http://www.searchtechnologies.com/aspire.html
  2. The Java API for Amazon CloudSearch can be downloaded from http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html