SlideShare uma empresa Scribd logo
1 de 25
Lessons from building a search
engine with Amazon Web Services
            Chirayu Patel
     chirayu@snappyfingers.com
Chirayu Patel
                       Developer
                     SnappyFingers
           Question and Answer Search Engine
                   100% in the cloud



5-Apr-09                CloudCamp - Bangalore
My experiences
                 What worked?
                 What didn’t?
           What could be done better?
                What did I miss?




5-Apr-09           CloudCamp - Bangalore
Recap - AWS?
      •    EC2 – Elastic Compute Cloud
      •    S3 – Simple Storage Service
      •    SQS – Simple Queue Service
      •    SDB – SimpleDB




5-Apr-09                   CloudCamp - Bangalore
Why AWS?
      •    Computing Power requirements unknown
      •    Cheap
      •    Availability of multiple services
      •    Easy to implement SnappyFingers architecture
           using AWS services




5-Apr-09                   CloudCamp - Bangalore
SnappyFingers
      • Information Retrieval System (IRS)
      • FrontEnd
           – Nothing unique here




5-Apr-09                    CloudCamp - Bangalore
Three motivations (behind my decisions)
      • Reluctance to learn
      • Cost Conscious
      • I write buggy code




5-Apr-09                  CloudCamp - Bangalore
Architectural Requirements
      •    Loose Coupled
      •    Scalable
      •    Fault Tolerant
      •    Budget dependent




5-Apr-09                 CloudCamp - Bangalore
IRS Architecture
           Pipeline
                                                                                            SQS
                      Pipe Crawler           Pipe Parser            Pipe Indexer




                                                                                           EC2
                 Crawler                        Parser                     Indexer




                                                EC2 + S3                             SDB
                               Data Store                               Errors



5-Apr-09                                    CloudCamp - Bangalore
Pipes and Pipelines




5-Apr-09         CloudCamp - Bangalore
Pipes and Pipelines
      • Pipes contain jobs
      • Pipeline is a group of pipe
      • Easy to create pipelines and add pipes




5-Apr-09                 CloudCamp - Bangalore
Job ORM
                                                                SQS API
      class CrawlerJob (JobBase):
                                                                SDB API
          class SDBInterfaceConfig:
              domain_name = settings.CRAWLER_JOB_DOMAIN

           class SQSInterfaceConfig:
               queue_name = settings.CRAWLER_JOB_QUEUE
               timeout = settings.CRAWLER_JOB_TIMEOUT

           class AWSMetaData:
               action = CharField (...)
               url = CharField (...)
               ...
               ...
      Default attributes of each Job:
      • Pipeline Name
      • Status
      • Start Time
      • End Time
      • Id

5-Apr-09                                CloudCamp - Bangalore
Job Processing
      for i in range (num_of_jobs):
         try:
             job = cls.jobclass.sqs_get() # process job
             ...
         except Exception, e:
             job.job_processing_complete(…)
             fsdebug.mail_admins (..)
             end_transaction(rollback = True)
             job.sdb_save() # save in error store
          finally:
             job.sqs_del() # delete the job



5-Apr-09                   CloudCamp - Bangalore
The Good
      • Architecture easy to extend
      • ORM approach is a big time saver
      • Simple to add new services




5-Apr-09                CloudCamp - Bangalore
The Bad
      • Messages may be lost
           – Service Failure
           – SQS deletes messages after 4 days.


      Imp: System should be able to recreate jobs




5-Apr-09                     CloudCamp - Bangalore
Storage




5-Apr-09   CloudCamp - Bangalore
What do we store?
      • Crawler Data – Web Pages
      • Extracted Content – Questions/Answers
      • Backups




5-Apr-09                CloudCamp - Bangalore
Storage Structure

                        Meta Data                        Key + Value


           Postgres                                 S3




5-Apr-09                    CloudCamp - Bangalore
ORM
      • Extended Django ORM to support S3

      class S3WebPage (S3Model):
          _allowed_attrs = [quot;urlquot;, quot;contentquot;, ..]
          _name = quot;S3WebPage“
          ...
          ...




5-Apr-09               CloudCamp - Bangalore
The Good
      • Extremely scalable
      • Possible to store Python objects in S3
      • Latency issues can be solved by using a
        caching layer
      • No need to backup S3 data
      • Storage is cheap



5-Apr-09                 CloudCamp - Bangalore
The Bad
      • Postgres + S3 is not an elegant solution
           – Periodic syncing of Postgres and S3 required
      • High transaction costs
           – $.01 per 1000 PUT,COPY,POST or LIST Requests
           – $.01 per 10000 GET Requests




5-Apr-09                      CloudCamp - Bangalore
Computing




5-Apr-09    CloudCamp - Bangalore
EC2 – The Good
      • Computing needs are not constant
      • Data transfer to other AWS services is free
      • AMI’s per node type




5-Apr-09                 CloudCamp - Bangalore
The bad
      • Missed having a nerve center
           – Budget
           – Job Load
           – CPU load
      • Low cost 64bit severs are not available




5-Apr-09                 CloudCamp - Bangalore
Thank You

Mais conteúdo relacionado

Mais procurados

AWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWSAWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
Amazon Web Services
 

Mais procurados (17)

Infrastructure as a code: a cloud approach
Infrastructure as a code: a cloud approachInfrastructure as a code: a cloud approach
Infrastructure as a code: a cloud approach
 
AWS September Webinar Series - Visual Effects Rendering in the AWS Cloud with...
AWS September Webinar Series - Visual Effects Rendering in the AWS Cloud with...AWS September Webinar Series - Visual Effects Rendering in the AWS Cloud with...
AWS September Webinar Series - Visual Effects Rendering in the AWS Cloud with...
 
Best practices to use aws in countryside.
Best practices to use aws in countryside.Best practices to use aws in countryside.
Best practices to use aws in countryside.
 
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
 
ShadowReader - Serverless load tests for replaying production traffic
ShadowReader - Serverless load tests for replaying production trafficShadowReader - Serverless load tests for replaying production traffic
ShadowReader - Serverless load tests for replaying production traffic
 
Gpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on awsGpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on aws
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019Mvp skill saturday ep09 _06072019_azure updates - july 2019
Mvp skill saturday ep09 _06072019_azure updates - july 2019
 
Siebel CRM in Production - What Now?
Siebel CRM in Production - What  Now?Siebel CRM in Production - What  Now?
Siebel CRM in Production - What Now?
 
"Production-ready Serverless Java Applications in 3 weeks" at AWS Community D...
"Production-ready Serverless Java Applications in 3 weeks" at AWS Community D..."Production-ready Serverless Java Applications in 3 weeks" at AWS Community D...
"Production-ready Serverless Java Applications in 3 weeks" at AWS Community D...
 
Communication tool & Environment for Remote Worker
Communication tool & Environment for Remote WorkerCommunication tool & Environment for Remote Worker
Communication tool & Environment for Remote Worker
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold starts
 
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWSAWS Cloud Kata | Bangkok - Getting to Scale on AWS
AWS Cloud Kata | Bangkok - Getting to Scale on AWS
 
Empowering Amazon EC2 with GigaSpaces XAP
Empowering Amazon EC2 with GigaSpaces XAPEmpowering Amazon EC2 with GigaSpaces XAP
Empowering Amazon EC2 with GigaSpaces XAP
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
 
SpringOne Tour St. Louis - Serverless Spring
SpringOne Tour St. Louis - Serverless SpringSpringOne Tour St. Louis - Serverless Spring
SpringOne Tour St. Louis - Serverless Spring
 
Experiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlExperiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and Postgresql
 

Destaque

The power of abstraction
The power of abstractionThe power of abstraction
The power of abstraction
ACMBangalore
 

Destaque (9)

ACM Bangalore Distinguished Speaker Program
ACM Bangalore Distinguished Speaker ProgramACM Bangalore Distinguished Speaker Program
ACM Bangalore Distinguished Speaker Program
 
Opening Remarks - Cloud Symposium
Opening Remarks - Cloud SymposiumOpening Remarks - Cloud Symposium
Opening Remarks - Cloud Symposium
 
Overview of FreeBSD PMC Tools
Overview of FreeBSD PMC ToolsOverview of FreeBSD PMC Tools
Overview of FreeBSD PMC Tools
 
The power of abstraction
The power of abstractionThe power of abstraction
The power of abstraction
 
Securing Wireless Cellular Systems
Securing Wireless Cellular SystemsSecuring Wireless Cellular Systems
Securing Wireless Cellular Systems
 
In Home Safety Tips
In Home Safety TipsIn Home Safety Tips
In Home Safety Tips
 
cloud - internet rengineering
cloud - internet rengineeringcloud - internet rengineering
cloud - internet rengineering
 
Automated Design of Digital Microfluids Lab-on-Chip
Automated Design of Digital Microfluids Lab-on-ChipAutomated Design of Digital Microfluids Lab-on-Chip
Automated Design of Digital Microfluids Lab-on-Chip
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
 

Semelhante a Lesson from Building a Search Engine using the cloud

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
George Ang
 

Semelhante a Lesson from Building a Search Engine using the cloud (20)

Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
 
How to Train Your Classifier: Create a Serverless Machine Learning System wit...
How to Train Your Classifier: Create a Serverless Machine Learning System wit...How to Train Your Classifier: Create a Serverless Machine Learning System wit...
How to Train Your Classifier: Create a Serverless Machine Learning System wit...
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
Cloud Computing in Practice
Cloud Computing in PracticeCloud Computing in Practice
Cloud Computing in Practice
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Scaling Mapufacture on Amazon Web Services
Scaling Mapufacture on Amazon Web ServicesScaling Mapufacture on Amazon Web Services
Scaling Mapufacture on Amazon Web Services
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
 
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
 
Serverless in production, an experience report (LNUG)
Serverless in production, an experience report (LNUG)Serverless in production, an experience report (LNUG)
Serverless in production, an experience report (LNUG)
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Serverless in Production, an experience report (cloudXchange)
Serverless in Production, an experience report (cloudXchange)Serverless in Production, an experience report (cloudXchange)
Serverless in Production, an experience report (cloudXchange)
 
Azure and/or AWS: How to Choose the best cloud platform for your project
Azure and/or AWS: How to Choose the best cloud platform for your projectAzure and/or AWS: How to Choose the best cloud platform for your project
Azure and/or AWS: How to Choose the best cloud platform for your project
 
Navigate Data Service using AWS
Navigate Data Service using AWSNavigate Data Service using AWS
Navigate Data Service using AWS
 
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
 
AWS Lambda from the trenches
AWS Lambda from the trenchesAWS Lambda from the trenches
AWS Lambda from the trenches
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Aws Introduction, technology and $ sense
Aws Introduction, technology and $ senseAws Introduction, technology and $ sense
Aws Introduction, technology and $ sense
 
Big Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the CloudBig Data on EC2: Mashing Technology in the Cloud
Big Data on EC2: Mashing Technology in the Cloud
 

Mais de ACMBangalore

Mais de ACMBangalore (9)

Clouds in emerging markets
Clouds in emerging marketsClouds in emerging markets
Clouds in emerging markets
 
Opportunites and Challenges in Cloud COmputing
Opportunites and Challenges in Cloud COmputingOpportunites and Challenges in Cloud COmputing
Opportunites and Challenges in Cloud COmputing
 
Perspectives on Cloud COmputing - Google
Perspectives on Cloud COmputing - GooglePerspectives on Cloud COmputing - Google
Perspectives on Cloud COmputing - Google
 
Making of a Successful Cloud Business
Making of a Successful Cloud BusinessMaking of a Successful Cloud Business
Making of a Successful Cloud Business
 
Web Business Platforms on the Cloud
Web Business Platforms on the CloudWeb Business Platforms on the Cloud
Web Business Platforms on the Cloud
 
Badrinath Ramamurthy Cloud Infrastructure
Badrinath Ramamurthy   Cloud InfrastructureBadrinath Ramamurthy   Cloud Infrastructure
Badrinath Ramamurthy Cloud Infrastructure
 
market oriented cloud
market oriented cloudmarket oriented cloud
market oriented cloud
 
Case study - SaaS Abs Experience Jan07 09
Case study - SaaS Abs Experience Jan07 09Case study - SaaS Abs Experience Jan07 09
Case study - SaaS Abs Experience Jan07 09
 
virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Lesson from Building a Search Engine using the cloud

  • 1. Lessons from building a search engine with Amazon Web Services Chirayu Patel chirayu@snappyfingers.com
  • 2. Chirayu Patel Developer SnappyFingers Question and Answer Search Engine 100% in the cloud 5-Apr-09 CloudCamp - Bangalore
  • 3. My experiences What worked? What didn’t? What could be done better? What did I miss? 5-Apr-09 CloudCamp - Bangalore
  • 4. Recap - AWS? • EC2 – Elastic Compute Cloud • S3 – Simple Storage Service • SQS – Simple Queue Service • SDB – SimpleDB 5-Apr-09 CloudCamp - Bangalore
  • 5. Why AWS? • Computing Power requirements unknown • Cheap • Availability of multiple services • Easy to implement SnappyFingers architecture using AWS services 5-Apr-09 CloudCamp - Bangalore
  • 6. SnappyFingers • Information Retrieval System (IRS) • FrontEnd – Nothing unique here 5-Apr-09 CloudCamp - Bangalore
  • 7. Three motivations (behind my decisions) • Reluctance to learn • Cost Conscious • I write buggy code 5-Apr-09 CloudCamp - Bangalore
  • 8. Architectural Requirements • Loose Coupled • Scalable • Fault Tolerant • Budget dependent 5-Apr-09 CloudCamp - Bangalore
  • 9. IRS Architecture Pipeline SQS Pipe Crawler Pipe Parser Pipe Indexer EC2 Crawler Parser Indexer EC2 + S3 SDB Data Store Errors 5-Apr-09 CloudCamp - Bangalore
  • 10. Pipes and Pipelines 5-Apr-09 CloudCamp - Bangalore
  • 11. Pipes and Pipelines • Pipes contain jobs • Pipeline is a group of pipe • Easy to create pipelines and add pipes 5-Apr-09 CloudCamp - Bangalore
  • 12. Job ORM SQS API class CrawlerJob (JobBase): SDB API class SDBInterfaceConfig: domain_name = settings.CRAWLER_JOB_DOMAIN class SQSInterfaceConfig: queue_name = settings.CRAWLER_JOB_QUEUE timeout = settings.CRAWLER_JOB_TIMEOUT class AWSMetaData: action = CharField (...) url = CharField (...) ... ... Default attributes of each Job: • Pipeline Name • Status • Start Time • End Time • Id 5-Apr-09 CloudCamp - Bangalore
  • 13. Job Processing for i in range (num_of_jobs): try: job = cls.jobclass.sqs_get() # process job ... except Exception, e: job.job_processing_complete(…) fsdebug.mail_admins (..) end_transaction(rollback = True) job.sdb_save() # save in error store finally: job.sqs_del() # delete the job 5-Apr-09 CloudCamp - Bangalore
  • 14. The Good • Architecture easy to extend • ORM approach is a big time saver • Simple to add new services 5-Apr-09 CloudCamp - Bangalore
  • 15. The Bad • Messages may be lost – Service Failure – SQS deletes messages after 4 days. Imp: System should be able to recreate jobs 5-Apr-09 CloudCamp - Bangalore
  • 16. Storage 5-Apr-09 CloudCamp - Bangalore
  • 17. What do we store? • Crawler Data – Web Pages • Extracted Content – Questions/Answers • Backups 5-Apr-09 CloudCamp - Bangalore
  • 18. Storage Structure Meta Data Key + Value Postgres S3 5-Apr-09 CloudCamp - Bangalore
  • 19. ORM • Extended Django ORM to support S3 class S3WebPage (S3Model): _allowed_attrs = [quot;urlquot;, quot;contentquot;, ..] _name = quot;S3WebPage“ ... ... 5-Apr-09 CloudCamp - Bangalore
  • 20. The Good • Extremely scalable • Possible to store Python objects in S3 • Latency issues can be solved by using a caching layer • No need to backup S3 data • Storage is cheap 5-Apr-09 CloudCamp - Bangalore
  • 21. The Bad • Postgres + S3 is not an elegant solution – Periodic syncing of Postgres and S3 required • High transaction costs – $.01 per 1000 PUT,COPY,POST or LIST Requests – $.01 per 10000 GET Requests 5-Apr-09 CloudCamp - Bangalore
  • 22. Computing 5-Apr-09 CloudCamp - Bangalore
  • 23. EC2 – The Good • Computing needs are not constant • Data transfer to other AWS services is free • AMI’s per node type 5-Apr-09 CloudCamp - Bangalore
  • 24. The bad • Missed having a nerve center – Budget – Job Load – CPU load • Low cost 64bit severs are not available 5-Apr-09 CloudCamp - Bangalore