SlideShare a Scribd company logo
1 of 24
Find Me a Roof !
project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com
Goals
✓ Build a search engine on the vertical domain of realties
  advertisement.

✓ Index-linking informations from multiple sources.
✓ Design so that adding sources will be easy.
✓ Enriching poor informations with web services
  integration.

✓ Provide a user-friendly interface for localized and
  domain-field selective efficient searches.

✓ “Did you mean ... ?” and search suggestions.
✓ Deploy on Amazon EC2/S3.
Preview
Preview ( autocomplete )
Preview ( results )
Preview ( did you mean ... ? )
What we used
Back End Overview
                                                Download &
                                                 Dispatch


                               url repository
      roof bots


                                                Extractor 11
                                                 Extractor 1
          Main                                    Extractor

   LUCENE Indexes                               Extractor 11
                                                 Extractor 2
                                     DB          Extractor
SpellChecker   AutoCompleter
                                                      ...
                                                Extractor 11
                                                 Extractor n
                                                 Extractor
Back End Overview
                                                              Download &
                                                               Dispatch


                                             url repository
      roof bots


                                                              Extractor 11
                                                               Extractor 1
          Main                                                  Extractor

   LUCENE Indexes                                             Extractor 11
                                                               Extractor 2
                                                   DB          Extractor
SpellChecker    AutoCompleter
                                                                    ...
                                                              Extractor 11
                                                               Extractor n
                                                               Extractor
                     Why the DB ?
               will be explained later ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...

 • ... we got banned :-)
Crawling

• WebSphinx ( Carnegie Mellon University )
   • http://www-2.cs.cmu.edu/~rcm/websphinx/

• Timeout: 1s
• Limited scope to Rome and
   surroundings

   • Regex on url to visit and save
   • Coordinate filtering
Crawling
• Somehow WebSphinx stopped before reaching
  all of the realties ads...

• We wrote a simple PHP roofbot:
  • Starting from sitemaps
  • Reach indexing pages
  • Collecting urls with given navigation paths
• This way we reached all of the ~87k ads
  available in Rome and surroundings.
Data Extraction
•          HtmlUnit + Neko

•          JTidy + XPath
    ( even if #562127 (JTidy) forced us to skip few fields )


• Information collected :
     • Data ( realty type, contract type, address,
          surface, price, coordinates, contacts )

     • Text ( description )
• Data has been cleaned with regex
Data Enrichment
• Using Google maps API and web-services
   • Adding coordinates from the address
       • Geocoding WS with csv output :
   •   http://maps.google.com/maps/geo?output=csv&sensor=false&q=...


   • Adding address from coordinates
       • API Geocoding WS, max 2.500 requests / day :
   •   http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=...


• This works for 83% of performed requests.
   • i.e. failed when street numbers are out of google
       knowledge or when streets names are mistyped.
Text search
• While the user is typing, AutoCompleter
  index is queried to give suggestions using
  javascript.

• The Main index is used for search
  • If less than a threshold results are
    returned or if the highter score is too
    low, SpellChecker index is invoked to
    guess possible spell errors and results
    for the deducted correct query are also
    displayed.
Suggestions

• Actually, since AutoCompleter index often
  returned results for negligible words and
  don’t provide support for phrase-queries,
  we returned suggestions searching on a
  list of common locations and keywords.

• In production, this list may be feed with
  most common searches.
Why use a DB ?
        • To take advantages of indexes for
          efficient in-range searches for data
          analysis.
        • E.g. provide the average price for surface
          unit in the location with pickable range.
        • Chance to delegate filtering to the

          LUCENE
         Main Index
                           ID-based
QUERY                       Merge
                                               Results

           DB
An Example
SELECT avg("Prezzo"/"Superficie") FROM "Annunci"
WHERE "Contratto" = ‘Vendita’
AND "Latitudine" < X AND "Latitudine" > Y
AND "Longitudine" > Z AND "Longitudine" < W
AND "Superficie"   != 0 AND "Prezzo" != 0 ;
The current implementation
 • Filtering is performed at application level
   over lucene main index results
 • Database is used for data analysis
                     QUERY

                 LUCENE Main Index


       Data
      Analysis
                                     DB

                      Merge

                     Results
Data Analysis
• Right now, limited to the comparison
  with the local price for surface unit.
Geolocation




• Users can navigate the map to select their
  location of interest, and filter out ads
  located outside even if matching the
  query.
Deploy on AWS


• Launch and configure an EC2 AMI ( Amazon
  Machine Image ) starting from community
  provided “Debian” Linux AMI

• Saving the instance on S3 to preserve
  filesystem:
  •   ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt

  •   ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s
      <SECRET-KEY>

  •   ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
Find Me a Roof !
                      ( we don’t let you living under a bridge )




                  Thanks


project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com

More Related Content

Similar to Find me a roof!

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Alexander Pashynskiy
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Behar Veliqi
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101Huy Vo
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Anthony Dahanne
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Patrick Chanezon
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearchArno Flapper
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてLINE Corporation
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudAaron Walker
 

Similar to Find me a roof! (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearch
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
 

More from Alessandro Manfredi

More from Alessandro Manfredi (9)

Hey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security backHey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security back
 
WhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part IIWhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part II
 
Connect (4|n)
Connect (4|n)Connect (4|n)
Connect (4|n)
 
LUG - Ricompilazione kernel
LUG - Ricompilazione kernelLUG - Ricompilazione kernel
LUG - Ricompilazione kernel
 
LUG - Logical volumes management
LUG - Logical volumes managementLUG - Logical volumes management
LUG - Logical volumes management
 
LUG - Install Fest 2008
LUG - Install Fest 2008LUG - Install Fest 2008
LUG - Install Fest 2008
 
Advanced Shell Scripting
Advanced Shell ScriptingAdvanced Shell Scripting
Advanced Shell Scripting
 
ExAlg Overview
ExAlg OverviewExAlg Overview
ExAlg Overview
 
The "vi" Text Editor
The "vi" Text EditorThe "vi" Text Editor
The "vi" Text Editor
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Find me a roof!

  • 1. Find Me a Roof ! project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  • 2. Goals ✓ Build a search engine on the vertical domain of realties advertisement. ✓ Index-linking informations from multiple sources. ✓ Design so that adding sources will be easy. ✓ Enriching poor informations with web services integration. ✓ Provide a user-friendly interface for localized and domain-field selective efficient searches. ✓ “Did you mean ... ?” and search suggestions. ✓ Deploy on Amazon EC2/S3.
  • 6. Preview ( did you mean ... ? )
  • 8. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor
  • 9. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor Why the DB ? will be explained later ...
  • 10. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ...
  • 11. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ... • ... we got banned :-)
  • 12. Crawling • WebSphinx ( Carnegie Mellon University ) • http://www-2.cs.cmu.edu/~rcm/websphinx/ • Timeout: 1s • Limited scope to Rome and surroundings • Regex on url to visit and save • Coordinate filtering
  • 13. Crawling • Somehow WebSphinx stopped before reaching all of the realties ads... • We wrote a simple PHP roofbot: • Starting from sitemaps • Reach indexing pages • Collecting urls with given navigation paths • This way we reached all of the ~87k ads available in Rome and surroundings.
  • 14. Data Extraction • HtmlUnit + Neko • JTidy + XPath ( even if #562127 (JTidy) forced us to skip few fields ) • Information collected : • Data ( realty type, contract type, address, surface, price, coordinates, contacts ) • Text ( description ) • Data has been cleaned with regex
  • 15. Data Enrichment • Using Google maps API and web-services • Adding coordinates from the address • Geocoding WS with csv output : • http://maps.google.com/maps/geo?output=csv&sensor=false&q=... • Adding address from coordinates • API Geocoding WS, max 2.500 requests / day : • http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=... • This works for 83% of performed requests. • i.e. failed when street numbers are out of google knowledge or when streets names are mistyped.
  • 16. Text search • While the user is typing, AutoCompleter index is queried to give suggestions using javascript. • The Main index is used for search • If less than a threshold results are returned or if the highter score is too low, SpellChecker index is invoked to guess possible spell errors and results for the deducted correct query are also displayed.
  • 17. Suggestions • Actually, since AutoCompleter index often returned results for negligible words and don’t provide support for phrase-queries, we returned suggestions searching on a list of common locations and keywords. • In production, this list may be feed with most common searches.
  • 18. Why use a DB ? • To take advantages of indexes for efficient in-range searches for data analysis. • E.g. provide the average price for surface unit in the location with pickable range. • Chance to delegate filtering to the LUCENE Main Index ID-based QUERY Merge Results DB
  • 19. An Example SELECT avg("Prezzo"/"Superficie") FROM "Annunci" WHERE "Contratto" = ‘Vendita’ AND "Latitudine" < X AND "Latitudine" > Y AND "Longitudine" > Z AND "Longitudine" < W AND "Superficie" != 0 AND "Prezzo" != 0 ;
  • 20. The current implementation • Filtering is performed at application level over lucene main index results • Database is used for data analysis QUERY LUCENE Main Index Data Analysis DB Merge Results
  • 21. Data Analysis • Right now, limited to the comparison with the local price for surface unit.
  • 22. Geolocation • Users can navigate the map to select their location of interest, and filter out ads located outside even if matching the query.
  • 23. Deploy on AWS • Launch and configure an EC2 AMI ( Amazon Machine Image ) starting from community provided “Debian” Linux AMI • Saving the instance on S3 to preserve filesystem: • ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt • ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s <SECRET-KEY> • ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
  • 24. Find Me a Roof ! ( we don’t let you living under a bridge ) Thanks project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com

Editor's Notes