Presentation for "Find Me a Roof" project, a search engine on the vertical domain of realties advertisement.
Realized as a case of study for the "Web Information Management" class.
TeamStation AI System Report LATAM IT Salaries 2024
Find me a roof!
1. Find Me a Roof !
project for “Gestione dell’informazione sul Web” class
AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com
2. Goals
✓ Build a search engine on the vertical domain of realties
advertisement.
✓ Index-linking informations from multiple sources.
✓ Design so that adding sources will be easy.
✓ Enriching poor informations with web services
integration.
✓ Provide a user-friendly interface for localized and
domain-field selective efficient searches.
✓ “Did you mean ... ?” and search suggestions.
✓ Deploy on Amazon EC2/S3.
8. Back End Overview
Download &
Dispatch
url repository
roof bots
Extractor 11
Extractor 1
Main Extractor
LUCENE Indexes Extractor 11
Extractor 2
DB Extractor
SpellChecker AutoCompleter
...
Extractor 11
Extractor n
Extractor
9. Back End Overview
Download &
Dispatch
url repository
roof bots
Extractor 11
Extractor 1
Main Extractor
LUCENE Indexes Extractor 11
Extractor 2
DB Extractor
SpellChecker AutoCompleter
...
Extractor 11
Extractor n
Extractor
Why the DB ?
will be explained later ...
10. Crawling
• Collecting informations from
• www.trova-casa.net
• www.immobiliare.it
• First attempt on trova-casa.net :
• multithreading bruteforce on same-
structured url: after 75 k ...
11. Crawling
• Collecting informations from
• www.trova-casa.net
• www.immobiliare.it
• First attempt on trova-casa.net :
• multithreading bruteforce on same-
structured url: after 75 k ...
• ... we got banned :-)
12. Crawling
• WebSphinx ( Carnegie Mellon University )
• http://www-2.cs.cmu.edu/~rcm/websphinx/
• Timeout: 1s
• Limited scope to Rome and
surroundings
• Regex on url to visit and save
• Coordinate filtering
13. Crawling
• Somehow WebSphinx stopped before reaching
all of the realties ads...
• We wrote a simple PHP roofbot:
• Starting from sitemaps
• Reach indexing pages
• Collecting urls with given navigation paths
• This way we reached all of the ~87k ads
available in Rome and surroundings.
14. Data Extraction
• HtmlUnit + Neko
• JTidy + XPath
( even if #562127 (JTidy) forced us to skip few fields )
• Information collected :
• Data ( realty type, contract type, address,
surface, price, coordinates, contacts )
• Text ( description )
• Data has been cleaned with regex
15. Data Enrichment
• Using Google maps API and web-services
• Adding coordinates from the address
• Geocoding WS with csv output :
• http://maps.google.com/maps/geo?output=csv&sensor=false&q=...
• Adding address from coordinates
• API Geocoding WS, max 2.500 requests / day :
• http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=...
• This works for 83% of performed requests.
• i.e. failed when street numbers are out of google
knowledge or when streets names are mistyped.
16. Text search
• While the user is typing, AutoCompleter
index is queried to give suggestions using
javascript.
• The Main index is used for search
• If less than a threshold results are
returned or if the highter score is too
low, SpellChecker index is invoked to
guess possible spell errors and results
for the deducted correct query are also
displayed.
17. Suggestions
• Actually, since AutoCompleter index often
returned results for negligible words and
don’t provide support for phrase-queries,
we returned suggestions searching on a
list of common locations and keywords.
• In production, this list may be feed with
most common searches.
18. Why use a DB ?
• To take advantages of indexes for
efficient in-range searches for data
analysis.
• E.g. provide the average price for surface
unit in the location with pickable range.
• Chance to delegate filtering to the
LUCENE
Main Index
ID-based
QUERY Merge
Results
DB
19. An Example
SELECT avg("Prezzo"/"Superficie") FROM "Annunci"
WHERE "Contratto" = ‘Vendita’
AND "Latitudine" < X AND "Latitudine" > Y
AND "Longitudine" > Z AND "Longitudine" < W
AND "Superficie" != 0 AND "Prezzo" != 0 ;
20. The current implementation
• Filtering is performed at application level
over lucene main index results
• Database is used for data analysis
QUERY
LUCENE Main Index
Data
Analysis
DB
Merge
Results
21. Data Analysis
• Right now, limited to the comparison
with the local price for surface unit.
22. Geolocation
• Users can navigate the map to select their
location of interest, and filter out ads
located outside even if matching the
query.
23. Deploy on AWS
• Launch and configure an EC2 AMI ( Amazon
Machine Image ) starting from community
provided “Debian” Linux AMI
• Saving the instance on S3 to preserve
filesystem:
• ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt
• ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s
<SECRET-KEY>
• ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
24. Find Me a Roof !
( we don’t let you living under a bridge )
Thanks
project for “Gestione dell’informazione sul Web” class
AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com