SlideShare a Scribd company logo
1 of 32
Download to read offline
Outline                           WIRE Project                   Web Crawler               Conclusions




            WIRE: an Open Source Web Information
                    Retrieval Environment

                           Carlos Castillo and Ricardo Baeza-Yates
                                            Center for Web Research
                                             http://www.cwr.cl/
                                          CS Dept., University of Chile


                                              OSWIR 2005
                                           Compiegne, France
                                           September 19, 2005

Carlos Castillo and Ricardo Baeza-Yates                                        Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                        http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions




          1 WIRE Project



          2 Web Crawler



          3 Conclusions




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                     Web Crawler                      Conclusions



General Architecture

                                                             XML Index           XML Search
                      Focused Crawling




                                                                                  Text Search
                                                             Text Index
                  Crawling                Collection
                                                              Statistics



                  Importing                                   Extracting



                              Clustering         Classification



Carlos Castillo and Ricardo Baeza-Yates                                                 Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                 http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                        Web Crawler                      Conclusions



Web Crawler
                                                       Manager
                                                 Page score calculations
                                                 Long-term scheduling




                       Seeder                                                    Harvester
                                                       Collection
                    Link resolving                                          Short-term scheduling
                   Robots exclusions                                          Network transfers




                                                      Gatherer
                                                       Parsing
                                                    Link extraction


Carlos Castillo and Ricardo Baeza-Yates                                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                    http://www.cwr.cl/
Outline                           WIRE Project                    Web Crawler                     Conclusions



Scheduling


                                                 Future      Current
                                                                           =    Profit
                                                 Value        Value



                                                }
                      quality             0.4
             P1       freshness           0.1                              = Profit: 0.36
                                                    0.4       0.04
                      visited?            1



                                                }
                      quality             0.7
             P2       freshness           0.9                              = Profit: 0.07
                                                              0.63
                                                    0.7
                      visited?            1



                                                }
                      quality             0.6
                      freshness           -                               = Profit: 0.6
             P3                                     0.6       0
                      visited?            0

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                               http://www.cwr.cl/
Outline                            WIRE Project                         Web Crawler                            Conclusions



Downloading pages


                                                                  World Wide Web




          Web sites           S1          S2          S3          S4          S5          S6          S7
                                   P1,1        P2,1        P3,1        P4,1        P5,1        P6,1        P7,1
                                   P1,2        P2,2        P3,2        P4,2        P5,2        P6,2        P7,2
                                   P1,3        P2,3                    P4,3        P5,3        P6,2        P7,3
          Web pages
                                   P1,4        P2,4                    P4,4        P5,4                    P7,4
                                               P2,5                    P4,5                                P7,5
                                               P2,6
Carlos Castillo and Ricardo Baeza-Yates                                                           Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                   Conclusions



Storing contents
                                Document

                                                         1         hash(       )
                                                 Content seen?

                                      2



                                                   3
                                                             Disk Storage


                                     Free space list

Carlos Castillo and Ricardo Baeza-Yates                                            Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                       Conclusions



URL parsing

                                     http://host.domain.com/dir/file.html
                            1

                                                                    3
                h1('host.domain.com')


                                                                   h2('235 dir/file.html')




                host.domain.com 235
                                                 2
                                                             235 path/file.html 9421
                                                                                      4
                            SITE-ID = 235; DOC-ID = 9421

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/

More Related Content

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea, Inc.
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Microsoft Azure for Research
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkMike Taylor
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsE. Murphy
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizerJohannes Keizer
 

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne) (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in Tapio
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation Network
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

  • 1. Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 2. Outline WIRE Project Web Crawler Conclusions 1 WIRE Project 2 Web Crawler 3 Conclusions Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 3. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 4. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 5. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 6. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 7. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 8. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 9. Outline WIRE Project Web Crawler Conclusions General Architecture XML Index XML Search Focused Crawling Text Search Text Index Crawling Collection Statistics Importing Extracting Clustering Classification Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 10. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 11. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 12. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 13. Outline WIRE Project Web Crawler Conclusions Web Crawler Manager Page score calculations Long-term scheduling Seeder Harvester Collection Link resolving Short-term scheduling Robots exclusions Network transfers Gatherer Parsing Link extraction Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 14. Outline WIRE Project Web Crawler Conclusions Scheduling Future Current = Profit Value Value } quality 0.4 P1 freshness 0.1 = Profit: 0.36 0.4 0.04 visited? 1 } quality 0.7 P2 freshness 0.9 = Profit: 0.07 0.63 0.7 visited? 1 } quality 0.6 freshness - = Profit: 0.6 P3 0.6 0 visited? 0 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 15. Outline WIRE Project Web Crawler Conclusions Downloading pages World Wide Web Web sites S1 S2 S3 S4 S5 S6 S7 P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2 P1,3 P2,3 P4,3 P5,3 P6,2 P7,3 Web pages P1,4 P2,4 P4,4 P5,4 P7,4 P2,5 P4,5 P7,5 P2,6 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 16. Outline WIRE Project Web Crawler Conclusions Storing contents Document 1 hash( ) Content seen? 2 3 Disk Storage Free space list Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 17. Outline WIRE Project Web Crawler Conclusions URL parsing http://host.domain.com/dir/file.html 1 3 h1('host.domain.com') h2('235 dir/file.html') host.domain.com 235 2 235 path/file.html 9421 4 SITE-ID = 235; DOC-ID = 9421 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 18. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 19. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 20. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 21. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 22. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 23. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 24. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 25. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 26. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 27. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 28. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 29. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 30. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 31. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 32. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/