SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Crawling the web, Nutch with Scala

    Vikas Hazrati @
about


  CTO at Knoldus Software

  Co-Founder at MyCellWasStolen.com

  Community Editor at InfoQ.com

  Dabbling with Scala – last 40 months

  Enterprise grade implementations on Scala – 18 months



                                                          2
nutch
Web search
                    crawler   link-graph   parsing
 software




             solr


 lucene


                                                         3
nutch – but we have google!


             transparent




            understanding




             extensible




                              4
nutch – basic architecture




crawler                 searcher




                                       5
nutch - architecture



          Recursive                segments

crawler




                                                     links




                                      web database   pages
                      fetchlists      Crawl db
                                                             6
nutch – crawl cycle
                                             generate – fetch – update cycle
Create crawldb

    Inject root URLs
        In crawldb
                                                 Update segments
        Generate fetchlist

                                                 Index fetched pages
          Fetch content      repeat until
                             depth reached          deduplication
         Update crawldb
                                                  Merge indexes for
                                                     searching




 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
                                                                               7
nutch - plugins
                               generate – fetch – update cycle



Create crawldb               parser


    Inject root URLs
        In crawldb                     HTMLParserFilter

        Generate fetchlist


          Fetch content        URL Filter


         Update crawldb
                                      scoring filter




                                                                 8
nutch – extension points

plugin.xml            // tells Nutch about the plugin




               build.xml        // build the plugin




   ivy.xml           // plugin dependencies




                      // plugin source
         src

                                                        9
nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"
version="1.0.0" provider-name="nutch.org">
    <runtime>
        <library name="kdaggregator.jar">
            <export name="*" />
        </library>
    </runtime>
    <requires>
        <import plugin="nutch-extensionpoints" />
    </requires>
    <extension id="org.apache.nutch.parse.headings" name="Nutch
Headings Parse Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
        <implementation id="KDParseFilter"
class="com.knoldus.aggregator.server.plugins.DetailParserFilter
"></implementation>
    </extension>
</plugin>
                                                             10
public ParseResult filter(Content content, ParseResult
parseResult, HTMLMetaTags metaTags, DocumentFragment
doc) {

      LOG.debug("Parsing URL: " + content.getUrl());

      }
      Parse parse = parseResult.get(content.getUrl());
      Metadata metadata = parse.getData().getParseMeta();
      for (String tag : tags) {
        metadata.add(TAG_KEY, tag);
      }
      return parseResult;

  }
                                                            11
scala
                  I have Java !

concurrency       verbose




        popular             Strongly typed

                                             jvm

          OO                library


                                                           12
scala
Java:
class Person {
   private String firstName;
   private String lastName;
   private int age;

    public Person(String firstName, String lastName, int age) {
      this.firstName = firstName;
      this.lastName = lastName;
      this.age = age;
    }

    public   void   setFirstName(String firstName) { this.firstName = firstName; }
    public   void   String getFirstName() { return this.firstName; }
    public   void   setLastName(String lastName) { this.lastName = lastName; }
    public   void   String getLastName() { return this.lastName; }
    public   void   setAge(int age) { this.age = age; }
    public   void   int getAge() { return this.age; }
}


Scala:
class Person(var firstName: String, var lastName: String, var age: Int)




Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i      13
scala
Java – everything is an object unless it is primitive

Scala – everything is an object. period.



Java – has operators (+, -, < ..) and methods

Scala – operators are methods



Java – statically typed – Thing thing = new Thing()
Scala – statically typed but uses type inferencing
val thing = new Thing
                                                                14
evolution




            15
scala and concurrency



Fine grained        coarse grained




                         Actors


                                       16
actors




         17
18
problem context


Aggregator




             UGC



                                     19
solution

             Supplier 1
Aggregator

             Supplier 2



              Supplier 3




                                      20
Create crawldb

    Inject root URLs
        In crawldb           Supplier URLs

        Generate fetchlist


          Fetch content


         Update crawldb




                             plugins written in Scala

                                                        21
logic


Crawl the supplier


                                                     Parse
                            Is URL interesting




                                                 Pass extraction to
                                                       actor

                       seed
                     database



                                                                      22
plugin - scala
class DetailParserFilter extends HtmlParseFilter {

 def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc:
 DocumentFragment): ParseResult = {
   if (isDetailURL(content.getUrl)) {
     val rawHtml = content.getContent
     if (rawHtml.length > 0) processContent(rawHtml)
   }
   parseResult
 }

 private def isDetailURL(url: String): Boolean = {
   val result = url.matches(AggregatorConfiguration.regexEventDetailPages)
   result
 }

 private def processContent(rawHtml: Array[Byte]) = {
   (new DetailProcessor).start ! rawHtml
 }                                                                                     23
result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected




All with Nutch and 823 lines of Scala code


                                                      24
demo




in action ….




                      25
resources

         http://blog.knoldus.com


http://wiki.apache.org/nutch/NutchTutorial


       http://www.scala-lang.org/


          vikas@knoldus.com



                                                 26

Mais conteúdo relacionado

Mais procurados

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSSaumitra Srivastav
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresSteven Simpson
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

Mais procurados (20)

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Semelhante a Harnessing the power of Nutch with Scala

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...Maarten Balliauw
 
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchNDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchMaarten Balliauw
 
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Maarten Balliauw
 
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se....NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...Maarten Balliauw
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Using React with Grails 3
Using React with Grails 3Using React with Grails 3
Using React with Grails 3Zachary Klein
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Leejaxconf
 
Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)Trisha Gee
 
Java 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx FranceJava 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx FranceTrisha Gee
 
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010Arun Gupta
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLMorgan Dedmon
 
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek PiotrowskiJDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek PiotrowskiPROIDEA
 

Semelhante a Harnessing the power of Nutch with Scala (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Gradle
GradleGradle
Gradle
 
Java SE 8 & EE 7 Launch
Java SE 8 & EE 7 LaunchJava SE 8 & EE 7 Launch
Java SE 8 & EE 7 Launch
 
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
 
Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016
 
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchNDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
 
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
 
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se....NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Using React with Grails 3
Using React with Grails 3Using React with Grails 3
Using React with Grails 3
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
ExSchema
ExSchemaExSchema
ExSchema
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)
 
Java 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx FranceJava 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx France
 
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
 
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek PiotrowskiJDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
 
Spock
SpockSpock
Spock
 

Mais de Knoldus Inc.

Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance TestingKnoldus Inc.
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxKnoldus Inc.
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationKnoldus Inc.
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.Knoldus Inc.
 
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...Knoldus Inc.
 
Introduction to Buildpacks.io Presentation
Introduction to Buildpacks.io PresentationIntroduction to Buildpacks.io Presentation
Introduction to Buildpacks.io PresentationKnoldus Inc.
 
Introduction to Falco presentation.pptxx
Introduction to Falco presentation.pptxxIntroduction to Falco presentation.pptxx
Introduction to Falco presentation.pptxxKnoldus Inc.
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersKnoldus Inc.
 
Understanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of EfficiencyUnderstanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of EfficiencyKnoldus Inc.
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Getting Started With React Native Presntation
Getting Started With React Native PresntationGetting Started With React Native Presntation
Getting Started With React Native PresntationKnoldus Inc.
 
Elastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxElastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxKnoldus Inc.
 
Kotlin With JetPack Compose Presentation
Kotlin With JetPack Compose PresentationKotlin With JetPack Compose Presentation
Kotlin With JetPack Compose PresentationKnoldus Inc.
 
Angular AG grid and its features with Pagination
Angular AG grid and its features with PaginationAngular AG grid and its features with Pagination
Angular AG grid and its features with PaginationKnoldus Inc.
 
Grafana Loki (Monitoring Tool) Presentation
Grafana Loki (Monitoring Tool) PresentationGrafana Loki (Monitoring Tool) Presentation
Grafana Loki (Monitoring Tool) PresentationKnoldus Inc.
 
Components in Ionic Presentation (FrontEnd)
Components in Ionic Presentation (FrontEnd)Components in Ionic Presentation (FrontEnd)
Components in Ionic Presentation (FrontEnd)Knoldus Inc.
 
Testing Harmony Design Patterns & Anti-Patterns Unveiled
Testing Harmony Design Patterns & Anti-Patterns UnveiledTesting Harmony Design Patterns & Anti-Patterns Unveiled
Testing Harmony Design Patterns & Anti-Patterns UnveiledKnoldus Inc.
 
Introduction to AWS CloudWatch Presentation
Introduction to AWS CloudWatch PresentationIntroduction to AWS CloudWatch Presentation
Introduction to AWS CloudWatch PresentationKnoldus Inc.
 
Benefit of scrum ceremonies presentation
Benefit of scrum ceremonies presentationBenefit of scrum ceremonies presentation
Benefit of scrum ceremonies presentationKnoldus Inc.
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxKnoldus Inc.
 

Mais de Knoldus Inc. (20)

Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance Testing
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptx
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower Presentation
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.
 
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...
Building Resilient Software A Deep Dive into Self-Healing Test Automation Fra...
 
Introduction to Buildpacks.io Presentation
Introduction to Buildpacks.io PresentationIntroduction to Buildpacks.io Presentation
Introduction to Buildpacks.io Presentation
 
Introduction to Falco presentation.pptxx
Introduction to Falco presentation.pptxxIntroduction to Falco presentation.pptxx
Introduction to Falco presentation.pptxx
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All Developers
 
Understanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of EfficiencyUnderstanding System Design and Architecture Blueprints of Efficiency
Understanding System Design and Architecture Blueprints of Efficiency
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Getting Started With React Native Presntation
Getting Started With React Native PresntationGetting Started With React Native Presntation
Getting Started With React Native Presntation
 
Elastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptxElastic Search Capability Presentation.pptx
Elastic Search Capability Presentation.pptx
 
Kotlin With JetPack Compose Presentation
Kotlin With JetPack Compose PresentationKotlin With JetPack Compose Presentation
Kotlin With JetPack Compose Presentation
 
Angular AG grid and its features with Pagination
Angular AG grid and its features with PaginationAngular AG grid and its features with Pagination
Angular AG grid and its features with Pagination
 
Grafana Loki (Monitoring Tool) Presentation
Grafana Loki (Monitoring Tool) PresentationGrafana Loki (Monitoring Tool) Presentation
Grafana Loki (Monitoring Tool) Presentation
 
Components in Ionic Presentation (FrontEnd)
Components in Ionic Presentation (FrontEnd)Components in Ionic Presentation (FrontEnd)
Components in Ionic Presentation (FrontEnd)
 
Testing Harmony Design Patterns & Anti-Patterns Unveiled
Testing Harmony Design Patterns & Anti-Patterns UnveiledTesting Harmony Design Patterns & Anti-Patterns Unveiled
Testing Harmony Design Patterns & Anti-Patterns Unveiled
 
Introduction to AWS CloudWatch Presentation
Introduction to AWS CloudWatch PresentationIntroduction to AWS CloudWatch Presentation
Introduction to AWS CloudWatch Presentation
 
Benefit of scrum ceremonies presentation
Benefit of scrum ceremonies presentationBenefit of scrum ceremonies presentation
Benefit of scrum ceremonies presentation
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 

Último

activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 

Último (20)

activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 

Harnessing the power of Nutch with Scala

  • 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  • 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  • 3. nutch Web search crawler link-graph parsing software solr lucene 3
  • 4. nutch – but we have google! transparent understanding extensible 4
  • 5. nutch – basic architecture crawler searcher 5
  • 6. nutch - architecture Recursive segments crawler links web database pages fetchlists Crawl db 6
  • 7. nutch – crawl cycle generate – fetch – update cycle Create crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  • 8. nutch - plugins generate – fetch – update cycle Create crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  • 9. nutch – extension points plugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  • 10. nutch - example <plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter "></implementation> </extension> </plugin> 10
  • 11. public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  • 12. scala I have Java ! concurrency verbose popular Strongly typed jvm OO library 12
  • 13. scala Java: class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; } } Scala: class Person(var firstName: String, var lastName: String, var age: Int) Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  • 14. scala Java – everything is an object unless it is primitive Scala – everything is an object. period. Java – has operators (+, -, < ..) and methods Scala – operators are methods Java – statically typed – Thing thing = new Thing() Scala – statically typed but uses type inferencing val thing = new Thing 14
  • 15. evolution 15
  • 16. scala and concurrency Fine grained coarse grained Actors 16
  • 17. actors 17
  • 18. 18
  • 20. solution Supplier 1 Aggregator Supplier 2 Supplier 3 20
  • 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  • 22. logic Crawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  • 23. plugin - scala class DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  • 24. result 5 suppliers crawled Crawl cycles run continuously for few days > 500K seed data collected All with Nutch and 823 lines of Scala code 24
  • 26. resources http://blog.knoldus.com http://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26