SlideShare uma empresa Scribd logo
1 de 93
Feature and Algorithms
Xiaoguang Qi and Brian D. Davison
Department of Computer Science &
Engineering
Lehigh University, June 2007

                                         Presented by
                           Mr.Mumtaz Khan (MS 2nd Semester )
                               Department of Computer Science
                         University of Peshawar, September 2011
   Webpage classification significance
   Introduction
   Background
   Applications of web classification
   Features
   Algorithms
   Blog Classification
   Conclusion
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Let’s go back in history about 10 years.
     The Evolution of Websites: How 5 popular
     Websites have changed
   What’s different between past and present
    what changed?
   What’s different between past and present
    what changed?
     Flash animation
     Java Script
     Video Clips, Embedded Object
     Advertise, GG Ad sense, Yahoo!
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Webpage classification or webpage
    categorization is the process of assigning a
    webpage to one or more category labels. E.g.
    “News”, “Sport” , “Business”
   GOAL: They observe the existing of web
    classification techniques to find new area
    for research. Including web-specific
    features and algorithms that have been
    found to be useful for webpage
    classification.
   What will you learn?
     A Detailed review of useful features for web
      classification
     The algorithms used
     The future research directions
 Webpage classification can help improve the
  quality of web search.
 Knowing is thing help you to improve your SEO
  skill.
 Each search engine, keep their technique in
  secret.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   The general problem of webpage
    classification can be divided into
     Subject classification; subject or topic of
      webpage e.g. “Adult”, “Sport”, “Business”.
     Function classification; the role that the
      webpage play e.g. “Personal homepage”, “Course
      page”, “Admission page”.
   Based on the number of classes in webpage
    classification can be divided into
     binary classification
     multi-class classification
    Based on the number of classes that can be
    assigned to an instance, classification can be
    divided into single-label classification and
    multi-label classification.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Constructing and expanding web
    directories (web hierarchies)
     Yahoo !
     ODP or “Open Dictionary Project”
      ▪ http://www.dmoz.org
     How are they doing?
   How are they doing?
     By human effort
      ▪ July 2006, it was reported there are 73,354 editor in the
        dmoz ODP.
   As the web changes and continue to grow so
    “Automatic creation of classifiers from web
    corpora based on use-defined hierarchies”
    has been introduced by Huang et al. in 2004
   The starting point of this presentation !!
   Improving quality of search results
     Categories view
     Ranking view
   Improving quality of search results
     Categories view
     Ranking view
     In 1998, Page and Brin developed the link-based
     ranking algorithm called PageRank
      ▪ Calculates the hyperlinks with our considering the topic
        of each page
   Helping question answering systems
     Yang and Chua 2004
      ▪ suggest finding answers to list questions e.g. “name all the
        countries in Europe”
     How it worked?
      ▪ Formulated the queries and sent to search engines.
      ▪ Classified the results into four categories
        ▪   Collection pages (contain list of items)
        ▪   Topic pages (represent the answers instance)
        ▪   Relevant page (Supporting the answers instance)
        ▪   Irrelevant pages
      ▪ After that , topic pages are clustered, from which answers are
        extracted.
     Answering question system could benefit from web
      classification of both accuracy and efficiency
   Other applications
     Web content filtering
     Assisted web browsing
     Knowledge base construction
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   In this section, we review the types of features
    that useful in webpage classification research.
     The most important criteria in webpage classification
      that make webpage classification different from
      plaintext classification is HYPERLINK <a>…</a>
   We classify features into
     On-page feature: Directly located on the page
     Neighbors feature: Found on the pages related to the
      page to be classified.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Textual content and tags
     N-gram feature
      ▪ Imagine of two different documents. One contains
        phrase “New York”. The other contains the terms “New”
        and “York”. (2-gram feature).
      ▪ In Yahoo!, They used 5-grams feature.
     HTML tags or DOM
      ▪ Title, Headings, Metadata and Main text
        ▪ Assigned each of them an arbitrary weight.
        ▪ Now a day most of website using Nested list (<ul><li>) which
          really help in web page classification.
   Textual content and tags
     URL
      ▪ Kan and Thi 2004
        ▪ Demonstrated that a webpage can be classified based on its URL
   Visual analysis
     Each webpage has two representations
       1. Text which represent in HTML
       2. The visual representation rendered by a web browser
     Most approaches focus on the text while ignoring the
      visual information which is useful as well
     Kovacevic et al. 2004
      ▪ Each webpage is represented as a hierarchical “Visual
        adjacency multi graph.”
      ▪ In graph each node represents an HTML object and each
        edge represents the spatial relation in the visual
        representation.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Motivation
     The useful features that we discuss previously, in a
     particular these features are missing or
     unrecognizable
   Underlying Assumptions
     When exploring the features of neighbors, some
      assumptions are implicitly made in existing work.
     The presence of many “sports” pages in the neighborhood
      of P-a increases the probability of P-a being in “Sport”.
     Chakrabari et al. 2002 and Meczer 2005 showed that
      linked pages were more likely to have terms in common .
   Neighbor selection
     Existing research mainly focuses on page with in two steps
      of the page to be classified. At the distance no greater
      than two.
     There are six types of neighboring pages: parent, child,
      sibling, spouse, grandparent and grandchild.
   Neighbor selection cont.
     Furnkranz 1999
      ▪ The text on the parent pages surrounding the link is used to
        train a classifier instead of text on the target page.
      ▪ A Target page will be assigned multiple labels. These label are
        then combine by some voting scheme to form the final
        prediction of the target page’s class
     Sun et al. 2002
      ▪ Using the text on the target page. Using page title and anchor
        text from parent pages can improve classification compared a
        pure text classifier.
   Neighbor selection cont.
     Summary
      ▪ Using parent, child, sibling and spouse pages are all
        useful in classification, siblings are found to be the best
        source.
      ▪ Using information from neighboring pages may
        introduce extra noise, should be use carefully.
   Features
     Label : by editor or keyworder
     Partial content : anchor text, the surrounding
      text of anchor text, titles, headers
     Full content
      ▪ Among the three types of features, using the full
        content of neighboring pages is the most expensive
        however it generate better accuracy.
   Utilizing artificial links (implicit link)
     The hyperlinks are not the only one choice.
   What is implicit link?
     Connections between pages that appear in the
     results of the same query and are both clicked by
     users.
   Implicit link can help webpage classification
    as well as hyperlinks.
 However, since the results of different
  approaches are based on different
  implementations and different datasets, making
  it difficult to compare their performance.
 Sibling page are even more use full than parents
  and children.
     This approach may lie in the process of hyperlink
      creation.
     But a page often acts as a bridge to connect its
      outgoing links, which are likely to have common
      topic.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
• Dimension reduction
             • Relational learning
             • Modifications to
Algorithms     traditional algorithms
             • Hierarchical
               classification
             • Combining information
               from multiple sources
   Feature weighting
    o Another important role for webpage
      classification
    o Way of boosting the classification by
      emphasizing the features with the better
      discriminative power
    o Special case of weighing: “Feature
      Selection”
   A special case of “feature weighting”
   ‘Zero weight’ is assigned to the eliminated
    features
   The role:

         Reduce the                      Classification
                        Computational
       dimensionality                    can be more
                         complexity
       of the feature                   accurate in the
                          reduction
           space                        reduced space
   Simple approaches
     First fragment of each document
     First fragment to the web documents in
     hierarchical classification
   Text categorization approaches
     Information gain
     Mutual information
     Etc.
   Using the first fragment of each documents
     Assumption: a summary is at beginning of the
      document
     Fast and accurate classification for news articles
     Not satisfying for other types of documents
• First fragment applied to Hierarchical
    classification of web pages
     Useful for web documents
   Using expected mutual information and mutual
    information
     Two well-known metrics based on variation of the k-
      Nearest Neighbor algorithm
     Weighted terms according to its appearing HTML tags
     Terms within different tags handle different importance
   Using information gain
     Another well-known metric
     Still not apparently show which one is more superior
      for web classification
   Approving the performance of SVM classifiers
     By aggressive feature selection
     Developed a measure with the ability to predict the
      selection effectiveness without training and testing
      classifiers
   A popular Latent Semantic Indexing (LSI)
     In Text documents:
      ▪ Docs are reinterpreted into a smaller transformed, but less intuitive
        space
      ▪ Cons: high computational complexity makes it inefficient to scale
     in Web classification
      ▪ Experiments based on small datasets (to avoid the above ‘cons’)
      ▪ Some work has approved to make it applicable for larger datasets
        which still needs further study
• Dimension reduction
             • Relational learning
             • Modifications to
Algorithms     traditional algorithms
             • Hierarchical
               classification
             • Combining information
               from multiple sources
Webpage:
 instances with
the HYPERLINK
   RELATION
   connection
                    Hence, relational
                         learning
                     algorithms are
                     used with the
                        webpage
                      classification
   Webpage
classification: a
   relational
    learning
    problem
   Relaxation Labeling Algorithms
     Original proposal:
      ▪ Image analysis
     Current usage:
      ▪   Image and vision analysis
      ▪   Artificial Intelligence
      ▪   pattern recognition
      ▪   web-mining
   Link-based Classification Algorithms
     Utilizing 2 popular link-based algorithms
      ▪ Loopy belief propagation
      ▪ Iterative classification
• Flow of the algorithm


                                                                   Nodes with their
                    text classifier                                 assigned class
                                                                     probabilities




    Nodes’ probabilities
  reevaluated taking into
                                                                 Same process to each
    account the latest                Nodes considered in turn
                                                                   node’s neighbors
     estimates of the
        neighbors’
   Using a combined logistic classifier
     based on content and link information
      ▪ Shows improvement over a textual classifier
      ▪ Outperforms a single flat classifier based on both
        content and link features
   Selecting the proper Neighbors ONLY
     Not all neighbors are qualified
     The chosen neighbors’ option:
      ▪ Similar enough in content
   Two popular link-based algorithms:
     Loopy belief propagation
     Iterative classification
 Better performance on a web collection than
  textual classifiers
 During the scientists’ study, ‘a toolkit’ was
  implemented
     Toolkit features
      ▪ Classify the networked data which
        ▪ utilized a relational classifier and a collective inference procedure
        ▪ Demonstrated its great performance on several datasets including
          web collections
• Dimension reduction
             • Relational learning
             • Modifications to
Algorithms     traditional algorithms
             • Hierarchical
               classification
             • Combining information
               from multiple sources
   The traditional algorithms adjusted in the
    context of Webpage classification
     k-Nearest Neighbors (kNN)
       ▪ Quantify the distance between the test document
          and each training documents using “a dissimilarity
          measure”
       ▪ Cosine similarity or inner product is what used by
          most existing kNN classifiers
     Support Vector Machine (SVM)
   Varieties of modifications:
     Using the term co-occurrence in document
     Using probability computation
     Using “co-training”
   Using the term co-occurrence in documents
     An improved similarity measure
     The more co-occurred terms two documents have in
      common, the stronger the relationship between them
     Better performance over the normal kNN (cosine similarity
      and inner product measures)
   Using the probability computation
     Condition:
      ▪ The probability of a document d being in class c is determined by its
        distance b/w neighbors and itself and its neighbors’ probability of
        being in c
      ▪ Simple equation
        ▪ Prob. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
   Using “Co-training”
     Make use of labeled and unlabeled data
     Aiming to achieve better accuracy
     Scenario: Binary classification
      ▪ Classifying the unlabeled instances
        ▪ Two classifiers trained on different sets of features
        ▪ The prediction of each one is used to train each other
      ▪ Classifying only labeled instances
        ▪ The co-training can cut the error rate by half
     When generalized to multi-class problems
      ▪ When the number of categories is large
        ▪ Co-training is not satisfying
        ▪ On the other hand, the method of combining error-correcting output coding
          (more than enough classifiers in use), with co-training can boost performance
   In classification, both positive and negative
    examples are required
   SVM-Based aim:
     To eliminate the need for manual collection of
     negative examples while still retaining similar
     classification accuracy
1st: Identify the        2nd: Positive              3rd: training SVM
most important           Feature Filtering          classifier
positive features        • Filtering out possible   • Trained on the
• Positive data given      positive examples          labeled positive
• Unlabeled data given     from unlabeled data        examples
                         • Leaving only             • Trained on the
                           negative examples          filtered negative
                           (filter negative           examples
                           samples)
• Dimension reduction
             • Relational learning
             • Modifications to
Algorithms     traditional algorithms
             • Hierarchical
               classification
             • Combining information
               from multiple sources
   Not so many research since most web
    classifications focus on the same level
    approaches
   Approaches:
       Based on “divide and conquer”
       Error minimization
       Topical Hierarchy
       Hierarchical SVMs
       Using the degree of misclassification
       Hierarchical text categoriations
   The use of hierarchical classification based on
    “divide and conquer”
     Classification problems are splitted into sub-problems
      hierarchically
      ▪ More efficient and accurate that the non-hierarchical way
   Error minimization
     when the lower level category is uncertain,
      ▪ Minimize by shifting the assignment into the higher one
   Topical Hierarchy
     Classify a web page into a topical hierarchy
     Update the category information as the hierarchy
      expands
   Hierarchical SVMs
     Observation:
      ▪ Hierarchical SVMs are more efficient than flat SVMs
      ▪ None are satisfying the effectiveness for the large taxonomies
      ▪ Hierarchical settings do more harm than good to kNNs and naive Bayes
        classifiers
   Hierarchical Classification By the degree of
    misclassification
     Opposed to measuring “correctness”
     Distance are measured b/w the classifier-assigned classes and
      the true class.
   Hierarchical text categorization
     A detailed review was provided in 2005
• Dimension reduction
             • Relational learning
             • Modifications to
Algorithms     traditional algorithms
             • Hierarchical
               classification
             • Combining information
               from multiple sources
   Different sources are utilized
   Combining link and content information is quite
    popular
   Common combination way:
     Treat information from ‘different sources’ as ‘different
      (usually disjoint) feature sets’ on which multiple
      classifiers are trained
     Then, the generation of FINAL decision will be made
      by the classifiers
   Mostly has the potential to have better
    knowledge than any single method
   Voting and Stacking
     The well-developed method in machine learning
   Co-Training
     Effective in combining multiple sources
      ▪ Since here, different classifiers are trained on disjoint
        feature sets
   Please be noted that:
     Additional resource needs sometimes cause
      ‘disadvantage’
     The combination of 2 does NOT always BETTER
      than each separately
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Web site is the collection of we pages
   One branch of research focuses only on web
    site contents.
   Another branch of research focuses on
    utilizing the structural properties of web sites
   There is also research that utilize both
    structural and content information.
   Classification of web pages helpful to
    classifying a web site.
   piere 2001
     Proposed an approach to the classification of web sites
      into industry categories using HTML tages
     Accuracy around 90%
 Amitay et al(2003) used structural information of web
  site to determine its functionality(such as search
  engine, web directories, corporate sites)
 Ester et al(2002)
     Investigate three different approaches to determining the
      topical category of web site based on different web site
      representations
                  By a single virtual page
                  By a vector of topic frequencies
                  By a tree of its pages with topic
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   The word “blog” was originally a short form of
    “web log”
   Blogging has gained in popularity in recent
    years, an increasing amount of research about
    blog has also been conducted.
   Broken into three types
     Blog identification (to determine whether a web
      document is a blog)
     Mood classification or sentient of blogs.
     Genre classification
   Elgersma and Rijke 2006
     Common classification algorithm on Blog identification using
      number of human-selected feature e.g. “Comments” and
      “Archives”
     Accuracy around 90%
 Mihalcea and Liu 2006 classify Blog into two polarities of
  moods, happiness and sadness (Mood classification)
 Nowson 2006 discussed the distinction of three types of
  blogs (Genre Classification)
     News
     Commentary
     Journal
   Qu et al. 2006
     Automatic classification of blogs into four genres
      ▪ Personal diary
      ▪ New
      ▪ Political
      ▪ Sports
     Using unigram tfidf document representation and
      naive Bayes classification.
     Qu et al.’s approach can achieve an accuracy of
      84%.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
   Webpage classification is a type of supervised
    learning problem that aims to categorize
    webpage into a set of predefined categories
    based on labeled training data.
   They expect that future web classification
    efforts will certainly combine content and link
    information in some form.
   Future work would be well-advised to
     Emphasize text and labels from siblings over
      other types of neighbors.
     Incorporate anchor text from parents.
     Utilize other source of (implicit or explicit) human
      knowledge, such as query logs and click-through
      behavior, in addition to existing labels to guide
      classifier creation.
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011
Presented by
  Mr.Mumtaz Khan (MS 2nd Semester )
      Department of Computer Science
University of Peshawar, September 2011

Mais conteúdo relacionado

Mais procurados

Linked data MLA 2015
Linked data MLA 2015Linked data MLA 2015
Linked data MLA 2015Cason Snow
 
Linked Data MLA 2015
Linked Data MLA 2015Linked Data MLA 2015
Linked Data MLA 2015Cason Snow
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011Eric Larson
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGcsandit
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behavioursVernon Fowler
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningAmir Masoud Sefidian
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Dirk Lewandowski
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)amytaylor
 

Mais procurados (19)

Linked data MLA 2015
Linked data MLA 2015Linked data MLA 2015
Linked data MLA 2015
 
Linked Data MLA 2015
Linked Data MLA 2015Linked Data MLA 2015
Linked Data MLA 2015
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Web mining
Web miningWeb mining
Web mining
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behaviours
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
confernece paper
confernece paperconfernece paper
confernece paper
 
Hybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & FolksonmyHybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & Folksonmy
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?
 
Web data mining
Web data miningWeb data mining
Web data mining
 
EDS across the pond
EDS across the pondEDS across the pond
EDS across the pond
 
Web mining
Web miningWeb mining
Web mining
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)
 
Semantic web
Semantic web Semantic web
Semantic web
 

Semelhante a Webpage classification and Features

Web Page Classification
Web Page ClassificationWeb Page Classification
Web Page ClassificationPacharaStudio
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdfMdAnik19
 
Aggregate rank bringing order to web sites
Aggregate rank  bringing order to web sitesAggregate rank  bringing order to web sites
Aggregate rank bringing order to web sitesOUM SAOKOSAL
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportIOSR Journals
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approachSyed Islam
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationSE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationIvana Bosnic
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPA NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPijnlc
 
Web Authoring Principles for Focused and Effective Content
Web Authoring Principles for Focused and Effective ContentWeb Authoring Principles for Focused and Effective Content
Web Authoring Principles for Focused and Effective ContentEric Hodgson
 
WEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdfWEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdfSowmyaJyothi3
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Ijcem Journal
 
Comparative study of different ranking algorithms adopted by search engine
Comparative study of  different ranking algorithms adopted by search engineComparative study of  different ranking algorithms adopted by search engine
Comparative study of different ranking algorithms adopted by search engineEchelon Institute of Technology
 

Semelhante a Webpage classification and Features (20)

Web Page Classification
Web Page ClassificationWeb Page Classification
Web Page Classification
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdf
 
Macran
MacranMacran
Macran
 
Aggregate rank bringing order to web sites
Aggregate rank  bringing order to web sitesAggregate rank  bringing order to web sites
Aggregate rank bringing order to web sites
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
PageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey reportPageRank algorithm and its variations: A Survey report
PageRank algorithm and its variations: A Survey report
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationSE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPA NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
 
Web Authoring Principles for Focused and Effective Content
Web Authoring Principles for Focused and Effective ContentWeb Authoring Principles for Focused and Effective Content
Web Authoring Principles for Focused and Effective Content
 
Search engines
Search enginesSearch engines
Search engines
 
WEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdfWEBMINING_SOWMYAJYOTHI.pdf
WEBMINING_SOWMYAJYOTHI.pdf
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
407 409
407 409407 409
407 409
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
Comparative study of different ranking algorithms adopted by search engine
Comparative study of  different ranking algorithms adopted by search engineComparative study of  different ranking algorithms adopted by search engine
Comparative study of different ranking algorithms adopted by search engine
 

Mais de Higher Education Department KPK, Pakistan (6)

On Linked Open Data (LOD)-based Semantic Video Annotation Systems
On Linked Open Data (LOD)-based  Semantic Video Annotation SystemsOn Linked Open Data (LOD)-based  Semantic Video Annotation Systems
On Linked Open Data (LOD)-based Semantic Video Annotation Systems
 
On Annotation of Video Content for Multimedia Retrieval and Sharing
On Annotation of Video Content for Multimedia  Retrieval and SharingOn Annotation of Video Content for Multimedia  Retrieval and Sharing
On Annotation of Video Content for Multimedia Retrieval and Sharing
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Introduction to cms and wordpress
Introduction to cms and wordpressIntroduction to cms and wordpress
Introduction to cms and wordpress
 
Mpeg 7-21
Mpeg 7-21Mpeg 7-21
Mpeg 7-21
 
WWW Histor
WWW HistorWWW Histor
WWW Histor
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Webpage classification and Features

  • 1. Feature and Algorithms Xiaoguang Qi and Brian D. Davison Department of Computer Science & Engineering Lehigh University, June 2007 Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 2. Webpage classification significance  Introduction  Background  Applications of web classification  Features  Algorithms  Blog Classification  Conclusion
  • 3. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 4. Let’s go back in history about 10 years.  The Evolution of Websites: How 5 popular Websites have changed
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. What’s different between past and present what changed?
  • 14.
  • 15.
  • 16. What’s different between past and present what changed?  Flash animation  Java Script  Video Clips, Embedded Object  Advertise, GG Ad sense, Yahoo!
  • 17. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 18. Webpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business”  GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
  • 19. What will you learn?  A Detailed review of useful features for web classification  The algorithms used  The future research directions  Webpage classification can help improve the quality of web search.  Knowing is thing help you to improve your SEO skill.  Each search engine, keep their technique in secret.
  • 20. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 21. The general problem of webpage classification can be divided into  Subject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”.  Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
  • 22. Based on the number of classes in webpage classification can be divided into  binary classification  multi-class classification Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
  • 23.
  • 24. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 25. Constructing and expanding web directories (web hierarchies)  Yahoo !  ODP or “Open Dictionary Project” ▪ http://www.dmoz.org  How are they doing?
  • 26. How are they doing?  By human effort ▪ July 2006, it was reported there are 73,354 editor in the dmoz ODP.  As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004  The starting point of this presentation !!
  • 27. Improving quality of search results  Categories view  Ranking view
  • 28.
  • 29. Improving quality of search results  Categories view  Ranking view  In 1998, Page and Brin developed the link-based ranking algorithm called PageRank ▪ Calculates the hyperlinks with our considering the topic of each page
  • 30.
  • 31. Helping question answering systems  Yang and Chua 2004 ▪ suggest finding answers to list questions e.g. “name all the countries in Europe”  How it worked? ▪ Formulated the queries and sent to search engines. ▪ Classified the results into four categories ▪ Collection pages (contain list of items) ▪ Topic pages (represent the answers instance) ▪ Relevant page (Supporting the answers instance) ▪ Irrelevant pages ▪ After that , topic pages are clustered, from which answers are extracted.  Answering question system could benefit from web classification of both accuracy and efficiency
  • 32. Other applications  Web content filtering  Assisted web browsing  Knowledge base construction
  • 33. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 34. In this section, we review the types of features that useful in webpage classification research.  The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a>  We classify features into  On-page feature: Directly located on the page  Neighbors feature: Found on the pages related to the page to be classified.
  • 35. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 36. Textual content and tags  N-gram feature ▪ Imagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature). ▪ In Yahoo!, They used 5-grams feature.  HTML tags or DOM ▪ Title, Headings, Metadata and Main text ▪ Assigned each of them an arbitrary weight. ▪ Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
  • 37. Textual content and tags  URL ▪ Kan and Thi 2004 ▪ Demonstrated that a webpage can be classified based on its URL
  • 38. Visual analysis  Each webpage has two representations 1. Text which represent in HTML 2. The visual representation rendered by a web browser  Most approaches focus on the text while ignoring the visual information which is useful as well  Kovacevic et al. 2004 ▪ Each webpage is represented as a hierarchical “Visual adjacency multi graph.” ▪ In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
  • 39.
  • 40. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 41. Motivation  The useful features that we discuss previously, in a particular these features are missing or unrecognizable
  • 42.
  • 43. Underlying Assumptions  When exploring the features of neighbors, some assumptions are implicitly made in existing work.  The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”.  Chakrabari et al. 2002 and Meczer 2005 showed that linked pages were more likely to have terms in common .  Neighbor selection  Existing research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two.  There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
  • 44.
  • 45. Neighbor selection cont.  Furnkranz 1999 ▪ The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page. ▪ A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s class  Sun et al. 2002 ▪ Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
  • 46. Neighbor selection cont.  Summary ▪ Using parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source. ▪ Using information from neighboring pages may introduce extra noise, should be use carefully.
  • 47.
  • 48. Features  Label : by editor or keyworder  Partial content : anchor text, the surrounding text of anchor text, titles, headers  Full content ▪ Among the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
  • 49. Utilizing artificial links (implicit link)  The hyperlinks are not the only one choice.  What is implicit link?  Connections between pages that appear in the results of the same query and are both clicked by users.  Implicit link can help webpage classification as well as hyperlinks.
  • 50.
  • 51.  However, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance.  Sibling page are even more use full than parents and children.  This approach may lie in the process of hyperlink creation.  But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
  • 52.
  • 53. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 54. • Dimension reduction • Relational learning • Modifications to Algorithms traditional algorithms • Hierarchical classification • Combining information from multiple sources
  • 55. Feature weighting o Another important role for webpage classification o Way of boosting the classification by emphasizing the features with the better discriminative power o Special case of weighing: “Feature Selection”
  • 56. A special case of “feature weighting”  ‘Zero weight’ is assigned to the eliminated features  The role: Reduce the Classification Computational dimensionality can be more complexity of the feature accurate in the reduction space reduced space
  • 57. Simple approaches  First fragment of each document  First fragment to the web documents in hierarchical classification  Text categorization approaches  Information gain  Mutual information  Etc.
  • 58. Using the first fragment of each documents  Assumption: a summary is at beginning of the document  Fast and accurate classification for news articles  Not satisfying for other types of documents • First fragment applied to Hierarchical classification of web pages  Useful for web documents
  • 59. Using expected mutual information and mutual information  Two well-known metrics based on variation of the k- Nearest Neighbor algorithm  Weighted terms according to its appearing HTML tags  Terms within different tags handle different importance  Using information gain  Another well-known metric  Still not apparently show which one is more superior for web classification
  • 60. Approving the performance of SVM classifiers  By aggressive feature selection  Developed a measure with the ability to predict the selection effectiveness without training and testing classifiers  A popular Latent Semantic Indexing (LSI)  In Text documents: ▪ Docs are reinterpreted into a smaller transformed, but less intuitive space ▪ Cons: high computational complexity makes it inefficient to scale  in Web classification ▪ Experiments based on small datasets (to avoid the above ‘cons’) ▪ Some work has approved to make it applicable for larger datasets which still needs further study
  • 61. • Dimension reduction • Relational learning • Modifications to Algorithms traditional algorithms • Hierarchical classification • Combining information from multiple sources
  • 62. Webpage: instances with the HYPERLINK RELATION connection Hence, relational learning algorithms are used with the webpage classification Webpage classification: a relational learning problem
  • 63. Relaxation Labeling Algorithms  Original proposal: ▪ Image analysis  Current usage: ▪ Image and vision analysis ▪ Artificial Intelligence ▪ pattern recognition ▪ web-mining  Link-based Classification Algorithms  Utilizing 2 popular link-based algorithms ▪ Loopy belief propagation ▪ Iterative classification
  • 64. • Flow of the algorithm Nodes with their text classifier assigned class probabilities Nodes’ probabilities reevaluated taking into Same process to each account the latest Nodes considered in turn node’s neighbors estimates of the neighbors’
  • 65. Using a combined logistic classifier  based on content and link information ▪ Shows improvement over a textual classifier ▪ Outperforms a single flat classifier based on both content and link features  Selecting the proper Neighbors ONLY  Not all neighbors are qualified  The chosen neighbors’ option: ▪ Similar enough in content
  • 66. Two popular link-based algorithms:  Loopy belief propagation  Iterative classification  Better performance on a web collection than textual classifiers  During the scientists’ study, ‘a toolkit’ was implemented  Toolkit features ▪ Classify the networked data which ▪ utilized a relational classifier and a collective inference procedure ▪ Demonstrated its great performance on several datasets including web collections
  • 67. • Dimension reduction • Relational learning • Modifications to Algorithms traditional algorithms • Hierarchical classification • Combining information from multiple sources
  • 68. The traditional algorithms adjusted in the context of Webpage classification  k-Nearest Neighbors (kNN) ▪ Quantify the distance between the test document and each training documents using “a dissimilarity measure” ▪ Cosine similarity or inner product is what used by most existing kNN classifiers  Support Vector Machine (SVM)
  • 69. Varieties of modifications:  Using the term co-occurrence in document  Using probability computation  Using “co-training”
  • 70. Using the term co-occurrence in documents  An improved similarity measure  The more co-occurred terms two documents have in common, the stronger the relationship between them  Better performance over the normal kNN (cosine similarity and inner product measures)  Using the probability computation  Condition: ▪ The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in c ▪ Simple equation ▪ Prob. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
  • 71. Using “Co-training”  Make use of labeled and unlabeled data  Aiming to achieve better accuracy  Scenario: Binary classification ▪ Classifying the unlabeled instances ▪ Two classifiers trained on different sets of features ▪ The prediction of each one is used to train each other ▪ Classifying only labeled instances ▪ The co-training can cut the error rate by half  When generalized to multi-class problems ▪ When the number of categories is large ▪ Co-training is not satisfying ▪ On the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
  • 72. In classification, both positive and negative examples are required  SVM-Based aim:  To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
  • 73. 1st: Identify the 2nd: Positive 3rd: training SVM most important Feature Filtering classifier positive features • Filtering out possible • Trained on the • Positive data given positive examples labeled positive • Unlabeled data given from unlabeled data examples • Leaving only • Trained on the negative examples filtered negative (filter negative examples samples)
  • 74. • Dimension reduction • Relational learning • Modifications to Algorithms traditional algorithms • Hierarchical classification • Combining information from multiple sources
  • 75. Not so many research since most web classifications focus on the same level approaches  Approaches:  Based on “divide and conquer”  Error minimization  Topical Hierarchy  Hierarchical SVMs  Using the degree of misclassification  Hierarchical text categoriations
  • 76. The use of hierarchical classification based on “divide and conquer”  Classification problems are splitted into sub-problems hierarchically ▪ More efficient and accurate that the non-hierarchical way  Error minimization  when the lower level category is uncertain, ▪ Minimize by shifting the assignment into the higher one  Topical Hierarchy  Classify a web page into a topical hierarchy  Update the category information as the hierarchy expands
  • 77. Hierarchical SVMs  Observation: ▪ Hierarchical SVMs are more efficient than flat SVMs ▪ None are satisfying the effectiveness for the large taxonomies ▪ Hierarchical settings do more harm than good to kNNs and naive Bayes classifiers  Hierarchical Classification By the degree of misclassification  Opposed to measuring “correctness”  Distance are measured b/w the classifier-assigned classes and the true class.  Hierarchical text categorization  A detailed review was provided in 2005
  • 78. • Dimension reduction • Relational learning • Modifications to Algorithms traditional algorithms • Hierarchical classification • Combining information from multiple sources
  • 79. Different sources are utilized  Combining link and content information is quite popular  Common combination way:  Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trained  Then, the generation of FINAL decision will be made by the classifiers  Mostly has the potential to have better knowledge than any single method
  • 80. Voting and Stacking  The well-developed method in machine learning  Co-Training  Effective in combining multiple sources ▪ Since here, different classifiers are trained on disjoint feature sets
  • 81. Please be noted that:  Additional resource needs sometimes cause ‘disadvantage’  The combination of 2 does NOT always BETTER than each separately
  • 82. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 83. Web site is the collection of we pages  One branch of research focuses only on web site contents.  Another branch of research focuses on utilizing the structural properties of web sites  There is also research that utilize both structural and content information.  Classification of web pages helpful to classifying a web site.
  • 84. piere 2001  Proposed an approach to the classification of web sites into industry categories using HTML tages  Accuracy around 90%  Amitay et al(2003) used structural information of web site to determine its functionality(such as search engine, web directories, corporate sites)  Ester et al(2002)  Investigate three different approaches to determining the topical category of web site based on different web site representations  By a single virtual page  By a vector of topic frequencies  By a tree of its pages with topic
  • 85. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 86. The word “blog” was originally a short form of “web log”  Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted.  Broken into three types  Blog identification (to determine whether a web document is a blog)  Mood classification or sentient of blogs.  Genre classification
  • 87. Elgersma and Rijke 2006  Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives”  Accuracy around 90%  Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification)  Nowson 2006 discussed the distinction of three types of blogs (Genre Classification)  News  Commentary  Journal
  • 88. Qu et al. 2006  Automatic classification of blogs into four genres ▪ Personal diary ▪ New ▪ Political ▪ Sports  Using unigram tfidf document representation and naive Bayes classification.  Qu et al.’s approach can achieve an accuracy of 84%.
  • 89. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 90. Webpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data.  They expect that future web classification efforts will certainly combine content and link information in some form.
  • 91. Future work would be well-advised to  Emphasize text and labels from siblings over other types of neighbors.  Incorporate anchor text from parents.  Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
  • 92. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011
  • 93. Presented by Mr.Mumtaz Khan (MS 2nd Semester ) Department of Computer Science University of Peshawar, September 2011