SlideShare uma empresa Scribd logo
1 de 26
web page classification
with naïve bayes classifiers

nabeelah ali
27 november 2013
outline
• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation
description &
motivation
what is classification?
web page classification
web page classification can
be seen as a type of
document classification
documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information

• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)
why?
web directories
why?
improving search results
why?
• user profile mining
• information filtering
• creation of domain-specific search engines
literature
review
bag of words
text is represented as an unordered
list of words
n-gram representation
• document is represented by vector of
features

• concepts expressed by phrases can be
capture (e.g. “New York” vs “new” and
“york”)
using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these

• e.g. headings would have a greater weight

• four main elements are considered: title,
headings, metadata and main text

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and
metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
visual analysis
• visual representation by web browser is
important

• each web page is visualised as an adjacency
multigraph, with each section representing
a different kind of content

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel
approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.
URL features
• pages do not need to be fetched or
analysed

• fast!
• derives tokens from the URL and uses
these tokens as features

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification
using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.
web page classification
project design
dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)

• each page must be classified into a

category: course, department, faculty,
project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
document classification
single label classification: one and only one
class label is assigned to each instance
hard classification: an instance can either be
or not be in a particular class, with no
intermediate state
multi-class classification: instances that can
be divided into more than two categories
details of the dataset
experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a rc h
rese
experiment #2

HTML tag weighting

use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heavily than those in <p> tags)
sistant
as
CS
Dr
intern
22
0

ission ofe
adm
Pr

ssor
room
arch
rese
experiment #3
n-gram
use phrases instead of single words as features
t ant
assis

arch c
rese
onta

c t in

form

ogram description
pr

course outl
ine

atio
n
evaluation

k-fold cross validation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
evaluation
confusion matrix

http://en.wikipedia.org/wiki/Confusion_matrix
bibliography
B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and
algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML
structural elements and metadata in automated subject classification." Research and
Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL
features." Proceedings of the 14th ACM international conference on Information
and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web
page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
questions?

Mais conteúdo relacionado

Mais procurados

A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011Eric Larson
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGcsandit
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behavioursVernon Fowler
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your siteLouis Rosenfeld
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningAmir Masoud Sefidian
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Dirk Lewandowski
 
Navigation Systems
Navigation SystemsNavigation Systems
Navigation SystemsMiles Price
 

Mais procurados (19)

Search Systems
Search SystemsSearch Systems
Search Systems
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Web mining
Web miningWeb mining
Web mining
 
UW Forward - CUWL 2011
UW Forward - CUWL 2011UW Forward - CUWL 2011
UW Forward - CUWL 2011
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
Hybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & FolksonmyHybrid Approaches to Taxonomy & Folksonmy
Hybrid Approaches to Taxonomy & Folksonmy
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behaviours
 
Web data mining
Web data miningWeb data mining
Web data mining
 
confernece paper
confernece paperconfernece paper
confernece paper
 
Search Analytics: Diagnosing what ails your site
Search Analytics:  Diagnosing what ails your siteSearch Analytics:  Diagnosing what ails your site
Search Analytics: Diagnosing what ails your site
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Web mining
Web miningWeb mining
Web mining
 
Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?Search engine user behaviour: How can users be guided to quality content?
Search engine user behaviour: How can users be guided to quality content?
 
Navigation Systems
Navigation SystemsNavigation Systems
Navigation Systems
 
EDS across the pond
EDS across the pondEDS across the pond
EDS across the pond
 

Semelhante a web page classification

Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Drew Madelung
 
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Jonathan Ralton
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdfMdAnik19
 
Architecting a CMS for a content centered website
Architecting a CMS for a content centered websiteArchitecting a CMS for a content centered website
Architecting a CMS for a content centered websitekristin rowley
 
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Drew Madelung
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTVikas Bhushan
 
Expressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLExpressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLCredential Engine
 
Dbms classification according to data models
Dbms classification according to data modelsDbms classification according to data models
Dbms classification according to data modelsABDUL KHALIQ
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingOCLC
 
ECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldEuropean Collaboration Summit
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...CULS
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgOCLC
 
Google Paper
Google Paper Google Paper
Google Paper girish1m
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataJian Wu
 

Semelhante a web page classification (20)

Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
 
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
Collab365 - We Need to Talk: How to Converse with Regular People About Managi...
 
web page classification and algorithmn.pdf
web page classification and algorithmn.pdfweb page classification and algorithmn.pdf
web page classification and algorithmn.pdf
 
Architecting a CMS for a content centered website
Architecting a CMS for a content centered websiteArchitecting a CMS for a content centered website
Architecting a CMS for a content centered website
 
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
Essentials for the SharePoint Power User - SPTechCon San Francisco 2016
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
 
Expressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDLExpressing Concept Schemes & Competency Frameworks in CTDL
Expressing Concept Schemes & Competency Frameworks in CTDL
 
Hansen Metadata for Institutional Repositories
Hansen Metadata for Institutional RepositoriesHansen Metadata for Institutional Repositories
Hansen Metadata for Institutional Repositories
 
Dbms classification according to data models
Dbms classification according to data modelsDbms classification according to data models
Dbms classification according to data models
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
Websrc~1
Websrc~1Websrc~1
Websrc~1
 
Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
 
ECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern WorldECS19 - Marc Anderson - Managing Content Types in the Modern World
ECS19 - Marc Anderson - Managing Content Types in the Modern World
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern World
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Google Paper
Google Paper Google Paper
Google Paper
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

web page classification

  • 1. web page classification with naïve bayes classifiers nabeelah ali 27 november 2013
  • 2. outline • what is web page classification • motivation • literature review • project design • experiments • evaluation
  • 5. web page classification web page classification can be seen as a type of document classification
  • 6. documents vs web pages • web pages have structure • HTML indicates headings, paragraphs, meta-information • web pages are interconnected • they contain hyperlinks to other pages • they have locations (URLs)
  • 9. why? • user profile mining • information filtering • creation of domain-specific search engines
  • 11. bag of words text is represented as an unordered list of words
  • 12. n-gram representation • document is represented by vector of features • concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
  • 13. using html structure • assign weight depending on HTML tags, and make the feature a linear combination of these • e.g. headings would have a greater weight • four main elements are considered: title, headings, metadata and main text Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
  • 14. visual analysis • visual representation by web browser is important • each web page is visualised as an adjacency multigraph, with each section representing a different kind of content Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  • 15. URL features • pages do not need to be fetched or analysed • fast! • derives tokens from the URL and uses these tokens as features Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
  • 17. dataset • 4 universities dataset (cornell, texas, washington, wisconsin) • each page must be classified into a category: course, department, faculty, project, staff, student, other http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
  • 18. document classification single label classification: one and only one class label is assigned to each instance hard classification: an instance can either be or not be in a particular class, with no intermediate state multi-class classification: instances that can be divided into more than two categories
  • 19. details of the dataset
  • 20. experiment #1 bag of words use the words, unweighted, as features istant ass CS Dr intern 22 0 ission adm Professor room a rc h rese
  • 21. experiment #2 HTML tag weighting use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags) sistant as CS Dr intern 22 0 ission ofe adm Pr ssor room arch rese
  • 22. experiment #3 n-gram use phrases instead of single words as features t ant assis arch c rese onta c t in form ogram description pr course outl ine atio n
  • 23. evaluation k-fold cross validation From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
  • 25. bibliography B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005) Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378. Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005. Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.