SlideShare a Scribd company logo
1 of 55
Document Classification
In PHP


        @ianbarber - ian@ibuildings.com.......
                        http://phpir.com.......
Document Classification


Defining The Task
Document Pre-processing
Term Selection
Algorithms
What is
Document Classification?
Uses



 Ian Barber / @ianbarber / ian@ibuildings.com......
 Filter          Organise           Metadata
Filtering -
Binary Classification
Organising -....
Single Label Classification....
Metadata -
Multiple Label Classification
Manual Rules Written
Domain Experts
Machine Learning -.....
Automatically Extract Rules.....
Classes




 Training        Test
Documents     Documents
Evaluation

                 spam       ham

                 true       false
         spam
                positive   positive
                  false      true
         ham
                negative   negative
Measures....

$accuracy    =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision   = $tp / ($tp + $fp);

$recall      = $tp / ($tp + $fn);
Vector Space Model -
Bag Of Words
$doc   = strtolower(strip_tags($doc));

$regex = '/w+/';
preg_match_all($regex, $doc, $matches);

$words = $matches[0];




Extract Tokens
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



       i   really like eggs cabbage and donʼt stew

 A    1     1     1    1      0     0    0     0

 B    1     0     1    0      1     1    1     1
2.00




    1.00
i




       0




    -1.00
            0   0.50   1.00     1.50   2.00
                       really
$tf   
       = $termCount;

$idf      
   = log($totalDocs
                    / $docsWithTerm, 2);

$tfidf = $tf * $idf;




                         Term Weighting....
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like eggs cabbage and donʼt stew
 A    0 0.58    0 1.58      0      0     0      0
 B    0    0    0    0    1.58    1.58 3.16 0.58
 C    0 1.17    0    0      0      0     0     0.58
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like eggs cabbage and donʼt stew
 A    0 0.35    0 0.94      0      0     0      0
 B    0    0    0    0    0.31    0.31 0.63 0.11
 C    0 0.89    0    0      0      0     0     0.44
Dimensionality Reduction....
Stop Words....
happening - happen.......
                               happens - happen. .....
                             happened - happen.......
      http://tartarus.org/~martin/PorterStemmer ....
 hhttp://snowball.tartarus.org/algorithms/dutchtml..



Stemming
spam   ham
 term       $a    $b
not term    $c    $d




           Chi-Square....
$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare   = (
  $total *   pow($diff, 2 ) /
  (($a+$c)   * ($b+$d) *
   ($a+$b)   * ($c+$d));

      Chi-Square 1DF....
p         chi2.
0.1       2.71.
0.05      3.84.
0.01      6.63.
0.005     7.88.
0.001    10.83.


        p - Value....
Decision Tree - ID3

              ?

        ✔             ?

              ✖           ✔
Entropy....

$entropy =
   -( ($spam/$total)
       * log($spam/$total, 2))
   -( ($ham/$total)
       * log($ham/$total, 2));
1.00



          0.75
entropy




          0.50



          0.25



            0
                 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
                                         spam/total
Information Gain....

    $gain   =
     $baseEntropy
     -(($withCount/$total)* $withEntropy )
(    -(($woutCount/$total)* $woutEntropy )
Split   Entropy Proportion    E*P

 Base     50/50     1         1          1

 With     20/5    0.722      0.25      0.1805

Without   30/45    0.97      0.75      0.7275



        1 - With - Without = 0.092.
function build($tree) {
  if(!$tree->count('spam')) {
     $tree->setLeaf('ham');
  } else if(!$tree->count('ham')) {
    $tree->setLeaf('spam');
  } else {
    $term = $tree->findMaxGain();
    $tree->addSubtree($term,
         build($tree->getWith()),
         build($tree->getWout())
    ));
  }
  return $tree;
}
term


✔          term



     ✖            term



           ✔             ✖
Classification....
function classify($doc, $tree) {
  if($tree->isLeaf()) {
    return $tree->class;
  }
  $term = $tree->getSplitTerm();
  if(in_array($term, $doc)) {
    return classify($doc, $tree->getWith());
  } else {
    return classify($doc, $tree->getWout());
  }
}
Overfitting:....
Pruning or Stop Conditions....
K Nearest Neighbour
Spam
Term X




                         Ham


                Term Y
Term X




         Term Y
Term X




         Term Y
Cosine Similarity....


foreach($doca as $term => $tfidf) {
  $similarity +=
    floatval($tfidf) *
    floatval($docb[$term]);
}
Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
  Zend_Search_Lucene_Field::Text(
    'class', $class));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored(
    'contents', $content));
$index->addDocument($doc);
Zend_Search_Lucene::setResultSetLimit(25);

$analyser =
  Zend_Search_Lucene_Analysis_Analyzer::getDefault();
$tokens = $analyser->tokenize($content);

foreach($tokens as $key => $token) {
  $tok = $token->getTermText();
  if(strlen($tok) > 4)
    $filtered[$tok]++;
}
arsort($filtered);

              Classifying with ZSL....
$q = new Zend_Search_Lucene_Search_Query_MultiTerm();

$tc = 0;
foreach($filtered as $t => $tf) {
  $q->addTerm(
    new Zend_Search_Lucene_Index_Term($t));
  if(++$tc > 49) { break;}
}

$results = $index->find($q);
foreach($results as $result) {
  $classes[$result->class] += 1;
}

arsort($classes);
$class = key($classes);
Flax/Xapian Search Service
http://www.flax.co.uk.......
$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
  'store'      => true,
  'exacttext’ => true));
$db->addField('contents', array(
  'store'      => false,
  'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
  'class'    => $class,
  'contents' => $document));
$db->commit();
$db->addDocument(
        array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
  if($r['docid'] != 'foo') {
    $classes[$r['data']['class'][0]] += 1;
  }
}

arsort($classes);
$class = key($classes);
Spam
Term X




                         Ham




                Term Y
Prototypes For Rocchio

$mul = 1 / count($classDocs);

foreach($classDocs as $doc) {
  foreach($doc as $tid => $tfidf) {
    $prototype[$tid] += $mul * $tfidf;
  }
}
Naive Bayes -
Probability Based Classifier
Bayes Theorem
  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
                           Pr(Doc)



  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Likelihood Of Term Occurring
Given Class

  word      spam freq   pr(word|spam)   ham freq   pr(word|ham)

 register     1757          0.11          246          0.02

  sent        487           0.03         4600          0.36
Estimating Likelihood
$this->db->query("
   INSERT INTO class_terms
       (class, term, likelihood)
   SELECT d.class, d.term,
       count(*) / " . $classCount . "
   FROM documents AS d
   JOIN document_terms AS dt USING (did)
   WHERE d.class = '" . $class . "'"
);
Classifying A Document
foreach($classes as $class) {
  $prob[$class] = 0.5; // assume prior

    foreach($document as $term) {
      $prob[$class] *=
            $likely[$term][$class];
    }
}

arsort($prob);
$class = key($prob);
Document Classification


Defining The Problem
Document Processing
Term Selection
Algorithm
Image Credits
Title          http://www.flickr.com/photos/themacinator/3499579760/
What is...     http://www.flickr.com/photos/austinevan/1225274637/
Filter         http://www.flickr.com/photos/benimoto/2913950616/
Organise       http://www.flickr.com/photos/ellasdad/425813314/
Metadata       http://www.flickr.com/photos/banky177/2282734063/
Manual         http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic      http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space   http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction      http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming       http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words     http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared    http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3            http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting     http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes          http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion     http://www.flickr.com/photos/mukluk/241256203
Credits        http://www.flickr.com/photos/librarianavengers/413762956/
Questions?



       @ianbarber - ian@ibuildings.com.......
                       http://phpir.com     .

More Related Content

What's hot

“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonfRafael Dohms
 
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsPHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsGuilherme Blanco
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Jeff Carouth
 
Object Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPObject Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPChad Gray
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにYuya Takeyama
 
循環参照のはなし
循環参照のはなし循環参照のはなし
循環参照のはなしMasahiro Honma
 
PhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsPhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsBastian Feder
 
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackAn Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackVic Metcalfe
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix itRafael Dohms
 
The Art of Transduction
The Art of TransductionThe Art of Transduction
The Art of TransductionDavid Stockton
 
20160227 Granma
20160227 Granma20160227 Granma
20160227 GranmaSharon Liu
 
Your code sucks, let's fix it
Your code sucks, let's fix itYour code sucks, let's fix it
Your code sucks, let's fix itRafael Dohms
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of AtrocityMichael Pirnat
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
Mocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitMocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitmfrost503
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsDavid Golden
 

What's hot (20)

“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
 
PHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object CalisthenicsPHP for Adults: Clean Code and Object Calisthenics
PHP for Adults: Clean Code and Object Calisthenics
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4
 
Object Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHPObject Calisthenics Adapted for PHP
Object Calisthenics Adapted for PHP
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くために
 
PHP and MySQL
PHP and MySQLPHP and MySQL
PHP and MySQL
 
循環参照のはなし
循環参照のはなし循環参照のはなし
循環参照のはなし
 
PhpUnit - The most unknown Parts
PhpUnit - The most unknown PartsPhpUnit - The most unknown Parts
PhpUnit - The most unknown Parts
 
An Elephant of a Different Colour: Hack
An Elephant of a Different Colour: HackAn Elephant of a Different Colour: Hack
An Elephant of a Different Colour: Hack
 
You code sucks, let's fix it
You code sucks, let's fix itYou code sucks, let's fix it
You code sucks, let's fix it
 
The Art of Transduction
The Art of TransductionThe Art of Transduction
The Art of Transduction
 
Intoduction to php arrays
Intoduction to php arraysIntoduction to php arrays
Intoduction to php arrays
 
Functional programming with php7
Functional programming with php7Functional programming with php7
Functional programming with php7
 
20160227 Granma
20160227 Granma20160227 Granma
20160227 Granma
 
Your code sucks, let's fix it
Your code sucks, let's fix itYour code sucks, let's fix it
Your code sucks, let's fix it
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of Atrocity
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
Mocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnitMocking Dependencies in PHPUnit
Mocking Dependencies in PHPUnit
 
Taking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order FunctionsTaking Perl to Eleven with Higher-Order Functions
Taking Perl to Eleven with Higher-Order Functions
 

Similar to Document Classification In PHP - Slight Return

Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHPIan Barber
 
Top 10 php classic traps
Top 10 php classic trapsTop 10 php classic traps
Top 10 php classic trapsDamien Seguy
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)MongoSF
 
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Masahiro Nagano
 
Php tips-and-tricks4128
Php tips-and-tricks4128Php tips-and-tricks4128
Php tips-and-tricks4128PrinceGuru MS
 
Hidden treasures of Ruby
Hidden treasures of RubyHidden treasures of Ruby
Hidden treasures of RubyTom Crinson
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksNate Abele
 
Unit testing with zend framework tek11
Unit testing with zend framework tek11Unit testing with zend framework tek11
Unit testing with zend framework tek11Michelangelo van Dam
 
Php unit the-mostunknownparts
Php unit the-mostunknownpartsPhp unit the-mostunknownparts
Php unit the-mostunknownpartsBastian Feder
 
Advanced symfony Techniques
Advanced symfony TechniquesAdvanced symfony Techniques
Advanced symfony TechniquesKris Wallsmith
 
Unit testing with zend framework PHPBenelux
Unit testing with zend framework PHPBeneluxUnit testing with zend framework PHPBenelux
Unit testing with zend framework PHPBeneluxMichelangelo van Dam
 
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object CalisthenicsLucas Arruda
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)Night Sailer
 

Similar to Document Classification In PHP - Slight Return (20)

Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 
Top 10 php classic traps
Top 10 php classic trapsTop 10 php classic traps
Top 10 php classic traps
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
 
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
 
ddd+scala
ddd+scaladdd+scala
ddd+scala
 
Php tips-and-tricks4128
Php tips-and-tricks4128Php tips-and-tricks4128
Php tips-and-tricks4128
 
Hidden treasures of Ruby
Hidden treasures of RubyHidden treasures of Ruby
Hidden treasures of Ruby
 
PHPSpec BDD for PHP
PHPSpec BDD for PHPPHPSpec BDD for PHP
PHPSpec BDD for PHP
 
Smelling your code
Smelling your codeSmelling your code
Smelling your code
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate Frameworks
 
Separation of concerns - DPC12
Separation of concerns - DPC12Separation of concerns - DPC12
Separation of concerns - DPC12
 
Unit testing with zend framework tek11
Unit testing with zend framework tek11Unit testing with zend framework tek11
Unit testing with zend framework tek11
 
Php unit the-mostunknownparts
Php unit the-mostunknownpartsPhp unit the-mostunknownparts
Php unit the-mostunknownparts
 
Advanced symfony Techniques
Advanced symfony TechniquesAdvanced symfony Techniques
Advanced symfony Techniques
 
Intermediate PHP
Intermediate PHPIntermediate PHP
Intermediate PHP
 
Unittests für Dummies
Unittests für DummiesUnittests für Dummies
Unittests für Dummies
 
Unit testing with zend framework PHPBenelux
Unit testing with zend framework PHPBeneluxUnit testing with zend framework PHPBenelux
Unit testing with zend framework PHPBenelux
 
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
 
Spl Not A Bridge Too Far phpNW09
Spl Not A Bridge Too Far phpNW09Spl Not A Bridge Too Far phpNW09
Spl Not A Bridge Too Far phpNW09
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
 

More from Ian Barber

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giantsIan Barber
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleIan Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionIan Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionIan Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment TacticsIan Barber
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)Ian Barber
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & ToolsIan Barber
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)Ian Barber
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search Ian Barber
 

More from Ian Barber (10)

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & Tools
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Document Classification In PHP - Slight Return

  • 1. Document Classification In PHP @ianbarber - ian@ibuildings.com....... http://phpir.com.......
  • 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  • 4. Uses Ian Barber / @ianbarber / ian@ibuildings.com...... Filter Organise Metadata
  • 6. Organising -.... Single Label Classification....
  • 7. Metadata - Multiple Label Classification
  • 10. Classes Training Test Documents Documents
  • 11. Evaluation spam ham true false spam positive positive false true ham negative negative
  • 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  • 13. Vector Space Model - Bag Of Words
  • 14. $doc = strtolower(strip_tags($doc)); $regex = '/w+/'; preg_match_all($regex, $doc, $matches); $words = $matches[0]; Extract Tokens
  • 15. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  • 16. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  • 17. $tf = $termCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  • 18. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.58 0 1.58 0 0 0 0 B 0 0 0 0 1.58 1.58 3.16 0.58 C 0 1.17 0 0 0 0 0 0.58
  • 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.35 0 0.94 0 0 0 0 B 0 0 0 0 0.31 0.31 0.63 0.11 C 0 0.89 0 0 0 0 0 0.44
  • 22. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer .... hhttp://snowball.tartarus.org/algorithms/dutchtml.. Stemming
  • 23. spam ham term $a $b not term $c $d Chi-Square....
  • 24. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  • 25. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  • 26. Decision Tree - ID3 ? ✔ ? ✖ ✔
  • 27. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  • 28. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  • 29. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  • 30. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  • 31. function build($tree) { if(!$tree->count('spam')) { $tree->setLeaf('ham'); } else if(!$tree->count('ham')) { $tree->setLeaf('spam'); } else { $term = $tree->findMaxGain(); $tree->addSubtree($term, build($tree->getWith()), build($tree->getWout()) )); } return $tree; }
  • 32. term ✔ term ✖ term ✔ ✖
  • 33. Classification.... function classify($doc, $tree) { if($tree->isLeaf()) { return $tree->class; } $term = $tree->getSplitTerm(); if(in_array($term, $doc)) { return classify($doc, $tree->getWith()); } else { return classify($doc, $tree->getWout()); } }
  • 36. Spam Term X Ham Term Y
  • 37. Term X Term Y
  • 38. Term X Term Y
  • 39. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  • 40. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  • 41. Zend_Search_Lucene::setResultSetLimit(25); $analyser = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); $tokens = $analyser->tokenize($content); foreach($tokens as $key => $token) { $tok = $token->getTermText(); if(strlen($tok) > 4) $filtered[$tok]++; } arsort($filtered); Classifying with ZSL....
  • 42. $q = new Zend_Search_Lucene_Search_Query_MultiTerm(); $tc = 0; foreach($filtered as $t => $tf) { $q->addTerm( new Zend_Search_Lucene_Index_Term($t)); if(++$tc > 49) { break;} } $results = $index->find($q); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes);
  • 44. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  • 45. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  • 46. Spam Term X Ham Term Y
  • 47. Prototypes For Rocchio $mul = 1 / count($classDocs); foreach($classDocs as $doc) { foreach($doc as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; } }
  • 48. Naive Bayes - Probability Based Classifier
  • 49. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  • 50. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  • 51. Estimating Likelihood $this->db->query(" INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / " . $classCount . " FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = '" . $class . "'" );
  • 52. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  • 53. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  • 54. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  • 55. Questions? @ianbarber - ian@ibuildings.com....... http://phpir.com .

Editor's Notes

  1. Hello! PSC @ Ibuildings Twitter Email Blog - related posts
  2. This what you need to do to implement a classifier And also our table of contents a note on PHP - Qs at end, but ask syntax qs straight away First, talk about what and why?
  3. What is it - Assign documents to classes from a predefined set Classes can be any label - e.g. topic words, categories Documents in this case is text, web pages, emails, books But it can be really anything as long as you can extract features from it Algos not hard, applicable in all langs. Python/Java have good library versions So - Why do in PHP? Integrate into web apps - WP, Drupal, MediaWiki
  4. Classification is really organising of information - every day Lots of uses - can group into common tasks of filter, organise, add metadata Might do all three with uploading photos to flickr or facebook Filter, get rid of bad ones. Organise, upload to album or set Tag photos with people in them etc.
  5. Filtering is binary - Class OR Not Class - often hide or remove one lot Can break others types down into series of this binary choices often BUT: simple, not easy. In flickr example, what is good? - photographer, composition, light etc. - regular person, contains their friends etc. - SUBJECTIVE
  6. Organising is putting document in one place - one label chosen from a set of many possible Single label only (often EXACTLY 1, 0 not allowed) Folders, albums, libraries, handwriting recognition
  7. Tagging, can have multiple Often 0 to many labels Often for tagging topics in content E.g. a news story on us-china embargo talk might be filed under: US, China, Trade
  8. In 80s people would come up with rules - computers would apply IF this term AND this term THEN this category Took a lot of time - Needed domain expert -Needed knowledge engineer to get knowledge out of expert Hard to scale, need more experts for new cats - Subjective - experts disagree Usually result was 60%-90% accurate
  9. Machine Learning - ‘look at examples’ - Supervised Learning Work out rules based on manually classified data People don’t need to explain their thinking - just organise - easier Scales better, is cheaper, and about as accurate! In the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
  10. So what do we need? 1. the classes to classify to 2. A set of manually classified documents to train the classifier on 3. A set of manually classified docs to test on In some cases may have a third set of manually classified docs for validation How do we use these? We train a learner on training data to create a model
  11. Then use the model to classify each test document Compare manual to automatic judgements Here we’ve got a binary classification, for a spam checker Top is the manual judgement, side is classifier judgement Boxes will just be counts of judgements With that we can calculate some stats
  12. Accuracy is just correct percentage - BUT big biased sets make ‘no to all’ to all accurate. Or we desire bias e.g. FN over FP with spam Precision measures the number of positives that are true positives Recall measures what percentage of the available positives we captures Can have one w/out other: high threshold for precision, all positive for recall Researchers quote breakeven or fbeta
  13. To compare classifiers, researchers often quote breakeven point This is just where recall and precision are equal F-Beta allows weighing precision more than recall, or vice versa. Beta = 1 is balanced If beta = 0.5, recall is half as important as precision, such as spam checker Before classifying, we need to extract features. How do we represent text
  14. All this work is classic Information Retrieval Bag of Words is so called because we discard the structure, and just note down appearances of words Throw away the ordering, any structure at all from web pages etc. See why called vector space in a couple of slides
  15. First we have to extract words Simplest version: Take continuous sequences of word character Ignore all punctuation including apostrophes etc. Each new token we find in each document will be added to a dictionary Each document has a vector - there is a dimension for each dictionary word Value is 1 if the document contained that token, 0 if it did not
  16. Here is the collection of these two phrases as a vector. 1 if the word is in the document, 0 if not Note both vectors have the same dimensions In a real document collection there are lots of dimensions!
  17. We can plot the documents on a graph - using 2 terms ‘i’ and ‘really’ Here the green circle is document A the red triangle B The documents on the last page are in 8 dimensional space - 8 terms But we want more information - how important a term is to a document Need to capture a position in that dimension other than 0 & 1 A weight
  18. TFIDF is a classic and very common weighting - are a lot of variations though TF is just count of instances of that term in the doc IDF is number of docs divided by number with term Gives less common terms a higher weight So best is uncommon term that appears a lot Lets look at a similar example to before, with some term weights added
  19. The idf means that the ‘i’ and ‘like’ actually disappear here In all docs - no distinguishing power - no value to doc Don’t gets weighted higher in B Then normalise to unit length
  20. Normalise is just each value divided by total length (sqrt of the sqrd values) I and Like still 0 though Waste of time processing Maybe there are others that are a waste of time?
  21. DR or term space reduction is removing terms that don’t contribute much This can often be by a factor of 10 or 100 Speeds up execution
  22. May have heard of stop words - Common in search engines Words like ‘of’ ‘the’ ‘an’ - or ‘het’, ‘de’ in dutch Little to no semantic value to us Can use a stoplist of words, or infer it from low idf scores Collection stop words Pokemon in english, not a stop word. Pokemon on pokemon forum: stop word.
  23. Try to come up with ‘root’ word Maps lots of different variations onto one term, reducing dimensions Result is usually not a real world, it’s just repeatable
  24. Kai-Square, greek not chinese - Helps choose indicative terms for each class Statistical technique - Calculates how related a term is to a class Take 4 Counts from data. How many spam docs contain term etc. We look for difference between expected and actual counts For a given cell Expected is the row sum * col sum / total Square the difference, divided by expected value, and add all them up
  25. Plug the numbers into this formula: a one step way of doing the same thing Comes out with a number - not interesting absolutely But is interesting relatively Chi-square is a distribution, so we can calculate a probability of the events being unrelated using the area from this distribution 1DF because there is one variable and one dependent (term) (class)
  26. P is the chance that variables are independent For > 10.83 we are 99.9% certain the variables change together Can work out the probability number from a chi-square distribution But for DR, can just use a threshold and remove terms below OK, so we’ve got a good set of data, now we need a learner
  27. Tree of has term questions - ends in class decision Easy to classify, and recursive building algorithm pretty easy Algo is: If all collection is class, then leaf of class Else, choose the best term - Split into 2 collections, WITH and WITHOUT term Recurse on each half But how does it determine best?
  28. First, calculate entropy Take counts for how many total docs, how many spam, how many ham minus section could be repeated for multiple classes Represents num bits needed to encode the class of a random choice of document from this set How much new information we get - Easier to see on graph
  29. Percentage of spam on horizontal entropy on vertical If all spam or no spam no entropy - we know what will come out If 50/50 entropy is 1 - we can’t guess ahead of time We want to reduce entropy - so that the sets are more consistent
  30. We’re using the entropy to calculated the maximum information gain This is the overall reduction in entropy The original entropy minus the new entropy New is weighted by the proportion of docs in each group withCount is the number of docs that have the feature woutCount is the number without, total is the total
  31. The split is how many of each class are in the group The entropy is calculated with the formula before The proportion is just the percentage of the total documents Final col is just entropy times proportion Note that the with class is very biased with a low entropy BUT - only a small proportion, so the final information gain is low
  32. Easy to implement recursive builder If ‘spam’ or ‘ham’ are empty - we say the tree is a leaf node. If not, we find the term with the highest info gain And built a subtree based on the set of terms with and without the term Just need to traverse to classify
  33. An completely made up example of an output tree.
  34. Millions of ways to do this, of course! Simple function to return leaf node Assumes document is an array of words
  35. Problem: Tree gets too specific to training data - Need to generalise Stop condition - min info gain or other Pruning - test by trimming off bottom parts of tree Use validation set to test effectiveness of measures DTs generate human interpretable rules - very handy BUT expensive to train, need small N dimension, and often require rebuilding
  36. KNN is much cheaper at training time - as there is no training Recall we can regard documents as vectors in a N-dimensional space Where N is the size of the dictionary
  37. Lets consider only 2 terms Docs with weights for terms X and Y Documents of class triangle and class circle They seem to have a spatial cluster This is also true in higher dimension for real documents
  38. Class of new doc = class of it’s K nearest neighbours The K is how many we look for
  39. In this case K is three, and the nearest three are all green circles. Choosing K is kind of hard, you might try a few different values but it’s usually in the 11-30 doc range - uneven to num classes Only real challenge is comparing documents Here we are looking at just the X and Y distance, this is the euclidean distance
  40. Very easy. Simply looking at the difference between one and the other Can actually do the whole thing in the database ! But, has some problems, so more common...
  41. Similarity measure, goes to 1 for identical, 0 for orthogonal Easy to do with normalised vectors - just take dot product Multiply each dimension in Doc A with it in Doc B, and sum Provides better matching than Euclidean We could just loop over documents, find K most similar But search engines do a very similar job - why not use one?
  42. Two options when classifying: count most common or add similarities Second helps, e.g. if 5 good matches in class A, 10 poor matches class B For multiple class tagging: use thresholds BUT: Have to compare all documents Search engines do a very similar job, use similar scoring. Why not use one?
  43. We can use Zend Framework native PHP implementation of Lucene We add an unindexed ‘class’ field, and our contents We would loop over our training data this way, adding documents
  44. Then, we construct a query. Use the same analyser to tokenise documents the same as training data And take a count of how often each word appears We don’t have IDF, so we’re just filtering short words
  45. Construct a query with the top 50 words by term frequency Results: take the most common class Works OK, not great. Java Lucene, can get a term vector - includes the true weights We aren’t limited to using pure PHP search engines though
  46. Flax is based on the open source Xapian engine, kind of like their Solr Has a similarity search that makes KNN ridiculously easy and very effective It works around the same lines as before, but extracting a set of relevant terms from the document or documents in question Weighing scheme is BM25 - more advanced
  47. This code creates a database, adds two fields to it, and indexes a document Uses a restful web service - available from any language
  48. Very similar to lucene loop Except we add then remove a document to use searchSimilar feature Gets good accuracy and is really fast. However, if we want to use this kind of technique and don’t have a flax handy, there is another related technique
  49. Instead of taking each value and comparing it We take the average of all the documents in each class And compare against that Very easy This works surprisingly well!
  50. Here we compute the centroid or average of all the class By summing the weight * 1/count. You might do this in the database, pretty straightforward op. Called a rohk-key-oh classifier because it’s based on a relevance feedback technique by Rocchio Classify by doing similarity against each - taking closest
  51. Quick and easy probability based classifier Very commonly used in spam checking, very trendy a couple of years back Naive assumption is that words are independent One word does not influence chances of seeing another - not true! BUT: Means that we don’t need an example for each combination of attributes Bayes is good at very high dimensionality because of this
  52. This is the Bayes theorem. Read the pipe as ‘given’, pr as probability of Pr(Doc) is constant, can be dropped for ranking Pr(Class) is either count or assumed - e.g. 60% spam = 0.6, or just use 0.5/0.5 for binary Have to work out Pr(Doc|Class) We calculate that by looking at the probability of the features in the docs
  53. We can look at the data itself to calculate the term likelihoods Conditional probability: Docs with term in class / Docs in class We had 1757 docs with the word register in the spam class, and about 16,000 docs in the spam class, so the probability is about 0.11. Register is more spam than ham, sent is more ham than spam
  54. Can calculate in SQL directly ClassCount is the number of docs in that class - from earlier query Divide: Number of docs in class containing term / Number of docs in class The stored value is the likelihood of seeing that term in a doc of that class Would call once for each class
  55. Independence assumption lets us treat probability of doc as product of probabilities of word for the given class Loop over the terms and multiply likelihood for each class Assumed prior of 0.5 Multi-bernouli - multinomial is term count in class over overall term count
  56. To sum up, these are the steps for a wide range of problems Step 1: Recognising that something is a classification problem - context spelling, author ident, intrusion detection, find genes in DNA Then extract features from the docs Apply a learner to generate a model for classifying Something for your mental toolbox!
  57. Thanks to the people who put their photos on flickr under Creative Commons
  58. Any questions?