SlideShare uma empresa Scribd logo
1 de 30
Premise




 You can determine the personality profile of
 a person based on their browsing habits
Technology was the Solution!
Then I Read This…



               Us & Them
               The Science of Identity
               By David Berreby
The Different States of Knowledge


 What a person knows


 What a person knows they don’t know


 What a person doesn’t know they don’t know
Technology was NOT the Solution



   Identity and discovery are
   NOT a technology solution
Our Subject Material
Our Subject Material

        HTML content is unstructured


        You can’t trust that anything
        semantically valid will be present


        There are some pretty bad web
        practices on the interwebz
How We’ll Capture This Data




             Start with base linguistics

             Extend with available extras
The Basic Pieces




  Page Data   Keywords      Weighting
   Scrapey    Without all   Word diets
   Scrapey     the fluff      FTW
Capture Raw Page Data


             Semantic data on the web
             is sucktastic

             Assume 5 year olds built
             the sites
             Language is the key
Extract Keywords



              We now have a big jumble
              of words. Let’s extract

              Why is “and” a top word?
              Stop words = sad panda
Weight Keywords


             All content is not created
             equal

             Meta and headers and
             semantics oh my!

             This is where we leech
             off the work of others
Questions to Keep in Mind

   Should I use regex to parse web
   content?

    How do users interact with page
    content?

   What key identifiers can be monitored
   to detect interest?
Fetching the Data: The Request

The Simple Way

  $html = file_get_contents('URL');


The Controlled Way

  $c = curl_init('URL');
Fetching the Data: cURL
 $req = curl_init($url);

 $options = array(
    CURLOPT_URL => $url,
    CURLOPT_HEADER => $header,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_AUTOREFERER => true,
    CURLOPT_TIMEOUT => 15,
    CURLOPT_MAXREDIRS => 10
 );

 curl_setopt_array($req, $options);
//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');

//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
      script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
      style>#is', '', $mod_content);

$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);

natcasesort($mod_content);
//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();

//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
   $word = trim($word);
   if(strlen($word) > 2 && !in_array($word, $common_words)){
      $searched_words[$word]++;
   }
}

arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data



 //load scraped page data as a valid DOM document
 $dom = new DOMDocument();
 @$dom->loadHTML($page_content);

 //scrape title
 $title = $dom->getElementsByTagName("title");
 $title = $title->item(0)->nodeValue;
//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");

for ($i = 0; $i < $metas->length; $i++){
  $meta = $metas->item($i);
  if($meta->getAttribute("property")){
    if ($meta->getAttribute("property") == "og:description"){
      $dataReturn["description"] = $meta->getAttribute("content");
    }
  } else {
    if($meta->getAttribute("name") == "description"){
      $dataReturn["description"] = $meta->getAttribute("content");
    } else if($meta->getAttribute("name") == "keywords”){
      $dataReturn[”keywords"] = $meta->getAttribute("content");
    }
  }
}
Weighting Important Data


              Tags you should care
              about: meta (include
              OG), title, description, h1+
              , header

              Bonus points for adding in
              content location modifiers
Weighting Important Tags


//our keyword weights
$weights = array("keywords"   => "3.0",
                 "meta"       => "2.0",
                 "header1"    => "1.5",
                 "header2"    => "1.2");

//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
   $searched_words[$word]++;
}
Expanding to Phrases


            2-3 adjacent words, making
            up a direct relevant callout

            Seems easy right? Just like
            single words

            Language gets wonky
            without stop words
Working with Unknown Users



            The majority of users won’t
            be immediately targetable

            Use HTML5 LocalStorage &
            Cookie backup
Adding in Time Interactions

             Interaction with a site does
             not necessarily mean
             interest in it

             Time needs to also include
             an interaction component

             Gift buying seasons see
             interest variations
Grouping Using Commonality




                  Common
                  Interests
      Interests               Interests
      User A                    User B
Building an Identity Extraction Engine

Mais conteúdo relacionado

Mais de Jonathan LeBlanc

Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer WorkshopJonathan LeBlanc
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security PracticesJonathan LeBlanc
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI ElementsJonathan LeBlanc
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingJonathan LeBlanc
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyJonathan LeBlanc
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensJonathan LeBlanc
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data SecurityJonathan LeBlanc
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data SecurityJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityJonathan LeBlanc
 
Building a Mobile Location Aware System with Beacons
Building a Mobile Location Aware System with BeaconsBuilding a Mobile Location Aware System with Beacons
Building a Mobile Location Aware System with BeaconsJonathan LeBlanc
 
Identity in the Future of Embeddables & Wearables
Identity in the Future of Embeddables & WearablesIdentity in the Future of Embeddables & Wearables
Identity in the Future of Embeddables & WearablesJonathan LeBlanc
 
Internet Security and Trends
Internet Security and TrendsInternet Security and Trends
Internet Security and TrendsJonathan LeBlanc
 

Mais de Jonathan LeBlanc (20)

Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer Workshop
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security Practices
 
Box Authentication Types
Box Authentication TypesBox Authentication Types
Box Authentication Types
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI Elements
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scoping
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments Globally
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web Tokens
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from Scratch
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data Security
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data Security
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable Security
 
Kill All Passwords
Kill All PasswordsKill All Passwords
Kill All Passwords
 
BattleHack Los Angeles
BattleHack Los Angeles BattleHack Los Angeles
BattleHack Los Angeles
 
Building a Mobile Location Aware System with Beacons
Building a Mobile Location Aware System with BeaconsBuilding a Mobile Location Aware System with Beacons
Building a Mobile Location Aware System with Beacons
 
Identity in the Future of Embeddables & Wearables
Identity in the Future of Embeddables & WearablesIdentity in the Future of Embeddables & Wearables
Identity in the Future of Embeddables & Wearables
 
Internet Security and Trends
Internet Security and TrendsInternet Security and Trends
Internet Security and Trends
 

Último

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Building an Identity Extraction Engine

  • 1.
  • 2. Premise You can determine the personality profile of a person based on their browsing habits
  • 3. Technology was the Solution!
  • 4. Then I Read This… Us & Them The Science of Identity By David Berreby
  • 5. The Different States of Knowledge What a person knows What a person knows they don’t know What a person doesn’t know they don’t know
  • 6. Technology was NOT the Solution Identity and discovery are NOT a technology solution
  • 8. Our Subject Material HTML content is unstructured You can’t trust that anything semantically valid will be present There are some pretty bad web practices on the interwebz
  • 9. How We’ll Capture This Data Start with base linguistics Extend with available extras
  • 10.
  • 11. The Basic Pieces Page Data Keywords Weighting Scrapey Without all Word diets Scrapey the fluff FTW
  • 12. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
  • 13. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
  • 14. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
  • 15.
  • 16. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
  • 17. Fetching the Data: The Request The Simple Way $html = file_get_contents('URL'); The Controlled Way $c = curl_init('URL');
  • 18. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
  • 19. //list of findable / replaceable string characters $find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);
  • 20. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);
  • 21. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
  • 22. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
  • 23.
  • 24. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+ , header Bonus points for adding in content location modifiers
  • 25. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
  • 26. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
  • 27. Working with Unknown Users The majority of users won’t be immediately targetable Use HTML5 LocalStorage & Cookie backup
  • 28. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
  • 29. Grouping Using Commonality Common Interests Interests Interests User A User B

Notas do Editor

  1. Technology is the solution!
  2. We’ll be looking at unstructured web page data
  3. The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  4. Open graph protocol
  5. Different methods for making the request
  6. This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  7. Stripping irrelevant data
  8. Scraping site keywords