When it comes to building customized experiences for your users, the biggest key is in understanding who those users are and what they're interested in. The largest problem with the traditional method for doing this, which is through a profile system, is that this is all user-curated content, meaning that the user has the ability to enter in whatever they want and be whoever they want. While this gives people the opportunity to portray themselves how they wish to the outside world, it is an unreliable identity source because it's based on perceived identity. In this session we will take a practical look into constructing an identity entity extraction engine, using PHP, from web sources. This will deliver us a highly personalized, automated identity mechanism to be able to drive customized experiences to users based on their derived personalities. We will explore concepts such as: - Building a categorization profile of interests for users using web sources that the user interacts with. - Using weighting mechanisms, like the Open Graph Protocol, to drive higher levels of entity relevance. - Creating personality overlays between multiple users to surface new content sources. - Dealing with users who are unknown to you by combining identity data capturing with HTML5 storage mechanisms.
8. Our Subject Material
HTML content is unstructured
You can’t trust that anything
semantically valid will be present
There are some pretty bad web
practices on the interwebz
9. How We’ll Capture This Data
Start with base linguistics
Extend with available extras
10.
11. The Basic Pieces
Page Data Keywords Weighting
Scrapey Without all Word diets
Scrapey the fluff FTW
12. Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
13. Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
14. Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
15.
16. Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
17. Fetching the Data: The Request
The Simple Way
$html = file_get_contents('URL');
The Controlled Way
$c = curl_init('URL');
20. //set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
21. Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
22. //loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
23.
24. Weighting Important Data
Tags you should care
about: meta (include
OG), title, description, h1+
, header
Bonus points for adding in
content location modifiers
26. Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
27. Working with Unknown Users
The majority of users won’t
be immediately targetable
Use HTML5 LocalStorage &
Cookie backup
28. Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations