SlideShare uma empresa Scribd logo
1 de 39
Using Operational Redundancy
Effective Web Data Mining
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc
Premise
The interactions of a user can be used to
personalize their experience
Elements of Mining Redundancy
Website
Data
Mining
User
Emotional
State Mining
User
Interaction
Mining
Our Subject Material
HTML content is poorly structured
There are some pretty bad web
practices on the interwebz
You can’t trust that anything
semantically valid will be present
How We’ll Capture This Data
Start with base linguistics
Extend with available extras
The Basic Pieces
Page Data
Scrapey
Scrapey
Keywords
Without all
the fluff
Weighting
Word diets
FTW
Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
Fetching the Data: cURL
$req = curl_init($url);
$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => $header,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_TIMEOUT => 15,
CURLOPT_MAXREDIRS => 10
);
curl_setopt_array($req, $options);
//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');
//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
style>#is', '', $mod_content);
$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);
natcasesort($mod_content);
//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
Weighting Important Data
Tags you should care
about: meta (include OG),
title, description, h1+,
header
Bonus points for adding in
content location modifiers
Weighting Important Tags
//our keyword weights
$weights = array("keywords" => "3.0",
"meta" => "2.0",
"header1" => "1.5",
"header2" => "1.2");
//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations
Grouping Using Commonality
Interests
User A
Interests
User B
Interests
Common
Using Color Theory
Products with a feel-good message
Happiness, energy, encouragement
Health care (but not food!)
Relatable, calm, friendly, peace, security
Startups / innovative products
Creativity, imagination
Auction sites (but not sales sites!)
Passion, stimulation, excitement, power
What We’re Talking About
The CSS Service Engine
lesscss.org
sass-lang.com
learnboost.github.com/stylus
http://leafo.net/lessphp/
Design Engine Foundation: LESSPHP
+
The Basics of a Design Engine
//create new LESS object
$less= new lessc();
//compile LESS code to CSS
$less->checkedCompile(
'/path/styles.less',
'path/styles.css');
//create new CSS file and return new file link
echo "<link rel='stylesheet' href='http://path/styles.css'
type='text/css' />";
Passing Variables into LESSPHP
//create a new LESS object
$less = new lessc();
//set the variables
$less->setVariables(array(
'color' => 'red',
'base' => '960px'
));
//compile LESS into PHP and unset variables
echo $less->compile(".magic { color: @color;
width: @base - 200; }");
$less->unsetVariable('color');
Implementing Color Functions
Lighten / Darken Saturate / Desaturate
Adjust HueMix Colors
Managing Irrelevant Content
Remove / hide content
based on user profile
and state
Managing Irrelevant Content
//variables passed into LESS compilation
$less->setVariables(array(
"percent" => "80%",
));
//LESS template
.highlight{
@bg-color: "#464646”;
@font-color: "#eee";
background-color: fade(@bg-color, @percent);
color: fade(@font-color, @percent);
}
Traits of the Bored
Distraction
Repetition
Tiredness
Reasons for Boredom
Lack of interest
Readiness
Acting on Disinterest / Boredom
Highlighting on Agitated Behavior
Highlight relevant
content to reduce
agitated behavior
Acting Upon User Queues
$less->setVariables(array(
"percent" => "100%",
"size-mod" => "2"
));
Variables passed into LESS script
Acting Upon User Queues
.highlight{
@bg-calm: "blue";
@bg-action: "red";
@base-font: "14px";
background-color: mix(@bg-calm,
@bg-action,
@percent );
font-size: @size-mod + @base-font;
}
LESS script logic for color / size variations
Interaction and Emotion Plugin
jQuery Behavior Miner
by Cedric Dugas
https://github.com/posa
bsolute/jquery-
behavior-miner
In the End…
What a person is interested in
What a person is doing
What their emotional state is
http://slideshare.com/jcleblanc
Thank You! Questions?
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc

Mais conteúdo relacionado

Mais procurados

JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataGregg Kellogg
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScriptMark Casias
 
Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentationjongosling
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHPHem Shrestha
 
Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!David Pilato
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentationplindner
 
Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Webjoelburton
 
nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)Wietse Wind
 
20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters将一 深見
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQueryGunjan Kumar
 
jQuery Presentation
jQuery PresentationjQuery Presentation
jQuery PresentationRod Johnson
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web StandardsAarron Walter
 
Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Lou Susi
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...Ícaro Medeiros
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019RonRohlfs1
 

Mais procurados (20)

Google Hack
Google HackGoogle Hack
Google Hack
 
HTML5 Essentials
HTML5 EssentialsHTML5 Essentials
HTML5 Essentials
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScript
 
Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentation
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHP
 
Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentation
 
Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)
 
20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
jQuery Best Practice
jQuery Best Practice jQuery Best Practice
jQuery Best Practice
 
jQuery Presentation
jQuery PresentationjQuery Presentation
jQuery Presentation
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
jQuery
jQueryjQuery
jQuery
 
jQuery
jQueryjQuery
jQuery
 
Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Introduction to Web Design, Week 1
Introduction to Web Design, Week 1
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 

Semelhante a Creating Operational Redundancy for Effective Web Data Mining

Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Kris Wallsmith
 
Intro to php
Intro to phpIntro to php
Intro to phpSp Singh
 
Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web frameworktaggg
 
Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Kris Wallsmith
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Kris Wallsmith
 
Share point hosted add ins munich
Share point hosted add ins munichShare point hosted add ins munich
Share point hosted add ins munichSonja Madsen
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference ClientDallan Quass
 
How to insert json data into my sql using php
How to insert json data into my sql using phpHow to insert json data into my sql using php
How to insert json data into my sql using phpTrà Minh
 
Building a real life application in node js
Building a real life application in node jsBuilding a real life application in node js
Building a real life application in node jsfakedarren
 
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxMYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxArjayBalberan1
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa HallPitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hallhannonhill
 
Scaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsScaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsMike Schinkel
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3tutorialsruby
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/tutorialsruby
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3tutorialsruby
 

Semelhante a Creating Operational Redundancy for Effective Web Data Mining (20)

Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)
 
Intro to php
Intro to phpIntro to php
Intro to php
 
Assetic (OSCON)
Assetic (OSCON)Assetic (OSCON)
Assetic (OSCON)
 
Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web framework
 
Assetic (Zendcon)
Assetic (Zendcon)Assetic (Zendcon)
Assetic (Zendcon)
 
Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)
 
Share point hosted add ins munich
Share point hosted add ins munichShare point hosted add ins munich
Share point hosted add ins munich
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
How to insert json data into my sql using php
How to insert json data into my sql using phpHow to insert json data into my sql using php
How to insert json data into my sql using php
 
Building a real life application in node js
Building a real life application in node jsBuilding a real life application in node js
Building a real life application in node js
 
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxMYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa HallPitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
 
Scaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsScaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise Apps
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 

Mais de Jonathan LeBlanc

JavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJonathan LeBlanc
 
Improving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsImproving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsJonathan LeBlanc
 
Better Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessBetter Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessJonathan LeBlanc
 
Best Practices for Application Development with Box
Best Practices for Application Development with BoxBest Practices for Application Development with Box
Best Practices for Application Development with BoxJonathan LeBlanc
 
Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer WorkshopJonathan LeBlanc
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security PracticesJonathan LeBlanc
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI ElementsJonathan LeBlanc
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingJonathan LeBlanc
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyJonathan LeBlanc
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensJonathan LeBlanc
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data SecurityJonathan LeBlanc
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data SecurityJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityJonathan LeBlanc
 

Mais de Jonathan LeBlanc (20)

JavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the Client
 
Improving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsImproving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data Insights
 
Better Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessBetter Data with Machine Learning and Serverless
Better Data with Machine Learning and Serverless
 
Best Practices for Application Development with Box
Best Practices for Application Development with BoxBest Practices for Application Development with Box
Best Practices for Application Development with Box
 
Box Platform Overview
Box Platform OverviewBox Platform Overview
Box Platform Overview
 
Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer Workshop
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security Practices
 
Box Authentication Types
Box Authentication TypesBox Authentication Types
Box Authentication Types
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI Elements
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scoping
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments Globally
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web Tokens
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from Scratch
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data Security
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data Security
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable Security
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Creating Operational Redundancy for Effective Web Data Mining

  • 1. Using Operational Redundancy Effective Web Data Mining Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc
  • 2. Premise The interactions of a user can be used to personalize their experience
  • 3. Elements of Mining Redundancy Website Data Mining User Emotional State Mining User Interaction Mining
  • 4.
  • 5. Our Subject Material HTML content is poorly structured There are some pretty bad web practices on the interwebz You can’t trust that anything semantically valid will be present
  • 6. How We’ll Capture This Data Start with base linguistics Extend with available extras
  • 7. The Basic Pieces Page Data Scrapey Scrapey Keywords Without all the fluff Weighting Word diets FTW
  • 8. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
  • 9. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
  • 10. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
  • 11.
  • 12. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
  • 13. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
  • 14. //list of findable / replaceable string characters $find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);
  • 15. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);
  • 16. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
  • 17. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
  • 18. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+, header Bonus points for adding in content location modifiers
  • 19. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
  • 20. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
  • 21. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
  • 22. Grouping Using Commonality Interests User A Interests User B Interests Common
  • 23.
  • 24. Using Color Theory Products with a feel-good message Happiness, energy, encouragement Health care (but not food!) Relatable, calm, friendly, peace, security Startups / innovative products Creativity, imagination Auction sites (but not sales sites!) Passion, stimulation, excitement, power
  • 26. The CSS Service Engine lesscss.org sass-lang.com learnboost.github.com/stylus
  • 28. The Basics of a Design Engine //create new LESS object $less= new lessc(); //compile LESS code to CSS $less->checkedCompile( '/path/styles.less', 'path/styles.css'); //create new CSS file and return new file link echo "<link rel='stylesheet' href='http://path/styles.css' type='text/css' />";
  • 29. Passing Variables into LESSPHP //create a new LESS object $less = new lessc(); //set the variables $less->setVariables(array( 'color' => 'red', 'base' => '960px' )); //compile LESS into PHP and unset variables echo $less->compile(".magic { color: @color; width: @base - 200; }"); $less->unsetVariable('color');
  • 30. Implementing Color Functions Lighten / Darken Saturate / Desaturate Adjust HueMix Colors
  • 31. Managing Irrelevant Content Remove / hide content based on user profile and state
  • 32. Managing Irrelevant Content //variables passed into LESS compilation $less->setVariables(array( "percent" => "80%", )); //LESS template .highlight{ @bg-color: "#464646”; @font-color: "#eee"; background-color: fade(@bg-color, @percent); color: fade(@font-color, @percent); }
  • 33. Traits of the Bored Distraction Repetition Tiredness Reasons for Boredom Lack of interest Readiness Acting on Disinterest / Boredom
  • 34. Highlighting on Agitated Behavior Highlight relevant content to reduce agitated behavior
  • 35. Acting Upon User Queues $less->setVariables(array( "percent" => "100%", "size-mod" => "2" )); Variables passed into LESS script
  • 36. Acting Upon User Queues .highlight{ @bg-calm: "blue"; @bg-action: "red"; @base-font: "14px"; background-color: mix(@bg-calm, @bg-action, @percent ); font-size: @size-mod + @base-font; } LESS script logic for color / size variations
  • 37. Interaction and Emotion Plugin jQuery Behavior Miner by Cedric Dugas https://github.com/posa bsolute/jquery- behavior-miner
  • 38. In the End… What a person is interested in What a person is doing What their emotional state is
  • 39. http://slideshare.com/jcleblanc Thank You! Questions? Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

Notas do Editor

  1. The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  2. Open graph protocol
  3. This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  4. Stripping irrelevant data
  5. Scraping site keywords
  6. You can also play with the fade in / fade out to modify the lightness and highlighting