In the SEO community, we are fortunate to have many wonderful off-the-shelf SEO tools that cover the most common use cases and problems. But what happens when you want to truly impress your client or boss, or you face challenges nobody has seen or considered before? Do you simply give up, or do you roll up your sleeves to code novel solutions to save the day? In his talk, Hamlet will walk you over some really interesting SEO problems he has successfully solved by using the Python data science stack. You will get access to comprehensive Jupyter notebooks with code you can reuse for your own projects.
13. Hamlet Batista | @hamletbatista | #TechSEOBoost
CHALLENGING SEO PROBLEMS
–
THAT NEED PROGRAMMING WORK
14. Hamlet Batista | @hamletbatista | #TechSEOBoost
IBM WebSphere => SAP Hybris
15. Hamlet Batista | @hamletbatista | #TechSEOBoost
IBM WebSphere Site
Category Page
(Links to one or more
Product Listing
Pages)
Product Listing Page
(Links to one or more
Product Pages)
Product Page
(Single SKU)
16. Hamlet Batista | @hamletbatista | #TechSEOBoost
SAP Hybris Site
Category Page
(Links to one or more
Product Pages)
Product Page
(Single SKU)
17. Hamlet Batista | @hamletbatista | #TechSEOBoost
Old Site
Product Pages
(717)
New Site
Product Pages
(442)
Product
Mapping
(3431)
18. Hamlet Batista | @hamletbatista | #TechSEOBoost
Old Site
Category
Pages
(371)
New Site
Category
Pages
(147)
Category
Mapping
(712)
28. Hamlet Batista | @hamletbatista | #TechSEOBoost
https://github.com/plotly/plotly.py
29. Hamlet Batista | @hamletbatista | #TechSEOBoost
Solution Part 1 – Steps
Step 1:
Pull Google Analytics Data
–
Step 2:
Store Data in Pandas DataFrame
–
Step 3:
Perform Data Preparation and
Perform Basic Set Operations
CHALLENGE: Find Which Pages Lost
SEO Traffic
30. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – Basics
https://pandas.pydata.org/
Python for Data Science Cheat Sheet
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonF
orDataScience.pdf
31. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – Jupyter
Google Colaboratory
https://colab.research.google.com/notebooks/
welcome.ipynb
32. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – Pandas
https://pandas.pydata.org/
Cheat Sheet
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
10 Minutes to pandas
https://pandas.pydata.org/pandas-docs/stable/10min.html
Intro to Pandas for Excel Super Users
https://towardsdatascience.com/intro-to-pandas-for-excel-
super-users-dac1b38f12b0
33. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – Requests
WEB SCRAPING REFERENCE:
A Simple Cheat Sheet for Web Scraping with
Python
https://blog.hartleybrody.com/web-scraping-cheat-sheet/
http://docs.python-requests.org/en/master/
34. Hamlet Batista | @hamletbatista | #TechSEOBoost
https://ga-dev-tools.appspot.com/query-explorer/
35. Hamlet Batista | @hamletbatista | #TechSEOBoost
Pulling Google Analytics Data
36. Hamlet Batista | @hamletbatista | #TechSEOBoost
Storing Data in a DataFrame
37. Hamlet Batista | @hamletbatista | #TechSEOBoost
Transforming Data for Analysis
https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/
Left Join Full Outer Join Left Join (if NULL)
Inner Join Right Join Right Join (if NULL)
38. Hamlet Batista | @hamletbatista | #TechSEOBoost
Transforming Data for Analysis
39. Hamlet Batista | @hamletbatista | #TechSEOBoost
Pages That Lost SEO Traffic
40. Hamlet Batista | @hamletbatista | #TechSEOBoost
Solution Part 2 – Steps
Step 1:
We will crawl old pages to follow
redirects
–
Step 2:
We will group pages using regular
expressions
–
Step 3:
Repeat the previous analysis
CHALLENGE: Find Which Page Groups Lost
SEO Traffic (Manually)
41. Hamlet Batista | @hamletbatista | #TechSEOBoost
Regular Expressions for
SEOs and Digital
Marketers (with Use
Cases)
https://netpeaksoftware.com/blog/
regular-expressions-for-seos-
and-digital-marketers-with-use-
cases
Regex101.com
42. Hamlet Batista | @hamletbatista | #TechSEOBoost
Crawling Old Pages
43. Hamlet Batista | @hamletbatista | #TechSEOBoost
Grouping with Regexes
Lookahead and Lookbehind Zero-Length Assertions
https://www.regular-expressions.info/lookaround.html
44. Hamlet Batista | @hamletbatista | #TechSEOBoost
https://github.com/plotly/plotly.py
45. Hamlet Batista | @hamletbatista | #TechSEOBoost
Page Groups That Lost SEO Traffic
46. Hamlet Batista | @hamletbatista | #TechSEOBoost
Reverse Engineer Success Too
47. Hamlet Batista | @hamletbatista | #TechSEOBoost
How Do We Generalize This?
48. Hamlet Batista | @hamletbatista | #TechSEOBoost
Using Machine Learning!
56. Hamlet Batista | @hamletbatista | #TechSEOBoost
Solution Part 3 – Steps
Step 1:
Collect training data
–
Step 2:
Prepare and split training data into
training, and testing
–
Step 3:
Find best model
CHALLENGE: Find Which Page Groups Lost
SEO Traffic (Automatically)
57. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – BeautifulSoup
BeautifulSoup 4 Cheatsheet
http://akul.me/blog/2016/beautifulsoup-cheatsheet/
https://www.crummy.com/software/BeautifulSoup/bs4/download/
An SEO’s guide to XPath
https://builtvisible.com/seo-guide-to-xpath/
59. Hamlet Batista | @hamletbatista | #TechSEOBoost
Data Scientist Bottom Up Solution
Inside the BloomReach Algorithm - Using
Machine Learning to Understand Page
Templates
https://www.bloomreach.com/en/blog/2018/07/using-machine-
learning-to-learn-page-templates.html
60. Hamlet Batista | @hamletbatista | #TechSEOBoost
For most Ecommerce sites, the dimensions
and quantity of images and input form elements
change by page template.
Let’s use that as the features vector.
Hamlet’s Observation
and Simpler Solution
61. Hamlet Batista | @hamletbatista | #TechSEOBoost
Hamlet’s Observation and Simpler Solution
62. Hamlet Batista | @hamletbatista | #TechSEOBoost
Hamlet’s Observation and Simpler Solution
63. Hamlet Batista | @hamletbatista | #TechSEOBoost
Collecting Training Data
64. Hamlet Batista | @hamletbatista | #TechSEOBoost
What is One Hot Encoding?
Why and when do you have to
use it?
https://hackernoon.com/what-is-one-
hot-encoding-why-and-when-do-you-
have-to-use-it-e3c6186d008f
Prepare and Split Data
65. Hamlet Batista | @hamletbatista | #TechSEOBoost
Cross Validation and Grid Search
For Model Selection in Python
https://stackabuse.com/cross-validation-
and-grid-search-for-model-selection-in-
python/
Find Best Model
66. Hamlet Batista | @hamletbatista | #TechSEOBoost
https://github.com/plotly/plotly.py
67. Hamlet Batista | @hamletbatista | #TechSEOBoost
Simple guide to confusion matrix terminology
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Confusion Matrix
68. Hamlet Batista | @hamletbatista | #TechSEOBoost
But wait… We can do Better
69. Hamlet Batista | @hamletbatista | #TechSEOBoost
Using Deep Learning!
70. Hamlet Batista | @hamletbatista | #TechSEOBoost
Solution Part 4 – Steps
Step 1:
Label a few thousand web page
screenshots with the visual features
you care about
–
Step 2:
Train a computer vision model to
predict more granular page groups
–
Step 3: Find best model
CHALLENGE: Learn More Granular Page
Groups that Lost SEO Traffic (Automatically)
71. Hamlet Batista | @hamletbatista | #TechSEOBoost
https://www.tensorflow.org/
Keras Cheat Sheet
https://s3.amazonaws.com/assets.dataca
mp.com/blog_assets/Keras_Cheat_Sheet
_Python.pdf
TensorFlow Tutorial For
Beginners
https://www.datacamp.com/community/tut
orials/tensorflow-tutorial
Python – Tensorflow
& Keras
72. Hamlet Batista | @hamletbatista | #TechSEOBoost
Bottleneck
The “Information
Bottleneck” Theory
https://www.quantamagazine.org/ne
w-theory-cracks-open-the-black-
box-of-deep-learning-20170921/
73. Hamlet Batista | @hamletbatista | #TechSEOBoost
Encoder Bottleneck Decoder
Input Image Reconstructed Image
Latent Space
Representation
AUTOENCODER
74. Hamlet Batista | @hamletbatista | #TechSEOBoost
14 x 14 Feature Map
1. Input Image 2. Convolutional
Feature Extraction
3. RNN with attention
over the image
4. Word by word
generation
LSTM
Encoder Bottleneck Decoder
Latent Space
Representation
Caption Generator
75. Hamlet Batista | @hamletbatista | #TechSEOBoost
Python – Tensorflow Object Detection API
https://github.com/tensorflow/models/tree/master/research/object_detection
76. Hamlet Batista | @hamletbatista | #TechSEOBoost
AutoML Vision API Tutorial
https://cloud.google.com/vision/automl/docs/tutorial
Google AutoML
77. Hamlet Batista | @hamletbatista | #TechSEOBoost
Visually Labeling Screenshots
78. Hamlet Batista | @hamletbatista | #TechSEOBoost
Don't Take Security
Advice from SEO Experts
or Psychics
https://www.troyhunt.com/dont-
take-security-advice-from-seo-
experts-or-psychics-neil-patel/
79. Hamlet Batista | @hamletbatista | #TechSEOBoost
Launch Jupyter Notebook in Google
Colaboratory
https://colab.research.google.com/github/ranksense/open-
source/blob/master/Presentations/TechSEOBoost/2018/Pyt
honforSEOTechSEOBoost2018_Hamlet_Batista.ipynb
81. Hamlet Batista | @hamletbatista | #TechSEOBoost
Summary
Practical applications
of Python => 3.6
for:
Data extraction
–
Preparation
–
Analysis
–
Machine learning
–
Deep learning
82. Hamlet Batista | @hamletbatista | #TechSEOBoost
Free Realtime SEO Monitor
–
Ongoing monitoring with no active crawls
–
Receive alerts about critical SEO issues
–
Apply quick, temporary fixes in Cloudflare
–
Create developer tickets for permanent solutions
ABOUT RANKSENSE
– Apply for Beta Access
www.ranksense.com