This document discusses how machine learning is shaping Google and technical SEO. It addresses how TF-IDF is not the best algorithm and that BM25 and machine learning take other factors into account. Wikimedia Research has released machine learning ranking models on GitHub. The document also discusses how Google may use click-through rate as a ranking factor alongside other signals processed by machine learning algorithms, and how techniques like query disambiguation, semantic relevance analysis, content deduplication, and evaluating click satisfaction should be focuses for technical SEO.
Aligarh Hire 💕 8250092165 Young and Hot Call Girls Service Agency Escorts
TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shaping Google and Technical SEO
1. JR Oakes | @jroakes #TechSEOBoost
Fun with Machines. How Machine
Learning is Shaping Google and
Technical SEO
2. JR Oakes | @jroakes #TechSEOBoost
About Me
• Studied Industrial Design at NCSU
• Worked as an architectural glass
artist for 10 years.
• Was Lead Developer and then
Director of Strategy for medium-
sized agency with 100+ clients
worldwide.
• Work as Director, Technical SEO
for Adapt.
3. JR Oakes | @jroakes #TechSEOBoost
I have a problem with tf-idf
4. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
TF-IDF is very hand-wavy and
sounds very fancy, but is not the
magic elixir to DOMINATING ON
GOOGLE.
5. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
It is actually not even the best IR
algorithm.
BM25 takes into account document
length in addition to other factors in
various iterations.
6. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
https://wikimedia-research.github.io/Discovery-Search-Test-BM25/
https://wikimedia-research.github.io/Discovery-Search-Test-InterleavedLTR/
7. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
The Search Platform Team has been working on improving search on
Wikimedia projects with machine learning. Machine learned-ranking (MLR)
enables us to rank relevance of pages using a model trained on implicit and
explicit judgements. In the first test of the learning-to-rank (LTR) project, we
evaluated the performance of a click-based model on users searching English
Wikipedia. We found that users were slightly more likely to engage with MLR-
provided results than with BM25 results (assessed via the clickthrough rate
and a preference statistic). We also found that users with machine learning-
ranked results were statistically significantly more likely to click on the first
search result first than users with BM25-ranked results, which indicates that
we are onto something. The next step for us is to evaluate the model’s
performance on Wikipedia in other languages.
9. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
Wikimedia Research released their first model on Github last month.
MjoLniR – our Python and Spark-based library for handling the
backend data processing for Machine Learned Ranking at
Wikimedia.
https://github.com/wikimedia/search-MjoLniR/tree/master/mjolnir
10. JR Oakes | @jroakes #TechSEOBoost
About TF-IDF
We are WAY beyond TF-IDF. TF-IDF seems to work because it causes you to
look for related phrases, but it is not a very good relevance metric. It is a
keyword frequency metric.
11. JR Oakes | @jroakes #TechSEOBoost
How is Google Using Machine
Learning?
12. JR Oakes | @jroakes #TechSEOBoost
Was Larry Kim right?
13. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
14. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
15. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
Potentially:
• Clicks - For our click model we use a generalization of the PositionBased
Model (PBM) [9], at the core of which lies an examination hypothesis,
stating that in order to be clicked a document has to be examined and
attractive:
• Attention – What if users get the information that they need directly from
the SERP (Answer boxes), without a click, how do we know they were
satisfied?
• Satisfaction – “While looking at the reasons specified by the raters we
found out that 42% of the raters who said that they would click through on
a SERP, indicated that their goal was “to confirm information already
present in the summary” So additional clicks don’t necessarily mean a
poor initial result.
16. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
17. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
Machine Learning in its simplest form takes:
1. Input features
2. An algorithm that processes the features (most often) in a linear, non-
linear, or tree-based way to make a prediction.
3. And an evaluation metric that compares the prediction to your “ground
truth” data.
It is technically possible that CTR and / or Quality Rater data provides the
ground truth.
18. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
The problem is:
We don’t have the ground truth, we don’t know the features, and we sure as
hell have no idea what is in here:
19. JR Oakes | @jroakes #TechSEOBoost
CTR As A Ranking Factor
We know that it probably depends on:
• Click-through-rate
• Context models
• Ground-truth quality (Quality Rater’s Guidelines)
• And other standard factors.
20. JR Oakes | @jroakes #TechSEOBoost
Storytelling
21. JR Oakes | @jroakes #TechSEOBoost
Storytelling
22. JR Oakes | @jroakes #TechSEOBoost
Storytelling
Using Generative Adversarial Networks to train machines how to see the
storylines in news events.
https://www.ijcai.org/proceedings/2017/0554.pdf
24. JR Oakes | @jroakes #TechSEOBoost
LSTMs
We would also guess that LSTMs (with attention) play some role in Rankbrain
based on its state-of-the-art ability to pick up referential information in texts
well beyond traditional BOW models.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
25. JR Oakes | @jroakes #TechSEOBoost
What should we focus on?
26. JR Oakes | @jroakes #TechSEOBoost
Query Disambiguation
27. JR Oakes | @jroakes #TechSEOBoost
Query Disambiguation
Very little information in the query and
a lot of information in the possible
results.
28. JR Oakes | @jroakes #TechSEOBoost
Query Disambiguation
Google tries to give us a nudge.
29. JR Oakes | @jroakes #TechSEOBoost
Query Disambiguation
What a strong hint to
consider when thinking about
what needs to be included
on a page discussing:
Lipton Tea
Also a very strong hint at
potential navigation.
30. JR Oakes | @jroakes #TechSEOBoost
Query Disambiguation
AT&T does an amazing job
at this.
31. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
32. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
Bill Slawski (as always) is spot on.
33. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
Going back to the patent from Google
in 2014 (Integrated external related
phrase information into a phrase-
based indexing information retrieval
system), we see that there is an
marked gain in the significance of
phrases in a page based on
additional semantically related
qualifying phrases.
34. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
There are many ways to handle this on a
page level.
35. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
But, this really starts much sooner by
trying to discover content / intent
categories that your site is relevant
for to even start the process of
building out relevant content
categories for your visitors.
https://anaconda.org/jroakes/cluster-
share/notebook
36. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
The prior notebook ingests your
keywords, models them to vector
space, and then runs k-means to
group the keywords into relevance
clusters.
37. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
38. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
Note this goes well beyond term-
frequency.
39. JR Oakes | @jroakes #TechSEOBoost
Semantic Relevance
Using skip-gram models impart
probability of cooccurrence across
large corpuses which is much closer
to what Google does than simple tf-
idf.
40. JR Oakes | @jroakes #TechSEOBoost
We should also care about click
satisfaction.
41. JR Oakes | @jroakes #TechSEOBoost
Click Satisfaction
42. JR Oakes | @jroakes #TechSEOBoost
Click Satisfaction
Working hard to ensure that your pages get the clicks. H/T to @fighto for the
excellent article here:
https://searchengineland.com/alert-abnormal-organic-ctr-detected-automatic-
detection-poorly-performing-meta-data-280290
https://anaconda.org/jroakes/ctr_anamolies_share/notebook
43. JR Oakes | @jroakes #TechSEOBoost
We should also care about content
deduplication.
44. JR Oakes | @jroakes #TechSEOBoost
Content Deduplication
https://anaconda.org/jroakes/duplicate_detection
_with_shingling_share/notebook
45. JR Oakes | @jroakes #TechSEOBoost
Wrapping Up
It is very difficult to gain intuition into how Google works based on solely external
data. The reality is that context, machine learning, and click data allows for the
building of models that humans cannot understand easily.
We wanted to move the conversation away from simplistic keyword mechanisms
and towards an understanding that there semantics and context are much more
valuable to ranking.