O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Challenges of building a search engine like web rendering service

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 98 Anúncio

Challenges of building a search engine like web rendering service

Baixar para ler offline

SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.

SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Challenges of building a search engine like web rendering service (20)

Anúncio

Mais recentes (20)

Challenges of building a search engine like web rendering service

  1. 1. Challenges of building a search engine like web rendering service Giacomo Zecchini | Verve Search @giacomozecchini
  2. 2. Hi, I’m Giacomo Technical Director at Technical background and previous experiences in development Love: understanding how things work and Web Performance @giacomozecchini
  3. 3. @giacomozecchini As always happens in all the worst stories, everything started with a website migration.
  4. 4. One of the largest brands in the UK, with millions of pounds on the line. @giacomozecchini
  5. 5. The client was migrating platform, moving from a Server Side rendering website to a Client Side Rendering one. @giacomozecchini
  6. 6. ..and of course SSR was going to be put in place. @giacomozecchini
  7. 7. But.. two days before the migration the client told us “SSR is not working”. @giacomozecchini
  8. 8. We were already at code freeze. @giacomozecchini
  9. 9. We implemented a short term solution* static rendering the site and using the user-agent to serve the right content. *A sort of customized Rendertron script. @giacomozecchini
  10. 10. After the migration we helped the client to move to a medium-long term solution (Prerender.io). @giacomozecchini
  11. 11. Implementation is always harder than it seems. @giacomozecchini
  12. 12. My curiosity made me start to research web rendering services. @giacomozecchini
  13. 13. * Icons made by Freepik from www.flaticon.com In the past the html was the most important thing to download in order to access the content of a page @giacomozecchini
  14. 14. Today, JavaScript is a big part of the web and makes everything more complex * Icons made by Freepik from www.flaticon.com @giacomozecchini
  15. 15. In the past we had a Crawling-Indexing process Crawler Processing Index URLs Crawl Queue URL HTML @giacomozecchini
  16. 16. Now, we’ve moved to a crawling-rendering-indexing process https://developers.google.com/search/docs/guides/javascript-seo-basics Crawler Processing Index Renderer URLs Crawl Queue URL HTML Render Queue @giacomozecchini
  17. 17. Google calls the rendering element WRS Crawler Processing Index Renderer URLs Crawl Queue URL HTML Render Queue WRS @giacomozecchini
  18. 18. Martin Splitt’s TechSEO Boost 2019 talk https://www.youtube.com/watch?v=Qxd_d9m9vzo In his presentation, Martin covered a lot of interesting implementation details. If you’re interested in Google’s WRS, this is the presentation to watch. @giacomozecchini
  19. 19. These are three of the most important thing you can get from a Web Rendering Service DOM Tree Render Tree + Layout Rendered HTML @giacomozecchini
  20. 20. DOM Tree & Render Tree https://developers.google.com/web/fundamentals/performance/critical-rendering-path/render-tree-construction @giacomozecchini
  21. 21. Layout information https://youtu.be/WjMSfTK1_SY?t=239 The layout information helps to understand where elements are positioned on a page, their dimensions, and their importance. @giacomozecchini
  22. 22. The layout information is useful when it comes to: - Understand the semantics of a page - Check if a page is mobile friendly - Find intrusive interstitials - Understand above the fold content https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
  23. 23. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server My toy web rendering service implementation Fetchers @giacomozecchini
  24. 24. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server It uses a first in first out queue Fetchers @giacomozecchini
  25. 25. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server ...utilises Chrome DevTools Protocol Fetchers https://chromedevtools.github.io/devtools-protocol/ @giacomozecchini
  26. 26. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server I’m currently working on the fetch server Fetchers @giacomozecchini
  27. 27. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server It uses the cache when possible Fetchers @giacomozecchini
  28. 28. Crawler Renderer Render queue Fetch Server Cache Server Chrome instances Robots.txt Server DNS Server If the URL is not cached, the crawler fetches it Fetchers @giacomozecchini
  29. 29. The real problems start when you have to make choices about the actual rendering of pages! @giacomozecchini
  30. 30. What about the viewport? Do you want to limit the number of fetches? Are you going to render a page multiple times? @giacomozecchini
  31. 31. Developing software is a matter of choices and context. @giacomozecchini
  32. 32. The same group of people in a different context may end up developing the same project in a totally different way. @giacomozecchini
  33. 33. You can’t replicate Google’s WRS without having the same data they have. @giacomozecchini
  34. 34. This is where I began to realise the case to build your own rather than relying on tools. @giacomozecchini
  35. 35. You can’t replicate Google’s WRS but you can learn from it. @giacomozecchini
  36. 36. Understanding Google's WRS behaviour @giacomozecchini
  37. 37. Understanding Google's WRS behaviour @giacomozecchini
  38. 38. If you need JavaScript console message data you can use the Mobile Friendly test or Search Console Live Test but be careful! @giacomozecchini
  39. 39. Mobile-Friendly Test, Search Console Live Test, AMP Test, and Rich Results Test are using the WRS infrastructure, but bypassing cache, using shorter timeouts, and few other differences. @giacomozecchini
  40. 40. Hic sunt dracones / Here be Dragons What follows is based on my own tests and assumptions, results may be false positives. Google can change implementation details at any time without notice, explanation, or justification. @giacomozecchini
  41. 41. How we’ll approach the edge cases 1. Define the edge case 2. Understand Google's WRS support and behaviour (personal assumption) 3. Check for tools support and behaviour 4. Propose a solution @giacomozecchini
  42. 42. Some of the tested tools I did multiple tests for each edge case. @giacomozecchini
  43. 43. This is not an evaluation of those tools, but just a comparison between their results and those of Google’s WRS. @giacomozecchini
  44. 44. Edge Case #1 HTTPS/HTTP mixed content @giacomozecchini
  45. 45. Mixed content occurs when initial HTML is loaded over a secure HTTPS connection, but other resources are loaded over an insecure HTTP connection. https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
  46. 46. Website: https://www.example.com CSS: http://www.example.com/style.css @giacomozecchini
  47. 47. Chrome will automatically upgrade mixed content from HTTP to HTTPS. If the fetch fails that asset won’t be loaded. @giacomozecchini
  48. 48. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. * this was not the case until recently @giacomozecchini Google’s WRS seems to behave like Chrome
  49. 49. Tools support 10% of tests showed a different result @giacomozecchini
  50. 50. Solution When visiting an HTTPS website, upgrade the URLs of assets from HTTP to HTTPS. Using Chromium-based browsers you should already have the right solution in place. @giacomozecchini
  51. 51. Edge Case #2 Infinite scrolling / Lazy loading @giacomozecchini
  52. 52. @giacomozecchini SCROLL
  53. 53. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a fixed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini
  54. 54. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Then calculate the new viewport: Viewport = Page Height + pixels The amount of additional pixels depends on the page, it could be thousands of pixels @giacomozecchini
  55. 55. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport triggers Infinite loading or lazy loading events. @giacomozecchini
  56. 56. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. 10,000,000 px This seems to be the maximum viewport height @giacomozecchini
  57. 57. Tools support 95% of tests showed a different result * for very tall pages @giacomozecchini
  58. 58. Solution Start with a fixed viewport @giacomozecchini
  59. 59. Solution Wait for an event: onload DOMContentLoaded If you’re using puppeteer: networkidle0 networkidle2 @giacomozecchini
  60. 60. Solution VIEWPORT PAGE HEIGHT Check for the page height and compare it to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
  61. 61. Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than the page Viewport = Page Height + pixels @giacomozecchini
  62. 62. Solution The simplest solution is then to wait for X seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT * for more complex solutions you can look at ongoing requests or an event-based approach. @giacomozecchini
  63. 63. Edge Case #3 Content-visibility @giacomozecchini
  64. 64. https://web.dev/content-visibility/ content-visibility is a CSS property that enables the browser to skip an element's rendering. @giacomozecchini
  65. 65. https://web.dev/content-visibility/ content-visibility is used together with contain-intrinsic-size, a CSS property allow you to specify natural size of an element if the element is affected by size containment. @giacomozecchini
  66. 66. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT Google starts the rendering using a fixed viewport: Mobile: 412 X 732 Desktop: 1024 x 1024 @giacomozecchini
  67. 67. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT Viewport = Page Height + pixels When the browser starts the rendering the Page Height is calculated using the contain-intrinsic-size @giacomozecchini
  68. 68. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. VIEWPORT PAGE HEIGHT A bigger viewport makes the browser rendering the element affected by size containment. @giacomozecchini
  69. 69. Tools support 97% of tests showed a different result * for very tall pages @giacomozecchini
  70. 70. Solution Start with a fixed viewport @giacomozecchini
  71. 71. Solution Wait for an event: onload DOMContentLoaded If you’re using puppeteer: networkidle0 networkidle2 @giacomozecchini
  72. 72. Solution VIEWPORT PAGE HEIGHT Check for the page height and compare that to the initial viewport. https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
  73. 73. Solution VIEWPORT PAGE HEIGHT If the viewport is shorter than the page Viewport = Page Height + Pixels @giacomozecchini
  74. 74. Solution The simplest solution is then to wait for X seconds and stop rendering or check viewport and Page Height again. VIEWPORT PAGE HEIGHT @giacomozecchini
  75. 75. Edge Case #4 Shadow DOM @giacomozecchini
  76. 76. Shadow DOM https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM @giacomozecchini
  77. 77. Google is able to render and use Shadow DOM content. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. @giacomozecchini
  78. 78. Tools support ? 93% of tests showed a different result @giacomozecchini
  79. 79. Solution Using the document.documentElement.outerHTML returns a DOMString containing an HTML serialization of the element and its descendants but not the Shadow DOM. *Puppeter’s Page.content() returns the outerHTML @giacomozecchini
  80. 80. Solution The solution is to get the DOM tree, traverse it and serialize it into HTML. https://www.w3schools.com/js/js_htmldom_navigation.asp @giacomozecchini
  81. 81. Solution Document HTML BODY DIV P P Text Text HEAD @giacomozecchini
  82. 82. Solution dom2html library: https://github.com/GoogleChromeLabs/dom2html Chrome DevTools Protocol: DOM.getDocument DOMSnapshot.getSnapshot DOMSnapshot.captureSnapshot https://chromedevtools.github.io/devtools-protocol/ * If interested in Shadow DOM, have a look at: https://web.dev/declarative-shadow-dom/ @giacomozecchini
  83. 83. Edge Case #5 Iframe @giacomozecchini
  84. 84. IFRAME Page A Page B @giacomozecchini
  85. 85. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. Google is able to render an Iframe inlining the <body> content in a <div>. @giacomozecchini
  86. 86. N.B. These are just personal assumptions based on tests. Tests could be wrong and implementation details may change tomorrow. If the page included through the Iframe has a noindex, the content is not included in the page. @giacomozecchini
  87. 87. Iframe - Tools support ? 95% of tests showed a different result @giacomozecchini
  88. 88. Solution Get the DOM tree, traverse it, and serialize it into HTML. N.B. When traversing the DOM you only need the content of <body>, remove other HTML elements and tags such as the <head>. Remember to check for the noindex. @giacomozecchini
  89. 89. Solution dom2html library: https://github.com/GoogleChromeLabs/dom2html Chrome DevTools Protocol: DOM.getDocument DOMSnapshot.getSnapshot DOMSnapshot.captureSnapshot https://chromedevtools.github.io/devtools-protocol/ @giacomozecchini
  90. 90. Are those the only problems that exist? @giacomozecchini
  91. 91. Web rendering services
  92. 92. What we can learn from this? @giacomozecchini
  93. 93. Sometimes you should reinvent the wheel. It’s fun and you can learn a lot from that! @giacomozecchini
  94. 94. When you change the way you look at things, the things you look at change. Understanding these limitations should change, in those edge cases, the advice that you provide. @giacomozecchini
  95. 95. Don’t use tools blindly! Tools are great and save us a huge amount of time in all our tasks. The majority of pages on the web are not affected by those edge cases. @giacomozecchini
  96. 96. If your website uses or is affected by one of the mentioned edge cases, you can open a support ticket to check with your tool provider if they are already covering that. @giacomozecchini
  97. 97. Thank You! Got questions? DM on Twitter are open @giacomozecchini

×