Ensuring Technical Readiness For Copilot in Microsoft 365
Class 39: ...and the World Wide Web
1. Lecture 39:
…and the
World Wide
Web
cs1120 Fall 2011
David Evans
http://www.cs.virginia.edu/evans
2. Announcements
Exam 2 due 61 seconds ago!
70
69
68
67
66
65
64
63
62
60
Friday: we will return graded Exam 2, along with
guidance about the Final
Must be present (or email me in advance) to win!
If you want to present your PS8 in class Monday, remember to email me!
2
3. Plan
The World Wide Web
Building Web Applications
How Google Works
(or, going back to pre-PS5 to make things
really fast again!)
cs1120 recap in one (heavily animated) slide!
3
7. Overview:
Many of the discussions of the
future at CERN and the LHC era
end with the question – “Yes, but
how will we ever keep track of
such a large project?” This
proposal provides an answer to
such questions. Firstly, it
discusses the problem of
information access at CERN.
Then, it introduces the idea of
linked information systems, and
compares them with less flexible
ways of finding information.
http://www.w3.org/History/1989/proposal-msw.html
10. WorldWideWeb
Established a common language for sharing
information on computers
Lots of previous attempts (Gopher, WAIS,
Archie, Xanadu, etc.) failed
10
11. Why the World Wide Web?
World Wide Web succeeded because it was simple!
Didn’t attempt to maintain links, just a common
way to name things
Uniform Resource Locators (URL)
http://www.cs.virginia.edu/cs1120/index.html
Service Hostname File Path
HyperText Transfer Protocol
12. HyperText Transfer Protocol
Server
GET /cs1120/index.html HTTP/1.0
<html>
<head> Contents
… of file
Client (Browser) HTML
HyperText Markup Language
13. HTML: HyperText Markup Language
Language for controlling display of web pages
Uses formatting tags: between < and >
Document ::= <html> Header Body </html>
Header ::= <head> HeadElements </head>
HeadElements ::= HeadElement HeadElements
HeadElements ::= ε | <title> Element </title>
Body ::= <body> Elements </body>
Elements ::= ε | Element Elements
Element ::= <p> Element </p>
Element ::= <center> Element </center>
…
14. Popular Web Site: Strategy 1
Static, Authored Web Site
Drawbacks:
•Have to do all the
work yourself
•The world may
already have enough
Twinkie-experiment
websites
Content Producer
http://www.twinkiesproject.com/
15. Popular Web Site: Strategy 2
Dynamic Web Applications
Attracts users
Seed content and
function
Web Programmer
Produce more
content
eBay in 1997
http://web.archive.org/web/19970614001443/http://www.ebay.com/
16. Popular Web Site: Strategy 2
Dynamic Web Applications
Attracts users
Seed content and
function
Advantages:
• Users do most of the work
• If you’re lucky, they might even pay you
for the privilege!
Disadvantages:
• Lose control over the content (you might
Produce more
get sued for things your users do)
content reddit.com today
• Have to know how to program a web
application
reddit.com in 2005
17. Dynamic Web Sites
Programs that run on the web server
Can be written in any language (often in Python or Java), just
need a way to connect the web server to the program
Program generates HTML (often JavaScript also now)
Every useful web site does this
Programs that run on the client’s machine
Java, JavaScript (aka, “Scheme for the Web”), Flash, etc.:
language must be supported by the client’s browser
Responsive interface: limited round-trips to server
20. Building a Web Search Engine
Database of web pages
Crawling the web collecting pages and links
Indexing them efficiently
Responding to Searches
Spell checking – edit distance
How to find documents that match a query
How to rank the “best” documents
21. Crawling Crawler
activeURLs = * “www.yahoo.com” +
while (len(activeURLs) > 0) :
newURLs = [ ]
for URL in activeURLs:
page = downloadPage (URL)
newURLs += extractLinks (page)
activeURLs = newURLs
Problems:
Will keep revisiting the same pages
Will take very long to get a good view of the web
Will annoy web server admins
downloadPage and extractLinks must be very robust
22. Building a Web Search Engine
Database of web pages
Crawling the web collecting pages and links
Indexing them efficiently
Responding to Searches
How to find documents that match a query
How to rank the “best” documents
23. Building an Index
What if we just stored all the pages?
Answering a query would be (size of the database)
(need to look at all characters in database)
Google: about 40 Billion pages (1 Trillion URLs, but number
actually indexed is a closely kept corporate secret)
* 60 KB (average web page size)
= ~2.4 Quadrillion bytes to search!
Linear is not nearly good enough when n is Quadrillions
24. Hash Table
Index Key-Value Pairs
0 , <“Colleen”, ? >, <“virginia”, ? >, … -
1 , <“Bob”, ? >, … -
2
3
…
[about a million bins?]
def lookup(key, table) : searchEntries(table[H(key, len(table))])
Finding a good H is difficult
You can download google’s from
http://code.google.com/p/google-sparsehash/
25. Google’s Lexicon
1998: 14 million words (billions today?)
Lookup word in H(word, nbins): maps to WordID
Key Words
0 *<“aardvark”, 1024235>, ... +
1 *<“aaa”, 224155>, ..., <“zzz”, 29543> +
... ...
nbins – 1 *<“abba”, 25583>, ..., <“zeit”, 50395> +
26. Google’s Reverse Index
(Based on 1998 paper…definitely changed some since then, but now they are secretive!)
WordId ndocs pointer
00000000 3
00000001 15
... “Inverted
Barrels”:
16777215 105 41 GB (1998)
Today: many TB?
Lexicon: 293 MB (1998)
Today: many GB?
27. Inverted Barrels
docid (27 bits) nhits (5 bits) hits (16 bits
each) plain hit:
capitalized: 1 bit
7630486927 23 font size: 3 bits
position: 12 bits
... first 4095 chars,
everything else
extra info for
anchors, titles
(less position bits)
Suggested experiment for winter break:
is the position field still only 12 bits?
28. Building a Web Search Engine
Database of web pages
Crawling the web collecting pages and links
Indexing them efficiently
Responding to Searches
Spell checking – edit distance
How to find documents that match a query
How to rank the “best” documents
29. Finding the “Best” Documents
Humans rate them
“Jerry and David’s Guide to the World Wide Web”
(became Yahoo!)
Machines rate them
Count number of occurrences of keyword
Easy for sites to rig this
Machine language understanding not good enough
Business Model
Whoever pays you the most is listed first
30. PageRank
If a site is important and interesting, other sites
will link to it.
Don’t ever take <a href=http://www.cs.virginia.edu/cs1120>cs1120</a>!
But…not all links are equal:
if a lot of highly-ranked sites link to this site,
this site should be highly-ranked.
30
31. PageRank
def pageRank (u):
rank = 0
for b in linksToPage (u)
rank = rank + PageRank (b) / Links (b)
return rank
Would this work?
32. Converging PageRank
Ranks of all pages depend on ranks of all other
pages
Keep recalculating ranks until they converge
def CalculatePageRanks (urls):
initially, every rank is 1
for as many times as necessary
calculate a new rank for each page (using old ranks)
replace the old ranks with the new ranks
How do initial ranks effect results?
How many iterations are necessary?
33. PageRank: 1998
Crawlable web (1998):
150 million pages, 1.7 Billion links
Database of 322 million links
Converges in about 50 iterations
Initialization matters
All pages = 1: very democratic, models browser
equally likely to start on random page
www.yahoo.com = 1, ..., all others = 0
More like what Google probably uses
34. Do we have a
search engine?
Theoretician: Sure!
Ali G: No way! It’ll blow up.
Google’s First Server
34
35. How do we make our service fast
enough to index the whole web
and serve billions of requests?
35
36. Counting Word Occurrences
“When in the Course of human events, it
* <“When”, 1>,
becomes necessary for one people to dissolve
<“in”, 1>,
the political bands which have connected them
<“the”, 2>
with another, …”
…+
“We the People of the United States, in Order * <“We”, 1>,
to form a more perfect Union, establish Justice, <“in”, 1>,
insure domestic Tranquility, provide for the …” <“the”, 2>
…+
map(doc, countWords)
If we have enough machines, can we do this fast for the whole web?
36
39. Key to Massive Parallel Execution
Get rid of state and mutation!
39
40. (define (count-matches p b) Functional Programming
(list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4)
def meval(expr, env):
Interpreters
… return evalApplication(expr, env)
... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 #
...
Any Mechanical
1 3 Turing Machine
2 Computation
A B C R1 R0
(or a b)
0 0 0 0 0
(not (and (not a) 0 0 1 0 1 Any Discrete Function
(not b))) … … … … …
AND NOT Mechanical Logic
“Magic” Transistors
40
41. (define (count-matches p b) Functional Programming
(list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4)
def meval(expr, env):
Interpreters
… return evalApplication(expr, env)
... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 #
...
Any Mechanical
1 3 Turing Machine
2 Computation
A B C R1 R0
(or a b)
0 0 0 0 0
(not (and (not a) 0 0 1 0 1 Any Discrete Function
(not b))) … … … … …
AND NOT Mechanical Logic
“Magic” Transistors
42. SimObject
PhysicalObject Objects
Place
MobileObject
m1: State and Mutation
1 2 3
(define (count-matches p b) Functional Programming
(list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4)
def meval(expr, env):
Interpreters
… return evalApplication(expr, env)
... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 #
...
Any Mechanical
1 3 Turing Machine
2 Computation
A B C R1 R0
(or a b)
43. SimObject
PhysicalObject Objects
Place
MobileObject
m1: State and Mutation
1 2 3
(define (count-matches p b) Functional Programming
(list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4)
def meval(expr, env):
Interpreters
… return evalApplication(expr, env)
... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 #
...
Any Mechanical
1 3 Turing Machine
2 Computation
A B C R1 R0
(or a b)
44. Objects
Recursive Definitions
State and Mutation
Functional Programming
Charge
(PS 1-4)
Universality
Abstraction
Now, you know
Interpreters
almost everything
you need to build the
Any Mechanical
Computation next reddit or
google!
Any Discrete Function
Mechanical Logic
“Magic” Transistors