This document summarizes a presentation about linking to copies of music videos on YouTube. It discusses how metadata is lost when the original YouTube video is removed. Although a specific video of "Satisfaction" by The Rolling Stones is used as an example, nearly 300 copies of the video remained on YouTube. It also discusses using music-related web addresses (URIs) to link to songs on various music sites in a transparent or opaque way. Finally, charts show the availability of YouTube video URIs decreasing over time for different datasets of music.
YouTube Music Videos Metadata Lost When Original Removed
1. Music
Videos
Copies
in
YouTube
Project
Presenta,on
12/15/10
Ma3hias
Prellwitz
Advisor:
Dr.
Nelson
2. Linking
to
a
par7cular
copy
“Rolling
Stones
-‐
Sa7sfac7on”
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
2
3. Metadata
lost
when
YouTube
video
disappears
video
7tle
The
Rolling
Stones
-‐
Sa7sfac7on
url
hCp://www.youtube.com/watch?v=214szPQBUYc
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
3
4. Metadata
hard
to
recover
from
Search
Engines
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
4
5. But
nearly
300
copies
remain
in
YouTube
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
5
7. Popular
Music
US
Top
40
Singles
Charts
of
9/25/10
N/A Percentage
& Country 61 - 70
Blues 51 - 60
Children's 41 - 50
Classical 31 - 40
Electronic 21 - 30
Folk 11 - 20
Funk / Soul 1 - 10
Hip Hop
Jazz
Latin
Non-Music
Pop
Reggae # Songs
Rock 11 - 15
Stage & Screen 6 - 10
World 1-5
N/A 1960 1970 1980 1990 2000 2010
Dataset: USA Singles Top 40 of 9/25/10, Number of set items: 49
Figure 1: Set distribution: Genre/Publication Year, Dataset: US Singles Charts Top 40 of 9/25/10, against
Source: www.discogs.com; Multiple genre assignments possible
Discogs. Number of items: 49
3.3.2 Music Blogs Finishing with a broader view of popularity in terms of
Disregarding the current popularity of a song by its chart music, ”The 500 Greatest Songs of All Time” are be-
ranking lead to approach of checking sites frequently linking ing observed. The Rolling Stones Magazine published on
YouTube URIs. The aim of blogs is to provide easily new 12/9/1994 a list of 500 songs [5]. ”The song list was chosen
textual content 8enriched with media and/or links to external
ODU
CS
95
F10 Videos
Copies
in
YouTube based on votes by 172 musicians, critics, and music-industry
sites, and 12/15/10 entries inMa3hias
Prellwitz order. Three blogs
display descending figures.” [9] The set distribution in figure 3 shows stability of
7
from Google’s blog publishing services Blogger.com7 were songs of older age between the 1960s and 1980s. Examples
8. Popular
Music
Selected
Music
Blogs
N/A Percentage
& Country 41 - 50
Blues 31 - 40
Children's 21 - 30
Classical 11 - 20
Electronic 1 - 10
Folk
Funk / Soul
Hip Hop
Jazz
Latin
Non-Music # Songs
Pop 41 - 50
Reggae 31 - 40
Rock 21 - 30
11 - 20
Stage & Screen
1 - 10
World
N/A 1960 1970 1980 1990 2000 2010
Dataset: Music Blogs at blogspot.com, Number of set items: 742
Figure 2: Set distribution: Genre/Publication Year, Dataset: Music Blogs at blogspot.com, against Discogs.
Source: www.discogs.com; Multiple genre assignments possible
Number of items: 742
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
8
9. Popular
Music
The
500
Greatest
Songs
of
all
Time
N/A Percentage
& Country 61 - 70
Blues 51 - 60
Children's 41 - 50
Classical 31 - 40
Electronic 21 - 30
Folk 11 - 20
Funk / Soul 1 - 10
Hip Hop
Jazz
Latin
Non-Music
Pop
Reggae # Songs
Rock 11 - 15
Stage & Screen 6 - 10
World 1-5
N/A 1960 1970 1980 1990 2000 2010
Dataset: The 500 Greatest Songs of All Time by Rolling Stone, Number of set items: 500
Figure 3: Set distribution: Genre/Publication Year, Dataset: The 500 Greatest Songs of All Time by Rolling
Source: www.discogs.com; Multiple genre assignments possible
Stone, against Discogs. Number of items: 500
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
9
10. Total
Result
Size
Range
US
Top
40
Singles
Charts
of
9/25/10
d information by YouTube itself or via 123,239
elling link. 1e+05
Lady
Gaga
83,298
Alejandro
43,945
ility of additional music-related
10000
YouTube Amazon
Top 40 x% x%
x% x%
1000
gs of all time x% x% Result Size
one of these high-quality data, other 100 66 Selena
Gomez
&
The
considered to get information about
35 Scene
In previous smaller-scale tests we ex-
querying with the free-from video title 26 A
Year
Without
Rain
APIs (MusicBrainz, Amazon Product 10
M, Rhapsody) as well as via Google and
e with site parameters of music-related Median
ent URI syntaxes where artist and title Median (item touching zero)
1 Min-Max Range
parsed easily out. Best quality from
om Last.FM (65%) and Amazon (43%) 0 5 10 15 20 25 30 35 40 45
t matched the song of the YouTube
e we queried. Due to misspellings and Set items
nt terms (’offical’, ’HQ’, ’lyrics’, etc.),
erm against search engines with a site
Figure 4: Total Result Size log(2), Dataset: Top 40
improvement to 73% at the first result
US Single Charts of 9/25/10
.last.fm/music via Google search en-
related services without available APIs
Google against www.ilike.com with
inst www.ilike.comS
895
F10 38% success Copies
in
YouTube blogs at blogspot.com
ODU
C with Videos
Music
12/15/10 Ma3hias
Prellwitz similar high variation of announced total result sizes of
A
t item were the both outstanding best 10
hod. 1,000 or more also applies to the second dataset (Figure
11. Total
Result
Size
Range
Selected
Music
Blogs
264,753
2 Lady
Gaga
1e+05 256,205
1 Bad
Romance
232,936
0
682 741
10000
Result Size 1000
100
10
Mariah
Carey
featuring
0
Juelz
Santana
&
Bone
Median 0
Median (item touching zero) Thugs-‐n-‐Harmony
1 Min-Max Range 0
Don't
Forget
About
Us
0 100 200 300 400 500 600 700
Set items
Figure 5: Total Result Size log(2), Dataset: Music Figure 6: Total Result Size log
Blogs at blogspot.com 500 Songs of All Time
term including query method. Table 3: First rep
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz Dataset 11
4.3 Result Uniqueness US Single Charts Top 40
12. Total
Result
Size
Range
The
500
Greatest
Songs
of
all
Time
174,088
1e+05 1 Michael
Jackson
2 162,937
Billie
Jean
1 145,076
0
0
10000 493 500
682 741
1000
Result Size
100
10
0
The
Isley
Brothers
Median 0
tem touching zero) Median (item touching zero) That
Lady
(Part
1
and
2)
Range 1 Min-Max Range 0
200 300 400 500 600 700 0 100 200 300 400 500
Set items Set items
Result Size log(2), Dataset: Music Figure 6: Total Result Size log(2), Dataset: The Top
ot.com 500 Songs of All Time
once zero results over time. Out of these, had a median uniqueness that reflects almost half of the
d a result, which might95
F10due to the ac- opies
in
maximum possible retrievable results. (Table 4.3)
ODU
CS
8
12/15/10
be Videos
C YouTube
Ma3hias
Prellwitz
tist and song title information and the all 12
ery method.
13. URI
Unavailability
Rooted
from
a
selected
collec7on
1.00 25
Median Absolute Deviation
0.98
20
0.96
0.94 15
0.92
10
Median Absolute Deviation
Datasets 0.90
Top 40 US Singles Charts
0.88 5
Music Blogs @ blogspot.com
The 500 Greatest Songs
0.86
0
0 1 2 3 4 5 6 7 8 9 10
Weeks -5
-10
Figure 7: Gone URIs from collection
-15
1.0
Top 40 US Singles Charts
Datasets
Music Blogs @ blogspot.com -20
The 500 Greatest Songs Top 40 US Singles Cha
Music Blogs @ blogspo
egression -25 The 500 Greatest Song
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
13
-9 -8 -7 -6
14. The 500 Greatest Songs
Median Absolute
0.86
0
URI
Unavailability
5
0 1 2 3 4 6 7 8 9 10
Weeks -5
Expected
Half-‐life
-10
Figure 7: Gone URIs from collection
-15
1.0
Top 40 US Singles Charts
Datasets
Music Blogs @ blogspot.com -20
The 500 Greatest Songs Top 40 US Singles Cha
Music Blogs @ blogspo
Linear Regression
-25 The 500 Greatest Song
-9 -8 -7 -6
Figure 9: New publis
Half life dataset
0.5
0 3 6 9 12 15 18
Month most of them have been p
for the gone URIs almost a
Figure 8: Predicted Half life of collection within a year (99.6%). For
that contains also items re
and considering that YouT
4.5 Publish and Removal be discovered that also the
Following these results, evaluation was undertaken to ob- unavailable videos existed
serve the daily rate of new published and removed videos
US Top 40
during F10 observationouTube
ODU
CS
895
the
12/15/10
Videos
Copies
in
Y period. According to the higher re-
Ma3hias
Prellwitz Singles Charts
14
gression rate of a collection top current chart songs, figure Music Blogs
15. URI
Publica7on
and
Removal
Rate
.00 25
Median Absolute Deviation
.98
20 New URIs
.96
.94 15
.92
10
Median Absolute Deviation
.90
.88 5
.86
0
-5
-10
-15
0
Datasets
-20 Gone URIs
Top 40 US Singles Charts
Music Blogs @ blogspot.com
Linear Regression
-25 The 500 Greatest Songs
-9 -8 -7 -6 -5 -4 -3 -2 -1 0
Weeks
Figure 9: New published and removed URIs by
5
dataset
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
15
most of them have been published this year, and therefore
16. for the gone URIs almost all removed videos had a lifetime
tion within a year (99.6%). For the set ’Top 500 songs of all time’
that contains also items reflecting songs back to the 1960s
Life7mes
of
unavailable
videos
and considering that YouTube was founded in 2005, it can
be discovered that also there the majority (62.4 %) of now
ken to ob- unavailable videos existed for less than one year.
ved videos
US Top 40
higher re- Singles Charts
ngs, figure Music Blogs
f new pub- The 500
elonging to Greatest Songs
ated songs. 0 10 20 30 40 50 60 70 80 90 100
API once a 0-1 1-2 2-3 3-4 4-5
Percent
etrieved by
ation. Ad-
can be seen Figure 10: YouTube video lifetimes, separated by
backwards. years
in a initial US Top 40
be synchro- Breaking the the interval of the first year down into week
Singles Charts
Table 4: Video Gone
publication periodsBlogs here only focusing on the first four reflecting
Music and Gone Reason
are able to a lifetime of a month or less brings up the results shown
The 500
Third-party triggered
Greatest Songs
in figure 11. The upper bar displays those of the set of
This video contains content from [c
current chart0 songs, and 30 40 half of the removed URIs
10 20
almost 50 60 70 80 90 100 owner], who has blocked it (in your
Percent
(41.3%) had a lifetime - of a2 -month 4 or less. Getting into
0-1 1 2 3 3-
on copyright grounds.
higher granularity 16.0% existed for only up to one week.
tion period This video is no longer available be
In comparison, the lifetime of gone URIs from the set ’Top
ulating the Figure 11: YouTube video lifetimes, separated by YouTube account associated with t
500 songs of all time’ was for 25% less than a month and for
Gdata API weeks (zoom into first four adding up to lifetimes has been terminated due to multip
9.9% less than a week.
of monitor within one month) party notifications of copyright infri
n the daily This video is no longer available
at URI did 4.7 Reasons for gone YouTube videos copyright claim by [copyright owne
ch. Figure Neglecting the specifics copy right holders. grouping the rea-
triggered by claims by of each dataset and So with, a video YouTube triggered
s retrieved ODU
CS
895
F10(Table Videos
Copies
in
Yof or the associated accountURI that
sons blocked, removed, all YouTube video watch close by
was 4.7) why ouTube
12/15/10 Ma3hias
Prellwitz This video is no longer available be
of 9/25/10 became unavailable during 23.3% of the removed videos,
YouTube afterwards. For the observation period shows 16
YouTube account associated with t
YouTube removed a video or discontinued the user account,
17. YouTube removed a video or discontinued the user account,
has been terminated.
e.g. due to violations against one of its policies or its terms
This video has been removed as a violation 3.1
of service. For only 13.2% a user it self took action to remove
of YouTube’s [...] policy
one if his videos or closed its account. The remaining group
This video has been removed because its con- 2.8
Reasons
for
no
unavailable
videos
summarizes observed crawling errors or status changes of
video, e.g. the uploading user set a video to private, or it is
tent violated YouTube’s Terms of Service.
This video contains content from [copyright 2.4
no longer directly available due to its content.
owner]. It is not available (in your country).
With the focus on popular music videos it can be concluded This video is no longer available because the 1.9
that according to the interpretation of the gone messages, YouTube account associated with this video
copyright owners – here music Video companies – have the
Table 4: record Gone Reasons has been terminated due to repeated copy-
Gone Reason
main influence and assertiveness of letting their copyrighted Percent right infringements.
material removed that was uploaded by others without their 48.8
Third-party triggered User triggered 13.2
90
This video contains content from [copyright
100 given permission.
26.0 This video is no longer available because the 12.3
Percent owner], who has blocked it (in your country) uploader has closed their YouTube account.
4.8 Summary grounds.
on copyright This video has been deleted. 0.9
This video is no longer available because the 18.3 Other/Errors 14.3
The evaluation of the dataset showed also over a restricted
ated by YouTube account associated with this video Authentication required (see 2.2 reasons) 7.8
monitoring period of max. 2.5 months that a high fluc-
ifetimes has been terminated due to multiple third- N/A 3.3
tuation of music-related videos on YouTube exists. As we
party notifications of copyright infringement. Verification required (see 2.2 reasons) 2.3
sorely focused on popular music, besides removing a video
This video is no longer available due to a 4.5 Upgrade to Flash Player 10 0.7
by a user or closing its account the majority of removing a
copyright claim by [copyright owner]
h, a video video copy is due to its copyrighted content and its removal The video you have requested is not available <0.1
YouTube triggered 23.7
close by initiated by the copyright holder of its content or YouTube
is This video is not availabe in your country <0.1
d videos, itself removed it due is no longer available because the or 13.5
This video to violation of its terms of service The video is a duplicate copy of a previously <0.1
YouTube account associated with this video
policies. Hence, choosing a particular copy of that classifi- uploaded video
account,
has been terminated.
cation of videos and publish the given URI has a high risk
its terms
This video has been removed as a violation 3.1
to remove becoming unavailable.
of
of YouTube’s [...] policy
ing group that its search engine index is updated very shorthand.
hanges of 5. RETRIEVAL AND removed because its con- AP- 2.8
This video has been PRESERVATION
Continuing with the initial mentioned example of a video
tent violated YouTube’s Terms of Service.
e, or it is PROACHES contains content from [copyright
This video 2.4 copy of the song ’Satisfaction’ by ’The Rolling Stones’ that
As it was owner]. It is not once a video your country). its
discovered that available (in is gone, neither was removed due to copyright claims and its disappearance
valuable metadata from the HTML representation of the 1.9
This video is no longer available because the discovered on 4/9/10, and searching Google search engines
oncluded given video watch URI will associated with this video its
messages, YouTube account be no longer available, nor with its video watch URI http://www.youtube.com/watch?
ATOM representation from the Gdata API returns the basic
has been terminated due to repeated copy- v=214szPQBUYc bought up 34 results on 12/13/10 where all
have the video characteristics. (2.2). of them were not affiliated with YouTube. Parsing out ex-
pyrighted right infringements.
User triggered 13.2 act metadata about the desired artist and song title from
hout their
5.1 Existing web no longer available because the
This video is infrastructure 12.3
the HTML neighborhood might have the vulnerability of
As YouTube is no longerclosed their YouTube account. video misspellings, abbreviations, additions, or failure, if the sur-
uploader has a source for the aboutness of a
once it is This video existing public web infrastructure was 0.9
gone, the has been deleted. rounding markup does not consider the video content.
observed with its likeliness of holding a video’s information
restricted Other/Errors 14.3
afterwards.
high fluc- Authentication required (see 2.2YouTube
ODU
CS
895
F10 Videos
Copies
in
reasons) 7.8
s. As we 12/15/10
N/A Ma3hias
Prellwitz 3.3 5.1.2 Web Archives
17
g a video 5.1.1 Search Engine Caches 2.2 reasons)
Verification required (see 2.3 Utilizing web archives – preserving websites over time by
18. When
a
YouTube
video
disappears
‣ video
,tle
The
Rolling
Stones
-‐
Sa,sfac,on
‣ url
h3p://www.youtube.com/watch?v=214szPQBUYc
‣ published
2009-‐06-‐13
13:44
removed
2010-‐04-‐09
(300
days
online)
HTTP/1.1 303 See Other
Location: http://www.youtube.com/index?
ytsession=JzUNcRUYijSVoqkvtLNiXZG...
Content-Type: text/html; charset=utf-8
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
18
19. Metadata
purged
from
YouTube
Databases
‣ previous
example
video
,tle:
The
Rolling
Stones
–
Sa,sfac,on
‣ removal
reason
This
video
is
no
longer
available
due
to
a
copyright
claim
by
ABKCO.
‣ Video
feed
!
http://gdata.youtube.com/feeds/api/videos/214szPQBUYc! ! !
HTTP/1.1 403 Forbidden!Content-Type: text/html; charset=UTF-8
Private
video
‣ Related
videos
!
http://gdata.youtube.com/feeds/api/videos/214szPQBUYc/related
HTTP/1.1 404 Not Found Content-Type: text/html; charset=UTF-8
Parent
Video
not
found
‣ Video
comments !
http://gdata.youtube.com/feeds/api/videos/214szPQBUYc/comments
HTTP/1.1 200 OK Content-Type: application/atom+xml; charset=UTF-8
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
19
20. Metadata
Normaliza7on
dereferencing
ASIN
via
amazon
ws:
Ar,st:
Michael
Jackson
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
Title:
Billie
Jean
(Single
Version)
20
21. Availability
of
music-‐related
metadata
‣ parsed
out
only
at
the
fiinformation RI
showed
up
in
the
result
list
fveryhe
first
,me to users in term
attached music-related rst
,me
a
U by YouTube itself or via or
t high interest
an Amazon affiliate selling link. following one of the search terms
‣ YouTube
crawling
restric,ons video copies of such a song.
Table 1: Availability of additional music-related
metadata 1e+05
Dataset YouTube Amazon
US Single Charts Top 40 42.0% 41.4%
Total URI count: 85,831 36,040 35,491
Music Blogs 11.9% 11.6% 10000
Total URI count: 582,068 69,068 67,483
The 500 greatest songs of all time 30.0% 28.2%
Total URI count: 313,542 94,275 88,290
1000
Result Size
‣ Remaining
not having one of these high-quality data, other
For URIs
por,on
‣ query
video
totle
against
music
to get information search
engines
methods have , be considered related
services
via
about 100
artist and song title: In previous smaller-scale tests we ex-
‣ the quality of querying with the free-from video title
plored Google/Yahoo!
with
site
parameter
www.last.fm/music
against music-related APIs (MusicBrainz, Amazon Product
Affiliate API, Last.fM, Rhapsody) as well as via Google and 10
Yahoo! search engine with site parameters of music-related
sites having transparent URI syntaxes where artist and title Median
ODU
CinformationVideos
Copies
in
YouTube
S
895
F10 could be parsed easily out. Best quality from Median (item touching zero)
12/15/10 Ma3hias
Prellwitz
APIs were returned from Last.FM (65%) and Amazon (43%) 1 Min-Max Range
21
where the first result matched the song of the YouTube
22. Retrieving
and
preserving
a
video’s
metadata
‣ Ac,ve
preserva'on
a3empt
once
a
video
copy
is
available
‣ Parse
HTML
out
for
structured
music-‐related
metadata
‣ YouTube
generated
meta
data
‣ AmazonMP3
affiliate
link
‣ search
engines
with
free-‐form
video
,tle
against
music-‐related
websites
‣ Preserving
metadata
into
the
public
web
infrastructure
‣ (micro)
blogging
systems
‣ online
bookmark
services
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
22
25. Poin7ng
to
a
Resolver
service
‣ http://ytresolve.cs.odu.edu/r/http://www.youtube.com/
watch?v=214szPQBUYc/
‣ Author-‐side
approach
‣ content
creator
points
directly
to
a
resolver
service
‣ Server-‐side
approach
‣ Plugin/Renderer
class
automa,cally
rewrites
YouTube
video
watch
URIs
to
resolver
service
‣ Client-‐side
approach
‣ Web-‐Browser
plugin
intercepts
click
on
Youtube
video
watch
URIs
and
redirects
to
resolver
service
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
25
26. YouTube
Resolver
service
http://www.youtube.com/watch?v=214szPQBUYc
http://www.youtube.com/v/214szPQBUYc
http://www.youtu.be/214szPQBUYc
http://www.youtube.com/user/WEASELxLOVER#p/a/u/2/214szPQBUYc
HTTP/1.1 303 See Others HTTP/1.1 200 OK
http://www.youtube.com/ HTTP/1.1 303 See Others *
index?ytsession=...
HTTP redirect
Status
search
for
preserved
metadata
‣
in
list
of
designated
accounts *)
http://www.youtube.com/verify_controversy...
http://www.youtube.com/verify_age...
exact
best
available
granularity https://www.google.com/accounts/ServiceLogin...
http://www.youtube.com/das_captcha..
query
YouTube
API
with
those
Provided
(and
evaluate)
alterna,ve
copies
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
26
27. Future
Work
‣ Evalua,on
of
preserva,on
and
retrieval
quality
of
chosen
services
‣ exchange
services
‣ addi,onal
automa,on
of
preserva,on
process
‣ once
YT
URI
was
passed
for
resolving
‣ Evalua,on
of
retrieved
available
copies
‣ redirect
to
best
copy
instead
of
returning
a
list
to
choose
‣ Consider
interna,onal
requesters
‣ taking
requester’s
loca,on
(country)
into
account
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
27
28. Summary
‣ Poin,ng
to
a
specific
YouTube
video
copy
by
its
URI
has
a
risk
of
disappearance
‣ alterna,ve
copies
over
,me
available
‣ YouTube
URIs
unlikely
to
be
cached
once
gone
‣ YouTube
metadata
only
reliable
for
available
URIs
‣ ac,ve
preserva,on
a3empt
‣ Introducing
a
level
of
indirec,on:
Resolver
service
‣ check
URI
status
and
loca,on
header
‣ search
the
public
web
for
injected
metadata
‣ query
for
alterna,ve
copies
ODU
CS
895
F10 Videos
Copies
in
YouTube
12/15/10 Ma3hias
Prellwitz
28