08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Internet content as research data
1. Internet Content as
Research Data
Digital Humanities Australia
March 2012, Canberra
Monica Omodei & Gordon Mohr
2. Research Examples
• Social networking
• Lexicography
• Linguistics
• Network Science
• Political Science
• Media Studies
• Contemporary history
3. Common
Collec)on
Strategies
• Crawl
Scope
&
Focus
1) Thema)c/Topical
(elec)ons,
events,
global
warming…)
2) Resource-‐specific
(video,
pdf,
etc.)
3) Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)
4) Exhaus)ve
(end
of
life, closure crawls, natl domains)
5) Frequency-‐Based
• Key
Inputs:
nomina)ons
from
subject
maSer
experts,
prior
crawl
data,
registry
data,
trusted
directories,
wikipedia
4. Exis)ng
web
archives
• Internet
Archive
• Common
Crawl
• Pandora
Archive
• Internet
Memory
Founda)on
Archive
• Other
na)onal
archives
• Research,
University
Library
archives
5. Internet Archive’s Web Archive
Positives
– Very broad – 175+ billion web instances
– Historic – started 1996
– Publicly accessible
– Time-based URL search
– API access
– Not constrained by legislation – covered by
fair use and fast take-down response
6. Internet
Archive’s
Web
Archive
Negatives
– Because of size can’t search by keyword
– Because of size, fully automated - QA not
possible
7. Common
Use
Cases
for
IA’s
web
archive
• Content
discovery
• Nostalgia
queries
• Web
site
restora)on
and
file
recovery
• Domain
name
valua)on
• Collabora)ve
R&D
• Prior
art
analysis
and
patent/copyright
infringement
research
• Legal
cases
• Topic
analysis,
web
trends
analysis,
popularity
analysis
8.
9.
10.
11. Common
Crawl
• Non-‐profit
founda)on
building
an
open
crawl
of
the
web
to
seed
research
and
innova)on
• Currently
5
billion
pages
• Stored
on
Amazon’s
S3
• Accessible
via
MapReduce
processing
in
Amazon’s
EC2
compute
cloud
• Wholesale
extrac)on,
transforma)on,
and
analysis
of
web
data
cheap
and
easy
• commoncrawl.org/data/accessing-‐the-‐data/
12. Common
Crawl
Nega)ves
• Not
designed
for
human
browsing
but
for
machine
access
• Objec)ve
is
to
support
large-‐scale
analysis
and
text
mining/indexing
–
not
long-‐term
preserva)on
• Some
costs
are
involved
for
direct
extrac)on
of
data
from
S3
storage
using
Requester-‐Pays
API
13. Pandora
Archive
• Posi)ves
– Quality
checked
– Targeted
Australian
content
with
selec)on
policy
– Historical
–
started
1996
– Bibliocentric
approach
–we
sites/publica)ons
selected
for
archiving
are
catalogued
(see
Trove)
– Keyword
search
– Publicly
accessible
– You
can
nominate
Australian
web
sites
for
inclusion
-‐
pandora.nla.gov.au/
registra)on_form.html
14.
15. Pandora
Archive
• Nega)ves
– labour
intensive
so
small
– significant
content
missed
because
permission
to
copy
refused
• Situa)on
will
improve
markedly
if
Legal
Deposit
provisions
extended
to
digital
publica)ons
• Broader
coverage
will
be
achieved
when
infrastructure
is
upgraded
hence
reducing
labour
costs
for
checking/fixing
crawls
16. Pandora
Archive
Stats
• Size
–
6.32
TB
• Number
of
Files
>
140
million
• Number
of
‘)tles’
>
30.5K
• Number
of
)tle
instances
>
73.5K
17.
18.
19.
20.
21. .au
Domain
Annual
Snapshots
• Annual
crawls
since
2005
commissioned
from
Internet
Archive
• Includes
sites
on
servers
located
in
Australia
as
well
as
.au
domain
• Robots.txt
respected
except
for
inline
images
and
stylesheets
• No
public
access
–
researcher
access
protocols
are
being
developed
• Full
text
search
–
tailored
to
archive
search
• Separate
.gov
crawl
publicly
accessible
soon
22. Australian
web
domain
crawls
Year
2005
2006
2007
2008
2009
2011
Files
185
596
516
1
billion
765
660
million
million
million
million
million
Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549
crawled
Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71
23. Internet
Memory
Founda)on
Archive
• internetmemory.org/en/
• no
keyword
search
yet
–
only
URL
• Number
of
European
partners
24.
25. Other
Na)onal
Archives
• List
of
Interna)onal
Internet
Preserva)on
Consor)um
member
archives
–
netpreserve.org/about/archiveList.php
• Some
are
whole
domain
archives,
some
are
selec)ve
archives,
many
are
both
• Some
have
public
access,
others
you
will
need
to
nego)ate
access
for
research
• Most
archives
have
been
collected
using
the
heritrix
open-‐source
crawler
and
thus
use
the
standard
format
(warc
ISO
format)
26. Research
Archives
• California
Digital
Library
• Harvard
University
Libraries
• Columbia
University
Libraries
• University
of
North
Texas
….
and
many
more
• WebCITE
-‐
webcita)on.org
(cita)on
service
archive)
28. Create
your
own
Archive
• Use
a
subscrip)on
service
• Build
your
own
archive
using
open-‐source
crawler
heritrix
and
standard
file
format
.warc
• Use
web
cita)on
services
that
create
archive
copies
as
you
bookmark
pages
29. Subscrip)on
Services
• archive-‐it.org
(service
operated
by
non-‐profit
Internet
Archive
since
2006)
• archivethe.net
(service
operated
by
non-‐profit
Internet
Memory
Founda)on)
• California
Digital
Library
Web
Archiving
Service
-‐
cdlib.org/services/uc3/was.html
• OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm
30.
31. Install
web
archiving
system
locally
• Easy-‐to-‐deploy
web
archiving
toolkit
not
yet
available
(that
meets
web
archive
standards)
• Ins)tu)onal
web
archiving
infrastructure
is
feasible
and
has
been
established
at
a
number
of
universi)es
for
use
by
researchers
–
needs
IT
systems
engineers
to
set
up
though
• Archives
can
be
deposited
with
the
NLA
for
long-‐term
preserva)on
32. 'Memento':
adding
)me
to
the
web
Protocol
and
browser
add-‐on
(MementoFox)
• Aids
discovery,
aggrega)on
of
page
histories
33. Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis
Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…
37. File formats and data: CDX
• Index for Wayback Machine: used to browse
WARC-based archive
• Space-delimited text file
• Only essential metadata needed by Wayback
– URL
– Content Digest
– Capture Timestamp
– Content-Type
– HTTP response code
– etc.
38. File formats and data: WAT
• Yet Another Metadata Format! ☺ ☹
• Not preservation format
• Data exchange and analysis
• Less than full WARC, more than CDX
• Essential metadata for many types of analysis
• Avoids barriers to data exchange: copyright,
privacy
• Work-in-progress: we want your feedback
39. File formats and data: WAT
• WAT is WARC ☺
– WAT records are WARC
metadata records
File formats & data:
– WARC-Refers-To header • CDX: 53 MB
identifies original WARC
record
• WAT: 443 MB
• WAT payload is JSON
• WARC: 8,651 MB
– Compact
– Hierarchical
– Supported by every
programming environ
40. Some
References
• hSp://en.wikipedia.org/wiki/Web_archiving
• hSp://netpreserve.org/about/archiveList.php
• Web
Archives:
The
Future(s)
-‐
hSp://www.netpreserve.org/publica)ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf
41. Contacts
• Webarchive
@
nla.gov.au
• Secretariat
@
internetmemory.org
• Queries
about
the
internet
archive
web
archive
hSp://iawebarchiving.wordpress.com/
• Queries
about
Archive-‐It
service
hSp://www.archive-‐it.org/contact-‐us
• momodei
@
nla.gov.au
• gojomo
@
xavvy.com