Internet content as research data

Internet Content as
Research Data
Digital Humanities Australia
March 2012, Canberra
Monica Omodei & Gordon Mohr

Research Examples
•  Social networking
•  Lexicography
•  Linguistics
•  Network Science
•  Political Science
•  Media Studies
•  Contemporary history

Common
Collec)on
Strategies

•  Crawl
Scope
&
Focus

1)  Thema)c/Topical
(elec)ons,
events,
global
warming…)

2)  Resource-‐speciﬁc
(video,
pdf,
etc.)

3)  Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)

4)  Exhaus)ve
(end
of
life, closure crawls, natl domains)

5)  Frequency-‐Based

•  Key
Inputs:
nomina)ons
from
subject
maSer
experts,

prior
crawl
data,
registry
data,
trusted
directories,

wikipedia

Exis)ng
web
archives

•  Internet
Archive

•  Common
Crawl

•  Pandora
Archive

•  Internet
Memory
Founda)on
Archive

•  Other
na)onal
archives

•  Research,
University
Library
archives

Internet Archive’s Web Archive

Positives
–  Very broad – 175+ billion web instances
–  Historic – started 1996
–  Publicly accessible
–  Time-based URL search
–  API access
–  Not constrained by legislation – covered by
fair use and fast take-down response

Internet
Archive’s
Web
Archive

Negatives
–  Because of size can’t search by keyword
–  Because of size, fully automated - QA not
possible

Common
Use
Cases
for
IA’s
web

archive

•  Content
discovery

•  Nostalgia
queries

•  Web
site
restora)on
and
ﬁle
recovery

•  Domain
name
valua)on

•  Collabora)ve
R&D

•  Prior
art
analysis
and
patent/copyright
infringement

research

•  Legal
cases

•  Topic
analysis,
web
trends
analysis,
popularity

analysis

Common
Crawl

•  Non-‐proﬁt
founda)on
building
an
open
crawl

of
the
web
to
seed
research
and
innova)on

•  Currently
5
billion
pages

•  Stored
on
Amazon’s
S3

•  Accessible
via
MapReduce
processing
in

Amazon’s
EC2
compute
cloud

•  Wholesale
extrac)on,
transforma)on,
and

analysis
of
web
data
cheap
and
easy

•  commoncrawl.org/data/accessing-‐the-‐data/

Common
Crawl

Nega)ves

•  Not
designed
for
human
browsing
but
for

machine
access

•  Objec)ve
is
to
support
large-‐scale
analysis
and

text
mining/indexing
–
not
long-‐term

preserva)on

•  Some
costs
are
involved
for
direct
extrac)on

of
data
from
S3
storage
using
Requester-‐Pays

API

Pandora
Archive

•  Posi)ves

–  Quality
checked

–  Targeted
Australian
content
with
selec)on
policy

–  Historical
–
started
1996

–  Bibliocentric
approach
–we
sites/publica)ons

selected
for
archiving
are
catalogued
(see
Trove)

–  Keyword
search

–  Publicly
accessible

–  You
can
nominate
Australian
web
sites
for

inclusion
-‐
pandora.nla.gov.au/
registra)on_form.html

Pandora
Archive

•  Nega)ves

–  labour
intensive
so
small

–  signiﬁcant
content
missed
because
permission
to

copy
refused

•  Situa)on
will
improve
markedly
if
Legal

Deposit
provisions
extended
to
digital

publica)ons

•  Broader
coverage
will
be
achieved
when

infrastructure
is
upgraded
hence
reducing

labour
costs
for
checking/ﬁxing
crawls

Pandora
Archive
Stats

•  Size
–
6.32
TB

•  Number
of
Files

>
140
million

•  Number
of
‘)tles’
>
30.5K

•  Number
of
)tle
instances
>
73.5K

.au
Domain
Annual
Snapshots

•  Annual
crawls
since
2005
commissioned
from

Internet
Archive

•  Includes
sites
on
servers
located
in
Australia

as
well
as
.au
domain

•  Robots.txt
respected
except
for
inline
images

and
stylesheets

•  No
public
access
–
researcher
access
protocols

are
being
developed

•  Full
text
search
–
tailored
to
archive
search

•  Separate
.gov
crawl
publicly
accessible
soon

Australian
web
domain
crawls

Year
2005
2006
2007
2008
2009
2011

Files
185
596
516
1
billion
765
660

million
million
million
million
million

Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549

crawled

Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71

Internet
Memory
Founda)on

Archive

•  internetmemory.org/en/

•  no
keyword
search
yet
–
only
URL

•  Number
of
European
partners

Other
Na)onal
Archives

•  List
of
Interna)onal
Internet
Preserva)on

Consor)um
member
archives
–

netpreserve.org/about/archiveList.php

•  Some
are
whole
domain
archives,
some

are

selec)ve
archives,
many
are
both

•  Some
have
public
access,
others
you
will
need

to
nego)ate
access
for
research

•  Most
archives
have
been
collected
using
the

heritrix
open-‐source
crawler
and
thus
use
the

standard
format
(warc
ISO
format)

Research
Archives

•  California
Digital
Library

•  Harvard
University
Libraries

•  Columbia

University
Libraries

•  University
of
North
Texas

….
and
many
more

•  WebCITE
-‐
webcita)on.org
(cita)on
service

archive)

Bringing
Archives
Together

•  Common
standard
and
APIs

•  Memento
project

Create
your
own
Archive

•  Use
a
subscrip)on
service

•  Build
your
own
archive
using
open-‐source

crawler
heritrix
and
standard
ﬁle
format
.warc

•  Use
web
cita)on
services
that
create
archive

copies
as
you
bookmark
pages

Subscrip)on
Services

•  archive-‐it.org
(service
operated
by
non-‐proﬁt

Internet
Archive
since
2006)

•  archivethe.net
(service
operated
by
non-‐proﬁt

Internet
Memory
Founda)on)

•  California
Digital
Library
Web
Archiving

Service
-‐
cdlib.org/services/uc3/was.html

•  OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm

Install
web
archiving
system
locally

•  Easy-‐to-‐deploy
web
archiving
toolkit
not
yet

available
(that
meets
web
archive
standards)

•  Ins)tu)onal
web
archiving
infrastructure
is

feasible
and
has
been
established
at
a
number

of
universi)es
for
use
by
researchers
–
needs

IT
systems
engineers
to
set
up
though

•  Archives
can
be
deposited
with
the
NLA
for

long-‐term
preserva)on

'Memento':
adding
)me
to
the

web

Protocol
and
browser
add-‐on
(MementoFox)

•  Aids
discovery,
aggrega)on
of
page
histories

Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis

Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…

Platform & Toolkit: Overview

•  Software

–  Apache Hadoop

–  Apache Pig

•  Data/File format

–  WARC

–  CDX

–  WAT (new!)

Apache Hadoop

•  HDFS

–  Distributed storage

–  Durable, default 3x replication

–  Scalable: Yahoo! 60+PB HDFS

•  MapReduce

–  Distributed computation

–  You write Java functions

–  Hadoop distributes work across cluster

–  Tolerates failures

File formats and data: CDX
•  Index for Wayback Machine: used to browse
WARC-based archive

•  Space-delimited text ﬁle

•  Only essential metadata needed by Wayback

–  URL

–  Content Digest

–  Capture Timestamp

–  Content-Type

–  HTTP response code

–  etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹

•  Not preservation format

•  Data exchange and analysis

•  Less than full WARC, more than CDX

•  Essential metadata for many types of analysis

•  Avoids barriers to data exchange: copyright,
privacy

•  Work-in-progress: we want your feedback

File formats and data: WAT
•  WAT is WARC ☺

–  WAT records are WARC
metadata records

File formats & data:

–  WARC-Refers-To header •  CDX: 53 MB

identiﬁes original WARC
record

•  WAT: 443 MB

•  WAT payload is JSON

•  WARC: 8,651 MB

–  Compact

–  Hierarchical

–  Supported by every
programming environ

Some
References

•  hSp://en.wikipedia.org/wiki/Web_archiving

•  hSp://netpreserve.org/about/archiveList.php

•  Web
Archives:
The
Future(s)
-‐

hSp://www.netpreserve.org/publica)ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf

Contacts

•  Webarchive
@
nla.gov.au

•  Secretariat
@
internetmemory.org

•  Queries
about
the
internet
archive
web
archive

hSp://iawebarchiving.wordpress.com/

•  Queries
about
Archive-‐It
service

hSp://www.archive-‐it.org/contact-‐us

•  momodei
@
nla.gov.au

•  gojomo
@
xavvy.com

Internet content as research data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Internet content as research data

Semelhante a Internet content as research data (20)

Mais de National Library of Australia

Mais de National Library of Australia (20)

Último

Último (20)

Internet content as research data