SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Webarchive CDX summary
WARCNet WG1 - Comparing entire web domains
Aarhus virtual meeting 21.4.2021
yves.maurer@bnl.etat.lu
@yvesmaurer
github.com/ymaurer
Why
• We want to know more about “National Collections” and “National
Webs”
• Many Web Archives are not accessible through the Internet
• Often Have only stats of XX PB in Archive
• Not a lot of information on country-code-TLD
• Little info on overlap between archives
• WARC & CDX are too big
• Need for “low common denominator data” that is still “rich enough”
1 file
245 MB
~ 0.5 million
1.6 TB
Size comparison (wlu)
~ 0.5 million
84 TB
WARC CDXJ Summary file
50x 7000x
Detailed info & Code
https://github.com/ymaurer/cdx-summarize
CDX CDX CDX CDX CDX CDX
…
… …
cdx-summarize.py cdx-summarize.py cdx-summarize.py
combine-summary.py
.summary JSON file
Possible alternative sources
Or other Search Engine Index
which holds information
about count & size of MIME
types per domain
What is the summary file?
• A file with an entry per 2nd level domain and summary info per year
about the number of files and their size:
• HTML
• CSS
• Images
• PDF
• Video
• Audio
• Javascript
• JSON (Javascript Object Notation)
• Fonts
• HTTP vs HTTPS (secure Web)
Summary file example
bnl.lu {
"2002":
{"n_html":175,"n_image":0,"n_pdf":0, ...
"s_html":52634,"s_image":0,"s_pdf":0, ...},
"2003":
{"n_html":639,"n_image":44,"n_pdf":30, ... ,
"s_html":1295481,"s_image":295235,"s_pdf":3071214, ...}
}
Example:
Average size of files
0
5000
10000
15000
20000
25000
2015 2016 2017 2018 2019 2020 2021
Average
Size
in
Bytes Average size of HTML files (s_html / n_html)
Luxembourg Web Archive
Example:
Using the domain names
0
2
4
6
8
10
12
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency
In IA 2nd-level domains vs French words
French wordlist .fr domains
-4
-3
-2
-1
0
1
2
3
4
5
6
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
First letter frequency in IA 2nd-level domains vs French words
Data sources used
Luxembourg Web Archive
Data source Luxembourg Web Archive
• Established in 2016
• CDXJ files on disk
• Run programs locally
Data source Internet Archive
• CDX Server API at:
http://web.archive.org/cdx/search/cdx?url=lu
Download using:
https://github.com/ikreymer/cdx-index-client
Data downloaded for:
lu dk be fr frl nl
should have 41136 67843 71202 311813 42 230871
actually have 41136 42037 71202 303282 42 205147
missing (%) 0.00% 38.04% 0.00% 2.74% 0.00% 11.14%
Data source Common crawl
• Hosted on Amazon S3
• Receipe at:
https://groups.google.com/g/common-
crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ
• Download CDX / CDXJ and process locally
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
100
bytes
1 KB 10 KB 100 KB 1 MB 10 MB 100 MB 1 GB 10 GB 100 GB
Number
of
domains
Size in Archive in bytes (logarithmic)
Number of domains per size archived in ccTLD .fr
IA
commoncrawl
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
0 5 10 15 20 25
Number
of
compressed
bytes
(logarithmic)
Number of years in IA archive
.fr bytes vs number of years presence of domain in Internet Archive
Further process .summary
• host_year_total.py
• overlap.py
2nd level domain Year File Count Bytes
alvestedetocht.frl 2015 2 3750
alvestedetocht.frl 2016 108 483354679
Year Common Crawl Internet Archive CC & IA
2019 469 620 1180
0
10000
20000
30000
40000
50000
60000
70000
80000
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.lu overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
webarchive.lu webarchive.lu AND InternetArchive
webarchive.lu AND InternetArchive AND CommonCrawl webarchive.lu AND CommonCrawl
InternetArchive InternetArchive AND CommonCrawl
CommonCrawl
0
500000
1000000
1500000
2000000
2500000
1993 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
.fr overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts
lufr lufr AND iafr lufr AND iafr AND ccfr lufr AND ccfr iafr iafr AND ccfr ccfr
Related work
• Internet Archive metadata service
https://github.com/jeffersonbailey/web-archive-apis-workshop
curl "https://web.archive.org/__wb/search/metadata?q=tld:lu"
• Sawood Alam et al. characterization of webarchive holdings for memento
aggregator
https://netpreserve.org/resources/IIPC_project-Archive_profiling-final_report.pdf
https://github.com/oduwsdl/MementoMap
• Shine, SOLRWayback can probably answer questions like this for a single
archive
• Only one developer/tester ! Bugs…
• MIME
• MIME types are reported by server but not necessarily correct
• Some common-crawl data has no MIME information
• No canonical way to “simplify” MIME types
• Maybe missing interesting categories?
• Domains / Hosts
• Tradeoff between size of summary file and details (e.g. www.ic.ac.uk)
• IDN (xn--p1ai -> рф -> ru)
• Overlap analysis
• Very crude
Limitations

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

The Danish case: What does the danish web talk about
The Danish case: What does the danish web talk aboutThe Danish case: What does the danish web talk about
The Danish case: What does the danish web talk about
 
Webber Presentation
Webber PresentationWebber Presentation
Webber Presentation
 
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 
Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"Presentatie for "Studiemiddag Linked Data Archieven"
Presentatie for "Studiemiddag Linked Data Archieven"
 
Learning R - Handling NetCDF files
Learning R - Handling NetCDF filesLearning R - Handling NetCDF files
Learning R - Handling NetCDF files
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Sharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca GrantSharing irish place names as linked open data - Rebecca Grant
Sharing irish place names as linked open data - Rebecca Grant
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)
 
Linked Data
Linked DataLinked Data
Linked Data
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Open Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in GermanyOpen Access of Research Data - The Present and Future Situation in Germany
Open Access of Research Data - The Present and Future Situation in Germany
 
Wikidata
WikidataWikidata
Wikidata
 
Linked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering GroupLinked Data Research Projects at Ontology Engineering Group
Linked Data Research Projects at Ontology Engineering Group
 
The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...The ARIADNE interoperability framework, component architecture and registry s...
The ARIADNE interoperability framework, component architecture and registry s...
 

Semelhante a Maurer Presentation - WARCnet Spring Meeting 2021

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
csching
 

Semelhante a Maurer Presentation - WARCnet Spring Meeting 2021 (20)

IIIF & Digital Humanities
IIIF & Digital Humanities     IIIF & Digital Humanities
IIIF & Digital Humanities
 
ELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hubELIXIR Competence Centre in EOSC-hub
ELIXIR Competence Centre in EOSC-hub
 
Hambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2OHambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2O
 
Intro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer AielloIntro to R and H2O with Spencer Aiello
Intro to R and H2O with Spencer Aiello
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
NISO REST Training IIIF
NISO REST Training IIIF NISO REST Training IIIF
NISO REST Training IIIF
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service   meetup ovh bordeauxOvh analytics data compute with apache spark as a service   meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
 
OVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a ServiceOVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a Service
 
Linked Data Usecases
Linked Data UsecasesLinked Data Usecases
Linked Data Usecases
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
The End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersThe End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident Responders
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies" Vorlesung "Web-Technologies"
Vorlesung "Web-Technologies"
 
Lec 01 Introduction.pptx
Lec  01 Introduction.pptxLec  01 Introduction.pptx
Lec 01 Introduction.pptx
 
Nilges Making The Metadata Work NISO Virtual Conference Ebooks
Nilges Making The Metadata Work NISO Virtual Conference EbooksNilges Making The Metadata Work NISO Virtual Conference Ebooks
Nilges Making The Metadata Work NISO Virtual Conference Ebooks
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 

Mais de WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
WARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
WARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
WARCnet
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...
WARCnet
 

Mais de WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Web scraping using semi-automated browsing
 Web scraping using semi-automated browsing Web scraping using semi-automated browsing
Web scraping using semi-automated browsing
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...Whose Archives? Reflections on ethics and the cultural significance of web ar...
Whose Archives? Reflections on ethics and the cultural significance of web ar...
 

Último

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Último (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Maurer Presentation - WARCnet Spring Meeting 2021

  • 1. Webarchive CDX summary WARCNet WG1 - Comparing entire web domains Aarhus virtual meeting 21.4.2021 yves.maurer@bnl.etat.lu @yvesmaurer github.com/ymaurer
  • 2. Why • We want to know more about “National Collections” and “National Webs” • Many Web Archives are not accessible through the Internet • Often Have only stats of XX PB in Archive • Not a lot of information on country-code-TLD • Little info on overlap between archives • WARC & CDX are too big • Need for “low common denominator data” that is still “rich enough”
  • 3. 1 file 245 MB ~ 0.5 million 1.6 TB Size comparison (wlu) ~ 0.5 million 84 TB WARC CDXJ Summary file 50x 7000x
  • 4. Detailed info & Code https://github.com/ymaurer/cdx-summarize CDX CDX CDX CDX CDX CDX … … … cdx-summarize.py cdx-summarize.py cdx-summarize.py combine-summary.py .summary JSON file
  • 5. Possible alternative sources Or other Search Engine Index which holds information about count & size of MIME types per domain
  • 6. What is the summary file? • A file with an entry per 2nd level domain and summary info per year about the number of files and their size: • HTML • CSS • Images • PDF • Video • Audio • Javascript • JSON (Javascript Object Notation) • Fonts • HTTP vs HTTPS (secure Web)
  • 7. Summary file example bnl.lu { "2002": {"n_html":175,"n_image":0,"n_pdf":0, ... "s_html":52634,"s_image":0,"s_pdf":0, ...}, "2003": {"n_html":639,"n_image":44,"n_pdf":30, ... , "s_html":1295481,"s_image":295235,"s_pdf":3071214, ...} }
  • 9. 0 5000 10000 15000 20000 25000 2015 2016 2017 2018 2019 2020 2021 Average Size in Bytes Average size of HTML files (s_html / n_html) Luxembourg Web Archive
  • 11. 0 2 4 6 8 10 12 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z First letter frequency In IA 2nd-level domains vs French words French wordlist .fr domains
  • 12. -4 -3 -2 -1 0 1 2 3 4 5 6 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z First letter frequency in IA 2nd-level domains vs French words
  • 14. Data source Luxembourg Web Archive • Established in 2016 • CDXJ files on disk • Run programs locally
  • 15. Data source Internet Archive • CDX Server API at: http://web.archive.org/cdx/search/cdx?url=lu Download using: https://github.com/ikreymer/cdx-index-client Data downloaded for: lu dk be fr frl nl should have 41136 67843 71202 311813 42 230871 actually have 41136 42037 71202 303282 42 205147 missing (%) 0.00% 38.04% 0.00% 2.74% 0.00% 11.14%
  • 16. Data source Common crawl • Hosted on Amazon S3 • Receipe at: https://groups.google.com/g/common- crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ • Download CDX / CDXJ and process locally
  • 17. 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 100 bytes 1 KB 10 KB 100 KB 1 MB 10 MB 100 MB 1 GB 10 GB 100 GB Number of domains Size in Archive in bytes (logarithmic) Number of domains per size archived in ccTLD .fr IA commoncrawl
  • 18. 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 0 5 10 15 20 25 Number of compressed bytes (logarithmic) Number of years in IA archive .fr bytes vs number of years presence of domain in Internet Archive
  • 19. Further process .summary • host_year_total.py • overlap.py 2nd level domain Year File Count Bytes alvestedetocht.frl 2015 2 3750 alvestedetocht.frl 2016 108 483354679 Year Common Crawl Internet Archive CC & IA 2019 469 620 1180
  • 20. 0 10000 20000 30000 40000 50000 60000 70000 80000 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 .lu overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts webarchive.lu webarchive.lu AND InternetArchive webarchive.lu AND InternetArchive AND CommonCrawl webarchive.lu AND CommonCrawl InternetArchive InternetArchive AND CommonCrawl CommonCrawl
  • 21. 0 500000 1000000 1500000 2000000 2500000 1993 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 .fr overlap between Internet Archive, Common Crawl and Luxembourg Web archive in terms of hosts lufr lufr AND iafr lufr AND iafr AND ccfr lufr AND ccfr iafr iafr AND ccfr ccfr
  • 22. Related work • Internet Archive metadata service https://github.com/jeffersonbailey/web-archive-apis-workshop curl "https://web.archive.org/__wb/search/metadata?q=tld:lu" • Sawood Alam et al. characterization of webarchive holdings for memento aggregator https://netpreserve.org/resources/IIPC_project-Archive_profiling-final_report.pdf https://github.com/oduwsdl/MementoMap • Shine, SOLRWayback can probably answer questions like this for a single archive
  • 23. • Only one developer/tester ! Bugs… • MIME • MIME types are reported by server but not necessarily correct • Some common-crawl data has no MIME information • No canonical way to “simplify” MIME types • Maybe missing interesting categories? • Domains / Hosts • Tradeoff between size of summary file and details (e.g. www.ic.ac.uk) • IDN (xn--p1ai -> рф -> ru) • Overlap analysis • Very crude Limitations