O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and jus...
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData s...
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- Yo...
Solution ?
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URL...
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Way...
Nice Data Format
Store the raw crawl data.
Format 1:
WARC
Store only the
Meta-Information
as JSON
Format 2:
WAT
Store only the
Plain Text Content
Format 3:
WET
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warce...
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%defa...
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pa...
It‘s Demo Time!
Let's cross fingers now
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
And:...
Próximos SlideShares
Carregando em…5
×

Crawl the entire web in 10 minutes...and just 100€

1.639 visualizações

Publicada em

Use CommonCrawl to extract Data easy with PIG Scripts on AWS EMR Cluster

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Crawl the entire web in 10 minutes...and just 100€

  1. 1. Crawl the entire web in 10 minutes... Copyright ©: 2015 OnPage.org GmbH Using AWS-EMR, AWS-S3, PIG, CommonCrawl ...and just 100 €
  2. 2. Since 2011 in Munich Work at OnPage.org Interested in Webcrawling and BigData Frameworks Build low cost scalable BigData solutions About Me Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org
  3. 3. Do you want to build your own Search- Engine? - High Hardware / Cloud Costs - Nutch needs ~ 1 Hour for 1 million URLs - You want to crawl > 1 Billion URLs
  4. 4. Solution ?
  5. 5. Don‘t Crawl! - Use Common-Crawl : https://commoncrawl.org - Non-Profit-Organisation - ~Monthly over 2 Billions Crawled URLs - Over 1.000 TB total since 2009 - URL seeding list from Blekko: https://blekko.com
  6. 6. Don‘t Crawl! – Use Common Crawl! - Scalably stored on Amazon AWS S3 - Hadoop compatible format powered by Archive.org (Wayback Machine) - Partitionable with S3 Object Prefix possibility - 100MB-1GB file Sizes (gzip) -> Hadoop size
  7. 7. Nice Data Format
  8. 8. Store the raw crawl data. Format 1: WARC
  9. 9. Store only the Meta-Information as JSON Format 2: WAT
  10. 10. Store only the Plain Text Content Format 3: WET
  11. 11. Choose the right format - WARC (Raw HTML): 1.000 MB - WAT (Meta data as JSON) : 450 MB - WET (Plain Text): 150 MB
  12. 12. Processing - Pure Hadoop with MapReduce - Input Classes: http://commoncrawl.org/the-data/get-started/
  13. 13. Processing - High Level ETL-Layer like PIG: http://pig.apache.org - Example Stuff : - https://github.com/norvigaward/warcexamples - https://github.com/mortardata/mortar-examples - https://github.com/matpalm/common-crawl
  14. 14. PIG Example REGISTER file:/home/hadoop/lib/pig/piggybank.jar DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader(); %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz"; -- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/"; %default OUTPUT_PATH "s3://example-bucket/out"; pages = LOAD '$INPUT_PATH' USING FileLoaderClass AS (url, html); meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title; filtered = FILTER meta_titles BY meta_title IS NOT NULL; STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
  15. 15. Hadoop & PIG on AWS - Support new Hadoop releases - PIG Integration - Replace HDFS with S3 - Easy UI to start quickly - Pay per Hour to scale as much as posible
  16. 16. It‘s Demo Time! Let's cross fingers now
  17. 17. That‘s it! Customer: Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org And: We are hiring! https://de.onpage.org/about/jobs/

×