SlideShare a Scribd company logo
1 of 17
Download to read offline
NETWORK	
  SEARCH	
  ENGINE	
  
Data Pipeline
2
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
160 TB per Month!
Snapshot of Entire Internet
Data Pipeline
3
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
Data Pipeline
4
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
8 X T4 Large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
Elasticsearch
Data Pipeline
5
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
Elasticsearch
Data Pipeline
6
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
Flask
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
1 X T2 Micro
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
System Costs:
~$2800 per month
If spot instances used:
~$300 per month
Free
Common
Crawl
AWS S3
Wikipedia 2015
CC Blast:
Custom Python Module
Filetypes:
Text – WET
Hyperlinks - WAT
Common
Crawl
AWS S3
Wikipedia 2015
Query:
*.wikipedia.org
808
buckets
15,000
Sub-bucket
~50/2000
wikipedia URLs
Filetypes:
Text – WET
Hyperlinks - WAT
JSON
CC Blast:
Custom Python Module
Source
of Truth
Common
Crawl
AWS S3
Wikipedia 2015
Query:
*.wikipedia.org
808
buckets
15,000
Sub-bucket
~50/2000
wikipedia URLs
Filetypes:
Text – WET
Hyperlinks - WAT
JSON
CC Blast:
Custom Python Module
Source
of Truth
Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Telecommunications
Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Elasticsearch
Index by Text
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Value
URL
Clustering
Rank
Value
Links
/Data_Science 189 a-c, a-d, …
/Insight_Data 186 c-a, c-h, …
/Spark_Streaming 185 a-b, b-c
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Key = URL, Order by Rank
CassandraElasticsearch
Index by Text
Engineering Challenges :
Approximation of page rank with low latency
14
2
35
4
Network Map – Wikipedia Contributor
QUERY: ALL
Mary 1
Tyler
Ben Casey
6
Barb
Dana
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Data Engineering
Data Science Application
Pearson
Correlation
Page Rank Correlation to
‘Data Engineering
Biology Biotech Tech
Matthew Rubashkin
BioE PhD, UC Berkeley
Origin
Write Capacity of Cassandra

More Related Content

What's hot

Info 2402 assignment 2_ crawler
Info 2402 assignment 2_ crawlerInfo 2402 assignment 2_ crawler
Info 2402 assignment 2_ crawler
Shahriar Rafee
 
Hadoop webinar-130808141030-phpapp01
Hadoop webinar-130808141030-phpapp01Hadoop webinar-130808141030-phpapp01
Hadoop webinar-130808141030-phpapp01
Kaushik Dey
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 
PyCon 2012 - Data Driven Design
PyCon 2012 -  Data Driven DesignPyCon 2012 -  Data Driven Design
PyCon 2012 - Data Driven Design
Max Klymyshyn
 
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
Dr.-Ing. Thomas Hartmann
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
Raminder Singh
 

What's hot (20)

Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSync
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
 
Info 2402 assignment 2_ crawler
Info 2402 assignment 2_ crawlerInfo 2402 assignment 2_ crawler
Info 2402 assignment 2_ crawler
 
Hadoop webinar-130808141030-phpapp01
Hadoop webinar-130808141030-phpapp01Hadoop webinar-130808141030-phpapp01
Hadoop webinar-130808141030-phpapp01
 
Watch Your Log!
Watch Your Log!Watch Your Log!
Watch Your Log!
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
PyCon 2012 - Data Driven Design
PyCon 2012 -  Data Driven DesignPyCon 2012 -  Data Driven Design
PyCon 2012 - Data Driven Design
 
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
2015.03 - The RDF Validator - A Tool to Validate RDF Data (KIM)
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
 
A hint of_mint
A hint of_mintA hint of_mint
A hint of_mint
 
sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...
sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...
sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in Cassandra
 
Apache HBase State of the Project
Apache HBase State of the ProjectApache HBase State of the Project
Apache HBase State of the Project
 
April 2013 HUG: Storm and Hadoop - Convergence of Big-Data and Low-Latency Pr...
April 2013 HUG: Storm and Hadoop - Convergence of Big-Data and Low-Latency Pr...April 2013 HUG: Storm and Hadoop - Convergence of Big-Data and Low-Latency Pr...
April 2013 HUG: Storm and Hadoop - Convergence of Big-Data and Low-Latency Pr...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 

Viewers also liked

Power point general menjar
Power point general menjarPower point general menjar
Power point general menjar
RakelMM
 
Projecte Animals Marins
Projecte Animals MarinsProjecte Animals Marins
Projecte Animals Marins
Imaginaulaviva
 

Viewers also liked (15)

TREBALL GRUP 4
TREBALL GRUP 4TREBALL GRUP 4
TREBALL GRUP 4
 
Presentacion 1 vicee
Presentacion  1 viceePresentacion  1 vicee
Presentacion 1 vicee
 
Donald Trump y la Libertad de Expresión
Donald Trump y la Libertad de ExpresiónDonald Trump y la Libertad de Expresión
Donald Trump y la Libertad de Expresión
 
Expense Reduction Analysts - Services and Benefits3.31.10
Expense Reduction Analysts - Services and Benefits3.31.10Expense Reduction Analysts - Services and Benefits3.31.10
Expense Reduction Analysts - Services and Benefits3.31.10
 
Actividades
Actividades Actividades
Actividades
 
Hoja de vida Laura morales
Hoja de vida Laura moralesHoja de vida Laura morales
Hoja de vida Laura morales
 
Bengali Matrimonial Rituals, Cultures, traditions followed in bengali marriages
Bengali Matrimonial Rituals, Cultures, traditions followed in bengali marriagesBengali Matrimonial Rituals, Cultures, traditions followed in bengali marriages
Bengali Matrimonial Rituals, Cultures, traditions followed in bengali marriages
 
Guia Rápido TI-84
Guia Rápido TI-84Guia Rápido TI-84
Guia Rápido TI-84
 
Exercícios Voyage 200
Exercícios Voyage 200Exercícios Voyage 200
Exercícios Voyage 200
 
Manual Voyage 200
Manual Voyage 200Manual Voyage 200
Manual Voyage 200
 
Power point general menjar
Power point general menjarPower point general menjar
Power point general menjar
 
Shaping the Way We Teach English at the Lebanese University
Shaping the Way We Teach English at the Lebanese UniversityShaping the Way We Teach English at the Lebanese University
Shaping the Way We Teach English at the Lebanese University
 
My Lebanon
My LebanonMy Lebanon
My Lebanon
 
Projecte Animals Marins
Projecte Animals MarinsProjecte Animals Marins
Projecte Animals Marins
 
Visão do trem - profecia cumprida
Visão do trem - profecia cumpridaVisão do trem - profecia cumprida
Visão do trem - profecia cumprida
 

Similar to Insight_150115_Demo

Similar to Insight_150115_Demo (20)

Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
 
Extending Analytic Reach
Extending Analytic ReachExtending Analytic Reach
Extending Analytic Reach
 
Deep Dive on Amazon S3
Deep Dive on Amazon S3Deep Dive on Amazon S3
Deep Dive on Amazon S3
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Analisi dei dati con AWS: una panoramica degli strumenti disponibili
Analisi dei dati con AWS: una panoramica degli strumenti disponibiliAnalisi dei dati con AWS: una panoramica degli strumenti disponibili
Analisi dei dati con AWS: una panoramica degli strumenti disponibili
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Log Analysis At Scale
Log Analysis At ScaleLog Analysis At Scale
Log Analysis At Scale
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka ConnectDevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
DevFest Nantes 2018 - Créer un data pipeline en 20 minutes avec Kafka Connect
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

Insight_150115_Demo