SlideShare uma empresa Scribd logo
1 de 45
BIG DATA
PROCESSING
IN THE CLOUD:
A HYDRA/SUFIA
EXPERIENCE
Helsinki
June 2014
Collin Brittle
Zhiwu Xie
WHO?
WHAT?
WHY?
SENSORS
SMARTINFRASTRUCTURE
DATA SHARING
• Encourage exploratory and multidisciplinary
research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …
CHARACTERIZATION
• Compute intensive
• Storage intensive
• Communication intensive
• On-demand
• Scalability challenge
COMPUTE INTENSIVE
• About 6GB raw data per hour
• Must be continuously processed,
ingested, and further processed
• User-generated computations
• Must not interfere with data retrieval
STORAGE INTENSIVE
• SEB will accumulate about 60TB of raw data
per year
• To facilitate researchers, we must keep raw
data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable
storage facility to hold this much data
• Within XSEDE, only TACC’s Ranch can
allocate this much storage
COMMUNICATION
INTENSIVE
• What if hundreds of researchers around
the world each tried to download
hundreds of TB of our data?
ON DEMAND
• Explorative and multidisciplinary
research cannot predict the data usage
beforehand
SCALABILITY
• How to deal with these challenges in a
scalable manner?
BIG DATA + CLOUD
• Affordable
• Elastic
• Scalable
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
OBJECTS AND
DATASTREAMS
Local Object
Meta Meta File
OBJECTS AND
DATASTREAMS
Local Object
Meta Meta File
REMOTE
STORAGE
Local
Repository
EC2 GlacierS3
Amazon
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
Worker
Worker
Worker
Database
Public
Server
Clients
Redis
BACKGROUND
PROCESSING
0100
0010
FROM QUEUES
TO THE CLOUD
1010
0101
0101
0101
1100
0011
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
1100
0011
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1100
0011
0011
1100
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1111
0000
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
QUEUEING
QUEUEING
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
0101
0101
0101
0101
FROM QUEUES
TO THE CLOUD
0010
0100
0010
0100
0010
0100
1010
0101
1010
0101
1010
0101
1100
0011
1100
0011
1100
0011
1100
0011
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
0010
0100
0000
0010
Database
Public
Server
Clients
Redis
Master
Redis
Slave
Private
Server
Private
Server
Private
Server
DISTRIBUTED
PROCESSING
SCALE UP
SCALE UP
WE CHOSE SUFIA
WHAT IS SUFIA?
• Ruby on Rails framework…
• Based on Hydra…
• Using Fedora Commons…
• And Resque
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
QUESTIONS?
rotated8 (who works at) vt.edu

Mais conteúdo relacionado

Mais procurados (6)

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
 
2017 04 embl
2017 04 embl2017 04 embl
2017 04 embl
 
AKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS NetworkAKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS Network
 

Destaque

Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Destaque (7)

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience

(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect DataAmazon Web Services
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Cloud Native Day Tel Aviv
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data PlatformsChris Kernaghan
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhcdrsm79
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Offsite presentation original
Offsite presentation originalOffsite presentation original
Offsite presentation originalsally.de
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience (20)

(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active Archive
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Big Data
Big Data Big Data
Big Data
 
Offsite presentation original
Offsite presentation originalOffsite presentation original
Offsite presentation original
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 

Último

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Último (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Big Data Processing in the Cloud: A Hydra/Sufia Experience

Notas do Editor

  1. The work reported here is a collaboration between the University Libraries’ Center for Digital Research and Scholarship and the Smart Infrastructure Laboratory at Virginia Tech.
  2. The project centers around the Virginia Tech Signature Engineering Building, or SEB.
  3. This new, one-hundred-and-sixty-thousand square-foot building will house a portion of Virginia Tech’s College of Engineering. The Smart Infrastructure Laboratory, or VT-SIL, also wants to turn this building into a full-scale living laboratory.
  4. Which is why during the construction, VT-SIL mounted over two hundred and forty vibration-monitoring accelerometers and hundreds of temperature, air flow, and other sensors, in one hundred and thirty six different locations throughout the building. Upon completion, the SEB will be the most instrumented building for vibrations in the world.
  5. VT-SIL will utilize the collected data to improve the design, monitoring, and daily operation of civil and mechanical infrastructure. The data will also be used to investigate how humans interact with the built environment.
  6. Moreover, VT-SIL wants to openly share much of the data with the public. The objective is to encourage exploratory and multidisciplinary research, and to foster an open and inclusive community of researchers and educators. The VT library’s involvement in this project focuses on data sharing and reuse, in particular, how to make the process more effective and efficient. This is a big data problem that presents many distinctive challenges.
  7. Now let’s step back a little bit. Forget the specific nature of the data and instead focus on the more abstract but also more generalizable characteristics of the problem we face. We believe there are at least five distinct characteristics that separate this problem from many other data related projects done in libraries, and we believe similar characteristics will be seen more and more often as libraries are involved in more data intensive research.
  8. First, big data problems require intensive computing power. Take SEB data as an example- the SEB generates about six gigabytes of raw data per hour. This may not sound much, but realize that we may need to do complicated processing to transform the raw data, to ingest it into the repository, and to extract various metadata and features. All while the data keeps pouring in. As the data grows larger, fewer end users will have the resources to process it, and will naturally expect us to do at least some preliminary processing for them. For example, seismologists researching earthquakes will only be interested in the portion of the data that involves earthquakes. These researchers will want us to identify the earthquake data segments for them, instead of downloading many years worth of data archives just to figure it out by themselves. Such user-generated computations will demand even more processing power. Also, processing new data must not interfere with serving the ingested data.
  9. Big data also poses a storage challenge. For example, the SEB will accumulate roughly sixty terabytes of raw data each year. In order to facilitate multidisciplinary research to detect, for example, structural deteriorations over time, we must keep raw data for an extended period of time, e.g., >= 5 years VT does not currently have an affordable storage facility to hold this much data. Even for universities that have already built massive storage systems, sharing data across institutional boundaries is still very problematic. Now let’s take a look at the existing national R&D infrastructure. XSEDE, the consortium including all NSF funded supercomputer centers, has a list of storage allocations. From the list we can easily figure out that the Texas Advanced Computer Center’s Ranch is the only storage system that can allocate sufficient long-term storage for the SEB project. But getting the allocation approved isn’t easy.
  10. Of course big data also poses the challenge of big data transfer. Even if we don’t have to pay for the bandwidth, imagine how crowded the network will be if we have hundreds of researchers around the world, and each tried to download hundreds of terabytes of data from us? It’s not very practical. It will take weeks, if not months, to move the data sets around. Is it really worth the trouble? A more efficient and effective way to deal with this problem is to help the researchers reduce the data to more manageable sizes before sharing. But this, again, goes back to the first challenge of user-generated computation load.
  11. We also predict much of the data processing will be on-demand. This is because explorative and multidisciplinary research cannot predict the data usage beforehand. New ideas will pop up from time to time that will require the data being manipulated in totally different ways from before. And it will be very hard to predict how much processing power is enough.
  12. All this leads the fifth challenge. How can this scale?
  13. We believe the cloud is a viable, and for now, probably the only feasible solution to move forward. The cloud is affordable, can cope with the on-demand workloads, and scales well without needing the high initial investment with hardware. Bandwidth cost is the major drawback, which we hope to mitigate by processing the data where it is stored.
  14. Those characteristics became framework requirements. The chosen framework needed to mix local and remote content… … support background processing… …and be distributable.
  15. Let’s start with mixing local and remote content. This supports the storage intensive characteristic. If we can’t store data remotely, we can’t store all the data.
  16. So, instead of keeping everything locally…
  17. …we keep a pointer to the remote file. In effect, we are keeping a way of getting the remote data.
  18. This is another way of looking at it. The local repository is pointing to the data somewhere in Amazon.
  19. Next, the framework needs to be able to process data asynchronously in the background. This helps fulfill the compute intensive characteristic.
  20. Here, the workers on the right are the important bit. They’re going to all the data processing for us.
  21. Now, I’m going to show a quick demonstration of the workers and the queuing system. Here’s some data we’re going to be working with.
  22. Some of the data is queued up into three queues. Some of the data is in multiple queues, and some is just in one. The queues here represent different kinds of processing that the workers will do.
  23. And here’s our worker.
  24. Here it’s picking up its first job off a queue. Which queue it chooses depends on how the worker was created. It may prefer or avoid certain queues.
  25. Now it has the data, and is ready to work.
  26. So it works, and creates the new metadata, and updates the item in the database.
  27. We’re back to the beginning.
  28. Choose a queue…
  29. … pick up data…
  30. … and process.
  31. Repeat.
  32. These screens are pulled from the demo application I created. Here’s what it looks like with nothing going on. Nothing in the queues (on the side), and no workers running.
  33. Now we’re working! There are plenty of jobs queued up to keep the one worker busy. Unfortunately, trying to do all this data crunching on a single server will bog down all the other tasks the server is trying to do, like serve web pages. So, background workers speed up the server by allowing web pages to be served while work is going on, but they still slow the server down, as the hardware has limits. In short, this won’t scale.
  34. But if we can distribute the workload to multiple servers, we can get the work done faster, with less impact to our patrons. This meets the scalability characteristic.
  35. Let’s visit our worker again. It used to be able to keep up with the jobs as they came in.
  36. But now it’s overwhelmed. In our case, 6 terabytes of data per hour will do that.
  37. So we start up new workers on new hardware to help. But we’re not going to buy more hardware! We’re already using Amazon for storage, they can handle our hardware too.
  38. The load on our system is going to change, though, and we’re going to want more and more workers to deal with longer and longer queues. Now that they are not on our public server, with is easier to accommodate. And since Amazon still charges up for idle workers, we wind down if demand tapers off.
  39. In our demo, it looks like this. Here’s the one worker from before.
  40. Now we’ve scaled up, and the average time spent in a queue is falling.
  41. Sufia checks two of our framework requirements out of the box. Fedora lets us mix local and remote content, and Resque gives us packground processing.