Big Data Processing in the Cloud: A Hydra/Sufia Experience

•Transferir como PPTX, PDF•

0 gostou•2,647 visualizações

This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.

Dados e análise Tecnologia Negócios

BIG DATA
PROCESSING
IN THE CLOUD:
A HYDRA/SUFIA
EXPERIENCE
Helsinki
June 2014
Collin Brittle
Zhiwu Xie

DATA SHARING
• Encourage exploratory and multidisciplinary
research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …

CHARACTERIZATION
• Compute intensive
• Storage intensive
• Communication intensive
• On-demand
• Scalability challenge

COMPUTE INTENSIVE
• About 6GB raw data per hour
• Must be continuously processed,
ingested, and further processed
• User-generated computations
• Must not interfere with data retrieval

STORAGE INTENSIVE
• SEB will accumulate about 60TB of raw data
per year
• To facilitate researchers, we must keep raw
data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable
storage facility to hold this much data
• Within XSEDE, only TACC’s Ranch can
allocate this much storage

COMMUNICATION
INTENSIVE
• What if hundreds of researchers around
the world each tried to download
hundreds of TB of our data?

ON DEMAND
• Explorative and multidisciplinary
research cannot predict the data usage
beforehand

SCALABILITY
• How to deal with these challenges in a
scalable manner?

BIG DATA + CLOUD
• Affordable
• Elastic
• Scalable

FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable

OBJECTS AND
DATASTREAMS
Local Object
Meta Meta File

REMOTE
STORAGE
Local
Repository
EC2 GlacierS3
Amazon

Worker
Worker
Worker
Database
Public
Server
Clients
Redis
BACKGROUND
PROCESSING

0100
0010
FROM QUEUES
TO THE CLOUD
1010
0101
0101
0101
1100
0011

1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101

FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
1100
0011

FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1100
0011
0011
1100
1010
0101

FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101

FROM QUEUES
TO THE CLOUD
1010
0101
1111
0000
1010
0101
1010
0101

FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101

0101
0101
0101
0101
FROM QUEUES
TO THE CLOUD

0010
0100
0010
0100
0010
0100
1010
0101
1010
0101
1010
0101
1100
0011
1100
0011
1100
0011

1100
0011
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
0010
0100
0000
0010

Database
Public
Server
Clients
Redis
Master
Redis
Slave
Private
Server
Private
Server
Private
Server
DISTRIBUTED
PROCESSING

WHAT IS SUFIA?
• Ruby on Rails framework…
• Based on Hydra…
• Using Fedora Commons…
• And Resque

QUESTIONS?
rotated8 (who works at) vt.edu

Mais conteúdo relacionado

Mais procurados

Hadoop TutorialUjjwal Gupta

Lunch & Learn Intro to Big DataMelissa Hornbostel

Google BigQuery Best PracticesMatillion

Cloud DataverseMerce Crosas

2017 04 emblJohannes Keizer

AKstem Service: Supporting the AGRIS NetworkAIMS (Agricultural Information Management Standards)

Mais procurados (6)

Hadoop Tutorial

Lunch & Learn Intro to Big Data

Google BigQuery Best Practices

Cloud Dataverse

2017 04 embl

AKstem Service: Supporting the AGRIS Network

Destaque

Sept 24 NISO Virtual Conference: Library Data in the CloudNational Information Standards Organization (NISO)

Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes

Big Data: an introductionBart Vandewoestyne

Introduction to Big DataKaran Desai

20170126 big data processingVienna Data Science Group

Introduction to Big Data/Machine LearningLars Marius Garshol

Big data pptNasrin Hussain

Destaque (7)

Sept 24 NISO Virtual Conference: Library Data in the Cloud

Big data introduction - Big Data from a Consulting perspective - Sogeti

Big Data: an introduction

Introduction to Big Data

20170126 big data processing

Introduction to Big Data/Machine Learning

Big data ppt

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience

(STG308) How EA, State Of Texas & H3 Biomedicine Protect DataAmazon Web Services

Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems

Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Cloud Native Day Tel Aviv

BigData, NoSQL & ElasticSearchSanura Hettiarachchi

Data Pipelines with Spark & DataStax EnterpriseDataStax

Deploying Big Data PlatformsChris Kernaghan

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Harness the power of Data in a Big Data LakeSaurabh K. Gupta

Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP

Kafka & Hadoop in RakutenRakuten Group, Inc.

Lessons from lhcdrsm79

Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services

re:Invent 2013-foster-madduriRavi Madduri

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

M6d cassandrapresentationEdward Capriolo

The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA

Big Data Putchong Uthayopas

Offsite presentation originalsally.de

Meetup 25/04/19: Big Data Digipolis Antwerpen

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience (20)

(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data

Three Steps to Modern Media Asset Management with Active Archive

Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015

BigData, NoSQL & ElasticSearch

Data Pipelines with Spark & DataStax Enterprise

Deploying Big Data Platforms

20160331 sa introduction to big data pipelining berlin meetup 0.3

Harness the power of Data in a Big Data Lake

Don't Be Scared. Data Don't Bite. Introduction to Big Data.

Kafka & Hadoop in Rakuten

Lessons from lhc

Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...

re:Invent 2013-foster-madduri

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

M6d cassandrapresentation

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both

Big Data

Offsite presentation original

Meetup 25/04/19: Big Data

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

Último

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

How we prevented account sharing with MFAAndrei Kaleshka

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

ASML's Taxonomy Adventure by Daniel Cantervoginip

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Multiple time frame trading analysis -brianshannon.pdfchwongval

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Easter Eggs From Star Wars and in cars 1 and 217djon017

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Big Data Processing in the Cloud: A Hydra/Sufia Experience

1. BIG DATA PROCESSING IN THE CLOUD: A HYDRA/SUFIA EXPERIENCE Helsinki June 2014 Collin Brittle Zhiwu Xie

2. WHO?

3. WHAT?

4. WHY?

5. SENSORS

6. SMARTINFRASTRUCTURE

7. DATA SHARING • Encourage exploratory and multidisciplinary research • Foster open and inclusive communities around • modeling of dynamic systems • structural health monitoring and damage detection • occupancy studies • sensor evaluation • data fusion • energy reduction • evacuation management • …

8. CHARACTERIZATION • Compute intensive • Storage intensive • Communication intensive • On-demand • Scalability challenge

9. COMPUTE INTENSIVE • About 6GB raw data per hour • Must be continuously processed, ingested, and further processed • User-generated computations • Must not interfere with data retrieval

10. STORAGE INTENSIVE • SEB will accumulate about 60TB of raw data per year • To facilitate researchers, we must keep raw data for an extended period of time, e.g., >= 5 years • VT currently does not have an affordable storage facility to hold this much data • Within XSEDE, only TACC’s Ranch can allocate this much storage

11. COMMUNICATION INTENSIVE • What if hundreds of researchers around the world each tried to download hundreds of TB of our data?

12. ON DEMAND • Explorative and multidisciplinary research cannot predict the data usage beforehand

13. SCALABILITY • How to deal with these challenges in a scalable manner?

14. BIG DATA + CLOUD • Affordable • Elastic • Scalable

15. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable

16. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable

17. OBJECTS AND DATASTREAMS Local Object Meta Meta File

18. OBJECTS AND DATASTREAMS Local Object Meta Meta File

19. REMOTE STORAGE Local Repository EC2 GlacierS3 Amazon

20. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable

21. Worker Worker Worker Database Public Server Clients Redis BACKGROUND PROCESSING

22. 0100 0010 FROM QUEUES TO THE CLOUD 1010 0101 0101 0101 1100 0011

23. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101

24. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101

25. 1010 0101 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 1010 0101

26. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101 1100 0011

27. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1100 0011 0011 1100 1010 0101

28. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101

29. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101

30. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101 1010 0101

31. FROM QUEUES TO THE CLOUD 1010 0101 1111 0000 1010 0101 1010 0101

32. FROM QUEUES TO THE CLOUD 1010 0101 1010 0101

33. QUEUEING

34. QUEUEING

35. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable

36. 0101 0101 0101 0101 FROM QUEUES TO THE CLOUD

37. 0010 0100 0010 0100 0010 0100 1010 0101 1010 0101 1010 0101 1100 0011 1100 0011 1100 0011

38. 1100 0011 FROM QUEUES TO THE CLOUD 1010 0101 1100 0011 0010 0100 0000 0010

39. Database Public Server Clients Redis Master Redis Slave Private Server Private Server Private Server DISTRIBUTED PROCESSING

40. SCALE UP

41. SCALE UP

42. WE CHOSE SUFIA

43. WHAT IS SUFIA? • Ruby on Rails framework… • Based on Hydra… • Using Fedora Commons… • And Resque

44. FRAMEWORK REQUIREMENTS • Mix local and remote content • Support background processing • Be distributable

45. QUESTIONS? rotated8 (who works at) vt.edu

Notas do Editor

The work reported here is a collaboration between the University Libraries’ Center for Digital Research and Scholarship and the Smart Infrastructure Laboratory at Virginia Tech.
The project centers around the Virginia Tech Signature Engineering Building, or SEB.
This new, one-hundred-and-sixty-thousand square-foot building will house a portion of Virginia Tech’s College of Engineering. The Smart Infrastructure Laboratory, or VT-SIL, also wants to turn this building into a full-scale living laboratory.
Which is why during the construction, VT-SIL mounted over two hundred and forty vibration-monitoring accelerometers and hundreds of temperature, air flow, and other sensors, in one hundred and thirty six different locations throughout the building. Upon completion, the SEB will be the most instrumented building for vibrations in the world.
VT-SIL will utilize the collected data to improve the design, monitoring, and daily operation of civil and mechanical infrastructure. The data will also be used to investigate how humans interact with the built environment.
Moreover, VT-SIL wants to openly share much of the data with the public. The objective is to encourage exploratory and multidisciplinary research, and to foster an open and inclusive community of researchers and educators. The VT library’s involvement in this project focuses on data sharing and reuse, in particular, how to make the process more effective and efficient. This is a big data problem that presents many distinctive challenges.
Now let’s step back a little bit. Forget the specific nature of the data and instead focus on the more abstract but also more generalizable characteristics of the problem we face. We believe there are at least five distinct characteristics that separate this problem from many other data related projects done in libraries, and we believe similar characteristics will be seen more and more often as libraries are involved in more data intensive research.
First, big data problems require intensive computing power. Take SEB data as an example- the SEB generates about six gigabytes of raw data per hour. This may not sound much, but realize that we may need to do complicated processing to transform the raw data, to ingest it into the repository, and to extract various metadata and features. All while the data keeps pouring in. As the data grows larger, fewer end users will have the resources to process it, and will naturally expect us to do at least some preliminary processing for them. For example, seismologists researching earthquakes will only be interested in the portion of the data that involves earthquakes. These researchers will want us to identify the earthquake data segments for them, instead of downloading many years worth of data archives just to figure it out by themselves. Such user-generated computations will demand even more processing power. Also, processing new data must not interfere with serving the ingested data.
Big data also poses a storage challenge. For example, the SEB will accumulate roughly sixty terabytes of raw data each year. In order to facilitate multidisciplinary research to detect, for example, structural deteriorations over time, we must keep raw data for an extended period of time, e.g., >= 5 years VT does not currently have an affordable storage facility to hold this much data. Even for universities that have already built massive storage systems, sharing data across institutional boundaries is still very problematic. Now let’s take a look at the existing national R&D infrastructure. XSEDE, the consortium including all NSF funded supercomputer centers, has a list of storage allocations. From the list we can easily figure out that the Texas Advanced Computer Center’s Ranch is the only storage system that can allocate sufficient long-term storage for the SEB project. But getting the allocation approved isn’t easy.
Of course big data also poses the challenge of big data transfer. Even if we don’t have to pay for the bandwidth, imagine how crowded the network will be if we have hundreds of researchers around the world, and each tried to download hundreds of terabytes of data from us? It’s not very practical. It will take weeks, if not months, to move the data sets around. Is it really worth the trouble? A more efficient and effective way to deal with this problem is to help the researchers reduce the data to more manageable sizes before sharing. But this, again, goes back to the first challenge of user-generated computation load.
We also predict much of the data processing will be on-demand. This is because explorative and multidisciplinary research cannot predict the data usage beforehand. New ideas will pop up from time to time that will require the data being manipulated in totally different ways from before. And it will be very hard to predict how much processing power is enough.
All this leads the fifth challenge. How can this scale?
We believe the cloud is a viable, and for now, probably the only feasible solution to move forward. The cloud is affordable, can cope with the on-demand workloads, and scales well without needing the high initial investment with hardware. Bandwidth cost is the major drawback, which we hope to mitigate by processing the data where it is stored.
Those characteristics became framework requirements. The chosen framework needed to mix local and remote content… … support background processing… …and be distributable.
Let’s start with mixing local and remote content. This supports the storage intensive characteristic. If we can’t store data remotely, we can’t store all the data.
So, instead of keeping everything locally…
…we keep a pointer to the remote file. In effect, we are keeping a way of getting the remote data.
This is another way of looking at it. The local repository is pointing to the data somewhere in Amazon.
Next, the framework needs to be able to process data asynchronously in the background. This helps fulfill the compute intensive characteristic.
Here, the workers on the right are the important bit. They’re going to all the data processing for us.
Now, I’m going to show a quick demonstration of the workers and the queuing system. Here’s some data we’re going to be working with.
Some of the data is queued up into three queues. Some of the data is in multiple queues, and some is just in one. The queues here represent different kinds of processing that the workers will do.
And here’s our worker.
Here it’s picking up its first job off a queue. Which queue it chooses depends on how the worker was created. It may prefer or avoid certain queues.
Now it has the data, and is ready to work.
So it works, and creates the new metadata, and updates the item in the database.
We’re back to the beginning.
Choose a queue…
… pick up data…
… and process.
Repeat.
These screens are pulled from the demo application I created. Here’s what it looks like with nothing going on. Nothing in the queues (on the side), and no workers running.
Now we’re working! There are plenty of jobs queued up to keep the one worker busy. Unfortunately, trying to do all this data crunching on a single server will bog down all the other tasks the server is trying to do, like serve web pages. So, background workers speed up the server by allowing web pages to be served while work is going on, but they still slow the server down, as the hardware has limits. In short, this won’t scale.
But if we can distribute the workload to multiple servers, we can get the work done faster, with less impact to our patrons. This meets the scalability characteristic.
Let’s visit our worker again. It used to be able to keep up with the jobs as they came in.
But now it’s overwhelmed. In our case, 6 terabytes of data per hour will do that.
So we start up new workers on new hardware to help. But we’re not going to buy more hardware! We’re already using Amazon for storage, they can handle our hardware too.
The load on our system is going to change, though, and we’re going to want more and more workers to deal with longer and longer queues. Now that they are not on our public server, with is easier to accommodate. And since Amazon still charges up for idle workers, we wind down if demand tapers off.
In our demo, it looks like this. Here’s the one worker from before.
Now we’ve scaled up, and the average time spent in a queue is falling.
Sufia checks two of our framework requirements out of the box. Fedora lets us mix local and remote content, and Resque gives us packground processing.

Big Data Processing in the Cloud: A Hydra/Sufia Experience

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Destaque

Destaque (7)

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience

Semelhante a Big Data Processing in the Cloud: A Hydra/Sufia Experience (20)

Último

Último (20)

Big Data Processing in the Cloud: A Hydra/Sufia Experience

Notas do Editor