Hadoop Summit 2010 Data Management On Grid

•

2 gostaram•443 visualizações

Yahoo Developer Network

Tecnologia Negócios

Data Management on Hadoop
@ Yahoo!

Srikanth Sundarrajan
Principal Engineer

Why is Data Management
important?
• Large datasets are incentives for users to come to
grid
• Volume of data movement
• Cluster access / partitioning (Research &
Production purposes)
• Resource consumption
• SLA’s on data availability
• Data Retention
• Regulatory compliance
• Data conversion

Data volumes

• Steady growth in data volumes (Data
movement per DAY – Into the grid)
40

35

30

25
TB
20

15

10

5

0

Data Acquisition Service

JT HDFS
Cluster 1

JT HDFS
Data Acquisition Cluster 2
Service

JT HDFS
Source Cluster 3

• Replication & Retention are additional Targets
services that handle cross cluster data
movement and data purge respectively

Pluggable interfaces

• Different warehouse may use different
interfaces to expose data (ex. http, scp, ftp or
some proprietary mechanism)
• Acquisition service should be generic and have
ability to plugin interfaces easily to support
newer warehouses

Data load & conversion

• Heavy lifting delegated to Map-reduce jobs,
keeping the acquisition service light
• Data load executed as a map-reduce job
• Data conversion as map-reduce job (to enable
faster data processing post acquisition)
– Fields inclusion/removal
– Data filtering
– Data Anonymization
– Data format conversion (raw delimited / Hadoop
sequence file)
• Cluster to cluster copy is a map-reduce job

Warehouse & Cluster isolation

• Source warehouses have diverse capacity,
often constrained
• Different clusters can have different versions
of Hadoop and cluster performance may not
be uniform
• Need for isolation at a warehouse & cluster
level and resource usage limits at a warehouse
level

Job throttling
Discovery

Discovery
threads

Queue per
source

Job execution
threads

Async Map reduce job post
resource negotiation

Cluster 1 Cluster N

Other things in consideration

• SLA, Feed priority & frequency in
consideration for scheduling data load
• Retention to remove old data (as required for
legal compliance and for capacity purposes)
• Interoperability across Hadoop versions

Mais conteúdo relacionado

Mais procurados

2. hadoop fundamentalsLokesh Ramaswamy

Introduction to GlusterFS Webinar - September 2011GlusterFS

Philly DB MapR OverviewMapR Technologies

Hadoop ecosystemStanley Wang

IBM GPFSKarthik V

Future of cloud storageGlusterFS

1 rh storage - architecture whitepaperAccenture

Geo-based content processing using hbaseRavi Veeramachaneni

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit

Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.

Aziksa hadoop architecture santosh jhaData Con LA

سکوهای ابری و مدل های برنامه نویسی در ابرdatastack

Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

Selective Data Replication with Geographically Distributed HadoopDataWorks Summit

How MariaDB is approaching DBaaSMariaDB plc

Hadoop Architecture Ganesh B

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

Apache HBase in the Enterprise Data Hub at CernerHBaseCon

Gfs vs hdfsYuval Carmel

Mais procurados (20)

2. hadoop fundamentals

Introduction to GlusterFS Webinar - September 2011

Philly DB MapR Overview

Hadoop ecosystem

IBM GPFS

Future of cloud storage

1 rh storage - architecture whitepaper

Geo-based content processing using hbase

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Achieving Separation of Compute and Storage in a Cloud World

Aziksa hadoop architecture santosh jha

سکوهای ابری و مدل های برنامه نویسی در ابر

Cisco UCS Integrated Infrastructure for Big Data with Cassandra

HBase Data Modeling and Access Patterns with Kite SDK

Selective Data Replication with Geographically Distributed Hadoop

How MariaDB is approaching DBaaS

Hadoop Architecture

Basic Hadoop Architecture V1 vs V2

Apache HBase in the Enterprise Data Hub at Cerner

Gfs vs hdfs

Semelhante a Hadoop Summit 2010 Data Management On Grid

Big data Hadoop Ayyappan Paramesh

Session 01 - Into to HadoopAnandMHadoop

Deploying Grid Services Using HadoopGeorge Ang

Hadoop Distributed File SystemVaibhav Jain

Hadoop ppt1chariorienit

Secure Networking in Big Data EnvironmentsNapier University

Kafka & Hadoop in RakutenRakuten Group, Inc.

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on HadoopYahoo Developer Network

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP

Hadoop on Azure, Blue elephantsOvidiu Dimulescu

Hadoop Backup and Disaster RecoveryCloudera, Inc.

What databaseRegunath B

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman

Meta scale kognitio hadoop webinarKognitio

Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru

Distributed Data processing in a Cloudelliando dias

Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7

Petabyte scale on commodity infrastructureelliando dias

Big Data Architecture Workshop - Vahid Amiridatastack

Semelhante a Hadoop Summit 2010 Data Management On Grid (20)

Big data Hadoop

Session 01 - Into to Hadoop

Deploying Grid Services Using Hadoop

Hadoop Distributed File System

Hadoop ppt1

Secure Networking in Big Data Environments

Kafka & Hadoop in Rakuten

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.

Hadoop on Azure, Blue elephants

Hadoop Backup and Disaster Recovery

What database

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Managing Big Data (Chapter 2, SC 11 Tutorial)

Meta scale kognitio hadoop webinar

Hadoop: Components and Key Ideas, -part1

Distributed Data processing in a Cloud

Big Data Reverse Knowledge Transfer.pptx

Petabyte scale on commodity infrastructure

Big Data Architecture Workshop - Vahid Amiri

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Histor y of HAM Radio presentation slidevu2urc

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Slack Application Development 101 Slidespraypatel2

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

A Call to Action for Generative AI in 2024Results

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Hadoop Summit 2010 Data Management On Grid

1. Data Management on Hadoop @ Yahoo! Srikanth Sundarrajan Principal Engineer

2. Why is Data Management important? • Large datasets are incentives for users to come to grid • Volume of data movement • Cluster access / partitioning (Research & Production purposes) • Resource consumption • SLA’s on data availability • Data Retention • Regulatory compliance • Data conversion

3. Data volumes • Steady growth in data volumes (Data movement per DAY – Into the grid) 40 35 30 25 TB 20 15 10 5 0

4. Data Acquisition Service JT HDFS Cluster 1 JT HDFS Data Acquisition Cluster 2 Service JT HDFS Source Cluster 3 • Replication & Retention are additional Targets services that handle cross cluster data movement and data purge respectively

5. Pluggable interfaces • Different warehouse may use different interfaces to expose data (ex. http, scp, ftp or some proprietary mechanism) • Acquisition service should be generic and have ability to plugin interfaces easily to support newer warehouses

6. Data load & conversion • Heavy lifting delegated to Map-reduce jobs, keeping the acquisition service light • Data load executed as a map-reduce job • Data conversion as map-reduce job (to enable faster data processing post acquisition) – Fields inclusion/removal – Data filtering – Data Anonymization – Data format conversion (raw delimited / Hadoop sequence file) • Cluster to cluster copy is a map-reduce job

7. Warehouse & Cluster isolation • Source warehouses have diverse capacity, often constrained • Different clusters can have different versions of Hadoop and cluster performance may not be uniform • Need for isolation at a warehouse & cluster level and resource usage limits at a warehouse level

8. Job throttling Discovery Discovery threads Queue per source Job execution threads Async Map reduce job post resource negotiation Cluster 1 Cluster N

9. Other things in consideration • SLA, Feed priority & frequency in consideration for scheduling data load • Retention to remove old data (as required for legal compliance and for capacity purposes) • Interoperability across Hadoop versions

10. Thanks!

Hadoop Summit 2010 Data Management On Grid

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hadoop Summit 2010 Data Management On Grid

Semelhante a Hadoop Summit 2010 Data Management On Grid (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Último

Último (20)

Hadoop Summit 2010 Data Management On Grid