[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

Coupang Confidential and Proprietary
이 문서는 쿠팡의 대외비이며 지적자산입니다
Journey to the Continuous and
Scalable Big Data Platform
Matthew (정재화), Coupang

About me
02
• Software Development Manager of BigData & DW Platform team
• 8+ years Hadoop experience
• Apache Tajo Committer and PMC
• blrunner78@gmail.com
• Blog : https://blrunner.tistory.com
• The author of Hadoop tech hand book

Agenda
03
1. On-Premise
2. Cloud 1.0
3. Cloud 2.0
4. Airflow as a Service
5. Zeppelin as a Service

Motivation
04
The purpose of a business is to
create and keep a customer
- Peter Drucker -

1. On-Premise

Architecture
06
• Aggregations and Joins
• MapReduce
• Hive/Pig/Spark
• Oozie
Logs
• Client Logs
• Server Logs
• Adhoc Query
• HiveRDBMS
External Data
ETL Cluster Read-Only Cluster

Team's Responsibility
07
• Architect, build and operate our data infrastructure and tools
• Create and maintain company-wide data pipeline
• Troubleshoot and resolve all issues as users arise

Areas of Improvements
08
• Pros
• A wide variety of workloads
• Continuous increase in users
• Cons
• Multiple copies of Data
• Lack of Elasticity
• Operation overhead

2. Cloud 1.0

Architecture : Decouple compute and storage
010
Domain Cluster #N
Domain Cluster #2
Centralized Resources
Hive
Meta store
Cloud Storage
Batch Cluster
HiveServer2
Ad-hoc Cluster
HiveServer2
Domain Cluster #1
HiveServer2
- Batch Jobs
- High throughput
- fault tolerant, ETL
- Ad-hoc Queries
- Low latency
- Interactive Analysis
- In-memory

011
• Architect, build and operate our data infrastructure and tools
• Troubleshoot and resolve all issues as users arise
• Implement company-wide data pipelines

012
• Pros
• Allows Parsing, Enriching of Data for Custom Need
• Independent scale of CPU and storage capacity
• Cons
• Learning Curve for Cloud Infrastructure
• Operation overhead
• Users want latest tools and more features

3. Cloud 2.0

High Level Architecture
014
Storage
Data Processing Tools
Scheduler Tools
Security
Airflow
LDAP Authentication Apache Ranger ACL & Audit
Zeppelin
Monitoring
Computing Clusters
Cloud Storage
Data Platorm
Portal

Various types of Computing Clusters
015
Centralized Resource
Hive
Meta Store
Cloud
Storage
Transient Cluster
- Batch Jobs
Persistent Cluster
- Interactive Queries
Workload Specific Cluster

016
• Architect, build and our data infrastructure and tools
• Create data APIs and data services
• Support users using SLA policies
• Maintaining security and data privacy
• Application Knowledge Support Artifacts, etc.

017
• Pros
• Onboard lots of users and variety of jobs
• Easier management and added features
• Cons
• Unintended infrastructure costs have increased
• A wide variety of client tools and Dev environments
• Various types of users

Lessons & Learnings
018
• Distribute traffic instead of concentrating the one place
• Optimize all types of system resources in clusters
• Enforce the Lifecycle of Hadoop Cluster
• Monitor clusters and send alarms from the efficiency perspective
• Training Users Continuously and building the community culture

4. Airflow as a Service

Why we love Airflow?
020
• Define Workflows as code
• Makes Workflows more maintainable, versionable, and testable
• More flexible execution and workflow generation
• Lots of features
• Sensor
• Workflow Profiling
• SLA alert
• Rich Web Interface
• Scalable Worker Processes
• In-house Airflow

Airflow : deployment process
021
Cloud Storage

5. Zeppelin as a Service

Why we love Zeppelin?
023
• Easy spark development in personal computer
• Customized Presto Interpreter
• Run presto query easily without complex JDBC configuration
• Export the heavy data file to local machine without exception
• Persistent Storage for Notebook

Zeppelin Architecture
024

025
• Users
• Load all notebooks in the main page -> Too slow
• Big notebook can consume most resources -> Zeppelin Pending
• Platform team
• Spark interpreter doesn’t support YARN cluster mode
• Doesn’t support the life cycle for notebooks
• Difficult to upgrade and improve existing zeppelins gracefully

Resolution
026
• Upgrade Zeppelin to 0.8.1
• Main Page Improvements
• Yarn Cluster Mode for Spark Interpreter
• Interpreter Lifecycle manager
• Interpreter Recovery
• Containerized Zeppelin on Kubernetes

Summary
027
• Understand who is the immediate customer
• Focus on the truly important things
• Detect and solve problems immediately
• Leverage the identity of infrastructure
• Best Practice is not best for you

SELECT question FROM you
https://boards.greenhouse.io/coupang/

Thank you

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a [Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

Semelhante a [Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정 (20)

Último

Último (20)

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정