Coupang has transitioned its big data platform from an on-premise architecture to cloud architectures over time. The initial cloud architecture separated compute and storage to improve scalability but still had operation overhead. The current cloud 2.0 architecture introduced centralized resources, various computing clusters, data APIs and services to onboard more users and variety of jobs while improving manageability. Airflow and Zeppelin were introduced as managed services to provide scheduling and interactive analysis in a scalable way, but required improvements to fully support users' needs.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정
1. Coupang Confidential and Proprietary
이 문서는 쿠팡의 대외비이며 지적자산입니다
Journey to the Continuous and
Scalable Big Data Platform
Matthew (정재화), Coupang
2. Coupang Confidential and Proprietary
About me
02
• Software Development Manager of BigData & DW Platform team
• 8+ years Hadoop experience
• Apache Tajo Committer and PMC
• blrunner78@gmail.com
• Blog : https://blrunner.tistory.com
• The author of Hadoop tech hand book
3. Coupang Confidential and Proprietary
Agenda
03
1. On-Premise
2. Cloud 1.0
3. Cloud 2.0
4. Airflow as a Service
5. Zeppelin as a Service
4. Coupang Confidential and Proprietary
Motivation
04
The purpose of a business is to
create and keep a customer
- Peter Drucker -
6. Coupang Confidential and Proprietary
Architecture
06
• Aggregations and Joins
• MapReduce
• Hive/Pig/Spark
• Oozie
Logs
• Client Logs
• Server Logs
• Adhoc Query
• HiveRDBMS
External Data
ETL Cluster Read-Only Cluster
7. Coupang Confidential and Proprietary
Team's Responsibility
07
• Architect, build and operate our data infrastructure and tools
• Create and maintain company-wide data pipeline
• Troubleshoot and resolve all issues as users arise
8. Coupang Confidential and Proprietary
Areas of Improvements
08
• Pros
• A wide variety of workloads
• Continuous increase in users
• Cons
• Multiple copies of Data
• Lack of Elasticity
• Operation overhead
10. Coupang Confidential and Proprietary
Architecture : Decouple compute and storage
010
Domain Cluster #N
Domain Cluster #2
Centralized Resources
Hive
Meta store
Cloud Storage
Batch Cluster
HiveServer2
Ad-hoc Cluster
HiveServer2
Domain Cluster #1
HiveServer2
- Batch Jobs
- High throughput
- fault tolerant, ETL
- Ad-hoc Queries
- Low latency
- Interactive Analysis
- In-memory
11. Coupang Confidential and Proprietary
Team's Responsibility
011
• Architect, build and operate our data infrastructure and tools
• Troubleshoot and resolve all issues as users arise
• Implement company-wide data pipelines
12. Coupang Confidential and Proprietary
Areas of Improvements
012
• Pros
• Allows Parsing, Enriching of Data for Custom Need
• Independent scale of CPU and storage capacity
• Cons
• Learning Curve for Cloud Infrastructure
• Operation overhead
• Users want latest tools and more features
14. Coupang Confidential and Proprietary
High Level Architecture
014
Storage
Data Processing Tools
Scheduler Tools
Security
Airflow
LDAP Authentication Apache Ranger ACL & Audit
Zeppelin
Monitoring
Computing Clusters
Cloud Storage
Data Platorm
Portal
15. Coupang Confidential and Proprietary
Various types of Computing Clusters
015
Centralized Resource
Hive
Meta Store
Cloud
Storage
Transient Cluster
- Batch Jobs
Persistent Cluster
- Interactive Queries
Workload Specific Cluster
16. Coupang Confidential and Proprietary
Team's Responsibility
016
• Architect, build and our data infrastructure and tools
• Create data APIs and data services
• Support users using SLA policies
• Maintaining security and data privacy
• Application Knowledge Support Artifacts, etc.
17. Coupang Confidential and Proprietary
Areas of Improvements
017
• Pros
• Onboard lots of users and variety of jobs
• Easier management and added features
• Cons
• Unintended infrastructure costs have increased
• A wide variety of client tools and Dev environments
• Various types of users
18. Coupang Confidential and Proprietary
Lessons & Learnings
018
• Distribute traffic instead of concentrating the one place
• Optimize all types of system resources in clusters
• Enforce the Lifecycle of Hadoop Cluster
• Monitor clusters and send alarms from the efficiency perspective
• Training Users Continuously and building the community culture
20. Coupang Confidential and Proprietary
Why we love Airflow?
020
• Define Workflows as code
• Makes Workflows more maintainable, versionable, and testable
• More flexible execution and workflow generation
• Lots of features
• Sensor
• Workflow Profiling
• SLA alert
• Rich Web Interface
• Scalable Worker Processes
• In-house Airflow
23. Coupang Confidential and Proprietary
Why we love Zeppelin?
023
• Easy spark development in personal computer
• Customized Presto Interpreter
• Run presto query easily without complex JDBC configuration
• Export the heavy data file to local machine without exception
• Persistent Storage for Notebook
25. Coupang Confidential and Proprietary
Areas of Improvements
025
• Users
• Load all notebooks in the main page -> Too slow
• Big notebook can consume most resources -> Zeppelin Pending
• Platform team
• Spark interpreter doesn’t support YARN cluster mode
• Doesn’t support the life cycle for notebooks
• Difficult to upgrade and improve existing zeppelins gracefully
26. Coupang Confidential and Proprietary
Resolution
026
• Upgrade Zeppelin to 0.8.1
• Main Page Improvements
• Yarn Cluster Mode for Spark Interpreter
• Interpreter Lifecycle manager
• Interpreter Recovery
• Containerized Zeppelin on Kubernetes
27. Coupang Confidential and Proprietary
Summary
027
• Understand who is the immediate customer
• Focus on the truly important things
• Detect and solve problems immediately
• Leverage the identity of infrastructure
• Best Practice is not best for you
28. Coupang Confidential and Proprietary
SELECT question FROM you
https://boards.greenhouse.io/coupang/