Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

•

15 gostaram•3,533 visualizações

DataWorks Summit

The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirshanka Das
Hadoop Summit 2015

Requirements
Source
Diversity
Batch
and
Streaming
Data
Quality

14
Source
Work
Unit
Work
Unit
Work
Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Write
Write
Data
Publish
Task
Task
Task

Taming Source Diversity
REST
SFTP
JDBC
Protocol
Config
Source Extractor
checkpoint

Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source

Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker

Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN

Cubert: Converting hours to minutes
http://github.com/linkedin/cubert
Physical language
Block organization
Specialized operators

Where is my data?
How did it get here?
….
26

Where is my data?
How did it get here?
….
WhereHows
26

WhereHows: Roadmap
Streaming ecosystem integration
Kafka, Samza
Recommendations for Datasets, Metrics
Exploring Open Source

Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,US 1
Android,IN 1

More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...

Challenges
Horizontally scalable
Low latency
Data freshness
Fault tolerance
OLAP features

Key features
SQL-like
interface
Columnar
storage and
indexing
Real-time
data load

(S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'

(S)QL: Group By
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action

(S)QL: ORDER BY and LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1

Broker Helix
Real
time Historical
Kafka Hadoop
Pinot
Architecture
Queries
Raw
Data Samza

Pinot@LinkedIn
Site-‐facing
Apps Reporting
dashboards Monitoring

Breaking the cycle
Form hypothesis
Query
Repeat

Breaking the cycle
Form hypothesis
Query
Repeat
OR …

Hmm... whats up with portugese and
spanish speaking countries?

Pinot Roadmap
Pinot is
Open Source !!!
github.com/linkedin/pinot
59

Kapil Surlaker
@kapilsurlaker
github.com/linkedin/
60
gobblin
cubert
pinot
Shirshanka Das
@shirshanka
Thanks!

Mais conteúdo relacionado

Mais procurados

Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward

Data Science Across Data Sources with Apache ArrowDatabricks

IEEE International Conference on Data Engineering 2015Yousun Jeong

Stsg17 speaker yousunjeongYousun Jeong

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks

Deploy data analysis pipeline with mesos and dockerVu Nguyen Duy

eBay Experimentation Platform on HadoopTony Ng

Apache kylin 2.0: from classic olap to real-time data warehouseYang Li

Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks

Keeping Identity Graphs In Sync With Apache SparkDatabricks

Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit

Building Reliable Data Lakes at Scale with Delta LakeDatabricks

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks

Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaDataWorks Summit/Hadoop Summit

Make streaming processing towards ANSI SQLDataWorks Summit

Mais procurados (20)

Speed up UDFs with GPUs using the RAPIDS Accelerator

Scaling your Data Pipelines with Apache Spark on Kubernetes

Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...

Data Science Across Data Sources with Apache Arrow

IEEE International Conference on Data Engineering 2015

Stsg17 speaker yousunjeong

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

Deploy data analysis pipeline with mesos and docker

eBay Experimentation Platform on Hadoop

Apache kylin 2.0: from classic olap to real-time data warehouse

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

Keeping Identity Graphs In Sync With Apache Spark

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

Next CERN Accelerator Logging Service with Jakub Wozniak

How to Boost 100x Performance for Real World Application with Apache Spark-(G...

Building Reliable Data Lakes at Scale with Delta Lake

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Make streaming processing towards ANSI SQL

Destaque

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit

Hadoop crash course workshop at Hadoop SummitDataWorks Summit

Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit

Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit

How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit

Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit

Apache Lens: Unified OLAP on Realtime and Historic DataDataWorks Summit

June 10 145pm hortonworks_tan & welch_v2DataWorks Summit

large scale collaborative filtering using Apache GiraphDataWorks Summit

Spark crash course workshop at Hadoop SummitDataWorks Summit

a Secure Public Cache for YARN Application ResourcesDataWorks Summit

From Beginners to Experts, Data Wrangling for AllDataWorks Summit

Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit

Internet of things Crash Course WorkshopDataWorks Summit

Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit

Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit

Sqoop on Spark for Data IngestionDataWorks Summit

Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit

Complex Analytics using Open Source TechnologiesDataWorks Summit

Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit

Destaque (20)

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic

Hadoop crash course workshop at Hadoop Summit

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Airflow - An Open Source Platform to Author and Monitor Data Pipelines

How to use Parquet as a Sasis for ETL and Analytics

Improving HDFS Availability with IPC Quality of Service

Apache Lens: Unified OLAP on Realtime and Historic Data

June 10 145pm hortonworks_tan & welch_v2

large scale collaborative filtering using Apache Giraph

Spark crash course workshop at Hadoop Summit

a Secure Public Cache for YARN Application Resources

From Beginners to Experts, Data Wrangling for All

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter

Internet of things Crash Course Workshop

Scaling HDFS to Manage Billions of Files with Key-Value Stores

Internet of Things Crash Course Workshop at Hadoop Summit

Sqoop on Spark for Data Ingestion

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

Complex Analytics using Open Source Technologies

Harnessing Hadoop Distuption: A Telco Case Study

Semelhante a Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

Flink SQL: The Challenges to Build a Streaming SQL EngineHostedbyConfluent

DevOps Powered by SplunkSplunk

Gobblin: Unifying Data Ingestion for HadoopYinan Li

Software Defined Storage - Open Framework and Intel® Architecture TechnologiesOdinot Stanislas

OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...NETWAYS

OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...NETWAYS

Transforming Mobile Push Notifications with Big Dataplumbee

Postgres Conf Keynote: What got you here WILL get you thereAnant Jhingran

What's New in 6.3 + Data On-BoardingSplunk

Azure Stream Analytics : Analyse Data in MotionRuhani Arora

A look at Flink 1.2Stefan Richter

Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Ververica

SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwarealistair_hann

Kylin OLAP Engine TourLuke Han

oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a

CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81

Event Sourcing - what could possibly go wrong?Andrzej Ludwikowski

Cloud Experience: Data-driven Applications Made Simple and FastDatabricks

#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub

Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu

Semelhante a Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn (20)

Flink SQL: The Challenges to Build a Streaming SQL Engine

DevOps Powered by Splunk

Gobblin: Unifying Data Ingestion for Hadoop

Software Defined Storage - Open Framework and Intel® Architecture Technologies

OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...

OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...

Transforming Mobile Push Notifications with Big Data

Postgres Conf Keynote: What got you here WILL get you there

What's New in 6.3 + Data On-Boarding

Azure Stream Analytics : Analyse Data in Motion

A look at Flink 1.2

Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup

SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware

Kylin OLAP Engine Tour

oracle_soultion_oracledataintegrator_goldengate_2021

CERN_DIS_ODI_OGG_final_oracle_golde.pptx

Event Sourcing - what could possibly go wrong?

Cloud Experience: Data-driven Applications Made Simple and Fast

#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode

Cloud-Native Patterns for Data-Intensive Applications

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

1. The Data Driven Network Kapil Surlaker Director of Engineering Powering the Data Driven Network Kapil Surlaker and Shirshanka Das Hadoop Summit 2015

2. 2

5. How does PYMK work? 5

7. Houston we have a problem

8. Step 1 Central transport pipeline

9. Still have a problem

10. Hadoop Ingest Pipeline Complexity

11. Step 2 Central Ingestion Framework 11

12. Requirements Source Diversity Batch and Streaming Data Quality

13. Gobblin Architecture

14. 14 Source Work Unit Work Unit Work Unit Extract Extract Extract Convert Convert Convert Quality Quality Quality Write Write Write Data Publish Task Task Task

15. Taming Source Diversity REST SFTP JDBC Protocol Config Source Extractor checkpoint

16. Solving for real-time Inefficiencies in batch YARN based Apache Helix Continuous Auto-scaling YARN Helix Executor 1 Executor 2 Executor 3 HDFS Stream Source

17. Data Quality Per record, per task, or per job Composable quality checkers Schema compatibility Audit check Sensitive fields Unique key Policy driven Record WriterJob Task Quality Checker FailQuarantine Policy Checker

18. Current Activity Open source @ github.com/linkedin/gobblin In production @ LinkedIn Tens of TB per day Hundreds of datasets ~20 different sources Gobblin on YARN

19.

20. Transformation: No one size fits all

21. Cubert: Converting hours to minutes http://github.com/linkedin/cubert Physical language Block organization Specialized operators

22.

23. Got Diversity?

24. Where is the billings data? How did it get here? What data is used to create inferred skills data? Who owns that flow? When will the latest profile data show up? 24

25. 25

26. Where is my data? How did it get here? …. 26

27. Where is my data? How did it get here? …. WhereHows 26

28. WhereHows architecture

29. 28

30. 29

31.

32. 31

33.

34. Lineage

35. WhereHows: Roadmap Streaming ecosystem integration Kafka, Samza Recommendations for Datasets, Metrics Exploring Open Source

36.

37. Real-time. Interactive.

38. Slice and Dice metrics

39. Precompute! Device Geo View Android US 1 Android IN 1 iOS US 1 Dimension View Android 2 iOS 1 US 2 IN 1 Android,US 1 iOS,US 1 Android,IN 1

40. More dimensions! Device Geo Carrier View Android US ATT 1 Android IN Reliance 1 iOS US Verizon 1 Dimension View Android 2 iOS 1 US 2 IN 1 ATT 1 Reliance 1 Verizon 1 Android,US 1 ... ...

41. Challenges Horizontally scalable Low latency Data freshness Fault tolerance OLAP features

42. Introducing Pinot

43. Key features SQL-like interface Columnar storage and indexing Real-time data load

44. (S)QL: Filters and Aggs SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'

45. (S)QL: Group By SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action

46. (S)QL: ORDER BY and LIMIT SELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1

47. Columnar Storage

48. Forward Index

49. Broker Helix Real time Historical Kafka Hadoop Pinot Architecture Queries Raw Data Samza

50. Fast but needs a ton of RAM

51. To pre-compute or not?

52. Data aware pre-computation

53. Pinot@LinkedIn Site-‐facing Apps Reporting dashboards Monitoring

54. Breaking the cycle

55. Breaking the cycle

56. Breaking the cycle Form hypothesis Query Repeat

57. Breaking the cycle Form hypothesis Query Repeat OR …

58. Hmm... whats up with portugese and spanish speaking countries?