SlideShare uma empresa Scribd logo
1 de 64
The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirshanka Das
Hadoop Summit 2015
2
How does PYMK work?
5
Houston
we have a problem
Step 1
Central transport pipeline
Still have
a problem
Hadoop Ingest Pipeline
Complexity
Step 2
Central
Ingestion
Framework
11
Requirements
Source
Diversity
Batch
and
Streaming
Data
Quality
Gobblin Architecture
14
Source
Work
Unit
Work
Unit
Work
Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Write
Write
Data
Publish
Task
Task
Task
Taming Source Diversity
REST
SFTP
JDBC
Protocol
Config
Source Extractor
checkpoint
Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source
Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker
Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN
Transformation: No one size fits all
Cubert: Converting hours to minutes
http://github.com/linkedin/cubert
Physical language
Block organization
Specialized operators
Got Diversity?
Where is the billings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
When will the latest profile data
show up? 24
25
Where is my data?
How did it get here?
….
26
Where is my data?
How did it get here?
….
WhereHows
26
WhereHows architecture
28
29
31
Lineage
WhereHows: Roadmap
Streaming ecosystem integration
Kafka, Samza
Recommendations for Datasets, Metrics
Exploring Open Source
Real-time. Interactive.
Slice and Dice metrics
Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,US 1
Android,IN 1
More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...
Challenges
Horizontally scalable
Low latency
Data freshness
Fault tolerance
OLAP features
Introducing Pinot
Key features
SQL-like
interface
Columnar
storage and
indexing
Real-time
data load
(S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'
(S)QL: Group By
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action
(S)QL: ORDER BY and LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1
Columnar Storage
Forward Index
Broker Helix
Real
time Historical
Kafka Hadoop
Pinot
Architecture
Queries
Raw
Data Samza
Fast but needs a ton of RAM
To pre-compute or not?
Data aware
pre-computation
Pinot@LinkedIn
Site-­‐facing	
  Apps Reporting	
  dashboards Monitoring
Breaking the cycle
Breaking the cycle
Breaking the cycle
Form hypothesis
Query
Repeat
Breaking the cycle
Form hypothesis
Query
Repeat
OR …
Hmm... whats up with portugese and
spanish speaking countries?
Brazil?
56
57
Holidays in Brazil 2015
Pinot Roadmap
Pinot is
Open Source !!!
github.com/linkedin/pinot
59
Kapil Surlaker
@kapilsurlaker
github.com/linkedin/
60
gobblin
cubert
pinot
Shirshanka Das
@shirshanka
Thanks!

Mais conteúdo relacionado

Mais procurados

Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerVu Nguyen Duy
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaDataWorks Summit/Hadoop Summit
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLDataWorks Summit
 

Mais procurados (20)

Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Make streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQLMake streaming processing towards ANSI SQL
Make streaming processing towards ANSI SQL
 

Destaque

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataDataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2DataWorks Summit
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache GiraphDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesDataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit
 

Destaque (20)

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 

Semelhante a Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

Flink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL EngineFlink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL EngineHostedbyConfluent
 
DevOps Powered by Splunk
DevOps Powered by SplunkDevOps Powered by Splunk
DevOps Powered by SplunkSplunk
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopYinan Li
 
Software Defined Storage - Open Framework and Intel® Architecture Technologies
Software Defined Storage - Open Framework and Intel® Architecture TechnologiesSoftware Defined Storage - Open Framework and Intel® Architecture Technologies
Software Defined Storage - Open Framework and Intel® Architecture TechnologiesOdinot Stanislas
 
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...NETWAYS
 
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...NETWAYS
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Dataplumbee
 
Postgres Conf Keynote: What got you here WILL get you there
Postgres Conf Keynote: What got you here WILL get you therePostgres Conf Keynote: What got you here WILL get you there
Postgres Conf Keynote: What got you here WILL get you thereAnant Jhingran
 
What's New in 6.3 + Data On-Boarding
What's New in 6.3 + Data On-BoardingWhat's New in 6.3 + Data On-Boarding
What's New in 6.3 + Data On-BoardingSplunk
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Ververica
 
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwareSomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwarealistair_hann
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine TourLuke Han
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxcamyla81
 
Event Sourcing - what could possibly go wrong?
Event Sourcing - what could possibly go wrong?Event Sourcing - what could possibly go wrong?
Event Sourcing - what could possibly go wrong?Andrzej Ludwikowski
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 

Semelhante a Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn (20)

Flink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL EngineFlink SQL: The Challenges to Build a Streaming SQL Engine
Flink SQL: The Challenges to Build a Streaming SQL Engine
 
DevOps Powered by Splunk
DevOps Powered by SplunkDevOps Powered by Splunk
DevOps Powered by Splunk
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
 
Software Defined Storage - Open Framework and Intel® Architecture Technologies
Software Defined Storage - Open Framework and Intel® Architecture TechnologiesSoftware Defined Storage - Open Framework and Intel® Architecture Technologies
Software Defined Storage - Open Framework and Intel® Architecture Technologies
 
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
OSMC 2015: Monitor Open stack environments from the bottom up and front to ba...
 
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
OSMC 2015 | Monitor OpenStack environments from the bottom up and front to ba...
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Postgres Conf Keynote: What got you here WILL get you there
Postgres Conf Keynote: What got you here WILL get you therePostgres Conf Keynote: What got you here WILL get you there
Postgres Conf Keynote: What got you here WILL get you there
 
What's New in 6.3 + Data On-Boarding
What's New in 6.3 + Data On-BoardingWhat's New in 6.3 + Data On-Boarding
What's New in 6.3 + Data On-Boarding
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
A look at Flink 1.2
A look at Flink 1.2A look at Flink 1.2
A look at Flink 1.2
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
 
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwareSomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Event Sourcing - what could possibly go wrong?
Event Sourcing - what could possibly go wrong?Event Sourcing - what could possibly go wrong?
Event Sourcing - what could possibly go wrong?
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn