Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

•

39 gostaram•11,289 visualizações

This talk was given by Joel Koshy (Senior Software Engineer at LinkedIn) at the Hadoop Summit (June 2013).

Tecnologia Negócios

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
People you may know

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 21

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 23

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 25

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Audit Trail

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

Mais conteúdo relacionado

Mais procurados

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

Intro to Time Series

Intro to Time Series

Intro to Time Series

Hive + Tez: A Performance Deep Dive

Hive + Tez: A Performance Deep Dive

Hive + Tez: A Performance Deep Dive

DataWorks Summit

LLAP: Sub-Second Analytical Queries in Hive

LLAP: Sub-Second Analytical Queries in Hive

LLAP: Sub-Second Analytical Queries in Hive

DataWorks Summit/Hadoop Summit

Salvatore Sanfilippo – How Redis Cluster works, and why In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Screw DevOps, Let's Talk DataOps

Screw DevOps, Let's Talk DataOps

Screw DevOps, Let's Talk DataOps

Kellyn Pot'Vin-Gorman

Airflow presentation

Airflow presentation

Airflow presentation

• Understand the issues with commercial database pricing and licensing. • Learn about the benefits of Amazon Aurora for improving performance and decreasing costs. • See how AWS Database Migration Service helps with your migration. • See how AWS Schema Conversion Tool makes conversions simple and quick. If you’re looking to improve application performance and availability and decrease database costs, it’s time to replace your expensive Oracle databases with an open-source compatible solution. Amazon Aurora is a MySQL-compatible relational database that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. You'll learn how to use the AWS Database Migration Service to migrate your data with minimal downtime, and how the AWS Schema Conversion Tool converts your Oracle schemas and procedural code into Amazon Aurora. We’ll follow with a quick demo of the entire process.

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Amazon Web Services

Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data. The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.

Apache Ranger

Hudi architecture, fundamentals and capabilities

Hudi architecture, fundamentals and capabilities

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Streaming sql and druid

Streaming sql and druid

Streaming sql and druid

Introduction to Redis

Introduction to Redis

Introduction to Redis

Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS. Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/ YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast

Vector databases and neural search

Vector databases and neural search

Vector databases and neural search

Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.

Performance Optimizations in Apache Impala

Performance Optimizations in Apache Impala

Performance Optimizations in Apache Impala

We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.

How Kafka Powers the World's Most Popular Vector Database System with Charles...

How Kafka Powers the World's Most Popular Vector Database System with Charles...

How Kafka Powers the World's Most Popular Vector Database System with Charles...

HostedbyConfluent

Apache Sentry for Hadoop security

Apache Sentry for Hadoop security

Apache Sentry for Hadoop security

bigdatagurus_meetup

Giraph at Hadoop Summit 2014

Giraph at Hadoop Summit 2014

Giraph at Hadoop Summit 2014

Claudio Martella

Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Introduction to memcached

Introduction to memcached

Introduction to memcached

Jurriaan Persyn

Introduction to MongoDB

Introduction to MongoDB

Introduction to MongoDB

Mais procurados (20)

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...

Intro to Time Series

Intro to Time Series

Intro to Time Series

Hive + Tez: A Performance Deep Dive

Hive + Tez: A Performance Deep Dive

Hive + Tez: A Performance Deep Dive

LLAP: Sub-Second Analytical Queries in Hive

LLAP: Sub-Second Analytical Queries in Hive

LLAP: Sub-Second Analytical Queries in Hive

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...

Screw DevOps, Let's Talk DataOps

Screw DevOps, Let's Talk DataOps

Screw DevOps, Let's Talk DataOps

Airflow presentation

Airflow presentation

Airflow presentation

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...

Apache Ranger

Hudi architecture, fundamentals and capabilities

Hudi architecture, fundamentals and capabilities

Hudi architecture, fundamentals and capabilities

Streaming sql and druid

Streaming sql and druid

Streaming sql and druid

Introduction to Redis

Introduction to Redis

Introduction to Redis

Vector databases and neural search

Vector databases and neural search

Vector databases and neural search

Performance Optimizations in Apache Impala

Performance Optimizations in Apache Impala

Performance Optimizations in Apache Impala

How Kafka Powers the World's Most Popular Vector Database System with Charles...

How Kafka Powers the World's Most Popular Vector Database System with Charles...

How Kafka Powers the World's Most Popular Vector Database System with Charles...

Apache Sentry for Hadoop security

Apache Sentry for Hadoop security

Apache Sentry for Hadoop security

Giraph at Hadoop Summit 2014

Giraph at Hadoop Summit 2014

Giraph at Hadoop Summit 2014

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Introduction to memcached

Introduction to memcached

Introduction to memcached

Introduction to MongoDB

Introduction to MongoDB

Introduction to MongoDB

Destaque

Architecture of a Kafka camus infrastructure

Architecture of a Kafka camus infrastructure

Architecture of a Kafka camus infrastructure

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Netflix Data Pipeline With Kafka

Netflix Data Pipeline With Kafka

Netflix Data Pipeline With Kafka

Allen (Xiaozhong) Wang

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data Application

LinkedIn Segmentation & Targeting Platform: A Big Data Application

LinkedIn Segmentation & Targeting Platform: A Big Data Application

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Introduction to Apache Kafka

Introduction to Apache Kafka

Introduction to Apache Kafka

LinkedIn Communication Architecture

LinkedIn Communication Architecture

LinkedIn Communication Architecture

Introduction to Databus

Introduction to Databus

Introduction to Databus

Building Distributed Systems Using Helix

Building Distributed Systems Using Helix

Building Distributed Systems Using Helix

Building a Data Pipeline from Scratch - Joe Crobak

Building a Data Pipeline from Scratch - Joe Crobak

Building a Data Pipeline from Scratch - Joe Crobak

What is a distributed data science pipeline. how with apache spark and friends.

What is a distributed data science pipeline. how with apache spark and friends.

What is a distributed data science pipeline. how with apache spark and friends.

Rakuten LeoFs - distributed file system

Rakuten LeoFs - distributed file system

Rakuten LeoFs - distributed file system

Rakuten Group, Inc.

Introduction to apache kafka

Introduction to apache kafka

Introduction to apache kafka

Apache Kafka

Realtime streaming architecture in INFINARIO

Realtime streaming architecture in INFINARIO

Realtime streaming architecture in INFINARIO

Non-interactive big-data analysis prohibits experimentation and can interrupt the analyst’s train of thoughts but analyzing and drawing insights in real time is no easy task with jobs often taking minutes/hours to complete. What if you want to put a interactive interface in front of that data that allows iterative insights? What if you need that interactive experience to be sub second? Traditional SQL and most MPP/NoSQL databases cannot run complex calculations over large data in a performant manner. Popular distributed systems such as Hadoop or Spark can execute jobs but their job overhead prohibits sub second response times. Learn how an in-memory computing framework enabled us to perform complex analysis jobs on massive data points with sub second response times — allowing us to plug it into a simple, drag-and-drop web 2.0 interface.

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

In-Memory Computing Summit

Intro to SnappyData Webinar

Intro to SnappyData Webinar

Intro to SnappyData Webinar

Destaque (20)

Architecture of a Kafka camus infrastructure

Architecture of a Kafka camus infrastructure

Architecture of a Kafka camus infrastructure

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Netflix Data Pipeline With Kafka

Netflix Data Pipeline With Kafka

Netflix Data Pipeline With Kafka

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

LinkedIn Segmentation & Targeting Platform: A Big Data Application

LinkedIn Segmentation & Targeting Platform: A Big Data Application

LinkedIn Segmentation & Targeting Platform: A Big Data Application

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Introduction to Apache Kafka

Introduction to Apache Kafka

Introduction to Apache Kafka

LinkedIn Communication Architecture

LinkedIn Communication Architecture

LinkedIn Communication Architecture

Introduction to Databus

Introduction to Databus

Introduction to Databus

Building Distributed Systems Using Helix

Building Distributed Systems Using Helix

Building Distributed Systems Using Helix

Building a Data Pipeline from Scratch - Joe Crobak

Building a Data Pipeline from Scratch - Joe Crobak

Building a Data Pipeline from Scratch - Joe Crobak

What is a distributed data science pipeline. how with apache spark and friends.

What is a distributed data science pipeline. how with apache spark and friends.

What is a distributed data science pipeline. how with apache spark and friends.

Rakuten LeoFs - distributed file system

Rakuten LeoFs - distributed file system

Rakuten LeoFs - distributed file system

Introduction to apache kafka

Introduction to apache kafka

Introduction to apache kafka

Apache Kafka

Realtime streaming architecture in INFINARIO

Realtime streaming architecture in INFINARIO

Realtime streaming architecture in INFINARIO

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

Intro to SnappyData Webinar

Intro to SnappyData Webinar

Intro to SnappyData Webinar

Semelhante a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

It is clear that all employees must have access to data wherever they are to make decisions. However, tools that allow to share data just as easily as the best collaborative tools such as a google doc or an office 365 should be used. Open source driven by the big data ecosystem and a number of large companies have provided solutions to allow organizations to federate data systems and secure their access. After a quick overview of existing open source solutions, and how such projects can be organized, it will be necessary to detail Dremio implementation, a unique and centralized interface on all your data. Some real feedbacks will conclude the presentation.

All data accessible to all my organization - Presentation at OW2con'19, June...

All data accessible to all my organization - Presentation at OW2con'19, June...

All data accessible to all my organization - Presentation at OW2con'19, June...

So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it. As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop. Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047 Description: So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it. As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop. Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

For organisations to successfully adopt data mesh, setting up and maintaining infrastructure needs to be easy. We believe the best way to achieve this is to leverage the learnings from building a ‘central nervous system‘, commonly used in modern data-streaming ecosystems. This approach formalises and automates of the manual parts of building a data mesh. This presentation introduces SpecMesh; a methodology and supporting developer toolkit to enable business to build the foundations of their data mesh.

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Sparkling Water is the newest application on the Apache Spark in-memory platform to extend Machine Learning for better predictions and to quickly deploy models into production. H2O is proud to partner with Cloudera and Databricks to bring this capability to a wide audience. H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFast¬TM Scoring Engine. Learn more by going to http://www.h2o.ai and contact us for more information. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Sparkling Water Webinar October 29th, 2014

Sparkling Water Webinar October 29th, 2014

Sparkling Water Webinar October 29th, 2014

Watch: https://bit.ly/2DYsUhD Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way. Attend this webinar and learn: - How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice - How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo - How you can use the Denodo Platform with large data volumes in an efficient way - How Prologis accelerated their use of Machine Learning with data virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

DataWorks Summit/Hadoop Summit

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

The Open Data Protocol (OData) is an open protocol for sharing data. It provides a way to break down data silos and increase the shared value of data by creating an ecosystem in which data consumers can interoperate with data producers in a way that is far more powerful than currently possible, enabling more applications to make sense of a broader set of data. Every producer and consumer of data that participates in this ecosystem increases its overall value. OData is consistent with the way the Web works – it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools.

Breaking down data silos with OData

Breaking down data silos with OData

Breaking down data silos with OData

Woodruff Solutions LLC

The LOD Gateway: Open Source Infrastructure for Linked Data

The LOD Gateway: Open Source Infrastructure for Linked Data

The LOD Gateway: Open Source Infrastructure for Linked Data

The oecd delta project – providing easier access to data through api's

The oecd delta project – providing easier access to data through api's

The oecd delta project – providing easier access to data through api's

Jonathan Challener

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Tech Community

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Tech Community

Big Data, Bigger Brains

Big Data, Bigger Brains

Big Data, Bigger Brains

Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with an experimental evaluation highlighting the performant and powerful integration of these projects. Speaker Jesus Camancho Rodriquez, Hortonworks

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

DataWorks Summit

Watch full webinar here: https://bit.ly/32c6TnG Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way. Attend this webinar and learn: - How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice - How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo - How you can use the Denodo Platform with large data volumes in an efficient way - About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Opensocial Haifa Seminar - 2008.04.08

Opensocial Haifa Seminar - 2008.04.08

Opensocial Haifa Seminar - 2008.04.08

Better integrations through open interfaces

Better integrations through open interfaces

Better integrations through open interfaces

Test trend analysis: Towards robust reliable and timely tests

Test trend analysis: Towards robust reliable and timely tests

Test trend analysis: Towards robust reliable and timely tests

Hugh McCamphill

Semelhante a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

All data accessible to all my organization - Presentation at OW2con'19, June...

All data accessible to all my organization - Presentation at OW2con'19, June...

All data accessible to all my organization - Presentation at OW2con'19, June...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Sparkling Water Webinar October 29th, 2014

Sparkling Water Webinar October 29th, 2014

Sparkling Water Webinar October 29th, 2014

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Breaking down data silos with OData

Breaking down data silos with OData

Breaking down data silos with OData

The LOD Gateway: Open Source Infrastructure for Linked Data

The LOD Gateway: Open Source Infrastructure for Linked Data

The LOD Gateway: Open Source Infrastructure for Linked Data

The oecd delta project – providing easier access to data through api's

The oecd delta project – providing easier access to data through api's

The oecd delta project – providing easier access to data through api's

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Microsoft Graph: Connect to essential data every app needs

Big Data, Bigger Brains

Big Data, Bigger Brains

Big Data, Bigger Brains

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Advanced Analytics and Machine Learning with Data Virtualization

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Opensocial Haifa Seminar - 2008.04.08

Opensocial Haifa Seminar - 2008.04.08

Opensocial Haifa Seminar - 2008.04.08

Better integrations through open interfaces

Better integrations through open interfaces

Better integrations through open interfaces

Test trend analysis: Towards robust reliable and timely tests

Test trend analysis: Towards robust reliable and timely tests

Test trend analysis: Towards robust reliable and timely tests

Mais de Amy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

LinkedIn Graph Presentation

LinkedIn Graph Presentation

LinkedIn Graph Presentation

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Voldemort on Solid State Drives

Voldemort on Solid State Drives

Voldemort on Solid State Drives

Untangling Cluster Management with Helix

Untangling Cluster Management with Helix

Untangling Cluster Management with Helix

All Aboard the Databus

All Aboard the Databus

All Aboard the Databus

Mais de Amy W. Tang (6)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

LinkedIn Graph Presentation

LinkedIn Graph Presentation

LinkedIn Graph Presentation

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn

Voldemort on Solid State Drives

Voldemort on Solid State Drives

Voldemort on Solid State Drives

Untangling Cluster Management with Helix

Untangling Cluster Management with Helix

Untangling Cluster Management with Helix

All Aboard the Databus

All Aboard the Databus

All Aboard the Databus

Último

Modernizing Securities Finance: The cloud-native prime brokerage platform transforming capital markets. Madhu Subbu, Managing Director, Head of Securities Finance Engineering Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Último (20)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AWS Community Day CPH - Three problems of Terraform

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

AXA XL - Insurer Innovation Award Americas 2024

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

A Year of the Servo Reboot: Where Are We Now?

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

GenAI Risks & Security Meetup 01052024.pdf

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Architecting Cloud Native Applications

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

1. Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved

2. HADOOP SUMMIT 2013 Network update stream

3. LinkedIn Corporation ©2013 All Rights Reserved We have a lot of data. We want to leverage this data to build products. Data pipeline

4. HADOOP SUMMIT 2013 People you may know

5. HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5

6. How do we integrate this variety of data and make it available to all these systems? LinkedIn Confidential ©2013 All Rights Reserved

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

10. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 10

11. HADOOP SUMMIT 2013 Central data pipeline

12. First attempt: don’t re-invent the wheel LinkedIn Confidential ©2013 All Rights Reserved

13. HADOOP SUMMIT 2013

14. Second attempt: re-invent the wheel! LinkedIn Confidential ©2013 All Rights Reserved

15. Use a central commit log LinkedIn Confidential ©2013 All Rights Reserved

16. HADOOP SUMMIT 2013 What is a commit log?

17. HADOOP SUMMIT 2013 The log as a messaging system LinkedIn Corporation ©2013 All Rights Reserved 17

18. HADOOP SUMMIT 2013 Apache Kafka LinkedIn Corporation ©2013 All Rights Reserved 18

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

20. HADOOP SUMMIT 2013 Usage at LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 20

21. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 21

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

23. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 23

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

25. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 25

26. Does it work? “All published messages must be delivered to all consumers (quickly)” LinkedIn Confidential ©2013 All Rights Reserved

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29

30. HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30