SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Data Engineering
April 2019 - STL Big Data User Group
Discussion Topics
Introductions
Background - Chris
Distributed systems - Chris
Day in the life - Tim
Demo - Kit
Why Data Engineering?
● It's not just about analytics.
● Treating data as an asset.
● Data science is built upon the work of good data engineering.
● The data we collect is changing. The way we store and access it must as well.
● Automate, automate, automate.
Distributed Systems
What does it mean?
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate
their actions by passing messages to one another.
Distributed Compute
1. Overview
2. Scaling
3. What does it mean for you?
Data Architecture - Lambda & Kappa
Lambda: Batch and Speed Layer
● Stores raw data
● Computes tables in batch
● Has streaming component
● Performs data modeling
● Exposes end tables to users
Kappa: Speed Layer
● All data as a stream
● In stream processing
● Real-time views
A Day In The Life
What’s not modern about this architecture?
● Cron job runs once a day to check for presence of a file on an FTP server
● CSV file is downloaded and loaded into an RDBMS star schema
Message Queues
● Publishers
● Exchanges
● Queues
● Consumers
● Messages are removed from
queues once a consumer reads
them
Scaling Up Message Queues
What if a single consumer can’t keep up with the volume?
Scaling Them Up Even Further...
What happens if each message needs to go to more than one set of consumers?
Distributed Logs
● Topic - a named stream of messages
● Partition - an ordered list of message. A topic is made up of a set of partitions
● Producer - a process that publishes messages to a topic
● Consumer - a process that consumes messages from a topic
● Consumer group - a named group of consumers who together consume
messages from a set of topics
Topics can be consumed by multiple consumer groups independently.
Each consumer group is responsible for tracking its own position in message
processing
Messages stay in the log for the defined retention period (are not removed once all
consumer groups have processed them)
Distributed Logs
Data Formats
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" }
● ]
● }
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" },
● { "name" : "qty", "type" : [ "null", "int" ], "default" : null }
● ]
● }
Column Oriented Storage
● Instead of storing all columns for a row together on disk, store each column
together on disk
● Enables higher compression ratios
● What kind of queries will you be executing against the data?
Apache ORC
● Column oriented
● Enables fast reading
● Compresses extremely well
● Commonly used with Hive and Presto
● Column oriented
● Enables fast reading
● Compresses well
● Commonly used with with Impala
Apache Parquet
Compression Comparison
Format Size
CSV 2.5 GB
Avro 1.2 GB
Parquet 817 MB
ORC 796 MB
Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
Demo
https://github.com/kitmenke/spark-hello-world
AWS Kinesis
API
AWS Comprehend
Appendix
Best in class modern architecture
… At 1904labs, we have a passion for open source
technologies, strive to build cloud first applications, and
are motivated by our desire to transform businesses into
data-driven enterprises.
Infrastructure
Bare Metal / Servers On Site
Build & Maintain
Bundled Vendors
Cloud
These open source tools
are already used broadly
across industries today
Open-Source Adoption in the Industry
1904labs’ primary focus is to
help our clients design, build
and operate world class and
modern data, development
operations and decision science
capabilities...
...Utilizing best of breed open
source and/or commercial open
source tools and platforms.

Mais conteúdo relacionado

Mais procurados

NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementAlluxio, Inc.
 
MongoDB for the SQL Server
MongoDB for the SQL ServerMongoDB for the SQL Server
MongoDB for the SQL ServerPaulo Fagundes
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and HadoopSalil Navgire
 
Big data in the cloud
Big data in the cloudBig data in the cloud
Big data in the cloudBen Sullins
 
Cassandra
CassandraCassandra
Cassandrarobjk
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architecturesRegunath B
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old versionSoftwareMill
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQLMongoDB
 
Intoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime DatabaseIntoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime DatabaseSahil Maiyani
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 

Mais procurados (20)

NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data management
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB for the SQL Server
MongoDB for the SQL ServerMongoDB for the SQL Server
MongoDB for the SQL Server
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Big data in the cloud
Big data in the cloudBig data in the cloud
Big data in the cloud
 
NoSQL
NoSQLNoSQL
NoSQL
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Cassandra
CassandraCassandra
Cassandra
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architectures
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
 
Intoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime DatabaseIntoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime Database
 
MongoDB SF Ruby
MongoDB SF RubyMongoDB SF Ruby
MongoDB SF Ruby
 
NoSQL
NoSQLNoSQL
NoSQL
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 

Semelhante a Data engineering Stl Big Data IDEA user group

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...HostedbyConfluent
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the ConferenceTimothy Spann
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsClaudiu Coman
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

Semelhante a Data engineering Stl Big Data IDEA user group (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
PyTables
PyTablesPyTables
PyTables
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Mais de Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 

Mais de Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Último

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Último (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Data engineering Stl Big Data IDEA user group

  • 1. Data Engineering April 2019 - STL Big Data User Group
  • 2. Discussion Topics Introductions Background - Chris Distributed systems - Chris Day in the life - Tim Demo - Kit
  • 3. Why Data Engineering? ● It's not just about analytics. ● Treating data as an asset. ● Data science is built upon the work of good data engineering. ● The data we collect is changing. The way we store and access it must as well. ● Automate, automate, automate.
  • 4.
  • 6. What does it mean? A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
  • 7.
  • 8. Distributed Compute 1. Overview 2. Scaling 3. What does it mean for you?
  • 9. Data Architecture - Lambda & Kappa Lambda: Batch and Speed Layer ● Stores raw data ● Computes tables in batch ● Has streaming component ● Performs data modeling ● Exposes end tables to users Kappa: Speed Layer ● All data as a stream ● In stream processing ● Real-time views
  • 10. A Day In The Life
  • 11. What’s not modern about this architecture? ● Cron job runs once a day to check for presence of a file on an FTP server ● CSV file is downloaded and loaded into an RDBMS star schema
  • 12. Message Queues ● Publishers ● Exchanges ● Queues ● Consumers ● Messages are removed from queues once a consumer reads them
  • 13. Scaling Up Message Queues What if a single consumer can’t keep up with the volume?
  • 14. Scaling Them Up Even Further... What happens if each message needs to go to more than one set of consumers?
  • 15. Distributed Logs ● Topic - a named stream of messages ● Partition - an ordered list of message. A topic is made up of a set of partitions ● Producer - a process that publishes messages to a topic ● Consumer - a process that consumes messages from a topic ● Consumer group - a named group of consumers who together consume messages from a set of topics Topics can be consumed by multiple consumer groups independently. Each consumer group is responsible for tracking its own position in message processing Messages stay in the log for the defined retention period (are not removed once all consumer groups have processed them)
  • 18. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" } ● ] ● }
  • 19. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" }, ● { "name" : "qty", "type" : [ "null", "int" ], "default" : null } ● ] ● }
  • 20. Column Oriented Storage ● Instead of storing all columns for a row together on disk, store each column together on disk ● Enables higher compression ratios ● What kind of queries will you be executing against the data?
  • 21. Apache ORC ● Column oriented ● Enables fast reading ● Compresses extremely well ● Commonly used with Hive and Presto ● Column oriented ● Enables fast reading ● Compresses well ● Commonly used with with Impala Apache Parquet
  • 22. Compression Comparison Format Size CSV 2.5 GB Avro 1.2 GB Parquet 817 MB ORC 796 MB Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
  • 26.
  • 27. Best in class modern architecture
  • 28. … At 1904labs, we have a passion for open source technologies, strive to build cloud first applications, and are motivated by our desire to transform businesses into data-driven enterprises.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Infrastructure Bare Metal / Servers On Site Build & Maintain Bundled Vendors Cloud
  • 34.
  • 35. These open source tools are already used broadly across industries today Open-Source Adoption in the Industry 1904labs’ primary focus is to help our clients design, build and operate world class and modern data, development operations and decision science capabilities... ...Utilizing best of breed open source and/or commercial open source tools and platforms.