SlideShare uma empresa Scribd logo
1 de 67
Big Data in 200KM/h
Omid Vahdaty, Big Data Ninja
Disclaimer
● I am not the best, I simply love
what I do VERY much.
● You are more than welcome to
challenge me or anything I have
to say as I could be wrong.
● This Lecture has evolves over
time, this is the 4th iteration.
● Feel Free to send me comments
A little bit about Me
In the Past(web,api, ops db, data warehouse)
5
Then came Big Data...
6
Then came the cloud...
7
Then came the invoice ...
It keeps
growing….
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
MY AWS BIG DATA
APPLICATION STACK
With some “Myst”...
Jargon
Big data?
Architecture?
Considerations?
Challenges?
How to get started?
Big Jargon & Basics concepts u should know
https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs-
and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/
● What is Big Data?
● scale out / up ?
● structured/semi structured/unstructured data ?
● ACID?
● OLAP VS OLTP? == Analytics VS operational
● DIY = Do it Yourself
● PaaS = Platform as a service
Big Data =
When your data outgrows
your infrastructure ability to process
● Volume (x TB processing per day)
● Velocity ( x GB/s)
● Variety (JSON, CSV, events etc)
● Veracity (how much of the data is accurate?)
Challenges creating big data architecture?
● What is the business use case ? How fast do u need the insights?
○ 15 min - 24 hours delay and above → use batch
○ Less than 15 min?
■ Might be batch - depends data source is files or events?.
■ Streaming?
● Sub seconds delay?
● Sub minute delay?
■ Streaming with in flight analytics ?
○ How complex is the compute jobs? Aggregations? joins?
Challenges creating big data architecture?
● What is the Velocity?
○ Under 100K events per second? Not a problem
○ Over 1M events per second? Costly. But doable.
○ Over 1B events per seconds? Not trivial at all.
● Volume ?
○ ~1TB a day ? Not a problem
○ Over ? it depends.
○ Over a petabyte? Well…. It depends.
● Veracity (how are you going to handle different data sources?)
○ Structured (CSV)
○ Semi structured (JSON,XML)
○ Unstructured (pictures, movies etc)
Challenges creating big data architecture?
● Performance targets?
● Costs targets?
● Security restrictions?
● Regulation restriction? privacy?
● Which technology to choose?
● Datacenter or cloud?
● Latency?
● Throughput?
● Concurrency?
● Security Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
Cloud Architecture rules of thumb...
● Decoupling
● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
○ Do use more b/c: maintenance
○ Training time
○ complexity/simplicity
How to get started on big data architecture?
Get answers to the below:
1. What is the business use case?
a. Volume? velocity? Variety? vercity/
b. Did you map all of data sources?
2. Where should we build a data platform?
a. What is the product? → requirements.
b. Cloud? datacenter?
3. Architecture?
a. DIY? Paas? , Pay as you go? Or fixed? Decoupled?
b. Fast? Cheap? simple?
4. Did you Communicate you plans?
5. Did you map all known challenges?
Where to build the data platform?
Big Data Architecture:
● Data Ingestion
○ Online
■ messaging
■ Streaming
○ Offline
■ Batch
■ Performance aspects
● Data Transformation (Hive)
○ JSON, CSV, TXT, PARQUET, Binary
● Data Modeling - (R, ML, AI, DEEP, SPARK)
● Data Visualization (choose your poison)
● PII regulation + GPDR regulation
● And: Performance... Cost… Security… Simple... Cloud best practices...
Big Data Generic Architecture
Data Ingestion
(file based ETL from remote DC)
Data Transformation ( row to colunar + cleansing)
Data Modeling ( joins/agg/ML/R)
Data Presentation
Text,
RAW
Data Ingestion
A layer in your big data architecture
designed to do one thing : ingest
data via Batch or Streaming, I.e
move (only) data from point A to
point B. from source data to the
next layer in the architecture
(decoupled).
Big Data Generic Architecture | Data Ingestion
Data Ingestion
Data Transformation
Data Modeling
Data Presentation
S3 Considerations
● Security
○ at rest: server side S3-Managed Keys (SSE-S3)
○ at transit: SSL / VPN
○ Hardening: user, IP ACL, write permission only.
● Upload
○ AWS s3 cli
○ Multi part upload
○ Aborting Incomplete Multipart Uploads Using a
Bucket Lifecycle Policy
○ Consider S3 CLI Sync command instead of CP
Sqoop - ETL
● Open source , part of EMR
● HDFS to RDMS and back. Via JDBC.
● E.g BiDirectional ETL from RDS to HDFS
● Unlikely use case: ETL from customer source operational DB.
Flume & Kafka
● Opens source project for streaming & messaging
● Popular
● Generic
● Good practice for many use cases. (a meetup by it self)
● Highly durable, scalable, extension etc.
● Downside : DIY, Non trivial to get started
Data Transfer Options
● Direct Connect (4GB/s?)
● For all other use case
○ S3 multipart upload
○ Compression
○ Security
■ Data at motion
■ Data at rest
Kinesis family of products
● Kinesis Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo, ...
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
○ Not managed!
● Kinesis Analytics - in flight analytics.
● Kin. Firehose - Park you data @ destination.
Kinesis Firehose - for Data parking
● Not for fast lane - no in flight analytics
● Ingest , transform and load to:
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
Comparison of Kinesis products
● Streams
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Buffering 60 sec minimum.
○ For larger “events”
Data
Transformation
A layer in your big data architecture
designed to : Transform and
Cleanse data (row data to columnar
data and convert data types, Fix
bugs in data)
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Quick word about hadoop Ecosystem
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper (hbase)
● zeppelin
EMR Architecture
● Master node
● Core nodes - like data nodes (with storage: HDFS)
● Task nodes - (extends compute)
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR lesson learned...
● Bigger instance type is good architecture
● Use spot instances - for the tasks only.
● Don't always use TEZ (MR? Spark?)
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Use Steps to automate steps on running cluster
● Use Glue to share Hive MetaStore
● Good Cost reduction article on EMR
So use EMR for ...
● Most dominant
○ Hive
○ Spark
○ Presto
● And many more….
● Good for:
○ Data transformation
○ Data modeling
○ Batch
○ Machine learning
Hive
● SQL over hadoop.
● Engine: spark, tez, MR
● JDBC / ODBC
● Not good when need to shuffle.
● Not peta scale.
● SerDe json, parquet,regex,text etc.
● Dynamic partitions
● Insert overwrite
● Data Transformation
● Convert to Columnar
Presto
● SQL over hadoop
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
● No insert overwrite
● No dynamic partitions.
● Has some connectors : redshift and more
● https://amazon-aws-big-data-
demystified.ninja/2018/07/02/aws-emr-
presto-demystified-everything-you-
wanted-to-know-about-presto/
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know
SQL, for system people, for those who want
to avoid java/scala
● Fair fight compared to hive in term of
performance only
● Good for unstructured files ETL : file to file ,
and use sqoop.
Hue
● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Orchestration
● EMR Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Visualization
Data Modeling
A layer in your big data architecture
designed to Model data: Joins,
Aggregations, nightly jobs, Machine
learning
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Spark
● In memory
● X10 to X100 times faster from hive
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark Streaming
● Near real time (1 sec latency)
● like batch of 1sec windows
● Streaming jobs with API
● DIY = Not relevant to us...
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark flavours
● Standalone
● With yarn
● With mesos
Spark SQL
● Same syntax as hive
● Optional JDBC via thrift
● Non trivial learning curve
● Upto X10 faster than hive.
● Works well with Zeppelin (out of the box)
● Does not replaces Hive
● Spark not always faster than hive
● insert overwrite -
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● works on top of Hive/ Spark SQL
● Inside EMR.
● Uses in the background:
○ Shiro
○ Livy
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries
○ which are long lasting.
● DO NOT USE FOR transformation.
● Good for : DW, Complex Joins.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases. (aggregations)?
● Raw data? S3.
● When to use emr and when redshift?
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift? athena?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scale ? redshift
● Complex joins → Redshift
Sage Maker
● Web notebook (jupiter
based) for data science
● Connects to all your data
sources (s3,athena etc)
● Help you manage the
entire lifecycle machine
learning
● Managed Service
● Used to create a ML to
predict cookie gender
AWS Glue
Shared meta store
Helps with some data transformation (managed service)
Automatic Schema discovery
AWD RDS (aurora, postgres, mysql)
● We used RDS aurora as Operational DB
● We did not use it for big data analytics although it supports upto 64Tb
● It is row based.
● The syntax is missing analytical functions
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Data
Presentation Used ONLY for presenting data for
operational applications or BI, Use
managed service to ensure HA.
Big Data Generic Architecture | Presentation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Athena
● Presto SQL
● In memory
● Hive metastore for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
● Now supports CTAS (No inserts are supported)
● Good for:
○ Read only SQL,
○ Ad hoc query,
○ low cost,
○ Managed
● Good cost reduction article on athena
Summery Take Away Message &
Action Items
Big Data Generic Architecture | Summary
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Big Data Architecture ?
Faster! Cheaper! Simpler!
How to get started | Call for Action
Lectures: AWS Big Data Demystified lectures #1 until #4
AWS Big Data Demystified Meetup Big Data Demystified meetup
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://big-data-demystified.ninja/
● Join our meetups subscribe to youtube channels
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.meetup.com/Big-Data-Demystified/
○ Big Data Demystified YouTube
○ AWS Big Data Demystified YouTube

Mais conteúdo relacionado

Mais procurados

Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsRendy Bambang Junior
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon
 

Mais procurados (20)

Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgent
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 

Semelhante a Big Data in 200 km/h | AWS Big Data Demystified #1.3

Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 

Semelhante a Big Data in 200 km/h | AWS Big Data Demystified #1.3 (20)

Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 

Mais de Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo dbOmid Vahdaty
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 

Mais de Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 

Último

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Último (20)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Big Data in 200 km/h | AWS Big Data Demystified #1.3

  • 1. Big Data in 200KM/h Omid Vahdaty, Big Data Ninja
  • 2. Disclaimer ● I am not the best, I simply love what I do VERY much. ● You are more than welcome to challenge me or anything I have to say as I could be wrong. ● This Lecture has evolves over time, this is the 4th iteration. ● Feel Free to send me comments
  • 3. A little bit about Me
  • 4.
  • 5. In the Past(web,api, ops db, data warehouse) 5
  • 6. Then came Big Data... 6
  • 7. Then came the cloud... 7
  • 8. Then came the invoice ...
  • 10. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 11. MY AWS BIG DATA APPLICATION STACK With some “Myst”...
  • 13. Big Jargon & Basics concepts u should know https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs- and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/ ● What is Big Data? ● scale out / up ? ● structured/semi structured/unstructured data ? ● ACID? ● OLAP VS OLTP? == Analytics VS operational ● DIY = Do it Yourself ● PaaS = Platform as a service
  • 14. Big Data = When your data outgrows your infrastructure ability to process ● Volume (x TB processing per day) ● Velocity ( x GB/s) ● Variety (JSON, CSV, events etc) ● Veracity (how much of the data is accurate?)
  • 15. Challenges creating big data architecture? ● What is the business use case ? How fast do u need the insights? ○ 15 min - 24 hours delay and above → use batch ○ Less than 15 min? ■ Might be batch - depends data source is files or events?. ■ Streaming? ● Sub seconds delay? ● Sub minute delay? ■ Streaming with in flight analytics ? ○ How complex is the compute jobs? Aggregations? joins?
  • 16. Challenges creating big data architecture? ● What is the Velocity? ○ Under 100K events per second? Not a problem ○ Over 1M events per second? Costly. But doable. ○ Over 1B events per seconds? Not trivial at all. ● Volume ? ○ ~1TB a day ? Not a problem ○ Over ? it depends. ○ Over a petabyte? Well…. It depends. ● Veracity (how are you going to handle different data sources?) ○ Structured (CSV) ○ Semi structured (JSON,XML) ○ Unstructured (pictures, movies etc)
  • 17. Challenges creating big data architecture? ● Performance targets? ● Costs targets? ● Security restrictions? ● Regulation restriction? privacy? ● Which technology to choose? ● Datacenter or cloud? ● Latency? ● Throughput? ● Concurrency? ● Security Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 18. Cloud Architecture rules of thumb... ● Decoupling ● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud ○ Do use more b/c: maintenance ○ Training time ○ complexity/simplicity
  • 19. How to get started on big data architecture? Get answers to the below: 1. What is the business use case? a. Volume? velocity? Variety? vercity/ b. Did you map all of data sources? 2. Where should we build a data platform? a. What is the product? → requirements. b. Cloud? datacenter? 3. Architecture? a. DIY? Paas? , Pay as you go? Or fixed? Decoupled? b. Fast? Cheap? simple? 4. Did you Communicate you plans? 5. Did you map all known challenges?
  • 20. Where to build the data platform?
  • 21. Big Data Architecture: ● Data Ingestion ○ Online ■ messaging ■ Streaming ○ Offline ■ Batch ■ Performance aspects ● Data Transformation (Hive) ○ JSON, CSV, TXT, PARQUET, Binary ● Data Modeling - (R, ML, AI, DEEP, SPARK) ● Data Visualization (choose your poison) ● PII regulation + GPDR regulation ● And: Performance... Cost… Security… Simple... Cloud best practices...
  • 22. Big Data Generic Architecture Data Ingestion (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Presentation Text, RAW
  • 23. Data Ingestion A layer in your big data architecture designed to do one thing : ingest data via Batch or Streaming, I.e move (only) data from point A to point B. from source data to the next layer in the architecture (decoupled).
  • 24. Big Data Generic Architecture | Data Ingestion Data Ingestion Data Transformation Data Modeling Data Presentation
  • 25. S3 Considerations ● Security ○ at rest: server side S3-Managed Keys (SSE-S3) ○ at transit: SSL / VPN ○ Hardening: user, IP ACL, write permission only. ● Upload ○ AWS s3 cli ○ Multi part upload ○ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy ○ Consider S3 CLI Sync command instead of CP
  • 26. Sqoop - ETL ● Open source , part of EMR ● HDFS to RDMS and back. Via JDBC. ● E.g BiDirectional ETL from RDS to HDFS ● Unlikely use case: ETL from customer source operational DB.
  • 27. Flume & Kafka ● Opens source project for streaming & messaging ● Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc. ● Downside : DIY, Non trivial to get started
  • 28. Data Transfer Options ● Direct Connect (4GB/s?) ● For all other use case ○ S3 multipart upload ○ Compression ○ Security ■ Data at motion ■ Data at rest
  • 29. Kinesis family of products ● Kinesis Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo, ... ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ○ Not managed! ● Kinesis Analytics - in flight analytics. ● Kin. Firehose - Park you data @ destination.
  • 30. Kinesis Firehose - for Data parking ● Not for fast lane - no in flight analytics ● Ingest , transform and load to: ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service
  • 31. Comparison of Kinesis products ● Streams ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Buffering 60 sec minimum. ○ For larger “events”
  • 32. Data Transformation A layer in your big data architecture designed to : Transform and Cleanse data (row data to columnar data and convert data types, Fix bugs in data)
  • 33. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 34. Quick word about hadoop Ecosystem
  • 35. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin
  • 36. EMR Architecture ● Master node ● Core nodes - like data nodes (with storage: HDFS) ● Task nodes - (extends compute) ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 37. EMR lesson learned... ● Bigger instance type is good architecture ● Use spot instances - for the tasks only. ● Don't always use TEZ (MR? Spark?) ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Use Steps to automate steps on running cluster ● Use Glue to share Hive MetaStore ● Good Cost reduction article on EMR
  • 38. So use EMR for ... ● Most dominant ○ Hive ○ Spark ○ Presto ● And many more…. ● Good for: ○ Data transformation ○ Data modeling ○ Batch ○ Machine learning
  • 39. Hive ● SQL over hadoop. ● Engine: spark, tez, MR ● JDBC / ODBC ● Not good when need to shuffle. ● Not peta scale. ● SerDe json, parquet,regex,text etc. ● Dynamic partitions ● Insert overwrite ● Data Transformation ● Convert to Columnar
  • 40. Presto ● SQL over hadoop ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries ● No insert overwrite ● No dynamic partitions. ● Has some connectors : redshift and more ● https://amazon-aws-big-data- demystified.ninja/2018/07/02/aws-emr- presto-demystified-everything-you- wanted-to-know-about-presto/
  • 41. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 42. Hue ● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 43. Orchestration ● EMR Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
  • 44. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Visualization
  • 45. Data Modeling A layer in your big data architecture designed to Model data: Joins, Aggregations, nightly jobs, Machine learning
  • 46. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 47. Spark ● In memory ● X10 to X100 times faster from hive ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 48. Spark Streaming ● Near real time (1 sec latency) ● like batch of 1sec windows ● Streaming jobs with API ● DIY = Not relevant to us...
  • 49. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 50. Spark flavours ● Standalone ● With yarn ● With mesos
  • 51. Spark SQL ● Same syntax as hive ● Optional JDBC via thrift ● Non trivial learning curve ● Upto X10 faster than hive. ● Works well with Zeppelin (out of the box) ● Does not replaces Hive ● Spark not always faster than hive ● insert overwrite -
  • 52. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● works on top of Hive/ Spark SQL ● Inside EMR. ● Uses in the background: ○ Shiro ○ Livy
  • 53. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries ○ which are long lasting. ● DO NOT USE FOR transformation. ● Good for : DW, Complex Joins.
  • 54. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. (aggregations)? ● Raw data? S3. ● When to use emr and when redshift?
  • 55. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? athena? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scale ? redshift ● Complex joins → Redshift
  • 56. Sage Maker ● Web notebook (jupiter based) for data science ● Connects to all your data sources (s3,athena etc) ● Help you manage the entire lifecycle machine learning ● Managed Service ● Used to create a ML to predict cookie gender
  • 57. AWS Glue Shared meta store Helps with some data transformation (managed service) Automatic Schema discovery
  • 58. AWD RDS (aurora, postgres, mysql) ● We used RDS aurora as Operational DB ● We did not use it for big data analytics although it supports upto 64Tb ● It is row based. ● The syntax is missing analytical functions
  • 59. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 60. Data Presentation Used ONLY for presenting data for operational applications or BI, Use managed service to ensure HA.
  • 61. Big Data Generic Architecture | Presentation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 62. Athena ● Presto SQL ● In memory ● Hive metastore for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions ● Now supports CTAS (No inserts are supported) ● Good for: ○ Read only SQL, ○ Ad hoc query, ○ low cost, ○ Managed ● Good cost reduction article on athena
  • 63. Summery Take Away Message & Action Items
  • 64. Big Data Generic Architecture | Summary Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 65. Big Data Architecture ? Faster! Cheaper! Simpler!
  • 66. How to get started | Call for Action Lectures: AWS Big Data Demystified lectures #1 until #4 AWS Big Data Demystified Meetup Big Data Demystified meetup
  • 67. Stay in touch... ● Omid Vahdaty ● +972-54-2384178 ● https://big-data-demystified.ninja/ ● Join our meetups subscribe to youtube channels ○ https://www.meetup.com/AWS-Big-Data-Demystified/ ○ https://www.meetup.com/Big-Data-Demystified/ ○ Big Data Demystified YouTube ○ AWS Big Data Demystified YouTube