SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Cassandra & MR
and how we used it in
Agenda
- Why do we need Cassandra & MapReduce
- 3 notes about Cassandra
- 3 notes Cassandra + MapReduce
Real Time Bidding (RTB)
RTB
- How often?
10 - 30M auctions per day for mobile devices in Russia
e.g. 10-30Gb of data /day
- What do we need?
Be effective in showing Ads
- They call it "big data & data mining"
We decided to use Cassandra&Hadoop for that
1. Cassandra tokens
∆ = [T0, T4] - token range
ex. ∆ = [-2^63, +2^63]
Every time one writes (K,V) into Cassandra:
- ex. token(K) in [T2, T3]
- (K,V) will be put into node 3 (if replica 1)
2. Cassandra Load Balancing
Partitioner generates tokens for your keys
E.g. it creates token(K)
Cassandra offers the following partitioners:
● Murmur3Partitioner (default): Uniformly distributes data
across the cluster based on MurmurHash hash values.
● RandomPartitioner: Uniformly distributes data across the
cluster based on MD5 hash values.
● ByteOrderedPartitioner: Keeps an ordered distribution of data
lexically by key bytes
The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the
right choice for new clusters in almost all cases.
http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
3. Cassandra indexes and knows
- Cassandra support common data formats
E.g. byte, string, long, double
- Cassandra support secondary indexes
E.g. you can select your data not bulky
- Cassandra knows how much data (records) in every
token range
Cassandra & Map-Reduce
Google says:
1. Cassandra is integrated with Map-Reduce
http://wiki.apache.org/cassandra/HadoopSupport
2. It is outdated
3. It is used for Hadoop 1.0.3 or whatever version
This means: Please install hadoop+mr cluster yourself
Cassandra & Map-Reduce (we want)
1. Cloudera Hadoop Distribution (CDH4)
Cloudera manager installs your cluster in couple of clicks
2. Up to date (Cassandra 1.1.x - 1.2.x)
Solution:
A) Take Cassandra sources from
http://cassandra.apache.org/download/
B) Take package org.apache.cassandra.hadoop and recompile it, having
dependencies from CDH4&Cassadnra[1.x]
And Jar is ready to go for your map-reduce jobs
1. Allocate your cluster
DataStax says:
To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.
This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.
Why?
Because this:
works 100 times faster than this:
2. Number of map tasks
Job control parameter: InputSplitSize (default 65536)
Estimates how much data one mapper will receive
Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks
3. How job reads the data
JobControlParameter: RangeBatchSize (default: 4096)
Bulk volume to read including your filters (primary & secondary indexes)
Cassandra does filtering job on server side
( [-2^63, +2^63] / number of map tasks )
Pros:
1. Easy to manage (Cassandra cluster & cloudera manager is
2. Easy to index
3. Supports query language & data types support
Cons:
1. Scalable extremely expensive (every node should run cassandra + hadoop)
2. Reading is very slow
3. Reading big amount is impossible
Note: Netflix reading using cassandra to manage the data.
But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!
http://www.datastax.com/dev/blog/2012-in-review-performance
Conclusion

Mais conteúdo relacionado

Mais procurados

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedVed Mulkalwar
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedVed Mulkalwar
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and presentGordon Chung
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2AAKASH S
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2AAKASH S
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkmeDori Waldman
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over CassandraNoam Barkai
 
Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Jéferson Machado
 
Cloud burst 소개
Cloud burst 소개Cloud burst 소개
Cloud burst 소개주영 송
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbagGordon Chung
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)Gordon Chung
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 

Mais procurados (19)

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over Cassandra
 
Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Apache Spark Internals - Part 2
Apache Spark Internals - Part 2
 
Cloud burst 소개
Cloud burst 소개Cloud burst 소개
Cloud burst 소개
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 

Destaque

Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timeAerospike, Inc.
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)bigdatagurus_meetup
 
How We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemHow We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemMongoDB
 
Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplacereducedata
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryNader Ganayem
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?Aerospike, Inc.
 
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...Aerospike, Inc.
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of EngagementAerospike, Inc.
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
 
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...Karunakar Ravirala
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 

Destaque (19)

Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)
 
How We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemHow We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising System
 
Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplace
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?
 
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
 
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 

Semelhante a Cassandra&map reduce

Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 

Semelhante a Cassandra&map reduce (20)

Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Unit 2
Unit 2Unit 2
Unit 2
 
ClusterAnalysis
ClusterAnalysisClusterAnalysis
ClusterAnalysis
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Scala+data
Scala+dataScala+data
Scala+data
 
Hadoop
HadoopHadoop
Hadoop
 

Último

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Último (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

Cassandra&map reduce

  • 1. Cassandra & MR and how we used it in
  • 2. Agenda - Why do we need Cassandra & MapReduce - 3 notes about Cassandra - 3 notes Cassandra + MapReduce
  • 4. RTB - How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day - What do we need? Be effective in showing Ads - They call it "big data & data mining" We decided to use Cassandra&Hadoop for that
  • 5. 1. Cassandra tokens ∆ = [T0, T4] - token range ex. ∆ = [-2^63, +2^63] Every time one writes (K,V) into Cassandra: - ex. token(K) in [T2, T3] - (K,V) will be put into node 3 (if replica 1)
  • 6. 2. Cassandra Load Balancing Partitioner generates tokens for your keys E.g. it creates token(K) Cassandra offers the following partitioners: ● Murmur3Partitioner (default): Uniformly distributes data across the cluster based on MurmurHash hash values. ● RandomPartitioner: Uniformly distributes data across the cluster based on MD5 hash values. ● ByteOrderedPartitioner: Keeps an ordered distribution of data lexically by key bytes The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases. http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
  • 7. 3. Cassandra indexes and knows - Cassandra support common data formats E.g. byte, string, long, double - Cassandra support secondary indexes E.g. you can select your data not bulky - Cassandra knows how much data (records) in every token range
  • 8. Cassandra & Map-Reduce Google says: 1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport 2. It is outdated 3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself
  • 9. Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks 2. Up to date (Cassandra 1.1.x - 1.2.x) Solution: A) Take Cassandra sources from http://cassandra.apache.org/download/ B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs
  • 10. 1. Allocate your cluster DataStax says: To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node. Why?
  • 11. Because this: works 100 times faster than this:
  • 12. 2. Number of map tasks Job control parameter: InputSplitSize (default 65536) Estimates how much data one mapper will receive Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks
  • 13. 3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096) Bulk volume to read including your filters (primary & secondary indexes) Cassandra does filtering job on server side ( [-2^63, +2^63] / number of map tasks )
  • 14. Pros: 1. Easy to manage (Cassandra cluster & cloudera manager is 2. Easy to index 3. Supports query language & data types support Cons: 1. Scalable extremely expensive (every node should run cassandra + hadoop) 2. Reading is very slow 3. Reading big amount is impossible Note: Netflix reading using cassandra to manage the data. But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra! http://www.datastax.com/dev/blog/2012-in-review-performance Conclusion