Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

•

2 gostaram•1,558 visualizações

Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data. https://www.bigdataspain.org/2017/talk/tuning-java-driver-for-apache-cassandra Big Data Spain 2017 November 16th - 17th Kinépolis Madrid

Tecnologia

Tuning Java Driver for
Apache Cassandra
November 2017
Nenad Bozic
@NenadBozicNs
nenad.bozic@smartcat.
SmartCat
www.smartcat.
io

Agenda
• intro to Apache Cassandra
• tuning options in driver
• use cases
• takeaways and Q&A

Cassandra Overview
• partitioned data with tunable consistency
• replication factor - how many replicas
• masterless architecture
• native multi-datacenter support

Architecture
Client request
Consistency level 1
Replication factor 3

Architecture
Client request
response
Consistency level 1
Replication factor 3

Data Modeling
• query based modeling
• data is denormalized
• data is duplicated

Use Cases
• when high availability is crucial, and eventual consistency is tolerable
• event sourcing
• logging continuous streams of data
• deep visitor analytics
• early prototyping with significant query changes
• referential integrity required
• dynamic access patterns on data

Load balancing
https://www.slideshare.net/planetcassandra/apache-cassandra-and-drivers

Data Center Aware Load Balancing
https://www.slideshare.net/planetcassandra/apache-cassandra-and-drivers

Toke Aware Load Balancing
https://www.slideshare.net/planetcassandra/apache-cassandra-and-drivers

Pooling options
• driver communicates with cluster with pool of connections
• changed between V2 and V3 version of protocol (core lowered to 1)
• going for more requests on connection can put more load to cluster
• add monitoring of in flight queries on driver side and tune for your use case

Speculative executions
• spawn additional queries to other nodes after configured time
http://docs.datastax.com/en/developer/java-driver/3.1/manual/speculative_execution/

Speculative executions
• constant speculative execution policy
• percentile speculative execution policy

Timeouts
• driver read timeout vs server read timeout
• driver settings for all queries or per query settings
• setReadTimeoutMillis and setConnectionTimeoutMillis

Retry policies
• fail early and retry
• add retry policy or speculative execution
• downgrading retry policy if inconsistent data is more important than no data

Click stream and IoT measurements
• visualize measurements from many devices
• fast access with tolerable inconsistencies
• DC aware and token aware policy to land on local node with data
• lower consistency level (ONE) or use downgrading retry policy
• use speculative executions to query more nodes if cluster can manage load

Mission critical data with tolerable performance
• stock data in warehouse used to compare with ERP system
• high consistency (read + write > replication factor)
• retry and reconnect policy is a must
• choose lower requests per connection numbers not to overload cluster
• set lower read timeout to fail early and retry

Write heavy low latency read use case
• ad serving (store user analytics and serve ads fast)
• separate read and write for different tuning options
• latency aware policy on reads to choose always fast performing nodes
• lower down read timeout on driver and server to fail early
• increase maximum requests per connection

Conclusion and take aways
• know your use case and know your database
• each tuning options requires good monitoring
TEST
ADJUST MEASURE

Links
• SmartCat Blog post - Tuning Java driver for Apache Cassandra - part 1
• SmartCat Blog post - Tuning Java driver for Apache Cassandra - part 2
• Use case example - Tuning for heavy write and low latency read scenario

Thank you
Nenad Bozic
@NenadBozic
Ns
SmartCat
www.smartcat.i
o

Mais conteúdo relacionado

Mais procurados

Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on. In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc. Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Spark Summit

Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

HostedbyConfluent

Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Spark Summit

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

SingleStore

Eventing and streaming open a world of compelling new possibilities to our software and platform designs. They can reduce time to decision and action while lowering total platform cost. But they are not a panacea. Understanding the edges and limits of these architectures can help you avoid painful missteps. This talk will focus on event-driven and streaming architectures and how Apache Kafka can help you implement these. It will also discuss key tradeoffs you will face along the way from partitioning schemes to the impact of availability vs. consistency (CAP Theorem). Finally, we’ll discuss some challenges of scale for patterns like Event Sourcing and how you can use other tools and even features of Kafka to work around them. This talk assumes a basic understanding of Kafka and distributed computing but will include brief refresher sections.

Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...

confluent

Presenter: Ben Laplanche, Product Manager, Pivotal Cloud Foundry Companies turn to PaaS and Cloud Native Applications to gain agility and speed. To provide customer value, a fault tolerant infrastructure is essential. But what happens if an entire data center, region, or even country should go offline? Cassandra holds the key to keeping application state in sync through replication, whilst Pivotal Cloud Foundry provides easy deployment to multiple IaaS providers. It also comes complete with a managed service offering for DataStax Enterprise. This talk will discuss how this setup can be deployed in one day, including demonstrations and a walkthrough of the key concepts, approaches, and considerations.

Building A Diverse Geo-Architecture For Cloud Native Applications In One Day

VMware Tanzu

Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...

Spark Summit

Assaf Araki – Real Time Analytics at Scale

Flink Forward

INTRODUCING: CREATE PIPELINE

SingleStore

Lambda at Weather Scale - Cassandra Summit 2015

Robbie Strickland

In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.

Tsinghua University: Two Exemplary Applications in China

DataStax Academy

Kick-Start with SMACK Stack

Knoldus Inc.

In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages. Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.

British Gas Connected Homes: Data Engineering

DataStax Academy

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark Summit

The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others. Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more! In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.

Petabridge: The New .NET Enterprise Stack

DataStax Academy

Monitoring Large-Scale Apache Spark Clusters at Databricks

Anyscale

Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data. We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases. This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.

High cardinality time series search: A new level of scale - Data Day Texas 2016

Eric Sammer

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark Summit

Spark Streaming the Industrial IoT

Jim Haughwout

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Mais procurados (20)

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...

Building A Diverse Geo-Architecture For Cloud Native Applications In One Day

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...

Assaf Araki – Real Time Analytics at Scale

INTRODUCING: CREATE PIPELINE

Lambda at Weather Scale - Cassandra Summit 2015

Tsinghua University: Two Exemplary Applications in China

Kick-Start with SMACK Stack

British Gas Connected Homes: Data Engineering

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Petabridge: The New .NET Enterprise Stack

Monitoring Large-Scale Apache Spark Clusters at Databricks

High cardinality time series search: A new level of scale - Data Day Texas 2016

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark Streaming the Industrial IoT

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Semelhante a Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data. Many use cases where Cassandra is natural fit require latency tuning in order to serve requests really fast. DataStax driver has many options, some less familiar, which can greatly influence performance aspect. This talk will focus on Java applications and options at your disposal in DataStax Java driver which became standard when you want to use this database. We will concentrate on both monitoring and tuning aspect of things and we will provide different options for different use cases. There is no silver bullet solution and having many options requires deep dive when you want to figure out right decision. This talk will narrow down options and give you push in the right direction.

Tuning Java Driver for Apache Cassandra

Nenad Bozic

The growth of Datacenter infrastructure is trending out of bounds, along with the pace in user activity and data generation in this digital era. However, the nature of the typical application deployment within the data center is changing to accommodate new business needs. Those changes introduce complexities in application deployment architecture and design, which cascade into requirements for a new generation of database technology (NoSQL) destined to ease that complexity. This webcast will discuss the modern data centers data centric application, the complexities that must be dealt with and common architectures found to describe and prescribe new data center aware services. Well look at the practical issues in implementation and overview current state of art in NoSQL database technology solving the problems of data center awareness in application development.

NoSQL – Data Center Centric Application Enablement

DATAVERSITY

Scaling a stateless system like web servers & application servers are pretty well understood, but scaling a stateful system has its own set of challenges. In this presentation, we hope to present our learnings & challenges faced while scaling a NoSQL database used within LinkedIn. Scaling a data system involves significant movement & replication of data within in a cluster. This can put considerable load on a system that is already running hot, affecting the service experience We will look at some of the challenges & the approaches that we took.

Scaling distributed data systems: A LinkedIn Case study

Sai Kiran Kanuri

20160331 sa introduction to big data pipelining berlin meetup 0.3

Simon Ambridge

We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Fwdays

'Kanthaka' is an attempt to bring the benefits of Big Data technologies to telecom industry. The objective of the system is to analyze the CDRs (Caller Detail Record) and give results in near real time. This is carried out as a final year project for my degree B. Sc. of Engineering (Hons) at University of Moratuwa as a team with 3 more colleagues, under the supervision of a senior lecturer and an industry expert. The presentation exhibits the background, findings after literature review and proposing architecture of the system as for now. Any feed backs on improvements that can be made, are warmly welcome!

Kanthaka - High Volume CDR Analyzer

Pushpalanka Jayawardhana

Performance and Scalability Tuning

Andres March

Azure DocumentDB Overview

Andrew Liu

Building a highly scalable and available cloud application

Noam Sheffer

Data Pipelines with Spark & DataStax Enterprise

DataStax

Meta scale kognitio hadoop webinar

Michael Hiskey

Got documents - The Raven Bouns Edition

Maggie Pint

Cassandra - A Basic Introduction Guide

Mohammed Fazuluddin

View On-Demand http://ecast.opensystemsmedia.com/403 Repeat Success, Not Mistakes; Use DDS Best Practices to Design Your Complex Distributed Systems RTI Connext DDS is a powerful tool that lets you efficiently build and integrate complex distributed systems like no other technology – if you use it right. Be aware of how to get the most out of DDS and how to avoid common pitfalls when developing your system. We've developed RTI Connext best practices over the course of hundreds of customer projects and many years. In this webinar, you will learn how to apply the best practices we have developed to use RTI Connext DDS in ways that will enable your system to scale effectively with optimal performance, while avoiding missteps that will cause poor performance, non-determinism and scalability problems.

Best Practices Using RTI Connext DDS

Real-Time Innovations (RTI)

4. (mjk) extreme performance 2

Doina Draganescu

PNDA - Platform for Network Data Analytics

John Evans

Amazon`s Dynamo

sarang33

Data lake-itweekend-sharif university-vahid amiry

datastack

Flashy prefetching for high performance flash drives

Pratik Bhat

Cloud Design Patterns

Taswar Bhatti

Semelhante a Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017 (20)

Tuning Java Driver for Apache Cassandra

NoSQL – Data Center Centric Application Enablement

Scaling distributed data systems: A LinkedIn Case study

20160331 sa introduction to big data pipelining berlin meetup 0.3

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Kanthaka - High Volume CDR Analyzer

Performance and Scalability Tuning

Azure DocumentDB Overview

Building a highly scalable and available cloud application

Data Pipelines with Spark & DataStax Enterprise

Meta scale kognitio hadoop webinar

Got documents - The Raven Bouns Edition

Cassandra - A Basic Introduction Guide

Best Practices Using RTI Connext DDS

4. (mjk) extreme performance 2

PNDA - Platform for Network Data Analytics

Amazon`s Dynamo

Data lake-itweekend-sharif university-vahid amiry

Flashy prefetching for high performance flash drives

Cloud Design Patterns

Mais de Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017

Big Data Spain

Scaling a backend for a big data and blockchain environment by Rafael Ríos at...

Big Data Spain

AI: The next frontier by Amparo Alonso at Big Data Spain 2017

Big Data Spain

Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017

Big Data Spain

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...

Big Data Spain

Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...

Big Data Spain

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...

Big Data Spain

Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...

Big Data Spain

State of the art time-series analysis with deep learning by Javier Ordóñez at...

Big Data Spain

Trading at market speed with the latest Kafka features by Iñigo González at B...

Big Data Spain

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...

Big Data Spain

The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...

Big Data Spain

Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...

Big Data Spain

Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017

Big Data Spain

Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...

Big Data Spain

Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...

Big Data Spain

Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...

Big Data Spain

More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...

Big Data Spain

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Big Data Spain

Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...

Big Data Spain

Mais de Big Data Spain (20)