Cassandra atwalmartlabsmeetup201203

Cassandra @walmartlabs

• Cassandra adoption at Walmart
– Using the DataStax distribution http://www.datastax.com/
• Introduction to the talks
• Hiring @labs

Walmart eCommerce 2


• Introduction to the talks
– Walmartlabs
• @labs – Using Cassandra for real-time stream processing
• @services – Using Cassandra for product and items
– DataStax
• Data modeling with Cassandra

Walmart eCommerce 3


• Hiring @labs
– Cassandra admins
– Java engineers
– http://www.walmartlabs.com/open-positions/

Walmart eCommerce 4

Cassandra for Real-time
Stream Processing
Karl Mueller, @WalmartLabs
Wang Lam, @WalmartLabs

Data-stream computation

• “Big” data: MapReduce (Hadoop)
– Map and Reduce steps
– Batch process large input (e.g., from HDFS)
– Hadoop distributes computation

• Fast data: MapUpdate (Muppet)
– Map and Update steps
– Continuously process streaming input
– Muppet maintains computation
– Muppet manages memory/storage

2012 Cassandra for Real-Time Stream Processing @WalmartLabs

The MapReduce framework (Hadoop)

• Event
– A <key, value> pair of data

• Map
– A function that performs (stateless) computation on incoming
events

• Reduce
– A function that combines all input for a particular key

• Application
– Map -> Reduce


The MapUpdate framework (Muppet)

• Event
– A <key, value> pair of data

• Map
– A function that performs (stateless) computation on incoming
events

• Update
– A function that updates a slate using incoming events

• Application
– A directed graph of Mappers and Updaters


A MapUpdate application


The Map (Foursquare::CheckinMapper)
sub map {
my $self = shift;
my $event = shift;

my $checkin = $event->{checkin};
my $timeslot = int($checkin->{created} / 900) * 900;
$event->{kosmix}->{timeslot} = $timeslot;
$event->{kosmix}->{interval} = 900;

my $venue_name = $checkin->{venue}->{name};
my $retailer = 0;
$retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i);
$retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i);
$retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i);
if ($retailer) {
$event->{kosmix}->{retailer} = $retailer;
$self->publish("FoursquareRetailerCheckin", $event,
$retailer.".".$timeslot);
2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs

The Update (Foursquare::RetailerUpdater)
use Muppet::Updater;
package Foursquare::RetailerUpdater;
@ISA = qw( Muppet::Updater );

use strict;

sub update {
my $self = shift;
my $event = shift;
my $slate = shift;
my $config = shift;
my $key = shift;

$slate->{timeslot} = $event->{kosmix}->{timeslot};
$slate->{interval} = $event->{kosmix}->{interval};
$slate->{retailer} = $event->{kosmix}->{retailer};
$slate->{count} += 1;
2012 ISD YBM Tech Fair - Big Fast Data @WalmartLabs

Example results


Muppet Processing

• Slates are 1 – 100KB in size

• Local cache on Muppet Node
– 85% reads from cache
– Write-though delayed cache
– ~750K slates in cache per node

• Remote slates read through Muppet API

• Cassandra is the permanent datastore

• Slates tend to be updated and read in batches
– 10-50 at a time

Muppet & Cassandra Architecture

~100x Muppet
Node Node
Processes NodeProcesses
Node Processes
Processes Node
Processes
Processes
Processes Processes
Processes
Processes
API Processes
Processes
Slate Cache Slate Cache
Slate Cache
Delay
Slate Cache
Delay Slate Cache

16x Cassandra
Cassandra Cassandra Cassandra
8x RAID0 SSD 8x RAID0 SSD 8x RAID0 SSD
1.2TB RAW 1.2TB RAW 1.2TB RAW


Datastore Requirements

• Consistent, low response time
– 10ms or less for slate reads on average

• 1+ billion keys, future expansion maybe 5-10 billion

• Value is whole set of data
– Slate losses in small amounts OK

• Datastore gets entirely “cold” reads
– Muppet Cache: 85% for reads
– Datastore cannot rely on cache for performance


Why Cassandra?

• Timeframe: Early 2010
– Low latency: a rare feature among NoSQL
– Most NoSQL favors throughput over response time
– New “Best NoSQL evur!!” every 2 months

• Cassandra:
– Open-Source, active community, Clustering a core feature

• Simple is good
– Peer networking, Data file format, key distribution

• QUORUM consistency good middle ground
– AP focus in CAP aligns well with our needs

Why Cassandra – the Challenges

• Seeks are going to be difficult
– Overwrites mean nightly compactions
– Compactions blow up seek performance
– 90%+ cold reads means lots of seeks
– Head and body reads can produce a lot of seeks

• Slates as an atomic unit means no bulk column slice reads

• Likely to have unfavorable read:write ratio
– Early estimates: 1:3, or even worse

• Oh yeah, spinning disks hate seeks. Uh oh!


Frequent Row Overwrites in Cassandra

TAIL Few Seeks
Full Compaction

BODY Some Seeks

HEAD Many Seeks

Growth During Day
Data Files (SS Tables)

Solution

• Cassandra + SSDs !!

• Expensive in terms of space, cheap in terms of IOps

• Random seeks “free”

• Good performance during nightly compactions


Compaction Effect on System


How did Cassandra do?

• Average latency below 10ms, often 5-8ms

• read-write ratio: 1:2
– Today, 1:1

• Compacting 500GB every night in <4 hours

• Individual C* nodes handled over 1500 rps/wps

• SSD cost: well worth it


Helping Cassandra out

• Muppet absorbs writes in local cache
– Write on # of updates or staleness
– Reduces write counts in Cassandra
– More efficient

• Compress all slates on Muppet nodes
– Easier to scale than C* nodes doing compression
– Less disk IO, less network
– CPU on Muppet nodes cheap

• Expire data via TTL
– Muppet apps decide data-keep length

• Java GC tuning flattened out CPU and GC stops

Recent and Future

• Cassandra 0.8.x
– Faster compaction
– Stability
– Performance

• Cassandra 1.0.x
– Close to deployment @WML
– LevelDB is very, very interesting
– Cache memory changes make large caches feasible!
– Row[Column] latest-only: very nice
– SSDs no longer needed? Possibly!
• Depends on cold seek requirements


Lessons

• Simple is usually faster and cheaper
– Add complexity only where needed

• Best solution can usually be made to work

• Proactive monitoring very important
– Trend graph everything relevant!

• Failing fast is better than succeeding late

• No substitute for understanding your platform

• Spend money when it will save you time and complexity

Q&A


Using Cassandra for
Products & Items
Rajkumar Venkat
rvenkat@walmartlabs.com

First Challenge
Build a truly Global Product Catalog

Dimensions, Products & Product Offerings - Example

Second Challenge
Catalog (& Categorize) Any Sellable Item

Flexible Categorization & Attribution

• The right kind of categorization and
attribution is crucial to making sense of the
enormity of product data
• Ultimate shopping experience
• Fine-grained analytics & planning

• Standards exist, but severely limiting
• Product landscape changes dramatically
every day

Other excerpts from the “shopping list”

• Lookup and potentially match products
and offerings by any combination of
attributes and other dimensional
criteria
• Item-Item Relationships & Collections
• Hierarchical
• Graph
• Low Latency, High Throughput,
Highly Available
• A scalable but unified system of record
for all product and offering data

Translating to Cassandra

• Modeling options
1. Product as a “wide row” encompassing all
offerings
2. Product assembled from several offering
“fragment” rows
• Multiple Column Families
• Product fragments
• Custom consistency enabler
• Custom row caching at column family level
• Single keyspace to hold all core data fragments
• Tighter control of replication factor, strategy
• Additional keyspaces only for supporting data

Translating to Cassandra (contd.)

• Flexible, selective denormalization
• Secondary indexes for faster attribute-level queries
• Dynamic composites
• define flexible comparators for different column key levels
• capture 1-n levels of dimension intersections
• Column slicing to retrieve the right offerings

The “Supporting Cast”

• Solr for additional indexing querying capabilities
• Mainly for attribute values
• Pattern matching
• Non-standard type comparisons
• Range checks

Cassandra atwalmartlabsmeetup201203

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Cassandra atwalmartlabsmeetup201203

Notas do Editor