Real-World Cassandra at ShareThis

Real-World Cassandra at ShareThis
Use Cases, Data Modeling, and Hector

1

ShareThis + Our Customers: Keys to Unlocking Social

1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES)

2. TAKE YOUR SOCIAL INVENTORY TO MARKET

3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY

2

Largest Ecosystem For Sharing and Engagement Across The Web

120 SOCIAL CHANNELS

SHARETHIS ECOSYSTEM
211 MILLION PEOPLE
(95.1% of the web)

2.4 MILLION PUBLISHERS

Source: ComScore U.S. January 2013; internal numbers, January 2013

3

Data Modeling and Why it Matters (Keep it even, Keep it slice-able)

A New Product: SnapSets

3 - x1.large

Use Case: SnapSets, A New Product

Use Case: SnapSets, A New Product (Continued)
CF: Users (userId)
meta:first_name=Ronald
meta:last_name=Melencio
meta:username=ronsharethis
scrapbook:timestamp:scrapbookId:name=Scrapbook 1
scrapbook:timestamp:scrapbookId:date_created=Jan 10
url1:sid:clipID={LOCATION DATA}
url1:sid:456={LOCATION DATA}
CF: Scrapbooks (scrapbookId)
clip:timestamp:clipId:url=sharethis.com
clip:timestamp:clipId:title=Clip 1
clip:timestamp:clipId:likes=10
CF: Clip (clipId)
comment:timestamp:commentId={"name":"Ronald","timestamp":'"jan 10","comment":"hi"}
CF: Stats (user:userId,application,publisher:pubId)
meta:total_scrapbooks=1
meta:total_clips=100
meta:total_scrapbook_comments=100
scrapbook:timestamp:scrapbookId:total_comments=10
scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:likes=10
scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:dislikes=10

High Velocity Reads and Writes: Count Service

9 – hi1.4xlarge
9 – x1.large

Use Case: Count Service for URL's
●

1 Billion Pageviews per day = 12k pageviews per second

●

60 Million Social Referrals per day = 720 social referrals per second

●

1 Million Shares per day = 12 shares per second

●

No expiration of Data* (3bn rows)

●

Requires minimum latency possible

●

Multiple read requests per page on blogs

●

Normalize and Hash the URL for a row key

●

Each social channel is a column

●

Retrieve the whole row for counts

●

Fix it by cheating ^_^ *

Insights that Matter – Your Social Analytics Dashboard
Timely Social Analytics
Benchmark your social
engagement with SQI

Identify
popular articles

Dive deeper into your
most social content

Measure social
activity on an hourly,
daily, weekly &
monthly basis.

Uncover which social
channels are driving
the most social traffic

12 - x1.large

13

Use Case: Loading Processed Batch Data
●

Backend Hadoop stack for processing analytics

●

58 JSON schemas map tabular data to key/value storage for slicing

●

MondoDB* did not scale for frequent row level writes on the same table

●

Needed to maintain read throughput during spikes to writes when
analytics were finished

●

No TTL* - Works daily, doesn't work hourly

●

Switching from Astyanax to Hector

●

Using a Hector Client through Java API's

Use Case: Loading Processed Batch Data (continued)
{

}

"schema":
[
{
"column_name":"publisher",
"column_type":"UTF8Type",
"column_level":"common",
"column_master":""
},
{"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""},
{"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""},
{"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""}
],
"row_key_format": "publisher:domain:month",
"column_family_name": "sqi_table"

CF -> Data Type
Row -> Publisher:domain:timestamp
Columns -> master:slave = value (topics, categories, urls, timestamps, etc)

Real Time Social Analytics
engagement with SQI

Identify trending
articles in real-time

most social content

Measure social
daily, weekly &
monthly basis.


12 cc1.4xlarge

17

Real Time Social Analytics
engagement with SQI

Identify trending
articles in real-time

most social content

Measure social
daily, weekly &
monthly basis.


12 cc1.4xlarge

18

Insights that Matter – And aren't accessible

Insights that Matter – And aren't accessible

●

Too many columns – unbounded url / channel sets

●

Cascading failure

●

Solutions:
–

Bigger Boxes – meh...

–

Split up the columns – split the rowkeys
●

–

Split up the columns – split the CF
●

–

Hash Urls and keep stats separate
Move URLs to their own space

Split up the columns – split the Keyspace
●

Keyspace is a timestamp

Ask Good
Data Modeling
Questions

22

●
●
●
●
●
●
●

How many rows will there be?
How many columns per row will you need?
How will you slice your data?
What are the maximum number of rows ?
What are the maximum number of columns?
Is your data relational?
How long will your data live?

23

Hector

https://github.com/hector-client/hector/wiki/User-Guide

24

Hector Imports
import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition;
import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel;
import me.prettyprint.cassandra.serializers.LongSerializer;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.cassandra.service.ColumnSliceIterator;
import me.prettyprint.cassandra.service.ThriftCfDef;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.cassandra.service.template.ColumnFamilyResult;
import me.prettyprint.cassandra.service.template.ColumnFamilyTemplate;
import me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate;
import me.prettyprint.hector.api.beans.ColumnSlice;
import me.prettyprint.hector.api.beans.HColumn;
import me.prettyprint.hector.api.beans.HCounterColumn;
import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;
import me.prettyprint.hector.api.ddl.ComparatorType;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.exceptions.HectorException;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;
import me.prettyprint.hector.api.query.ColumnQuery;
import me.prettyprint.hector.api.query.CounterQuery;
import me.prettyprint.hector.api.query.QueryResult;
import me.prettyprint.hector.api.query.SliceCounterQuery;
import me.prettyprint.hector.api.query.SliceQuery;

Hector: Add a keyspace
public static Cluster getCluster(String name, String hosts) {
return HFactory.getOrCreateCluster(name, hosts);
}
public static KeyspaceDefinition createKeyspaceDefinition(String keyspaceName, int replication) {
return HFactory.createKeyspaceDefinition(
keyspaceName,
ThriftKsDef.DEF_STRATEGY_CLASS, // "org.apache.cassandra.locator.SimpleStrategy"
replication,
null // ArrayList of CF definitions
);
}
public static void addKeyspace(Cluster cluster, KeyspaceDefinition ksDef) {
KeyspaceDefinition keyspaceDef = cluster.describeKeyspace(ksDef.getName());
if (keyspaceDef == null) {
cluster.addKeyspace(ksDef, true);
System.out.println("Created keyspace: " + ksDef.getName());
} else {
System.err.println("Keyspace already exists");
}
}

Hector: Define a CF

public static ColumnFamilyDefinition createGenericColumnFamilyDefinition(
String ksName, String cfName, ComparatorType ctName) {
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(ksName);
columnFamilyDefinition.setName(cfName);
columnFamilyDefinition.setDefaultValidationClass(ctName.getClassName());
columnFamilyDefinition.setReplicateOnWrite(true);
return new ThriftCfDef(columnFamilyDefinition);
}
public static ColumnFamilyDefinition createCounterColumnFamilyDefinition(String ksName, String cfName) {
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(ksName);
columnFamilyDefinition.setName(cfName);
columnFamilyDefinition.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());
columnFamilyDefinition.setReplicateOnWrite(true);
return new ThriftCfDef(columnFamilyDefinition);
}

Hector: Add a CF
Keyspace k = HFactory.createKeyspace(nameString, cluster);
public static void addColumnFamily(Cluster cluster, Keyspace keyspace, ColumnFamilyDefinition cfDef) {
KeyspaceDefinition ksDef = cluster.describeKeyspace(keyspace.getKeyspaceName());
if (ksDef != null) {
List<ColumnFamilyDefinition> list = ksDef.getCfDefs();
String cfName = cfDef.getName();
boolean exists = false;
for (ColumnFamilyDefinition myCfDef : list) {
if (myCfDef.getName().equals(cfName)) {
exists = true;
System.err.println("Found Column Family: " + cfName + ". Did not insert.");
}
}
if (!exists) {
cluster.addColumnFamily(cfDef, true);
System.out.println("Created column family: " + cfDef.getName());
}
} else {
System.err.println("Keyspace definition is null");
}
}

Hector: Insert Column
public static void insertColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String columnName, String columnValue) {
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());
//HFactory.createColumn(columnName, columnValue, StringSerializer.get(), StringSerializer.get())
HColumn<String, String> hCol = HFactory.createStringColumn(columnName, columnValue);
mutator.insert(rowKey, cfName, hCol);
mutator.execute();
}
public static void incrementCounter(
String counterColumnName) {
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());
mutator.insertCounter(
rowKey, cfName, HFactory.createCounterColumn(counterColumnName, 1, StringSerializer.get()));
mutator.execute();
}

Hector: Read Column

public static String getColumn(
String columnName) {
ColumnQuery<String, String, String> query =
Hfactory.createColumnQuery(
keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily(cfName).setKey(rowKey).setName(columnName);
HColumn<String, String> value = query.execute().get();
if (value != null) {
return value.getValue();
}
return "";
}

Hector: Read Column

public static long getCounter(
String counterColumnName) {
CounterQuery<String, String> query =
HFactory.createCounterColumnQuery(keyspace, StringSerializer.get(),StringSerializer.get());

}

query.setColumnFamily(cfName).setKey(rowKey).setName(counterColumnName);
HCounterColumn<String> counter = query.execute().get();
if (counter != null) {
return counter.getValue();
}
return 0;

Hector: Read A Slice

public static Map<String, String> getSlice(
String start, String end,
boolean reversed, int count) {
SliceQuery<String, String, String> query =
HFactory.createSliceQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
// for counter use HFactory.createSliceQuery
query.setColumnFamily(cfName);
query.setKey(rowKey);
query.setRange(start, end, reversed, count);
Iterator<HColumn<String, String>> iter = query.execute().get().getColumns().iterator();
Map<String, String> answer = new HashMap<String, String>();
while (iter.hasNext()) {
HColumn<String, String> temp = iter.next();
answer.put(temp.getName(), temp.getValue());
}
return answer;
}

Hector: Read All Columns

public static Map<String, String> getAllValues(
Cluster cluster, String keyspace,
String cf, String rowkey) {
HashMap<String, String> values = new HashMap<String, String>();
Keyspace keyspaceObject = HFactory.createKeyspace(keyspace, cluster);
SliceQuery<String,String,String> query =
Hfactory.createSliceQuery(
keyspaceObject, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily(cf).setKey(rowkey).setRange("", "", true, 10000);
QueryResult<ColumnSlice<String,String>> result = query.execute();
Iterator<HColumn<String, String>> iter = result.get().getColumns().iterator();
while (iter.hasNext()) {
HColumn<String, String> current = iter.next();
values.put(current.getName(), current.getValue());
}
return values;
}

Hector: DANGER

private static void dropAllKeyspaces(Cluster cluster) {
for (KeyspaceDefinition ksDef: cluster.describeKeyspaces()) {
if (!(ksDef.getName().equals("system") || ksDef.getName().equals("OpsCenter"))) {
cluster.dropKeyspace(ksDef.getName(), true);
System.out.println("Dropped keyspace: " + ksDef.getName());
}
}
}
private static void dropKeyspace(Cluster cluster, String keyspace) {
KeyspaceDefinition ksDef = createKeyspaceDefinition(keyspace, Hector.replication);
cluster.dropKeyspace(ksDef.getName(), true);
System.out.println("Dropped keyspace: " + ksDef.getName());
}
private static void dropColumnFamily(Cluster cluster, String keyspace, String cf) {
cluster.dropColumnFamily(keyspace, cf);
System.out.println("Dropped Column Family: " + cf );
}

Conclusions

●

Data Modeling is Important

●

Use Cassandra for write throughput

●

Keep your ring even and your data slice-able

●

Wrap your libraries and switch when you need to

We're hiring: http://www.sharethis.com/about/careers

●

●

●

Work with REAL big data, billions of requests per day
Work on products that millions people see and interact with on a daily
basis

●

Work with a real-time pipeline, machine learning, complex user models

●

#1 fastest growing company San Francisco

●

free lunches

●

... and of course work with a bunch fun, smart people and PhDs

Real-World Cassandra at ShareThis

Recommended

Recommended

More Related Content

Similar to Real-World Cassandra at ShareThis

Similar to Real-World Cassandra at ShareThis (20)

Recently uploaded

Recently uploaded (20)

Real-World Cassandra at ShareThis

Editor's Notes