SlideShare a Scribd company logo
1 of 38
Real-World Cassandra at ShareThis
Use Cases, Data Modeling, and Hector

1
ShareThis + Our Customers: Keys to Unlocking Social

1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES)

2. TAKE YOUR SOCIAL INVENTORY TO MARKET

3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY

2
Largest Ecosystem For Sharing and Engagement Across The Web

120 SOCIAL CHANNELS

SHARETHIS ECOSYSTEM
211 MILLION PEOPLE
(95.1% of the web)

2.4 MILLION PUBLISHERS

Source: ComScore U.S. January 2013; internal numbers, January 2013

3
Data Modeling and Why it Matters (Keep it even, Keep it slice-able)
Use Cases

5
A New Product: SnapSets

3 - x1.large
Use Case: SnapSets, A New Product
Use Case: SnapSets, A New Product (Continued)
CF: Users (userId)
meta:first_name=Ronald
meta:last_name=Melencio
meta:username=ronsharethis
scrapbook:timestamp:scrapbookId:name=Scrapbook 1
scrapbook:timestamp:scrapbookId:date_created=Jan 10
url1:sid:clipID={LOCATION DATA}
url1:sid:456={LOCATION DATA}
CF: Scrapbooks (scrapbookId)
clip:timestamp:clipId:url=sharethis.com
clip:timestamp:clipId:title=Clip 1
clip:timestamp:clipId:likes=10
CF: Clip (clipId)
comment:timestamp:commentId={"name":"Ronald","timestamp":'"jan 10","comment":"hi"}
CF: Stats (user:userId,application,publisher:pubId)
meta:total_scrapbooks=1
meta:total_clips=100
meta:total_scrapbook_comments=100
scrapbook:timestamp:scrapbookId:total_comments=10
scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:likes=10
scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:dislikes=10
Use Cases

9
High Velocity Reads and Writes: Count Service

9 – hi1.4xlarge
9 – x1.large
Use Case: Count Service for URL's
●

1 Billion Pageviews per day = 12k pageviews per second

●

60 Million Social Referrals per day = 720 social referrals per second

●

1 Million Shares per day = 12 shares per second

●

No expiration of Data* (3bn rows)

●

Requires minimum latency possible

●

Multiple read requests per page on blogs

●

Normalize and Hash the URL for a row key

●

Each social channel is a column

●

Retrieve the whole row for counts

●

Fix it by cheating ^_^ *
Use Cases

12
Insights that Matter – Your Social Analytics Dashboard
Timely Social Analytics
Benchmark your social
engagement with SQI

Identify
popular articles

Dive deeper into your
most social content

Measure social
activity on an hourly,
daily, weekly &
monthly basis.

Uncover which social
channels are driving
the most social traffic

12 - x1.large

13
Use Case: Loading Processed Batch Data
●

Backend Hadoop stack for processing analytics

●

58 JSON schemas map tabular data to key/value storage for slicing

●

MondoDB* did not scale for frequent row level writes on the same table

●

Needed to maintain read throughput during spikes to writes when
analytics were finished

●

No TTL* - Works daily, doesn't work hourly

●

Switching from Astyanax to Hector

●

Using a Hector Client through Java API's
Use Case: Loading Processed Batch Data (continued)
{

}

"schema":
[
{
"column_name":"publisher",
"column_type":"UTF8Type",
"column_level":"common",
"column_master":""
},
{"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""},
{"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""},
{"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""}
],
"row_key_format": "publisher:domain:month",
"column_family_name": "sqi_table"

CF -> Data Type
Row -> Publisher:domain:timestamp
Columns -> master:slave = value (topics, categories, urls, timestamps, etc)
Use Cases

16
Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQI

Identify trending
articles in real-time

Dive deeper into your
most social content

Measure social
activity on an hourly,
daily, weekly &
monthly basis.

Uncover which social
channels are driving
the most social traffic

12 cc1.4xlarge

17
Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQI

Identify trending
articles in real-time

Dive deeper into your
most social content

Measure social
activity on an hourly,
daily, weekly &
monthly basis.

Uncover which social
channels are driving
the most social traffic

12 cc1.4xlarge

18
Insights that Matter – And aren't accessible
Insights that Matter – And aren't accessible
Insights that Matter – And aren't accessible

●

Too many columns – unbounded url / channel sets

●

Cascading failure

●

Solutions:
–

Bigger Boxes – meh...

–

Split up the columns – split the rowkeys
●

–

Split up the columns – split the CF
●

–

Hash Urls and keep stats separate
Move URLs to their own space

Split up the columns – split the Keyspace
●

Keyspace is a timestamp
Ask Good
Data Modeling
Questions

22
●
●
●
●
●
●
●

How many rows will there be?
How many columns per row will you need?
How will you slice your data?
What are the maximum number of rows ?
What are the maximum number of columns?
Is your data relational?
How long will your data live?

23
Hector

https://github.com/hector-client/hector/wiki/User-Guide

24
Hector Imports
import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition;
import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel;
import me.prettyprint.cassandra.serializers.LongSerializer;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.cassandra.service.ColumnSliceIterator;
import me.prettyprint.cassandra.service.ThriftCfDef;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.cassandra.service.template.ColumnFamilyResult;
import me.prettyprint.cassandra.service.template.ColumnFamilyTemplate;
import me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate;
import me.prettyprint.hector.api.beans.ColumnSlice;
import me.prettyprint.hector.api.beans.HColumn;
import me.prettyprint.hector.api.beans.HCounterColumn;
import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;
import me.prettyprint.hector.api.ddl.ComparatorType;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.exceptions.HectorException;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;
import me.prettyprint.hector.api.query.ColumnQuery;
import me.prettyprint.hector.api.query.CounterQuery;
import me.prettyprint.hector.api.query.QueryResult;
import me.prettyprint.hector.api.query.SliceCounterQuery;
import me.prettyprint.hector.api.query.SliceQuery;
Hector: Add a keyspace
public static Cluster getCluster(String name, String hosts) {
return HFactory.getOrCreateCluster(name, hosts);
}
public static KeyspaceDefinition createKeyspaceDefinition(String keyspaceName, int replication) {
return HFactory.createKeyspaceDefinition(
keyspaceName,
ThriftKsDef.DEF_STRATEGY_CLASS, // "org.apache.cassandra.locator.SimpleStrategy"
replication,
null // ArrayList of CF definitions
);
}
public static void addKeyspace(Cluster cluster, KeyspaceDefinition ksDef) {
KeyspaceDefinition keyspaceDef = cluster.describeKeyspace(ksDef.getName());
if (keyspaceDef == null) {
cluster.addKeyspace(ksDef, true);
System.out.println("Created keyspace: " + ksDef.getName());
} else {
System.err.println("Keyspace already exists");
}
}
Hector: Define a CF

public static ColumnFamilyDefinition createGenericColumnFamilyDefinition(
String ksName, String cfName, ComparatorType ctName) {
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(ksName);
columnFamilyDefinition.setName(cfName);
columnFamilyDefinition.setDefaultValidationClass(ctName.getClassName());
columnFamilyDefinition.setReplicateOnWrite(true);
return new ThriftCfDef(columnFamilyDefinition);
}
public static ColumnFamilyDefinition createCounterColumnFamilyDefinition(String ksName, String cfName) {
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(ksName);
columnFamilyDefinition.setName(cfName);
columnFamilyDefinition.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());
columnFamilyDefinition.setReplicateOnWrite(true);
return new ThriftCfDef(columnFamilyDefinition);
}
Hector: Add a CF
Keyspace k = HFactory.createKeyspace(nameString, cluster);
public static void addColumnFamily(Cluster cluster, Keyspace keyspace, ColumnFamilyDefinition cfDef) {
KeyspaceDefinition ksDef = cluster.describeKeyspace(keyspace.getKeyspaceName());
if (ksDef != null) {
List<ColumnFamilyDefinition> list = ksDef.getCfDefs();
String cfName = cfDef.getName();
boolean exists = false;
for (ColumnFamilyDefinition myCfDef : list) {
if (myCfDef.getName().equals(cfName)) {
exists = true;
System.err.println("Found Column Family: " + cfName + ". Did not insert.");
}
}
if (!exists) {
cluster.addColumnFamily(cfDef, true);
System.out.println("Created column family: " + cfDef.getName());
}
} else {
System.err.println("Keyspace definition is null");
}
}
Hector: Insert Column
public static void insertColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String columnName, String columnValue) {
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());
//HFactory.createColumn(columnName, columnValue, StringSerializer.get(), StringSerializer.get())
HColumn<String, String> hCol = HFactory.createStringColumn(columnName, columnValue);
mutator.insert(rowKey, cfName, hCol);
mutator.execute();
}
public static void incrementCounter(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String counterColumnName) {
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());
mutator.insertCounter(
rowKey, cfName, HFactory.createCounterColumn(counterColumnName, 1, StringSerializer.get()));
mutator.execute();
}
Hector: Read Column

public static String getColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String columnName) {
ColumnQuery<String, String, String> query =
Hfactory.createColumnQuery(
keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily(cfName).setKey(rowKey).setName(columnName);
HColumn<String, String> value = query.execute().get();
if (value != null) {
return value.getValue();
}
return "";
}
Hector: Read Column

public static String getColumn(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String columnName) {
ColumnQuery<String, String, String> query =
Hfactory.createColumnQuery(
keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily(cfName).setKey(rowKey).setName(columnName);
HColumn<String, String> value = query.execute().get();
if (value != null) {
return value.getValue();
}
return "";
}
Hector: Read Column

public static long getCounter(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String counterColumnName) {
CounterQuery<String, String> query =
HFactory.createCounterColumnQuery(keyspace, StringSerializer.get(),StringSerializer.get());

}

query.setColumnFamily(cfName).setKey(rowKey).setName(counterColumnName);
HCounterColumn<String> counter = query.execute().get();
if (counter != null) {
return counter.getValue();
}
return 0;
Hector: Read A Slice

public static Map<String, String> getSlice(
Cluster cluster, Keyspace keyspace,
String cfName, String rowKey,
String start, String end,
boolean reversed, int count) {
SliceQuery<String, String, String> query =
HFactory.createSliceQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
// for counter use HFactory.createSliceQuery
query.setColumnFamily(cfName);
query.setKey(rowKey);
query.setRange(start, end, reversed, count);
Iterator<HColumn<String, String>> iter = query.execute().get().getColumns().iterator();
Map<String, String> answer = new HashMap<String, String>();
while (iter.hasNext()) {
HColumn<String, String> temp = iter.next();
answer.put(temp.getName(), temp.getValue());
}
return answer;
}
Hector: Read All Columns

public static Map<String, String> getAllValues(
Cluster cluster, String keyspace,
String cf, String rowkey) {
HashMap<String, String> values = new HashMap<String, String>();
Keyspace keyspaceObject = HFactory.createKeyspace(keyspace, cluster);
SliceQuery<String,String,String> query =
Hfactory.createSliceQuery(
keyspaceObject, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
query.setColumnFamily(cf).setKey(rowkey).setRange("", "", true, 10000);
QueryResult<ColumnSlice<String,String>> result = query.execute();
Iterator<HColumn<String, String>> iter = result.get().getColumns().iterator();
while (iter.hasNext()) {
HColumn<String, String> current = iter.next();
values.put(current.getName(), current.getValue());
}
return values;
}
Hector: DANGER

private static void dropAllKeyspaces(Cluster cluster) {
for (KeyspaceDefinition ksDef: cluster.describeKeyspaces()) {
if (!(ksDef.getName().equals("system") || ksDef.getName().equals("OpsCenter"))) {
cluster.dropKeyspace(ksDef.getName(), true);
System.out.println("Dropped keyspace: " + ksDef.getName());
}
}
}
private static void dropKeyspace(Cluster cluster, String keyspace) {
KeyspaceDefinition ksDef = createKeyspaceDefinition(keyspace, Hector.replication);
cluster.dropKeyspace(ksDef.getName(), true);
System.out.println("Dropped keyspace: " + ksDef.getName());
}
private static void dropColumnFamily(Cluster cluster, String keyspace, String cf) {
cluster.dropColumnFamily(keyspace, cf);
System.out.println("Dropped Column Family: " + cf );
}
Conclusions

●

Data Modeling is Important

●

Use Cassandra for write throughput

●

Keep your ring even and your data slice-able

●

Wrap your libraries and switch when you need to
We're hiring: http://www.sharethis.com/about/careers

●

●

●

Work with REAL big data, billions of requests per day
Work on products that millions people see and interact with on a daily
basis

●

Work with a real-time pipeline, machine learning, complex user models

●

#1 fastest growing company San Francisco

●

free lunches

●

... and of course work with a bunch fun, smart people and PhDs
Thank You!

38

More Related Content

Similar to Real-World Cassandra at ShareThis

Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineMyung Ho Yun
 
Architectural Best Practices to Master + Pitfalls to Avoid (P)
Architectural Best Practices to Master + Pitfalls to Avoid (P) Architectural Best Practices to Master + Pitfalls to Avoid (P)
Architectural Best Practices to Master + Pitfalls to Avoid (P) Elasticsearch
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
Efficient Context-sensitive Output Escaping for Javascript Template Engines
Efficient Context-sensitive Output Escaping for Javascript Template EnginesEfficient Context-sensitive Output Escaping for Javascript Template Engines
Efficient Context-sensitive Output Escaping for Javascript Template Enginesadonatwork
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
Node.js for enterprise - JS Conference
Node.js for enterprise - JS ConferenceNode.js for enterprise - JS Conference
Node.js for enterprise - JS ConferenceTimur Shemsedinov
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Thomas Bailet
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 

Similar to Real-World Cassandra at ShareThis (20)

Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
Architectural Best Practices to Master + Pitfalls to Avoid (P)
Architectural Best Practices to Master + Pitfalls to Avoid (P) Architectural Best Practices to Master + Pitfalls to Avoid (P)
Architectural Best Practices to Master + Pitfalls to Avoid (P)
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Efficient Context-sensitive Output Escaping for Javascript Template Engines
Efficient Context-sensitive Output Escaping for Javascript Template EnginesEfficient Context-sensitive Output Escaping for Javascript Template Engines
Efficient Context-sensitive Output Escaping for Javascript Template Engines
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Node.js for enterprise - JS Conference
Node.js for enterprise - JS ConferenceNode.js for enterprise - JS Conference
Node.js for enterprise - JS Conference
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Into The Box 2018 - CBT
Into The Box 2018 - CBTInto The Box 2018 - CBT
Into The Box 2018 - CBT
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 

Recently uploaded

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Real-World Cassandra at ShareThis

  • 1. Real-World Cassandra at ShareThis Use Cases, Data Modeling, and Hector 1
  • 2. ShareThis + Our Customers: Keys to Unlocking Social 1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES) 2. TAKE YOUR SOCIAL INVENTORY TO MARKET 3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY 2
  • 3. Largest Ecosystem For Sharing and Engagement Across The Web 120 SOCIAL CHANNELS SHARETHIS ECOSYSTEM 211 MILLION PEOPLE (95.1% of the web) 2.4 MILLION PUBLISHERS Source: ComScore U.S. January 2013; internal numbers, January 2013 3
  • 4. Data Modeling and Why it Matters (Keep it even, Keep it slice-able)
  • 6. A New Product: SnapSets 3 - x1.large
  • 7. Use Case: SnapSets, A New Product
  • 8. Use Case: SnapSets, A New Product (Continued) CF: Users (userId) meta:first_name=Ronald meta:last_name=Melencio meta:username=ronsharethis scrapbook:timestamp:scrapbookId:name=Scrapbook 1 scrapbook:timestamp:scrapbookId:date_created=Jan 10 url1:sid:clipID={LOCATION DATA} url1:sid:456={LOCATION DATA} CF: Scrapbooks (scrapbookId) clip:timestamp:clipId:url=sharethis.com clip:timestamp:clipId:title=Clip 1 clip:timestamp:clipId:likes=10 CF: Clip (clipId) comment:timestamp:commentId={"name":"Ronald","timestamp":'"jan 10","comment":"hi"} CF: Stats (user:userId,application,publisher:pubId) meta:total_scrapbooks=1 meta:total_clips=100 meta:total_scrapbook_comments=100 scrapbook:timestamp:scrapbookId:total_comments=10 scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:likes=10 scrapbook:timestamp:scrapbookId:clip:timestamp:clipId:dislikes=10
  • 10. High Velocity Reads and Writes: Count Service 9 – hi1.4xlarge 9 – x1.large
  • 11. Use Case: Count Service for URL's ● 1 Billion Pageviews per day = 12k pageviews per second ● 60 Million Social Referrals per day = 720 social referrals per second ● 1 Million Shares per day = 12 shares per second ● No expiration of Data* (3bn rows) ● Requires minimum latency possible ● Multiple read requests per page on blogs ● Normalize and Hash the URL for a row key ● Each social channel is a column ● Retrieve the whole row for counts ● Fix it by cheating ^_^ *
  • 13. Insights that Matter – Your Social Analytics Dashboard Timely Social Analytics Benchmark your social engagement with SQI Identify popular articles Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 - x1.large 13
  • 14. Use Case: Loading Processed Batch Data ● Backend Hadoop stack for processing analytics ● 58 JSON schemas map tabular data to key/value storage for slicing ● MondoDB* did not scale for frequent row level writes on the same table ● Needed to maintain read throughput during spikes to writes when analytics were finished ● No TTL* - Works daily, doesn't work hourly ● Switching from Astyanax to Hector ● Using a Hector Client through Java API's
  • 15. Use Case: Loading Processed Batch Data (continued) { } "schema": [ { "column_name":"publisher", "column_type":"UTF8Type", "column_level":"common", "column_master":"" }, {"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""}, {"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"}, {"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""}, {"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""} ], "row_key_format": "publisher:domain:month", "column_family_name": "sqi_table" CF -> Data Type Row -> Publisher:domain:timestamp Columns -> master:slave = value (topics, categories, urls, timestamps, etc)
  • 17. Insights that Matter – Your Social Analytics Dashboard Real Time Social Analytics Benchmark your social engagement with SQI Identify trending articles in real-time Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 cc1.4xlarge 17
  • 18. Insights that Matter – Your Social Analytics Dashboard Real Time Social Analytics Benchmark your social engagement with SQI Identify trending articles in real-time Dive deeper into your most social content Measure social activity on an hourly, daily, weekly & monthly basis. Uncover which social channels are driving the most social traffic 12 cc1.4xlarge 18
  • 19. Insights that Matter – And aren't accessible
  • 20. Insights that Matter – And aren't accessible
  • 21. Insights that Matter – And aren't accessible ● Too many columns – unbounded url / channel sets ● Cascading failure ● Solutions: – Bigger Boxes – meh... – Split up the columns – split the rowkeys ● – Split up the columns – split the CF ● – Hash Urls and keep stats separate Move URLs to their own space Split up the columns – split the Keyspace ● Keyspace is a timestamp
  • 23. ● ● ● ● ● ● ● How many rows will there be? How many columns per row will you need? How will you slice your data? What are the maximum number of rows ? What are the maximum number of columns? Is your data relational? How long will your data live? 23
  • 25. Hector Imports import me.prettyprint.cassandra.model.BasicColumnFamilyDefinition; import me.prettyprint.cassandra.model.ConfigurableConsistencyLevel; import me.prettyprint.cassandra.serializers.LongSerializer; import me.prettyprint.cassandra.serializers.StringSerializer; import me.prettyprint.cassandra.service.ColumnSliceIterator; import me.prettyprint.cassandra.service.ThriftCfDef; import me.prettyprint.cassandra.service.ThriftKsDef; import me.prettyprint.cassandra.service.template.ColumnFamilyResult; import me.prettyprint.cassandra.service.template.ColumnFamilyTemplate; import me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate; import me.prettyprint.hector.api.beans.ColumnSlice; import me.prettyprint.hector.api.beans.HColumn; import me.prettyprint.hector.api.beans.HCounterColumn; import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition; import me.prettyprint.hector.api.ddl.ComparatorType; import me.prettyprint.hector.api.ddl.KeyspaceDefinition; import me.prettyprint.hector.api.exceptions.HectorException; import me.prettyprint.hector.api.factory.HFactory; import me.prettyprint.hector.api.mutation.Mutator; import me.prettyprint.hector.api.query.ColumnQuery; import me.prettyprint.hector.api.query.CounterQuery; import me.prettyprint.hector.api.query.QueryResult; import me.prettyprint.hector.api.query.SliceCounterQuery; import me.prettyprint.hector.api.query.SliceQuery;
  • 26. Hector: Add a keyspace public static Cluster getCluster(String name, String hosts) { return HFactory.getOrCreateCluster(name, hosts); } public static KeyspaceDefinition createKeyspaceDefinition(String keyspaceName, int replication) { return HFactory.createKeyspaceDefinition( keyspaceName, ThriftKsDef.DEF_STRATEGY_CLASS, // "org.apache.cassandra.locator.SimpleStrategy" replication, null // ArrayList of CF definitions ); } public static void addKeyspace(Cluster cluster, KeyspaceDefinition ksDef) { KeyspaceDefinition keyspaceDef = cluster.describeKeyspace(ksDef.getName()); if (keyspaceDef == null) { cluster.addKeyspace(ksDef, true); System.out.println("Created keyspace: " + ksDef.getName()); } else { System.err.println("Keyspace already exists"); } }
  • 27. Hector: Define a CF public static ColumnFamilyDefinition createGenericColumnFamilyDefinition( String ksName, String cfName, ComparatorType ctName) { BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition(); columnFamilyDefinition.setKeyspaceName(ksName); columnFamilyDefinition.setName(cfName); columnFamilyDefinition.setDefaultValidationClass(ctName.getClassName()); columnFamilyDefinition.setReplicateOnWrite(true); return new ThriftCfDef(columnFamilyDefinition); } public static ColumnFamilyDefinition createCounterColumnFamilyDefinition(String ksName, String cfName) { BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition(); columnFamilyDefinition.setKeyspaceName(ksName); columnFamilyDefinition.setName(cfName); columnFamilyDefinition.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName()); columnFamilyDefinition.setReplicateOnWrite(true); return new ThriftCfDef(columnFamilyDefinition); }
  • 28. Hector: Add a CF Keyspace k = HFactory.createKeyspace(nameString, cluster); public static void addColumnFamily(Cluster cluster, Keyspace keyspace, ColumnFamilyDefinition cfDef) { KeyspaceDefinition ksDef = cluster.describeKeyspace(keyspace.getKeyspaceName()); if (ksDef != null) { List<ColumnFamilyDefinition> list = ksDef.getCfDefs(); String cfName = cfDef.getName(); boolean exists = false; for (ColumnFamilyDefinition myCfDef : list) { if (myCfDef.getName().equals(cfName)) { exists = true; System.err.println("Found Column Family: " + cfName + ". Did not insert."); } } if (!exists) { cluster.addColumnFamily(cfDef, true); System.out.println("Created column family: " + cfDef.getName()); } } else { System.err.println("Keyspace definition is null"); } }
  • 29. Hector: Insert Column public static void insertColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName, String columnValue) { Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); //HFactory.createColumn(columnName, columnValue, StringSerializer.get(), StringSerializer.get()) HColumn<String, String> hCol = HFactory.createStringColumn(columnName, columnValue); mutator.insert(rowKey, cfName, hCol); mutator.execute(); } public static void incrementCounter( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String counterColumnName) { Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); mutator.insertCounter( rowKey, cfName, HFactory.createCounterColumn(counterColumnName, 1, StringSerializer.get())); mutator.execute(); }
  • 30. Hector: Read Column public static String getColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName) { ColumnQuery<String, String, String> query = Hfactory.createColumnQuery( keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cfName).setKey(rowKey).setName(columnName); HColumn<String, String> value = query.execute().get(); if (value != null) { return value.getValue(); } return ""; }
  • 31. Hector: Read Column public static String getColumn( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String columnName) { ColumnQuery<String, String, String> query = Hfactory.createColumnQuery( keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cfName).setKey(rowKey).setName(columnName); HColumn<String, String> value = query.execute().get(); if (value != null) { return value.getValue(); } return ""; }
  • 32. Hector: Read Column public static long getCounter( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String counterColumnName) { CounterQuery<String, String> query = HFactory.createCounterColumnQuery(keyspace, StringSerializer.get(),StringSerializer.get()); } query.setColumnFamily(cfName).setKey(rowKey).setName(counterColumnName); HCounterColumn<String> counter = query.execute().get(); if (counter != null) { return counter.getValue(); } return 0;
  • 33. Hector: Read A Slice public static Map<String, String> getSlice( Cluster cluster, Keyspace keyspace, String cfName, String rowKey, String start, String end, boolean reversed, int count) { SliceQuery<String, String, String> query = HFactory.createSliceQuery(keyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); // for counter use HFactory.createSliceQuery query.setColumnFamily(cfName); query.setKey(rowKey); query.setRange(start, end, reversed, count); Iterator<HColumn<String, String>> iter = query.execute().get().getColumns().iterator(); Map<String, String> answer = new HashMap<String, String>(); while (iter.hasNext()) { HColumn<String, String> temp = iter.next(); answer.put(temp.getName(), temp.getValue()); } return answer; }
  • 34. Hector: Read All Columns public static Map<String, String> getAllValues( Cluster cluster, String keyspace, String cf, String rowkey) { HashMap<String, String> values = new HashMap<String, String>(); Keyspace keyspaceObject = HFactory.createKeyspace(keyspace, cluster); SliceQuery<String,String,String> query = Hfactory.createSliceQuery( keyspaceObject, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); query.setColumnFamily(cf).setKey(rowkey).setRange("", "", true, 10000); QueryResult<ColumnSlice<String,String>> result = query.execute(); Iterator<HColumn<String, String>> iter = result.get().getColumns().iterator(); while (iter.hasNext()) { HColumn<String, String> current = iter.next(); values.put(current.getName(), current.getValue()); } return values; }
  • 35. Hector: DANGER private static void dropAllKeyspaces(Cluster cluster) { for (KeyspaceDefinition ksDef: cluster.describeKeyspaces()) { if (!(ksDef.getName().equals("system") || ksDef.getName().equals("OpsCenter"))) { cluster.dropKeyspace(ksDef.getName(), true); System.out.println("Dropped keyspace: " + ksDef.getName()); } } } private static void dropKeyspace(Cluster cluster, String keyspace) { KeyspaceDefinition ksDef = createKeyspaceDefinition(keyspace, Hector.replication); cluster.dropKeyspace(ksDef.getName(), true); System.out.println("Dropped keyspace: " + ksDef.getName()); } private static void dropColumnFamily(Cluster cluster, String keyspace, String cf) { cluster.dropColumnFamily(keyspace, cf); System.out.println("Dropped Column Family: " + cf ); }
  • 36. Conclusions ● Data Modeling is Important ● Use Cassandra for write throughput ● Keep your ring even and your data slice-able ● Wrap your libraries and switch when you need to
  • 37. We're hiring: http://www.sharethis.com/about/careers ● ● ● Work with REAL big data, billions of requests per day Work on products that millions people see and interact with on a daily basis ● Work with a real-time pipeline, machine learning, complex user models ● #1 fastest growing company San Francisco ● free lunches ● ... and of course work with a bunch fun, smart people and PhDs

Editor's Notes

  1. We can change the look of the slide (and featured publishers), but I feel the ecosystem is a cool concept and graphic for getting a quick overview of who we are. The text below can be worked in somehow too, with the new look of this slide. Maybe the text can be cut down too. ShareThis empowers publishers with solutions to improve and drive value from the social engagement of their site. People share content that&apos;s most relevant to them, with people who they believe will also enjoy the content. More than 2.5 million publishers increase eyeballs, engagement, and advertising revenue through the ShareThis sharing platform. &lt;number&gt;