3. Cassandra
Distributed and decentralized data store
Very efficient for fast writes and reads (we ourselves
run a website that reads/writes in real time to
Cassandra)
But what about analytics?
4. Hadoop over Cassandra
Useful for -
Built-in support for hadoop since 0.6
Can use any language without having to understand the Thrift
API
Distributed analysis - massively reduces time
Possible to use Pig/Hive
What is supported -
Read from Cassandra since 0.6
Write to Cassandra since 0.7
Support for Hadoop Streaming since 0.7 (only output
streaming supported as of now)
5. Cluster Configuration
Ideal configuration -
Overlay a Hadoop cluster over the Cassandra nodes
Separate server for namenode/jobtracker
Tasktracker on each Cassandra node
At least one node needs to be a data node for house-keeping
purposes
What this achieves -
Data locality
Analytics engine scales with data
6. Ideal is not always ideal enough
Certain level of tuning always required
Tune cassandra.range.batch.size. Usually would want to
reduce it.
Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf.
xml for 0.6+) to avoid time-outs.
Use NetworkTopologyStrategy and custom Snitches to
separate the analytics as a virtual data-center.
7. Sample cluster topology
Real time random access -
All-in-one topology
Separate analytics nodes
8. Classes that make all this possible
ColumnFamilyRecordReader and
ColumnFamilyRecordWriter
To read/write rows from/to Cassandra
ColumnFamilySplit
Create splits over the Cassandra data
ConfigHelper
Helper to configure Cassandra specific information
ColumnFamilyInputFormat
and ColumnFamilyOutputFormat
Inherit Hadoop classes so that Hadoop jobs can interact with
data (read/write)
AvroOutputReader
Stream output to Cassandra
10. public class Lookalike extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new Lookalike(), args);
System.exit(0);
}
@Override
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), "Lookalike Report");
job.setJarByClass(Lookalike.class);
job.setMapperClass(LookalikeMapper.class);
job.setReducerClass(LookalikeReducer.class);
TextPair.class1);
job.setOutputKeyClass(
job.setOutputValueClass(TextPair.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE));
KeyPartitioner.class2);
job.setPartitionerClass(
job.setGroupingComparatorClass( extPair.GroupingComparator.class2);
T
job.setInputFormatClass(ColumnFamilyInputFormat.class); Con
setThriftContact(conf, host, 9160);
ConfigHelper.setColumnFamily(conf, keyspace, columnFamily);
ConfigHelper.setRangeBatchSize(conf, batchSize);
List<byte[]> columnNames = Arrays.asList("properties".getBytes(),
"personality".
getBytes())
SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames);
ConfigHelper.setSlicePredicate(conf, predicate);
job.waitForCompletion(true);
return 0;
}
1 - See this for more on TextPair - http://bit.ly/fCtaZA
2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
11. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,
TextPair, TextPair>
{
@Override
protected void setup(Context context)
{
HasMap<String, String> targetUserMap = loadTargetUserMap();
}
public void map(String key, SortedMap<byte[], IColumn> columns, Context
context) throws IOException, InterruptedException
{
//Read the properties and personality columns
IColumn propertiesColumn = columns.get("properties".getBytes());
if (propertiesColumn == null)
return;
String propertiesValue = new String(propertiesColumn.value()); //JSONObject
IColumn personalityColumn = columns.get("personality".getBytes());
if (personalityColumn == null)
return;
String personalityValue = new String(personalityColumn.value());
//JSONObject
for(Map.Entry<String, String> targetUser : targetUserMap.entrySet())
{
double score = scoreLookAlike(targetUser.getValue(), personalityValue);
if(score>=FILTER_SCORE)
{
context.write(new TextPair(propertiesValue,score.toString()),
new TextPair(targetUserMap.getKey(), score.toString));
}
}
}
}
12. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {
{
@Override
public void reduce(TextPair key, Iterable<TextPair> values, Context context)
throws IOException, InterruptedException {
{
int counter = 1;
for(TextPair value : values)
{
if(counter >= USER_COUNT) //USER_COUNT = 100
{
break;
}
context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond());
counter++;
}
}
}
//Sample Output
//TargetUser Lookalike User Score
//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739
//It is also possible to write this output to Cassandra (we don't do this currently).
//It is quite straight forward. See word_count example in Cassandra contrib folder
13. Some stats
Cassandra cluster of 16 nodes
Hadoop cluster of 5 nodes
Over 120 million rows
Over 600 GB of data
Over 20 Trillion computations
Hadoop - Just over 4 hours
Serial PHP script - crossed 48 hours and was still
chugging along
14. Links
Cassandra : The Definitive Guide
Hadoop MapReduce in Cassandra cluster (DataStax)
Cassandra and Hadoop MapReduce (Datastax)
Cassandra Wiki - Hadoop Support
Cassandra/Hadoop Integration (Jeremy Hanna)
Hadoop : The Definitive Guide