2. BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)
3. Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra
4. Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop
5. Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration
6. Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning
8. Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers
9. Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat
10. OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat
12. What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming
13. Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig
14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;
16. Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop
17. Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future
18. Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion
19. About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More