Apache Cassandra is a leading open-source distributed database capable of amazing feats of scale, but its data model requires a bit of planning for it to perform well. Of course, the nature of ad-hoc data exploration and analysis requires that we be able to ask questions we hadn’t planned on asking—and get an answer fast. Enter Apache Spark.
Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. It’s exactly what a Cassandra cluster needs to deliver real-time, ad-hoc querying of operational data at scale.
In this talk, we’ll explore Spark and see how it works together with Cassandra to deliver a powerful open-source big data analytic solution.
12. Data Model
card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
13. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition Key
14. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Clustering Column
15. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Primary Key
16. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition
17. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition
18. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition
19. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition
23. card_number xctn_time amount country merchant
5444…8389
4532…1191
5444…6027
4532…2189
13:58
13:59
13:59
14:02
5444…6027 14:03
5444…8389 14:03
$12.42
$50.38
$3.02
$12.42
€23.78
$98.03
US
US
US
US
ES
US
101
101
102
198
352
158
Partition
Partitioning
24. • Clumps of related data
• Always on one server
• Accessed by consistent hashing
Partitions
25. • Design for low-latency, high-
throughput reads and writes
• Inflexible ad-hoc queries
• Very little aggregation
• Very little secondary indexing
Consequences
56. import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._
object ScalaProgram {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val rdd = sc.textFile(“/actual/path/raven.txt”)
}
}
From a Text File
57. CREATE TABLE credit_card_transactions(
card_number VARCHAR,
transaction_time TIMESTAMP,
amount DECIMAL(10,2),
merchant_id VARCHAR,
country VARCHAR,
PRIMARY KEY(card_number,
transaction_time DESC));
Data Model
58. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val xctns = sc.cassandraTable(“payments”,
“credit_card_transactions")
From a Table
59. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val xctns = sc.cassandraTable(“payments”,
“credit_card_transactions”)
.select(“card_number”, “country”)
Projection
60. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val countries = xctns.map(x => (x.getString(“card_number”),
x.getString(“country”))
Making k / v pairs
61. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val cCount = xctns.reduce(card, country =>
// cool analysis redacted!
)
Count up Countries
62. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val countries = xctns.map(x => (x._1, x._2))
Make k / v pairs
63. val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(“spark://127.0.0.1:7077”,
“music”, conf)
val fraud = cCount.saveToCassandra(“payments”,
“frequent_countries”)
Persisting to a Table