SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Building a Flexible, Real-time
Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley
07 April 2014
Clint Kelly
Member of Technical Staff
WibiData
1
Overview
• The Kiji Project
• The Kiji data model and KijiSchema
• Mapping Kiji to Cassandra
• Status and future work
• Try it now!
2
Should there be any intro
page that talks about
WibiData anywhere?
The Kiji Project
3
4
!
Want to build this...
Have this...
5
!
Want to build this...
!
Have this...
6
Want to build this...
Open source components
• Batch processing
– Extract, transform, load
– Train machine learning models
• Scalable storage
– Time-series data
• Serialization
– Complex data types
7
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
KijiSchema
• Schemas and data serialization
• Complex, atomic data types
8
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
record UserLog {
long timestamp;
int user_id;
string url;
long session_id;
}
• Schema evolution
• Table metadata
Kiji batch components
• Scala DSL ➔ describe
MapReduce computations
• Machine learning library
• Hive adapter
9
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
Kiji real-time components
• REST server
• Scoring server
10
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
Kiji Summary
• Bridge between open-source technologies
and real-time, big data applications
• Users are building real systems with Kiji now!
– Personalized recommendation systems for retail
– Energy usage and analytics reporting
11
The Kiji data model and
KijiSchema
12
row
13
Table are composed of rows.
entity ID data
14
We call row keys “entity IDs.”
data0xfa “bob”
15
We support composite entity IDs (with
hashed and unhashed components).
info0xfa “bob” songs
16
Data in rows is organized into “column
families.”
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment
17
Column families contain columns,
named as “family:qualifier.”
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
18
Individual columns can have many
different timestamped versions.
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
19
Data values can be complex records
record SongPlay {
long song_id;
int user_rating;
long session_id;
device_type device;
}
20
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
info songs_todayentity ID songs_prev_year
21
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.
info songs_todayentity ID songs_prev_year
info songs_todayentity ID songs_prev_year
“real_time” (in-memory,
uncompressed, TTL = 1 day)
“batch” (compressed,
TTL = 12mo)
22
Locality groups
Always refer to columns by logical name
(“family:qualifier”).
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.
KijiSchema summary
• Data model similar to Cassandra, HBase,
BigTable
• Contains time dimension (not present in C*)
• Logical and physical organization separate
• Complex schemas with Avro
23
Mapping Kiji to Cassandra
24
Implementation notes
25
• Built for Cassandra 2.0.6+
• Native protocol / Java driver (no Thrift)
• Asynchronous API
• Assume users have Hadoop, ZooKeeper
Mapping a Kiji table ➔ Cassandra
• Locality group ➔ Table
• Entity ID ➔ Primary key
– Hashed components ➔ partition key
– Unhashed components ➔ clustering columns
• Family, qualifier, timestamp ➔ clustering columns
• Cell values ➔ blobs
26
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
CQL for Kiji locality group
CREATE TABLE users_locality_group_fast (
userid bigint,
user text,
family text,
qualifier text,
timestamp bigint,
value blob,
PRIMARY KEY (userid, username, family, qualifier, timestamp)
) WITH CLUSTERING ORDER BY (
username ASC, family ASC, qualifier ASC, timestamp DESC);
27
TODO: Show row diagram,
arrows pointing to components?
28
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | username | family | qualifier | timestamp | value
--------+----------+--------+----------------+-----------+---------------
0xfa | bob | info | email | 139653249 | 1243970104327
0xfa | bob | songs | abbey road | 139656012 | 0981274331032
0xfa | bob | songs | help | 139625013 | 9074132704129
0xfa | bob | songs | help | 139621359 | 1923079210370
0xfa | bob | songs | help | 139625013 | 4745018223497
0xfa | bob | songs | helter skelter | 139621324 | 7710423974234
Physical organization of data on disk
29
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob” info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
0xfa:bob:info:email:t0:bob@gmail.com
0xfa:bob:info:payment:t1:AMEX1234...
0xfa:bob:songs:let it be:t5:...
0xfa:bob:songs:let it be:t4:…
0xfa:bob:songs:let it be:t2:…
0xfa:bob:songs:help:t2:…
0xfa:bob:songs:helter skelter:t1:…
Efficient queries =
continuous scans!
Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa
AND user=‘bob’
AND family=‘info’;
30
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘info’ AND qualifier=‘email’;
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;
31
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
All songs played by “bob” on April 2nd ➔
SELECT qualifier, value FROM music WHERE
userid=0xfa AND user=‘bob’ AND family=‘songs’
AND timestamp >= 1396396800
AND timestamp <= 1396483200
ALLOW FILTERING; 😱😱
32
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
33
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
!
Bad Request: PRIMARY KEY
part timestamp cannot be
restricted (preceding part
qualifier is either not
restricted or by a non-EQ
relation)
Queries that do not map well to CQL
• Break up into multiple CQL queries
– Hooray for Session#executeAsync!
• Filter on the client
– Potentially very expensive, but functional
– Provide warning to user
• Educate users about table layout
– Layout in previous example is terrible for that query
• Most issues related to “time” dimension
34
MapReduce
• Wrote new InputFormat, OutputFormat
• Hadoop 2.x
• Multiple C* queries per RecordReader
• Does not use Thrift
35
Project status and next steps
36
Initial release in ~ 2 weeks
37
• Cassandra as part of the Bento Box
• Cassandra working in KijiSchema, KijiMR
Support in the coming months
• Cassandra integration with KijiREST,
KijiScoring, KijiExpress, etc.
• Expose Cassandra-specific features to users
– Variable consistency levels
– Load-balancing policies
– Diagnostics (e.g., route tracing)
• Kiji support in CQLSH
– Decode Avro values
38
Thanks to Cassandra community
• Great help on mailing lists for users, dev, java
driver
• Webinars, meetups, C* Summit all available
online
• Free training from DataStax
• Very easy to get up-to-speed
39
Try it now -- Kiji Bento Box
• Latest compatible versions of all components
• Hadoop, ZooKeeper, HBase
• Cassandra in ~2 weeks
40
www.kiji.org/getstarted
Mention hiring?
KijiSchema
• Schemas and data serialization
• Complex data types (e.g.,
nested maps)
• Schema evolution
• Metadata
• Composite row keys
• Transparent paging
• Data-definition language, REPL
41
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
42
Schema support
Support for complex schemas with Avro
record UserLog {
long timestamp;
int user_id;
string url;
}
KijiSchema allows schema versioning
43
Column name translation
•“family:qualifier” -> “A:B”
•Saves disk space
•Improves performance
•User-facing tools translate names
•Possible to turn this off
Kiji queries ➔ CQL queries
All data in family “songs” for user “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa AND user=‘bob’
AND family=‘songs’;
44
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

Mais conteúdo relacionado

Mais de DataStax Academy

Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 

Mais de DataStax Academy (20)

Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

  • 1. Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1
  • 2. Overview • The Kiji Project • The Kiji data model and KijiSchema • Mapping Kiji to Cassandra • Status and future work • Try it now! 2 Should there be any intro page that talks about WibiData anywhere?
  • 5. Have this... 5 ! Want to build this...
  • 6. ! Have this... 6 Want to build this...
  • 7. Open source components • Batch processing – Extract, transform, load – Train machine learning models • Scalable storage – Time-series data • Serialization – Complex data types 7 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 8. KijiSchema • Schemas and data serialization • Complex, atomic data types 8 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress record UserLog { long timestamp; int user_id; string url; long session_id; } • Schema evolution • Table metadata
  • 9. Kiji batch components • Scala DSL ➔ describe MapReduce computations • Machine learning library • Hive adapter 9 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 10. Kiji real-time components • REST server • Scoring server 10 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 11. Kiji Summary • Bridge between open-source technologies and real-time, big data applications • Users are building real systems with Kiji now! – Personalized recommendation systems for retail – Energy usage and analytics reporting 11
  • 12. The Kiji data model and KijiSchema 12
  • 14. entity ID data 14 We call row keys “entity IDs.”
  • 15. data0xfa “bob” 15 We support composite entity IDs (with hashed and unhashed components).
  • 16. info0xfa “bob” songs 16 Data in rows is organized into “column families.”
  • 17. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment 17 Column families contain columns, named as “family:qualifier.”
  • 18. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 18 Individual columns can have many different timestamped versions.
  • 19. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 19 Data values can be complex records record SongPlay { long song_id; int user_rating; long session_id; device_type device; }
  • 20. 20 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) info songs_todayentity ID songs_prev_year
  • 21. 21 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) Need this data ASAP for real-time scoring. Use this data only for batch jobs. info songs_todayentity ID songs_prev_year
  • 22. info songs_todayentity ID songs_prev_year “real_time” (in-memory, uncompressed, TTL = 1 day) “batch” (compressed, TTL = 12mo) 22 Locality groups Always refer to columns by logical name (“family:qualifier”). Need this data ASAP for real-time scoring. Use this data only for batch jobs.
  • 23. KijiSchema summary • Data model similar to Cassandra, HBase, BigTable • Contains time dimension (not present in C*) • Logical and physical organization separate • Complex schemas with Avro 23
  • 24. Mapping Kiji to Cassandra 24
  • 25. Implementation notes 25 • Built for Cassandra 2.0.6+ • Native protocol / Java driver (no Thrift) • Asynchronous API • Assume users have Hadoop, ZooKeeper
  • 26. Mapping a Kiji table ➔ Cassandra • Locality group ➔ Table • Entity ID ➔ Primary key – Hashed components ➔ partition key – Unhashed components ➔ clustering columns • Family, qualifier, timestamp ➔ clustering columns • Cell values ➔ blobs 26 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 27. CQL for Kiji locality group CREATE TABLE users_locality_group_fast ( userid bigint, user text, family text, qualifier text, timestamp bigint, value blob, PRIMARY KEY (userid, username, family, qualifier, timestamp) ) WITH CLUSTERING ORDER BY ( username ASC, family ASC, qualifier ASC, timestamp DESC); 27 TODO: Show row diagram, arrows pointing to components?
  • 28. 28 cqlsh:kiji_music>SELECT * FROM kiji_table_users; userid | username | family | qualifier | timestamp | value --------+----------+--------+----------------+-----------+--------------- 0xfa | bob | info | email | 139653249 | 1243970104327 0xfa | bob | songs | abbey road | 139656012 | 0981274331032 0xfa | bob | songs | help | 139625013 | 9074132704129 0xfa | bob | songs | help | 139621359 | 1923079210370 0xfa | bob | songs | help | 139625013 | 4745018223497 0xfa | bob | songs | helter skelter | 139621324 | 7710423974234
  • 29. Physical organization of data on disk 29 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 0xfa:bob:info:email:t0:bob@gmail.com 0xfa:bob:info:payment:t1:AMEX1234... 0xfa:bob:songs:let it be:t5:... 0xfa:bob:songs:let it be:t4:… 0xfa:bob:songs:let it be:t2:… 0xfa:bob:songs:help:t2:… 0xfa:bob:songs:helter skelter:t1:… Efficient queries = continuous scans!
  • 30. Kiji queries ➔ CQL queries All data in “info” column family for “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’; 30 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 31. Kiji queries ➔ CQL queries Data in “info:email” and last play of “help” for “bob” ➔ SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’; SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1; 31 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 32. Kiji queries ➔ CQL queries All songs played by “bob” on April 2nd ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING; 😱😱 32 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 33. Kiji queries ➔ CQL queries 33 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 ! Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)
  • 34. Queries that do not map well to CQL • Break up into multiple CQL queries – Hooray for Session#executeAsync! • Filter on the client – Potentially very expensive, but functional – Provide warning to user • Educate users about table layout – Layout in previous example is terrible for that query • Most issues related to “time” dimension 34
  • 35. MapReduce • Wrote new InputFormat, OutputFormat • Hadoop 2.x • Multiple C* queries per RecordReader • Does not use Thrift 35
  • 36. Project status and next steps 36
  • 37. Initial release in ~ 2 weeks 37 • Cassandra as part of the Bento Box • Cassandra working in KijiSchema, KijiMR
  • 38. Support in the coming months • Cassandra integration with KijiREST, KijiScoring, KijiExpress, etc. • Expose Cassandra-specific features to users – Variable consistency levels – Load-balancing policies – Diagnostics (e.g., route tracing) • Kiji support in CQLSH – Decode Avro values 38
  • 39. Thanks to Cassandra community • Great help on mailing lists for users, dev, java driver • Webinars, meetups, C* Summit all available online • Free training from DataStax • Very easy to get up-to-speed 39
  • 40. Try it now -- Kiji Bento Box • Latest compatible versions of all components • Hadoop, ZooKeeper, HBase • Cassandra in ~2 weeks 40 www.kiji.org/getstarted Mention hiring?
  • 41. KijiSchema • Schemas and data serialization • Complex data types (e.g., nested maps) • Schema evolution • Metadata • Composite row keys • Transparent paging • Data-definition language, REPL 41 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 42. 42 Schema support Support for complex schemas with Avro record UserLog { long timestamp; int user_id; string url; } KijiSchema allows schema versioning
  • 43. 43 Column name translation •“family:qualifier” -> “A:B” •Saves disk space •Improves performance •User-facing tools translate names •Possible to turn this off
  • 44. Kiji queries ➔ CQL queries All data in family “songs” for user “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’; 44 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123