Organisation Structure For digital Transformation Team
Cassandra data modelling best practices
1. Cassandra data modelling best practices:
1. Composite Type use throughAPIclientisnotrecommended.
2. Supercolumnfamilyuse isnotrecommendedasitde serializeall the columnsonusage as
againstdeserializationof singlecolumn.
3. We can create wide rows (huge columnsandseveral rows) andskinnyrows(smallcol andhuge
rows).
4. Valuelesscolumn;if Rowid={City+uid} we wanttowrite/readonlyCitythenuidcanbe emptyor
valuelesscolumn.
5. Can expire columnbasedonTTLset inseconds.
6. Countercolumnsmaintaintostore a numberthatincrementallycountsthe occurrencesof a
particulareventorprocess.For example,youmightuse acountercolumnto count the number
of timesapage isviewed.
7. Keyspace:aclusterhas one keyspace perapplication.
Top level containerforColumnFamilies.
ColumnFamily:A containerforRow KeysandColumnFamilies
Row Key:The unique identifierfordatastoredwithinaColumnFamily
Column:Name-Valuepairwithanadditional field:timestamp
SuperColumn:A Dictionaryof Columns identifiedbyRow Key.
8. RandomPartitioneristhe recommendedpartitioningscheme.Ithasthe followingadvantages
overOrderedPartitioningasinBOP
Randompartitioner:Ituseshashon the Row Keyto determine whichnode inthe clusterwill be
responsible forthe data.The hash value isgeneratedbydoingMD5 on the Row Key.Each node
inthe clusterina data centerisassignedsectionsof thisrange (token) andisresponsible for
storingthe data whose RowKey’shashvalue fallswithinthisrange.
TokenRange = (2^127) ÷ (# of nodesinthe cluster)
If the clusterisspannedacrossmultiple datacenters,the tokensare createdforindividual data
centers. Whichisbetter.
Byte OrderedPartitioner(BOP):Itallowsyoutocalculate yourowntokensandassign to nodes
yourself asopposedtoRandomPartitionerautomaticallydoingthisforyou.
9. Partitioning => Pickingoutone node tostore firstcopy of data on
Replication => Pickingoutadditional nodestostore more copiesof data.
Storage commitlog(durability) flushittomemtables(in-memorystructures) SSTables
whichcompact data usingcompactiontoremove stale dataand tombstones(indicatorthatdata
deleted).
10. Binaryprotocol isfasterthan thrift.
11. Why RP?
1. RP ensuresthatthe data is evenlydistributedacrossall nodesinthe clusterandnotcreate
data hotspotas inBOP.
2. When a newnode isaddedto the cluster,RPcan quicklyassignita new tokenrange and
move minimumamountof datafromothernodesto the new node whichitis now responsible
for.With BOP,thiswill have tobe done manually.
2. 3. Multiple ColumnFamiliesIssue:BOPcancause unevendistributionof dataif youhave
multiple columnfamilies.
4. The onlybenefitthatBOPhasoverRP isthat it allowsyoutodo row slices.You can obtaina
cursor like inRDBMS and move overyourrows.
12. columnfamilyasa map of a map.
SortedMap<RowKey,SortedMap<ColumnKey,ColumnValue>>
A map givesefficientkeylookup,andthe sortednature givesefficientscans.InCassandra,we
can use row keysandcolumnkeysto do efficientlookupsandrange scans.
13. The numberof columnkeysisunbounded.Inotherwords,youcan have wide rows.
A keycan itself holdavalue.Inotherwords,youcan have a valuelesscolumn.
14. You needtopass the timestampwitheachcolumnvalue,forCassandratouse internallyfor
conflictresolution.However,the timestampcanbe safelyignoredduringmodeling.
15. Start withquerypatternsandcreate ER model.Thenstartdeformalizingandduplicating. helps
to identifythe most frequentquerypatternsandisolate the lessfrequent.
Querypattern:
3. Get userby userid
Get itembyitemid
Get all the itemsthata particularuserlikes
Get all the userswholike a particularitem
Option1: Exact replicaof relational model.
Option2: Normalizedentitieswithcustomindexes
Option3: Normalizedentitieswithde-normalizationintocustomindexes
Option4: Partiallyde-normalizedentities
4. Keyspaces: container for column families and a cluster has 1 keyspace per application.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor='2';
Single device per row - Time Series Pattern 1
Partitioning to limit row size - Time Series Pattern 2
The solution is to use a pattern called row partitioning by adding data to the row key to limit the
amount of columns you get per device.
Reverse order timeseries with expiring columns -
Time Series Pattern 3
Data for a dashboard application and we only want to show the last 10 temperature readings. With
TTL time to live for data value it is possible.
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES
('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
5. create table Inbound (
InboundID int not null primary key auto_increment,
ParticipantID int not null,
FromParticipantID int not null,
Occurred date not null,
Subject varchar(50) not null,
Story text not null,
foreign key (ParticipantID) references Participant(ParticipantID),
foreign key (FromParticipantID) references Participant(ParticipantID));
create table Inbound (
ParticipantID int,
Occurred timeuuid,
FromParticipantID int,
Subject text,
Story text,
primary key (ParticipantID, Occurred));
1 Define the User Scenarios This ensures User participation and commitment.
2
Define the Steps in each
Scenario
Clarify the User Interaction.
3 Derive the Data Model.
Use a Modelling Tool, such as Data Architect or ERWin to
generate SQL.
4 Relate Data Entities to each Step. Create Cross-reference matrix to check results.
5
Identify Transactions for each
Entity
Confirm that each Entity has Transactions to load and read
Data
6 Prepare sample Data In collaboration with the Users.
7 Prepare Test Scripts Agree sign-off with the Users.
8 Define a Load Sequence
Reference Data, basics such as Products, any existing Users
or Customers,etc..
9 Run the Test Scripts Get User Sign-off to record progress.