MySQL Cluster Asynchronous replication (2014)

Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 121
Insert Picture HereMySQL Cluster replication
Frazer Clement
Senior Software Engineer, Oracle
frazer.clement@oracle.com
messagepassing.blogspot.com
April 2014

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2
Session Agenda
•
Intro
•
MySQL replication
•
MySQL Cluster replication
•
Recommendations
•
Advanced topics

Disclaimer
THE FOLLOWING IS INTENDED TO OUTLINE OUR GENERAL
PRODUCT DIRECTION. IT IS INTENDED FOR INFORMATION
PURPOSES ONLY, AND MAY NOT BE INCORPORATED INTO ANY
CONTRACT. IT IS NOT A COMMITMENT TO DELIVER ANY MATERIAL,
CODE, OR FUNCTIONALITY, AND SHOULD NOT BE RELIED UPON
IN MAKING PURCHASING DECISION. THE DEVELOPMENT,
RELEASE, AND TIMING OF ANY FEATURES OR FUNCTIONALITY
DESCRIBED FOR ORACLE'S PRODUCTS REMAINS AT THE SOLE
DISCRETION OF ORACLE

Insert Picture Here
Introduction

Frazer Clement
Senior Software Engineer Oracle
Based in Edinburgh UK.
Joined MySQL AB in 2007, then Sun, then Oracle...
Worked on NdbApi, Replication, Cluster membership, Conflict detection,
most areas of Cluster.
Worked with customers for several years to help solve their problems.
Previously worked for Nortel / IBM on HLR, HSS...
Included using MySQL Cluster as HSS database since ~ 2005.
Strong telco focus
Who?

●
Audience are familiar with replication concepts, limitations etc. Will
review some of that material, but just as context
●
You have questions that will come up as we proceed – happy to answer
as we go, unless it is too time consuming – then we defer to the end
●
I don't have material to cover everything, but can use a white board
●
I will not know all the answers :)
●
I might ask you questions :)
●
You will have great ideas and suggestions!
●
Happy to discuss concepts and ideas, but I cannot commit to any future
development or fixes.
Expectations

●
I cover MySQL replication separately to MySQL Cluster replication here
●
This reflects the implementation – much of it is in the generic MySQL
Server code base, which means :
●
We benefit from all the features implemented at that level
●
We benefit from the testing performed by the huge installed base of
users.
●
It is implemented by a different team within Oracle – they work on it
for us for 'free' :)
●
It is designed to be generic across different storage engines
(MyISAM, InnoDB, Ndb cluster)
●
It is not so easily modified to suit the specific needs of MySQL
Cluster
Structure

Insert Picture Here
MySQL replication
M S

MySQL replication topologies
M S
M
S
S
S/M S
M/S M/S
M/S
M/S
M/S
M/S SS
Master Slave
Master-Master
Circular + Slaves
Master – multi slave tree
Constraint : 1 Master
per Slave @ 1 time

MySQL replication components
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog files
Binlog
Dump
Threads
Slave servers
Master Server
Slave
IO
thread
Relay log files
Slave
SQL
thread
MySQL uses IO buffering,
so when producer and
consumer are close, data
is passed in memory

MySQL Binlogging 1
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog Index file
BinlogFile.000006
BinlogFile.000005
BinlogFile.000004
BinlogFile.000003
Time
Binlog rotation based on file size, or
manual flush command.
Purge can be time based or manual.
During transaction execution, a
binlog transaction is cached by
client threads. At transaction
commit time, client thread takes
binlog lock + writes transaction
to binlog.
Light yellow
binlog
transaction
caches are in-
memory, strong
yellow have
spilled to disk

Binlog gives storage engine independent operation log
●
For single-server systems, Binlog durability via fsync is important
●
fsync has large performance impact
●
Binlog file size (--max_binlog_size) trades off number of files,
rotation/open/close cost with granularity of purge.
●
Size-based rotation occurs as part of writing an event to the binlog –
e.g. some client session does the work and spends the time.
●
Time-based PURGE (--expire_logs_days) is triggered when
rotating, and also performed by some client session.
●
Binlog transaction cache is largest in-memory cacheable binlog
transaction (--binlog_cache_size). Larger transactions are spooled
to disk prior to commit-to-binlog. See SHOW STATUS LIKE
'%binlog_cache%';
MySQL Binlogging 2
PURGE LOGS and FLUSH LOGS
commands can manually invoke purge
and rotate actions

●
Binlog is a series of variable sized events
●
Events have a type, size, serverid
●
Some are Binlog metadata : FORMAT_DESC, ROTATE
●
Original implementation was 'Statement Based Replication' (SBR) –
including statements in QUERY event types. Challenging to ensure
determinism at the Slave.
●
“Row Based Replication' (RBR) – uses WRITE_ROW,
UPDATE_ROW, DELETE_ROW events and can improve slave
performance for small transactions.
●
In both cases transactions are demarcated by BEGIN and COMMIT
QUERY events
●
Transactions are kept in a single Binlog file – often rotation occurs
after a COMMIT event.
MySQL Binlogging 3 MySQL Cluster uses Row based
replication

Row based replication
●
Basic format of transactions is :
BEGIN, TABLE_MAP*, (WRITE_ROW|UPDATE_ROW|
DELETE_ROW)*, COMMIT
●
TABLE_MAP is for efficiency – mapping a table name to an id for the
scope of the transaction – later *_ROW events use the id.
●
Each *_ROW event can contain one or more sets of row images
(changed rows), where each set of images :
●
Is the same operation type (INSERT/UPDATE/DELETE)
●
Is on the same table
●
Affects the same columns
●
RBR can include some statements : DDL etc...
●
As normal, SQL_LOG_BIN=0 can temporarily disable Binlogging
MySQL RBR RBR events have few determinism
issues

RBR Example
FORMAT_DESC
CREATE TABLE HLR.AUTH_INFO(...
BEGIN
TABLE_MAP(HLR.CUG, 1)
TABLE_MAP(HLR.AUTH_INFO, 2)
WRITE_ROW(1, …)
WRITE_ROW(2, …)
UPDATE_ROW(1, …)
WRITE_ROW(1, …)
DELETE_ROW(1, …)
COMMIT
BEGIN
TABLE_MAP(HLR.AUTH_INFO, 1)
MySQL RBR
Description event at start of Binlog file
Create table recorded in event
Binlog transaction start
Table map events with transaction
scope
Events containing one or more row
image sets
Binlog transaction end
Start of new Binlog transaction
DDL statements appear between DML
transactions – not interleaved
mysqlbinlog –verbose great for
analysing Binlogs + Relay logs
Also SHOW BINARY LOGS, SHOW
BINLOG EVENTS

DDL statements are (should be) relatively rare, so the Binlog files
should be mostly sequences of back-to-back transactions.
Binlog file positions are denoted using a file {name, position} pair.
The position is a byte offset, and many offsets are invalid.
Valid offsets are the start of an event.
MySQL Binlogging
New events
Active Binlog file
Byte offsets 0

MySQL Binlog dump threads
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog files
Binlog
Dump
Threads
Slave servers
- Slaves connect to a Master in
the normal way as a client,
authenticate, then issue a
BINLOG_DUMP command,
which causes their session
thread to become what we call a
BINLOG DUMP thread.
- BINLOG DUMP threads have a
Binlog {File, Position} pair from
where they are reading. Each
can have a different position.
- Where they are close to the
'head' of the Active Binlog, data is
passed via memory.
- Generally the Binlog Dump files
read as much data as is available
and can be sent over to the Slave
– TCP backpressure allows the
Slave to control the rate.

MySQL slave threads
IO thread
Server
logic
Pluggable
Storage
Engines
Master Server
Slave
IO
thread
Slave
SQL
thread
Relay log Index file
RelaylogFile.000022
RelaylogFile.000021
RelaylogFile.000020
RelaylogFile.000019
Time
- Slave IO thread
connects to Master and
issues Binlog DUMP
command.
- Events received are
filtered by server id, then
written to relay log files.
- Slave IO thread can
operate entirely
separately to Slave SQL
thread, replicating the
Binlog files from the
Master.
- Relay logs are almost
exactly the same as
Binlogs.
- FLUSH LOGS can
manually rotate the
active relay log
IO and SQL threads can be stopped and started
together or separately using START SLAVE and
STOP SLAVE.

MySQL slave threads
SQL thread
Server
logic
Pluggable
Storage
Engines
Master Server
Slave
IO
thread
Slave
SQL
thread
Relay log Index file
RelaylogFile.000022
RelaylogFile.000021
RelaylogFile.000020
RelaylogFile.000019
Time
- Slave SQL thread
reads relay logs, via
memory if possible, and
executes the events.
- Event execution can
result in Storage Engine
(SE) calls, defining
transactions, operating
on data etc.
- When the Slave SQL
thread reaches the end
of a relay log file and
moves onto the next, the
old relay log file is
purged automatically.
- Relay logs are mostly
'invisible' in normal
operation.
Separate IO and SQL threads decouples k-safety /
geo redundancy from replica consistency / slave
apply performance limitations or issues

●
Most storage engines (e.g. InnoDB, MyISAM) have no slave-specific code.
They are invoked by the Server logic and behave as they would for any client
session invocation.
●
Normal MySQL replication has a 1:1 mapping between 'user transactions' and
'binlog transactions'.
●
User transactions execute in parallel at the Master and are serialised only when
writing their binlog cache contents to the Binlog at transaction commit time.
●
The Binlog forces a single serial order on concurrent transactions.
●
At the Slave, the SQL thread executes transactions in Binlog order. This may
be less concurrent than the causing execution at the Master.
●
At the Slave, the SQL thread must perform all blocking disk I/O, which might
have been performed by concurrent threads on the Master. This can be the limit
on Slave throughput.
MySQL slave SQL thread
Recent MTS work alleviates single-threaded slave
SQL thread limitation somewhat

●
The slave SQL thread can encounter permanent or temporary errors
●
Permanent errors generally stop the SQL thread (but not the IO thread)
●
Some permanent error types can be ignored : --slave-skip-errors
●
Some classes of errors can ignored with an alias : --slave-skip-
errors=ddl_exist_errors
●
Temporary errors result in transaction rollback and limited retries
●
--slave_transaction_retries controls the maximum number of retries before a
temporary error will cause the SQL thread to stop.
●
Temporary errors result in an entry in the Server's error log file
●
Retries are not immediate, but have bounded growing inter-retry delays to give
time for temporary conditions to resolve.
●
Problematic events can be manually skipped by setting the
sql_slave_skip_counter variable before restarting the Slave SQL thread.
MySQL slave SQL thread

Insert Picture Here
M S

●
Topologies and Components
●
Cluster internals
●
HA and distribution
●
Event ordering
●
SUMA buffering and duplicates
●
SendBuffer
●
NdbApi and SUMA concepts
●
NdbApi internals
●
NdbApi event buffering
●
Events and Blobs
●
Binlog Injector
●
Ndb slave

●
Ndb Binlog Injector
●
Ndb slave

Built on-top of MySQL replication
●
Tried and tested stack
●
Flexibility – replicate to and from other systems
●
Benefit from performance improvements and bug fixing
MySQL Cluster adds :
●
Binlogging of changes from other cluster clients including NdbApi
●
HA replication – no SPOF
●
Transactional replication position on the slave
●
Higher replication performance through batching and parallelism
●
Moving parts and complexity

MySQL Cluster replication topologies
Master Slave
Master-Master
Circular + Slaves
Master – multi slave tree
All the
standard
topologies,
with whole
clusters as M,
S or S/M.
M S M/S
M/S
M/S
M/S
M/S
M/S SS
M
S/M
S
S
S

MySQL Cluster replication topologies 2
Star/Hub with upstream master and
downstream Slave tree
A Slave Cluster can have multiple Masters
Multi-master
M
M
M
S
M
M/S
M/S
M/S M/S
M/S
S
S

MySQL Cluster replication components
For HA, each cluster has redundant Master and Slave MySQL Servers. Most
commonly two servers, with only one Slave Server active at a time. Both MySQL
Servers write Binlog, but commonly only one is serving Binlogs to downstream
slaves. A single server can perform both Master and Slave roles.
S
M
M
MySQL Cluster
data nodes,
other clients etc.
S
Slave servers
Master servers
MySQL protocol
MySQL protocol
MySQL protocol
MySQL protocol
NdbApi
events
NdbApi
DML/
DDL

MySQL Cluster replication components 2
MySQL Cluster
data nodes,
other clients etc.
Slave servers
MySQL protocol
NdbApi
DML/
DDL
MySQL protocol
NdbApi
events
MySQL protocol
MySQL protocol
Master servers
A single server can perform both Master and Slave roles
simultaneously
IO SQL
IO SQL
INJ DUMP
INJ DUMP
NDB
NDB
NDB
NDB
BINLOGS
BINLOGS
Relay
logs
Relay
logs

●
MySQL Cluster is designed for high write-rate systems, with low stable
latency
●
Parallelism and concurrency exist at many levels : rows, operations,
fragments, data nodes, transactions, clients.
●
Operations, transactions and therefore clients generally only
interact/contend where the operate on the same data (rows).
●
Otherwise transactions are entirely parallel and unsynchronised.
●
Great for throughput with low latency
●
Does not provide a single serial history of transactions
●
Does not provide a notion of consistent points in time
●
Consistent points are identified by a separate mechanism – Global
checkpoint.
MySQL cluster global checkpoint 1

●
Consistent points are used for :
- Creating potential system / backup recovery points
- Defining points in change event streams (replication)
●
The set of changes between two consistent points is referred to as an
epoch, and is identified by a cluster unique 64 bit epoch number.
●
Epoch numbers have a high 32-bit word called GCI (Global checkpoint
index), and a low 32-bit word called micro-GCI
●
System/Backup recovery points are at GCI boundaries, and are
created on the period TimeBetweenGlobalCheckpoints – defaults to
2000 millis.
●
Event stream consistency points are at micro-GCI boundaries, and are
created on the period TimeBetweenEpochs – defaults to 100 millis.

●
The ratio between these times results in the pattern of epoch numbers seen.
The default ratio is 20:1, resulting in micro-GCI values 0..19, then an increment of
the GCI value.
●
In disk overloaded systems, sometimes the GCI increment is stalled for longer,
and so higher micro-GCI values are seen – this can be a warning of redo disk IO
problems.
●
Epoch numbers are often logged as <GCI>/<microGCI>, generally more
readable than the 64bit representation.
●
Epoch numbers are assigned at transaction commit time, by the transaction's
Transaction Coordinator (TC) – a component on the data node
●
To get permission to commit, and an epoch number assigned, a transaction
must be fully prepared – and e.g. be holding all the row locks it needs. This
implies that it has no dependencies on any other running transaction.

●
The Global Checkpoint protocol works by broadcasting a prepare-to-
increment signal to all TC instances in the cluster, causing them to gate
new transaction commits (but continue all other processing). Once all
TC instances have acknowledged, a commit-increment signal is
broadcast, and all TC instances resume committing.
●
The effect here is that the parallel streams of committing transactions
are divided into before and after with the following properties :
All transactions in epoch n 'happened before' those in epoch n+1
Therefore an epoch boundary is a consistent point
●
Note that a parallel system has many equivalent partial event sort
orders, and epochs are just one of them, selected arbitrarily.

●
Epoch numbers are assigned when transactions begin to commit.
●
Commit of large transactions, especially involving disk data tables, can
take some time.
●
Post-commit triggers in the Tuple manager (TUP) component in the
data nodes send row change details to the Subscription manager
component.
●
The Subscription manager (SUMA) manages forwarding / multicasting
of row change details to NdbApi / MySQLD clients.
●
Each row change has an associated epoch number.
●
When a TUP instance has completed commit processing for all
transactions in an epoch, it notifies SUMA.
●
When all of the local TUP instances have completed an epoch, SUMA
informs its subscribers.
MySQL cluster row change events

.
MySQL cluster row change events flow
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
NdbApi NdbApi
events
Subscriber 1
Subscriber 2
Writes are synchronously replicated within a nodegroup.
All SUMA components in a nodegroup observe all write events.
Events are hashed to buckets independently of the node local fragment replica role.
e.g. write events can be delivered from 'Primary' or 'Backup' fragments.
Data nodes, NoOfReplicas=2

.
MySQL cluster row change events - nodegroup
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
A single row write transaction from an API node routes to one TC instance, from where it is
forwarded to the LDM instance managing the primary fragment replica for the fragment. Then it is
synchronously replicated to the nodegroup peer. Both nodes forward the event to their SUMA
instance, but only one SUMA forwards the event to the subscribing API nodes. The other buffers it.

.
MySQL cluster row change events + epochs
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
After an epoch increment, the TUP instances gradually finish committing the transactions from the
previous epoch and forwarding their events to the local SUMA instance. When SUMA receives this
'epoch completed' signal it forwards it to its subscribers. This tells the subscribers that they have
received all of the events for the given epoch from the source data node.

.
MySQL cluster row change events + epoch ack
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
Once subscribers have received an epoch completed event, they immediately respond with an
acknowledgement back to the data nodes. Once all subscribers have acknowledged reception of
an epoch to all data nodes, SUMA can release the epoch's event buffer space for reuse.

.
MySQL cluster row change events + failure
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
When a node fails unexpectedly, there will likely be events sent to subscribers for as-yet
unacknowledged epochs. In this case, there is uncertainty about whether all of the subscribers
received all of the unacknowledged events or not. To solve this, a surviving node in the nodegroup
will 'takeover' the bucket, and use its buffered events to resend unacknowledged epochs.

.
MySQL cluster row change events + resend
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
Depending on when the failure occurred, it's nature, buffers etc, the subscribers may have received
the original events or not. In any case they are re-sent by a nodegroup peer. Currently this can
mean that the API sees duplicate events within an epoch. Once all buffered events are re-sent, the
normal epoch-completed protocol is followed.
Data nodes, NoOfReplicas=2 STRICT mode problem

Observations
●
Row changes are distributed according to row's primary key
distribution hash
●
A single transaction can affect different fragments on same or different
data nodes
●
Even within a data node, the row changes are forwarded or buffered
based on bucket membership, which is a function of the primary key
hash, but independent of the local fragment replica's current primary or
backup role.
●
The cluster does not actively maintain transaction ordering
information within an epoch
●
Therefore : Events arrive at subscribers in a partially sorted order
MySQL cluster row change event ordering

Events arrive at subscribers in a partially sorted order
●
Events in epoch n+1 occurred after events in epoch n
●
Events within an epoch are only ordered w.r.t. individual primary keys.
(As it is guaranteed that a given primary key value will be in a particular
SUMA bucket)
Implications
●
Inter-row constraints may be temporarily violated if applying the row
events in-order (unique keys, foreign keys)
●
It's not trivial to extract and order the original 'user transactions' from
the event stream (requires per-event transaction id and a topological
sort)
●
Consistency is only guaranteed at epoch boundaries
MySQL cluster row change event ordering 2

In node failure cases, unacknowledged epochs are re-sent to
subscribers.
Subscribers do not currently filter out the re-sends.
Therefore it is possible to have duplicated events in an epoch.
This is one reason why Ndb slaves require IDEMPOTENT mode – to
allow them to handle cases where sequences of operations to a
primary key are partially/fully repeated...
[INSERT, UPDATE] → [INSERT, UPDATE][INSERT, UPDATE]
[INSERT, UPDATE] → [INSERT][INSERT, UPDATE]
[DELETE] → [DELETE][DELETE]
[UPDATE, DELETE] → [UPDATE, DELETE][UPDATE, DELETE]
MySQL cluster row change event duplicates

The row changes occurring in each nodegroup are divided into a
number of buckets (or slices), using a primary key hash. The number
of slices is designed to always balance across the available nodes in a
nodegroup.
Responsibility for forwarding events in each slice to subscribers is
given to one of the nodes in the group, and the others will buffer the
events.
As each node in the nodegroup is forwarding and buffering the same
number of slices/buckets, and as they slices are based on an MD5
hash, their forwarding IO and buffering capacity should be balanced.
For NoOfReplicas=2, each nodegroup has two slices, with each node
responsible for forwarding one and buffering one.
MySQL cluster row change SUMA balance

●
Data nodes buffer changes in SUMA to handle unexpected data node
failure without a break in the NdbApi row change event streams.
●
SUMA event buffering is in-memory and finite.
●
Buffer space is consumed by data change in the local node, and
released by acknowledgements from all subscribing Api nodes
●
Buffer space is liable to increase due to : Network problems to
subscribers, Slow subscribers, Failed-but-not-yet-detected Api nodes,
Cluster write rate spikes.
●
To protect the data nodes, SUMA event buffering can be limited in
terms of the number of epochs buffered and the number of bytes
(Cluster config MaxBufferedEpochs and (new)
MaxBufferedEpochBytes)
MySQL cluster row change SUMA buffering

●
MaxBufferedEpochBytes limits the amount of memory SUMA can
use for buffering events.
●
MaxBufferedEpochs limits the number of unacknowledged epochs
that SUMA will accept from any subscriber.
●
Reaching the MaxBufferedEpochBytes is not an immediate
problem, as the data nodes can stop buffering, but keep forwarding.
However it means that the cluster is no longer resilient to data node
failure - in the event of a data node failure, the Api nodes will be
informed that there is a gap in the event stream. For replication, this
requires a Backup + Restore cycle to resync the Slave. For this reason
it should be avoided..
●
“Out of event buffer: nodefailure will cause event failures, consider
increasing MaxBufferedEpochBytes”
●

●
When MaxBufferedEpochs is reached, subscribers with that epoch
lag will be disconnected, allowing their buffered data to be released.
●
This asynchronous disconnect from the data node side of the
connection will result in all data nodes disconnecting the subscribing
Api node, and will appear to NdbApi like a 'cluster failure'.
●
For MySQLD, the Binlog Injector thread will inject a GAP Incident
event in the Binlog.
●
The Api node is then free to reconnect and attempt to establish new
subscriptions...
●
“Disconnecting lagging nodes ...”
MaxBufferedEpochs can be used like a 'watchdog' on lagging subscribers – perhaps disconnecting them and allowing them
to reconnect can clear problems they may be having.
It is not really necessary as a guard on the data node buffering capacity, as that is limited by MaxBufferedEpochBytes.
However beware if setting it to an ultra-high value that a SUMA-internal pool is allocated based on it's setting.

Monitoring :
●
Bytes buffered in SUMA : Cannot be directly monitored
●
Epochs buffered in SUMA : NdbInfo ndb$pools, block is SUMA,
resource is “GCP”.
●
DUMP 8013 puts summary of oldest buffered epoch and which
subscriber's acknowledgements are pending into the cluster log.
Potential Improvements
NdbInfo views on subscriptions, subscribers, buckets, epochs,
buffered bytes, volumes of data sent to individual subscribers etc...

●
Most deployments have two or more 'Binlogging MySQLDs', and a set of
NdbApi clients and/or non-binlogging MySQLD servers.
●
The Binlogging MySQLDs subscribe to row change events on almost every
table in the cluster - this means that the data sent from data nodes to Api nodes
is noticeably higher for Binlogging MySQLDs than other Api clients.
●
When the rate of change in the Cluster is high, this imbalance can cause
problems with the SendBuffer resource.
●
SendBuffer is used to decouple the non-blocking data node core from the
blocking socket send protocols used to actually send data remotely.
●
Binlogging MySQLDs generally need more SendBuffer configured on the data
nodes than other Api nodes, to soak up spikes in the change rate.
.
MySQL cluster row change SendBuffer

.
MySQL cluster row change SendBuffer
API
API
API
API
API
API
API
API
NdbApi NdbApi
events
Subscriber 1
Subscriber 2
Generally many more Api nodes than subscribers.
Binlogging MySQLDs subscribe to all row change events – so are directly affected by
the rate of change in a cluster.
Links to Binlogging MySQLDs must be allowed more SendBuffer than other Api links

Epoch boundaries are used for system restore capabilities.
A cluster system restart is where all nodes start from on-disk state -
LCP + redo logs.
However only a subset of epoch boundaries (GCI boundaries) are
eligible as system restore points.
GCI boundaries are epochs with micro_GCI == 0.
They normal occur every TimeBetweenGlobalCheckpoints
milliseconds – defaults to 2000 millis.
However epochs are incremented every TimeBetweenEpochs millis –
defaults to 100 millis.
Therefore the completed epochs will almost always contain
changes which will not be available after a sudden system restart
MySQL cluster system restart + row changes

Completed epochs will almost always contain changes which will
not be available after a sudden system restart
●
Therefore the Binlogs of subscribing MySQLD nodes can contain
changes which are not in the cluster data nodes after a restart.
●
Therefore the downstream slave(s) can contain changes which are
not stored in the (old) Master cluster after a restart.
●
Therefore care must be taken to understand the restoration point of
the failed cluster during the restart, so that a decision about how to
resync can be made - Perhaps local or slave Binlogs can be replayed?
●
A failed cluster should not be allowed to come online and serve
traffic after a system restart without some analysis of the data
lost.
MySQL cluster system restart + row changes

NdbApi has the NdbEvent and NdbEventOperation classes.
Clients can create and share NdbEvents by name. Each NdbEvent
refers to a single table, and is parameterised by whether only modified
columns, or all columns are included. NdbEvents have a lifecycle
independent of the creating NdbApi client, so care must be taken.
Clients create and use NdbEventOperation objects to request that the
row change events defined by a particular NdbEvent object should be
sent to the NdbApi client. A single NdbApi client might use many
NdbEventOperations (one per table) to get a view of the row changes
occurring in some subset of tables.
Row change events in NdbApi
SUMA p.o.v :
Event = Subscription,
EventOperation = Subscriber

The set of currently defined events in a cluster can be see by looking at
the hidden NDB$EVENTS_0 table...
> ndb_select_all c<> dsys NDB$EVENTS_0
NAME EVENT_TYPE TABLEID TABLEVERSION TABLE_NAME ATTRIBUTE_MASK SUBID
SUBKEY
"REPL$mysql/ndb_schema" 262143 4 1 "mysql/def/ndb_schema" [511 0 0 0] 1
65537
"REPL$mysql/ndb_apply_status" 65535 6 1 "mysql/def/ndb_apply_status"
[31 0 0 0] 3 65539
"NDB$BLOBEVENT_REPL$mysql/ndb_schema_3" 393215 5 1 "mysql/def/NDB$BLOB_4_3"
[15 0 0 0] 2 65538
3 rows returned
NDBT_ProgramExit: 0 OK
Row change events in NdbApi
MySQL replication created events usually start with REPL$ or REPLF$
for modified-only or all-columns variants.
The subscription to mysql/ndb_schema is used by all MySQLDs to
communicate about schema changes.
Note that the mysql/ndb_schema table (id 4) has a blob column which
needs its own event to track those changes.
The attribute_mask here is deprecated and irrelevant.

NdbApi allows users to define and share Events, and to define
EventOperations to subscribe to row changes defined by the events.
NdbApi event == SUMA subscription
A description of the type of changes that are of interest.
Can be shared by many NdbApi clients
NdbApi event operation == SUMA subscriber
A way to request the flow of row events from a particular
Event/Subscription start flowing to this client.
EventOperations are associated with an Ndb object.
NdbApi event concepts

Ndb
object
Ndb
object
NdbApi event concepts
Subscription Subscriber
n 1 n
EventOperation
1
Event 1 n
1
1
Ndb
object
1
n
Table 1
EventBuffer
Api node
n
1
1
Cluster

●
An Ndb object can be used to subscribe to changes on one or more
tables by creating NdbEventOperations on them, and then polling for
incoming events.
●
The SUMA components on the data nodes will start sending the row
changes on some epoch boundary.
●
As events and epoch boundary signals arrive, thread(s) internal to the
NdbApi library receive, acknowledge and buffer them.
●
The buffering exists to decouple the data transfer from the data nodes
to the Api client from the Api client's consumption of events.
●
In cases where the change rate is very high, or the Api client is slow or
stalled, the buffer will grow
●
Excessive growth of the NdbApi event buffer can cause stability issues.

Data node Data node Data node Data node
Receiver
NdbApi
Api process
Event buffer
User code Ndb::pollEvents()
Within NdbApi, the receiver
thread(s) buffer and
acknowledge reception of row
data for an epoch. This
occurs independently of the
User code behaviour, and
helps with quick release of
data node SUMA event buffer
space.
User code can retrieve new
events by calling the
pollEvents() method. This will
return the next event from the
head of the Ndb object's event
buffer.
Only events in completed
epochs are made available to
the pollEvents method.

NdbApi Event Buffer
implementation
+ Zeroconf
- Unbounded growth
- Can destabilise host
+ Avoid OS malloc/free
- Never return memory to OS
- Often continues to grow despite
having 'free space'
Cases observed where :
- Event buffer growth causes
host slowdown due to paging,
resulting in decrease in both
user code event consumption
rate and receiver thread
performance. Eventually
MaxBufferedEpochs drives
client disconnect and buffer re-
initialisation (good outcome)
.
- Event buffer growth causes
Linux OOM killer to choose an
ndbmtd host to kill to relieve
memory pressure.
- Memory allocated from OS
continues to grow despite
buffer having large % free.
Hard crashing limit on size has
been implemented.
Soft 'GAP insert' limit in-
progress.

Each Event Buffer is linked to an Ndb object. Generally want minimum
buffers per process.
Currently NdbApi event Api only exposes buffering information in terms
of epochs :
- Latest epoch(GCI): Most recent epoch completely received by NdbApi
receiver thread (Tail of event buffer)
- Apply epoch(GCI): Epoch of event currently being consumed by
NdbApi user code (Head of event buffer).
These are of limited use as :
- epoch numbers are sparse, there's not direct indication of the number
of epochs buffered.
- epochs are of different sizes in terms of row changes and change size
NdbApi event buffer monitoring

NdbApi (and MySQLD) allow a 'GCI slip threshold' to be configured.
  ndbreportthreshbinlogepochslip=#
                      Threshold on number of epochs to be behind before
                      reporting binlog status. E.g. 3 means that if the
                      difference between what epoch has been received from the
                      storage nodes and what has been applied to the binlog is
                      3 or more, a status message will be sent to the cluster
                      log.
  ndbreportthreshbinlogmemusage=#
                      Threshold on percentage of free memory before reporting
                      binlog status. E.g. 10 means that if amount of available
                      memory for receiving binlog data from the storage nodes
                      goes below 10%, a status message will be sent to the
                      cluster log.
These threshold crossings cause cluster log events to be generated...

[MgmtSrvr] INFO      Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B
apply_epoch=236/18 latest_epoch=236/21
…

NdbApi also includes some per-Ndb event stream monitoring counters.
These are incremented independently of event consumption.
Ndb::getClientStat()
    DataEventsRecvdCount     = 18, /* Number of table data change
events received */
    NonDataEventsRecvdCount  = 19, /* Number of nondata events
received */
    EventBytesRecvdCount     = 20, /* Number of bytes of event data
received */
These can be seen from a MySQLD instance using :
> SHOW STATUS LIKE 'ndb_api%injector';

A configurable crash-stop limit on event buffer size was implemented
recently. The idea is that a process shutdown and restart, with
accompanying replication channel failover) is preferable to host OS
destabilisation when event buffer growth is excessive.
ndbeventbuffermaxalloc=#
Maximum memory that can be allocated for
buffering events by the ndb api
Work is in progress to implement a less severe limit – where excessive
buffering causes incoming events to be discarded, and the consumer
becomes aware of the 'gap' when it reaches it.
NdbApi event buffer limiting

●
Ndb Blobs are mostly implemented at the NdbApi layer of the stack.
●
Data for Blob columns are split into a small header, and zero or more
'parts' (e.g. 256 byte header, 0..n parts of 2000bytes, 4000 bytes...)
●
The 'header' data is a normal column in the table, something like a
VARBINARY(272).
●
The parts are rows in a hidden table, defined just to hold parts for that
particular column. These Blob part tables are named
NDB$BLOB<tableid>_<columnnumber>
●
From the point of view of the data nodes, the part tables are normal
'user tables'.
●
This allows arbitrary length data to be transactionally stored in Blob (or
Text) columns, but adds complexity at the NdbApi layer.
NdbApi events for Blobs

●
When a transaction modifying a Blob value (insert, update, delete)
occurs, it internally involves operations on the main table as normal, and
zero or more operations on rows in the parts table(s) involved.
●
TUP and SUMA treat the Blob part tables as separate tables.
●
NdbApi receives row changes for the Blob part tables separate to the
main table row change, with no ordering constraints between different
rows as normal.
●
With the merge_events option on, NdbApi correlates the main table and
part table events so that the Blob part table row changes are used to
create a pseudo main table row change event containing all the Blob
changes.
●
This is implemented using an in-memory hash by PK in the NdbApi.
NdbApi events for Blobs
Event merge merges all events on a
row in an epoch – e.g. separate user
transactions are merged together.

●
NdbApi receives row events and epoch completion signals from the
SUMA components of the running data nodes
●
NdbApi internally buffers events for incomplete epochs. These are not
made visible to NdbApi users until all data nodes have indicated that the
epoch is completed.
●
As data nodes complete epochs in-order, the NdbApi nodes will release
event data for each epoch to the user, in-order.
●
A user thread serially consuming events from the event Api can be sure
that when the first event for some epoch m > n is received, there will be
no more events for epoch n.
●
This sequencing is a kind of merge-sort by epoch number on the data
node event streams.
NdbApi event sequencing

●
In a quiet or idle cluster, epochs can occur which have no row changes
in them.
●
The normal global checkpoint mechanisms occur, and all data nodes
will send epoch completed signals, and expect acknowledgements.
●
This can be seen in the 'Latest epoch' values, which continue to climb
in an idle cluster. This is the only indication of empty epochs occurring at
the NdbApi layer.
●
In some cases, an epoch may have events which a user does not
consider relevant – e.g. slave-applied updates when –log-slave-
updates=0. In this case the epoch is not empty at the NdbApi level, but
may be considered as empty by the next layer up.
Empty epochs at the NdbApi layer

●
Mostly looked at the data nodes and the NdbApi event Api so far.
●
The Ndb Binlog Injector (BI) is a component of MySQL Servers when
part of a MySQL Cluster
●
A better name might be 'Ndb event listener', as it is responsible for
more than just Binlog generation.
●
The BI uses the NdbApi event Api to listen to row changes on an
internal schema table (mysql.ndb_schema).
●
The BI also uses the NdbApi event Api to listen to row changes on all
other tables (unless exceptions are made with mysql.ndb_replication)
●
The BI writes Binlog transactions and DDL statements to the Binlog.
●
The BI injector maintains a local mysql.ndb_binlog_index table for
mapping epoch numbers to Binlog files and positions.
Ndb Binlog Injector

●
Schema changes in MySQL Cluster involve cooperation between all
attached MySQL Servers, so that they can take any necessary steps,
and serve the new schema immediately when it is committed.
●
Event subscription to an internal table (mysql.ndb_schema) is used to
accomplish this. All MySQL Servers listen for changes on this table
using their BI and modifications to this table are used to communicate
(rows as shared memory!)
●
Schema changes can generate binlog entries - “DROP INDEX...”
●
The volume of change on this table is very low, and related to DDL.
●
BI uses a separate Ndb object (and EventBuffer) for its subscription to
mysql.ndb_schema row changes, so occasionally it can be seen in
epoch_slip logs. Usually the buffer is very small.
Ndb Binlog Injector + schema distribution

●
The BI is an event processing list, which consumes events from a
'schema event' subscription to the mysql.ndb_schema table, and a 'data
event' subscription to all the tables being Binlogged by the server.
●
The loop has the following pseudo-code :
while (!(disconnected || error ||...)
consume all schema events for epoch, taking steps required.
begin binlog transaction
insert 'fake' ndb_apply_status write row for epoch
consume all data events for epoch, writing to binlog transaction
decide whether to commit or rollback binlog transaction
commit/rollback
write details of epoch transaction to mysql.ndb_binlog_index table
Ndb Binlog Injector main loop

Ndb Binlog Injector main loop
schema
data
binlogs
BEGIN
WRITE_ROW
UPDATE_ROW
DELETE_ROW
WRITE_ROW
UPDATE_ROW
COMMIT
DROP INDEX A;
BEGIN
WRITE_ROW
...
NdbApi Server
Binlog
code
Transaction
cache
Server
MyISAM
code
mysql.ndb_binlog_index
Ndb
Binlog
Injector

Observations :
●
BI is a bottleneck for all changes to be Binlogged in the cluster. Not
commonly a problem though...
●
BI relies on generic MySQL Binlogging code (Binlog transaction cache,
Inline Binlog rotate and purge)
●
BI contends for OS locks inside generic MySQL Binlogging code
(though should be low/no contention if deployed as recommended)
●
BI relies on generic MySQL processing and MyISAM table handling for
ndb_binlog_index table maintenance (Table locking)
●
Health and liveness of the Binlog Injector is very important for
Ndb Binlog Injector

The epoch slip cluster logs mentioned before indicate where the BI is not
keeping up with the latest available epochs.
NdbApi counts exposed as status variables indicate the volume of data
processed by the data subscriptions of the BI :
mysql> show status like 'ndb_api%injector';
+++
| Variable_name                        | Value |
+++
| Ndb_api_event_data_count_injector    | 0     |
| Ndb_api_event_nondata_count_injector | 2     |
| Ndb_api_event_bytes_count_injector   | 256   |
+++
The output of SHOW ENGINE NDB STATUS indicates the progress of
the BI in terms of epochs
Ndb Binlog Injector Monitoring

mysql>SHOW ENGINE NDB STATUS
| ndbcluster | binlog | latest_epoch=10672993730572,
latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572,
latest_handled_binlog_epoch=10672993730572,
latest_applied_binlog_epoch=1022202216475 |
latest_epoch : Latest completed epoch from NdbApi point of view – tail of event buffer
latest_trans_epoch : Epoch of most recent transaction committed to Ndb from this server.
latest_received_binlog_epoch : Epoch of the most recently consumed event from the head
of event buffer.
latest_handled_binlog_epoch : Epoch of the most recently completely processed epoch
latest_applied_binlog_epoch : Epoch of the most recently completely processed epoch,
which resulted in Binlog write.

Observations :
- latest_epoch == latest_received_binlog_epoch == latest_handled_binlog_epoch
The NdbApi event buffer is empty, and the BI is idle
- latest_trans_epoch < latest_handled_binlog_epoch
Every cluster write done by this server has been binlogged
- latest_handled_binlog_epoch > latest_applied_binlog_epoch
Recent epochs have not had any binloggable content (Quiet cluster, or slave
updates...)

Inferences :
●
latest_epoch > other epochs : There are epochs in the NdbApi Event Buffer
●
latest_received_binlog_epoch > latest_handled_binlog_epoch : BI is processing an epoch
now.
●
latest_trans_epoch > latest_handled_binlog_epoch : Some transactions committed by this
server are not yet in the Binlog.
●
Also : SHOW BINARY LOGS and SHOW MASTER STATUS
The progress of Binlog writing can be seen.

The BI produces uniform transactions in the Binlog. Each Binlog transaction describes all
of the changes that occurred in a single cluster epoch. For this reason they are sometimes
referred to as Epoch transactions. Each user transaction occurs in one epoch, so each
user transaction's changes are recorded in one epoch transaction, and cannot span
epochs.
DDL statements and any other Binlogging activity will occur between these transactions,
not within them.
These transactions have the structure:
●
BEGIN event :The position of the BEGIN event is the position of the transaction
●
1+ TABLE_MAP events : At least one for the mysql.ndb_apply_status table
●
1 WRITE_ROW event to mysql.ndb_apply_status
●
1+ other events to other tables (WRITE_ROW, UPDATE_ROW, DELETE_ROW)
As normal with RBR, each event can contain changes to multiple rows
●
COMMIT event
The first WRITE_ROW to ndb_apply_status is a 'fake' event generated by the BI.
Ndb Binlog transactions

.
BEGIN
COMMIT
Cluster
Binlog content
Multi-row user transactions
Commit
in
same
epoch
Epoch
transaction
Table
maps 'Fake'
WRITE_ROW
Epoch transactions contain all the changes necessary to move a
Slave from one consistent epoch boundary to the next.

Merging all of the user transactions that occurred in an epoch into a
single epoch transaction in the Binlog is a strength and weakness. It
allows higher performance at the Slave, but complicates Binlog
positioning.
When looking at the Binlog on a Slave cluster, we can see that the first
Master's epochs are considered to be user transactions to the Slave, so
they can be merged together into one epoch transaction in the Slave's
binlog.
This is a source of efficiency, but can cause problems when performing
failover between clusters.

One problem with a slave cluster merging a master clusters epochs
together is slave promotion.
A common topology is a 'read scaled' cluster with 1 Master cluster, and n
Slave clusters.
When the Master cluster fails, one of the Slaves is selected to become
the new Master, and the other Slaves must failover their replication
to the new Master.
The problem with epoch merging here is that the old Master's epoch
stream (A1,A2,A3,A4,A5) may have been applied by Slave B as (B1(A1),
B2(A2,A3), B3, B4(A4,A5)). If Slave B becomes Master, and Slave C
has stopped at old Master epoch A2, which epoch transaction boundary
should it begin replicating from in Slave B's Binlog? B2 or B3?
Slave promotion Another motivation for Slave defaulting
to IDEMPOTENT mode

Ndb supports some extra optimisations to minimise the size of the Binlog
transactions it produces :
- log_update_as_write
This causes update events on tables to be logged as WRITE_ROW
(Insert) events. It requires that the downstream slave can idempotently
apply a WRITE_ROW event. The optimisation is that the row's 'before
image' need not be sent.
- log_updated_only
This causes the NdbApi event (and SUMA subscription) to only send
modified columns. The BI then only puts the modified columns in an
update or write_row event, saving space and time.
Both of these options need care to ensure behaviour is correct.
Ndb Binlog transaction optimisations

For finer-grained logging control, the mysql.ndb_replication table can be
used.
This allows binlogging on-or-off, log-update-as-write and log-updated-
only to be controlled per table, per binlogging server.
It now supports wildcards for easier use.
It also supports defining conflict detection / resolution algorithms.
Ndb_replication table

●
Used to map epoch number to binlog file and position
●
Append(Insert) only from binlog injector
●
Read only during failover from mysql clients
●
Bulk deletes occur during PURGE or RESET MASTER.
●
MyISAM, so concurrency controlled using single table lock!
●
One problem here is that a long running activity holding the table lock
can block the BI, as its thread gets stalled waiting for a table lock to
insert into the table. This generally causes epoch slip and event buffer
backlogs.
●
Known bad case is where a Binlog file is PURGEd, manually or
automatically, requiring many many rows to be deleted from
mysql.ndb_binlog_index. This can stall the BI for some time.
ndb_binlog_index table

Schema has evolved over time :
  `Position` bigint(20) unsigned NOT NULL,
  `File` varchar(255) NOT NULL,
  èpoch` bigint(20) unsigned NOT NULL,
  ìnserts` int(10) unsigned NOT NULL,
  ùpdates` int(10) unsigned NOT NULL,
  `deletes` int(10) unsigned NOT NULL,
  `schemaops` int(10) unsigned NOT NULL,
  òrig_server_id` int(10) unsigned NOT NULL,
  òrig_epoch` bigint(20) unsigned NOT NULL,
  `gci` int(10) unsigned NOT NULL,
  `next_position` bigint(20) unsigned NOT NULL,
  `next_file` varchar(255) NOT NULL,
  PRIMARY KEY (èpoch`,òrig_server_id`,òrig_epoch`)
ndb_binlog_index table content
Basic start position mapping
Epoch statistics
Slave epoch merge info
Handy gci number
New :Next position mapping

The slave epoch merge info is only present with the –ndb-log-orig server
option. If it is not set, those values are set to 0.
Normally the BI will insert one row per epoch transaction into
ndb_binlog_index.
With –ndb-log-orig, it will insert one additional row for every upstream
master epoch transaction that a Slave MySQLD has applied to this
cluster in this epoch.
This gives an indication of how an upstream master's epoch transactions
were merged into a Slaves epoch transactions – useful for cutover.
These upstream epoch rows do not contain epoch statistics values –
those are only produced for the local cluster's row.
ndb_binlog_index table epoch merge info

The ndb_binlog_index table has always contained a mapping from an
epoch number to an epoch transaction start position.
However for replication channel cutover, generally the slave cluster has
an already-applied epoch number from the ndb_apply_status table, and
so what they need is to get binlog content after the last applied epoch.
Various tricky and error prone techniques have evolved to do this in
cases where it is not easy (last applied epoch transaction is the last
epoch transaction in the binlog).
Recently added 'next event position' columns to ndb_binlog_index so
that rather than trying to find some entry representing a next event, we
can just directly obtain the correct position.
ndb_binlog_index table next position

Previously
●
ndb_binlog_index contains only
epoch->start_pos
●
Cutover involves inequality in
WHERE epoch > <x>, sort + limit
(Requires scanning)
●
Cutover does not detect that
new Master is missing relevant
events.
●
Cutover can silently skip over
non-epoch-transaction events,
e.g. DDL.
Now
●
ndb_binlog_index also contains
epoch->next_pos
●
Cutover involves equality
WHERE epoch = <x>
●
Cutover detects that new Master
is missing relevant events.
●
Cutover will find non-epoch-
transaction events e.g. DDL, and
can stop
Recommended!

Limitations of the next_position cutover
●
ndb_apply_status and ndb_binlog_index track only epoch transactions so
'inter-epoch DDL' application status is not visible.
●
Previously failover could silently skip inter-epoch DDL at a cutover point.
●
Now it will find it. This can lead to duplicate application of DDL causing the
Slave to stop.
●
Duplicate DDL can be ignored using –slave-skip-errors=ddl_exist_errors
●
ndb_binlog_index only tracks empty epochs if –ndb-log-empty-epochs=1 is
set. This has disk + network bandwidth impacts.
●
Backup and Restore can insert an ndb_apply_status entry with the restore
point of the backup as an epoch number, so that replication can be used to
catch up from this position.
●
If the restore point epoch was empty, and –ndb-log-empty-epochs=0, then it
won't be in ndb_binlog_index and we revert to trying to find the 'next' position.

●
Bulk deletes as part of PURGE are of the form :
DELETE FROM mysql.ndb_binlog_index where
File=<name>;
●
File column is unindexed, but even with an Index, this is a lot of work.
●
Worst case is many many epochs per Binlog file (e.g. small
epochs/large files). Can happen on low write-rate clusters.
●
Workaround : Split the DELETE into multiple invocations with LIMIT
clause. Can allow the BI to progress in most cases.
●
Better designs : Use InnoDB? Writers don't block readers. Use
partition by File? DELETE becomes DROP PARTITION.
●
Check BI status with SHOW PROCESSLIST to see if it's blocked on a
table lock.
ndb_binlog_index table and PURGE
Avoid –expire-logs-
days, PURGE manually
and consider pre-
deleting rows from
ndb_binlog_index in
batches

●
Queries on ndb_binlog_index are used for replication channel cutover,
so can be time-critical.
●
In cases where the mysql.ndb_binlog_index file is huge, they can be
slow. Beware a low client timeout here !
●
Indexes can be added using normal ALTER TABLE mechanisms, to
speed up performance.
●
Low write rate clusters can have high #s of epochs per file. Review
whether these clusters are keeping excessive binlog (and therefore have
excessively large mysql.ndb_binlog_index files), and consider rotating
and purging more often.
ndb_binlog_index table and queries

●
BI injector regularly needs to hold :
●
Server internal Binlog index lock (Unavailable during rotate/purge)
●
Server internal Binlog lock for binlogging (Unavailable if other client
threads are binlogging)
●
Table lock on mysql.ndb_binlog_index (Unavailable during any other
access due to MyISAM).
●
BI is 'just another client thread' from the p.o.v. of the generic MySQL
Binlog code. So it can get involved in Binlog rotation itself. If –
expire_logs_days is used then this can involve PURGE!
Binlog Injector problems

●
The Ndb Slave is almost entirely the standard MySQL replication
slave system until calls into the Ndb storage engine component are
made.
●
The IO thread is not modified in any way.
●
The SQL thread makes the normal calls into the SE interface, but
IDEMPOTENT mode is hard-coded on.
●
Batching is the number one source of Ndb performance
improvements, and this is the case in the slave.
●
The standard RBR events allow limited batching of multiple row
changes within a single event.
●
Ndb extends this batching using the –slave-allow-batching server
option.
Ndb Slave Batching, batching, batching

●
When applying an epoch transaction from an upstream master, each
event is applied serially as normal.
●
The Ndb handler uses the event application to define NdbApi
operations for the events, but only executes them when either a full
batch is defined, or there is a data dependency.
●
The batch size in bytes is specified using the –ndb_batch_size
Server parameter
●
In recent experiments, there appeared to be very little downside
in maximising the configured batch size, so that most epoch
transactions are executed in a single batch.
●
Batching effectiveness can be measured – see later
Ndb Slave batching Increase your batch size!

●
For PK insert/update/delete to in-memory or cached data, which is
what RBR should mostly be doing, most of the response time is due to
communication latency between the nodes involved.
●
So we spread this latency cost over as many operations in a batch as
possible.
●
What's more, the operations in a batch can run in parallel on different
threads of a data node, or different data nodes.
●
Finally, even with disk-data the operations in a batch get parallel
access to the underlying table space
Ndb Slave batching Increase your batch size!

●
Batching is similar to pipelining in a processor, it can be broken by
data dependencies and system limitations.
●
Where a RBR event needs to read some data, there is a data
dependency that must be satisfied before it can execute. This requires
that the current batch is flushed, then the read is performed, then the
update. Many round trips!
●
This is one reason why tables without primary keys are inefficient –
they require reads or even worse, scans to find matching rows for
update and delete.
●
Where Blob/Text columns are being modified there is an implicit
dependency in the implementation which requires that we lock the main
table row before modifying parts table rows. This requires a batch flush
to obtain the lock, so breaks up any surrounding batch.
Ndb Slave batch breakup
Avoid writing to PK-less
tables
Beware of Blobs

Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
3 round trip
latencies
20 events
Moderate batching 20:3

Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
1 round trip
latency
20 events
Max batching 20:1

Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
20 round trip
latencies
20 events
Min batching 20:20
The Slave SQL thread
CPU capacity is finite and
without good batching it is
limited by waiting for
responses

●
The Ndb Slave has Ndb object statistics which can be monitored while
it is running. Beware that these statistics are currently lost when the
Slave thread is stopped.
●
mysql> SHOW STATUS LIKE 'ndb_api%slave';
●
The meaning of the values is documented in the manual, but some
are of special interest :
●
Ndb_api_bytes_sent_count_slave
Can give rough indication of the apply rate of the slave in bytes/s
●
Ndb_api_trans_commit_count_slave
Can give indication of the apply rate of the slave in epochs/s
●
Ndb_api_wait_exec_complete_count_slave
Can give indication of the round trips performed by the slave
Ndb Slave monitoring
These are monotonic counters, so to get rates, you must
sample on some period, and determine the difference
between samples. Details online.

●
Ndb_api_pk_op_count_slave
Can give rough indication of the apply rate of the slave in rows/s
●
Ndb_api_trans_abort_count_slave
Can give indication of slave apply problems – locking or temporary errors.
(bad)
●
Ndb_api_read_row_count_slave
Can give indication of whether the slave is performing any reads (bad)
●
Ndb_api_table|range_scan_count_slave
Can give indication of whether the slave is performing any scan reads (bad)
●
Ndb_api_wait_nanos_count_slave
Can indicate time spent waiting for the data nodes – with caveats.

Interesting ratios :
●
Avg batches/epoch transaction (Batching ratio) :
Ndb_api_wait_exec_complete_count_slave /
●
Avg bytes/epoch transaction : Ndb_api_trans_commit_count_slave /
●
Avg rows/epoch transaction : Ndb_api_pk_op_count_slave /

The Ndb slave is often unlike all other NdbApi clients, as it is entirely serial, but
generates large transactions with mixed operation types and can be very
intensive.
Most intense period is when a Slave is 'catching up' with a Binlog – it can apply
epoch transactions much faster than they were originally committed on the
Master.
This can cause overload for the Slave cluster – redo logs can get stressed,
SendBuffers on the Slave sending to Slave cluster binlogging MySQLDs can be
overloaded at commit time.
Current recommendation is :
- Experiment with 'worst cases' and check behaviour and rates measured.
- Monitor rates in production to get notification when they approach their tested
limits.
Ndb Slave notes

●
One observation of the MySQL Cluster replication system is that it is missing
some end-to-end checks. It relies on the correct operation of the lower layers of
generic MySQL replication.
●
Partly this is by design – the replication layer treats events and transactions
separately and avoids dependencies between them. Most SEs have no slave-
specific logic.
●
However some cross-checks are simple and effective :
●
No jumping back : Received ndb_apply_status epoch numbers should
never decline without a Master pos change.
●
No repeats : Received ndb_apply_status epoch numbers should never
repeat without a rollback or Master pos change.
●
No retry failures : Received ndb_apply_status epoch numbers should not
increase without a commit, or Master pos change.
Ndb Slave improvements

●
Even better would be a check that epochs are in the expected sequence, but
that requires binlogging changes :
●
No gaps : Received ndb_apply_status epoch numbers should follow a
sequence, where each epoch includes its succesfully binlogged
predecessor's number.
Ndb Slave improvements

Insert Picture Here
Recommendations
M S

●
Performance recommendations
Not the main focus
●
Robustness recommendations
●
Potential cluster improvements.
Technical details in preceding slides
Recommendations

Binlog Injector
●
Set binlog_cache_size so that there's no spill-to-disk
Slave
●
--slave_allow_batching
●
Increase ndb_batch_size (+ test)
●
Avoid primary key less tables
●
Beware replicating Blobs/Text
●
Monitor slave activity using SHOW STATUS LIKE 'ndb_api%slave'
Performance recommendations

Data nodes
●
Set MaxBufferedEpochs high enough so that it indicates a real (hard
to reproduce) issue
●
Test SendBuffer configuration, especially from data nodes to
binlogging MySQLDs to ensure commit of the largest transactions and
heaviest load can be handled. (Slave catchup?)
Binlog Injector
●
Monitor SHOW STATUS LIKE 'ndb_api%injector' to understand
normal and excess flows.
●
Monitor SHOW PROCESSLIST to check BI state
●
Monitor SHOW ENGINE NDB STATUS to get NdbApi buffering
indication

Binlog Injector continued
●
Monitor SHOW MASTER STATUS to understand outgoing binlog
rates
●
Consider using –ndb-eventbuffer-max-alloc to avoid excessive event
buffer usage destabilising host
ndb_binlog_index table
●
Avoid using –expire-logs-days
●
Consider manual purge, potentially with pre-delete of
ndb_binlog_index rows in small batches
●
Consider adding indexes if cutover queries are too slow

System restart
●
Ensure that after a system restart, cluster is not brought back online
immediately as it will need some form of consistency restoration.
Replication channel cutover
●
Consider using the new replication channel cutover query, alongside –
slave-skip-errors=ddl_exist_errors
Slave
●
Test system robustness under prolonged 'Master catchup' scenario.
Monitor Slave cluster redo logs, redo log state, SendBuffer overload,
Binlogging MySQLD lag etc.

MySQL Cluster Asynchronous replication (2014)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a MySQL Cluster Asynchronous replication (2014)

Semelhante a MySQL Cluster Asynchronous replication (2014) (20)

Último

Último (20)

MySQL Cluster Asynchronous replication (2014)