SlideShare uma empresa Scribd logo
1 de 112
Baixar para ler offline
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 121
Insert Picture HereMySQL Cluster replication
Frazer Clement
Senior Software Engineer, Oracle
frazer.clement@oracle.com
messagepassing.blogspot.com
April 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2
Session Agenda
•
Intro
•
MySQL replication
•
MySQL Cluster replication
•
Recommendations
•
Advanced topics
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3
Disclaimer
THE FOLLOWING IS INTENDED TO OUTLINE OUR GENERAL
PRODUCT DIRECTION. IT IS INTENDED FOR INFORMATION
PURPOSES ONLY, AND MAY NOT BE INCORPORATED INTO ANY
CONTRACT. IT IS NOT A COMMITMENT TO DELIVER ANY MATERIAL,
CODE, OR FUNCTIONALITY, AND SHOULD NOT BE RELIED UPON
IN MAKING PURCHASING DECISION. THE DEVELOPMENT,
RELEASE, AND TIMING OF ANY FEATURES OR FUNCTIONALITY
DESCRIBED FOR ORACLE'S PRODUCTS REMAINS AT THE SOLE
DISCRETION OF ORACLE
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4
Insert Picture Here
Introduction
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5
Frazer Clement
Senior Software Engineer Oracle
Based in Edinburgh UK.
Joined MySQL AB in 2007, then Sun, then Oracle...
Worked on NdbApi, Replication, Cluster membership, Conflict detection,
most areas of Cluster.
Worked with customers for several years to help solve their problems.
Previously worked for Nortel / IBM on HLR, HSS...
Included using MySQL Cluster as HSS database since ~ 2005.
Strong telco focus
Who?
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6
●
Audience are familiar with replication concepts, limitations etc. Will
review some of that material, but just as context
●
You have questions that will come up as we proceed – happy to answer
as we go, unless it is too time consuming – then we defer to the end
●
I don't have material to cover everything, but can use a white board
●
I will not know all the answers :)
●
I might ask you questions :)
●
You will have great ideas and suggestions!
●
Happy to discuss concepts and ideas, but I cannot commit to any future
development or fixes.
Expectations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7
●
I cover MySQL replication separately to MySQL Cluster replication here
●
This reflects the implementation – much of it is in the generic MySQL
Server code base, which means :
●
We benefit from all the features implemented at that level
●
We benefit from the testing performed by the huge installed base of
users.
●
It is implemented by a different team within Oracle – they work on it
for us for 'free' :)
●
It is designed to be generic across different storage engines
(MyISAM, InnoDB, Ndb cluster)
●
It is not so easily modified to suit the specific needs of MySQL
Cluster
Structure
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8
Insert Picture Here
MySQL replication
M S
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9
MySQL replication topologies
M S
M
S
S
S/M S
M/S M/S
M/S
M/S
M/S
M/S SS
Master Slave
Master-Master
Circular + Slaves
Master – multi slave tree
Constraint : 1 Master
per Slave @ 1 time
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10
MySQL replication components
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog files
Binlog
Dump
Threads
Slave servers
Master Server
Slave
IO
thread
Relay log files
Slave
SQL
thread
MySQL uses IO buffering,
so when producer and
consumer are close, data
is passed in memory
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11
MySQL Binlogging 1
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog Index file
BinlogFile.000006
BinlogFile.000005
BinlogFile.000004
BinlogFile.000003
Time
Binlog rotation based on file size, or
manual flush command.
Purge can be time based or manual.
During transaction execution, a
binlog transaction is cached by
client threads. At transaction
commit time, client thread takes
binlog lock + writes transaction
to binlog.
Light yellow
binlog
transaction
caches are in-
memory, strong
yellow have
spilled to disk
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12
Binlog gives storage engine independent operation log
●
For single-server systems, Binlog durability via fsync is important
●
fsync has large performance impact
●
Binlog file size (--max_binlog_size) trades off number of files,
rotation/open/close cost with granularity of purge.
●
Size-based rotation occurs as part of writing an event to the binlog –
e.g. some client session does the work and spends the time.
●
Time-based PURGE (--expire_logs_days) is triggered when
rotating, and also performed by some client session.
●
Binlog transaction cache is largest in-memory cacheable binlog
transaction (--binlog_cache_size). Larger transactions are spooled
to disk prior to commit-to-binlog. See SHOW STATUS LIKE
'%binlog_cache%';
MySQL Binlogging 2
PURGE LOGS and FLUSH LOGS
commands can manually invoke purge
and rotate actions
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13
●
Binlog is a series of variable sized events
●
Events have a type, size, serverid
●
Some are Binlog metadata : FORMAT_DESC, ROTATE
●
Original implementation was 'Statement Based Replication' (SBR) –
including statements in QUERY event types. Challenging to ensure
determinism at the Slave.
●
“Row Based Replication' (RBR) – uses WRITE_ROW,
UPDATE_ROW, DELETE_ROW events and can improve slave
performance for small transactions.
●
In both cases transactions are demarcated by BEGIN and COMMIT
QUERY events
●
Transactions are kept in a single Binlog file – often rotation occurs
after a COMMIT event.
MySQL Binlogging 3 MySQL Cluster uses Row based
replication
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14
Row based replication
●
Basic format of transactions is :
BEGIN, TABLE_MAP*, (WRITE_ROW|UPDATE_ROW|
DELETE_ROW)*, COMMIT
●
TABLE_MAP is for efficiency – mapping a table name to an id for the
scope of the transaction – later *_ROW events use the id.
●
Each *_ROW event can contain one or more sets of row images
(changed rows), where each set of images :
●
Is the same operation type (INSERT/UPDATE/DELETE)
●
Is on the same table
●
Affects the same columns
●
RBR can include some statements : DDL etc...
●
As normal, SQL_LOG_BIN=0 can temporarily disable Binlogging
MySQL RBR RBR events have few determinism
issues
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15
RBR Example
FORMAT_DESC
CREATE TABLE HLR.AUTH_INFO(...
BEGIN
TABLE_MAP(HLR.CUG, 1)
TABLE_MAP(HLR.AUTH_INFO, 2)
WRITE_ROW(1, …)
WRITE_ROW(2, …)
UPDATE_ROW(1, …)
WRITE_ROW(1, …)
DELETE_ROW(1, …)
COMMIT
BEGIN
TABLE_MAP(HLR.AUTH_INFO, 1)
MySQL RBR
Description event at start of Binlog file
Create table recorded in event
Binlog transaction start
Table map events with transaction
scope
Events containing one or more row
image sets
Binlog transaction end
Start of new Binlog transaction
DDL statements appear between DML
transactions – not interleaved
mysqlbinlog –verbose great for
analysing Binlogs + Relay logs
Also SHOW BINARY LOGS, SHOW
BINLOG EVENTS
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16
DDL statements are (should be) relatively rare, so the Binlog files
should be mostly sequences of back-to-back transactions.
Binlog file positions are denoted using a file {name, position} pair.
The position is a byte offset, and many offsets are invalid.
Valid offsets are the start of an event.
MySQL Binlogging
New events
Active Binlog file
Byte offsets 0
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.17
MySQL Binlog dump threads
SQL clients
Server
logic
Pluggable
Storage
Engines
Binlogging
Binlog files
Binlog
Dump
Threads
Slave servers
- Slaves connect to a Master in
the normal way as a client,
authenticate, then issue a
BINLOG_DUMP command,
which causes their session
thread to become what we call a
BINLOG DUMP thread.
- BINLOG DUMP threads have a
Binlog {File, Position} pair from
where they are reading. Each
can have a different position.
- Where they are close to the
'head' of the Active Binlog, data is
passed via memory.
- Generally the Binlog Dump files
read as much data as is available
and can be sent over to the Slave
– TCP backpressure allows the
Slave to control the rate.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.18
MySQL slave threads
IO thread
Server
logic
Pluggable
Storage
Engines
Master Server
Slave
IO
thread
Slave
SQL
thread
Relay log Index file
RelaylogFile.000022
RelaylogFile.000021
RelaylogFile.000020
RelaylogFile.000019
Time
- Slave IO thread
connects to Master and
issues Binlog DUMP
command.
- Events received are
filtered by server id, then
written to relay log files.
- Slave IO thread can
operate entirely
separately to Slave SQL
thread, replicating the
Binlog files from the
Master.
- Relay logs are almost
exactly the same as
Binlogs.
- FLUSH LOGS can
manually rotate the
active relay log
IO and SQL threads can be stopped and started
together or separately using START SLAVE and
STOP SLAVE.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.19
MySQL slave threads
SQL thread
Server
logic
Pluggable
Storage
Engines
Master Server
Slave
IO
thread
Slave
SQL
thread
Relay log Index file
RelaylogFile.000022
RelaylogFile.000021
RelaylogFile.000020
RelaylogFile.000019
Time
- Slave SQL thread
reads relay logs, via
memory if possible, and
executes the events.
- Event execution can
result in Storage Engine
(SE) calls, defining
transactions, operating
on data etc.
- When the Slave SQL
thread reaches the end
of a relay log file and
moves onto the next, the
old relay log file is
purged automatically.
- Relay logs are mostly
'invisible' in normal
operation.
Separate IO and SQL threads decouples k-safety /
geo redundancy from replica consistency / slave
apply performance limitations or issues
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.20
●
Most storage engines (e.g. InnoDB, MyISAM) have no slave-specific code.
They are invoked by the Server logic and behave as they would for any client
session invocation.
●
Normal MySQL replication has a 1:1 mapping between 'user transactions' and
'binlog transactions'.
●
User transactions execute in parallel at the Master and are serialised only when
writing their binlog cache contents to the Binlog at transaction commit time.
●
The Binlog forces a single serial order on concurrent transactions.
●
At the Slave, the SQL thread executes transactions in Binlog order. This may
be less concurrent than the causing execution at the Master.
●
At the Slave, the SQL thread must perform all blocking disk I/O, which might
have been performed by concurrent threads on the Master. This can be the limit
on Slave throughput.
MySQL slave SQL thread
Recent MTS work alleviates single-threaded slave
SQL thread limitation somewhat
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.21
●
The slave SQL thread can encounter permanent or temporary errors
●
Permanent errors generally stop the SQL thread (but not the IO thread)
●
Some permanent error types can be ignored : --slave-skip-errors
●
Some classes of errors can ignored with an alias : --slave-skip-
errors=ddl_exist_errors
●
Temporary errors result in transaction rollback and limited retries
●
--slave_transaction_retries controls the maximum number of retries before a
temporary error will cause the SQL thread to stop.
●
Temporary errors result in an entry in the Server's error log file
●
Retries are not immediate, but have bounded growing inter-retry delays to give
time for temporary conditions to resolve.
●
Problematic events can be manually skipped by setting the
sql_slave_skip_counter variable before restarting the Slave SQL thread.
MySQL slave SQL thread
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.22
Insert Picture Here
MySQL Cluster replication
M S
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.23
●
Topologies and Components
●
Cluster internals
●
HA and distribution
●
Event ordering
●
SUMA buffering and duplicates
●
SendBuffer
●
NdbApi and SUMA concepts
●
NdbApi internals
●
NdbApi event buffering
●
Events and Blobs
●
Binlog Injector
●
Ndb slave
MySQL Cluster replication
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.24
●
Ndb Binlog Injector
●
Ndb slave
MySQL Cluster replication
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.25
Built on-top of MySQL replication
●
Tried and tested stack
●
Flexibility – replicate to and from other systems
●
Benefit from performance improvements and bug fixing
MySQL Cluster adds :
●
Binlogging of changes from other cluster clients including NdbApi
●
HA replication – no SPOF
●
Transactional replication position on the slave
●
Higher replication performance through batching and parallelism
●
Moving parts and complexity
MySQL Cluster replication
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.26
MySQL Cluster replication topologies
Master Slave
Master-Master
Circular + Slaves
Master – multi slave tree
All the
standard
topologies,
with whole
clusters as M,
S or S/M.
M S M/S
M/S
M/S
M/S
M/S
M/S SS
M
S/M
S
S
S
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.27
MySQL Cluster replication topologies 2
Star/Hub with upstream master and
downstream Slave tree
A Slave Cluster can have multiple Masters
Multi-master
M
M
M
S
M
M/S
M/S
M/S M/S
M/S
S
S
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.28
MySQL Cluster replication components
For HA, each cluster has redundant Master and Slave MySQL Servers. Most
commonly two servers, with only one Slave Server active at a time. Both MySQL
Servers write Binlog, but commonly only one is serving Binlogs to downstream
slaves. A single server can perform both Master and Slave roles.
S
M
M
MySQL Cluster
data nodes,
other clients etc.
S
Slave servers
Master servers
MySQL protocol
MySQL protocol
MySQL protocol
MySQL protocol
NdbApi
events
NdbApi
DML/
DDL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.29
MySQL Cluster replication components 2
MySQL Cluster
data nodes,
other clients etc.
Slave servers
MySQL protocol
NdbApi
DML/
DDL
MySQL protocol
NdbApi
events
MySQL protocol
MySQL protocol
Master servers
A single server can perform both Master and Slave roles
simultaneously
IO SQL
IO SQL
INJ DUMP
INJ DUMP
NDB
NDB
NDB
NDB
BINLOGS
BINLOGS
Relay
logs
Relay
logs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.30
●
MySQL Cluster is designed for high write-rate systems, with low stable
latency
●
Parallelism and concurrency exist at many levels : rows, operations,
fragments, data nodes, transactions, clients.
●
Operations, transactions and therefore clients generally only
interact/contend where the operate on the same data (rows).
●
Otherwise transactions are entirely parallel and unsynchronised.
●
Great for throughput with low latency
●
Does not provide a single serial history of transactions
●
Does not provide a notion of consistent points in time
●
Consistent points are identified by a separate mechanism – Global
checkpoint.
MySQL cluster global checkpoint 1
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.31
●
Consistent points are used for :
- Creating potential system / backup recovery points
- Defining points in change event streams (replication)
●
The set of changes between two consistent points is referred to as an
epoch, and is identified by a cluster unique 64 bit epoch number.
●
Epoch numbers have a high 32-bit word called GCI (Global checkpoint
index), and a low 32-bit word called micro-GCI
●
System/Backup recovery points are at GCI boundaries, and are
created on the period TimeBetweenGlobalCheckpoints – defaults to
2000 millis.
●
Event stream consistency points are at micro-GCI boundaries, and are
created on the period TimeBetweenEpochs – defaults to 100 millis.
MySQL cluster global checkpoint 2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.32
●
The ratio between these times results in the pattern of epoch numbers seen.
The default ratio is 20:1, resulting in micro-GCI values 0..19, then an increment of
the GCI value.
●
In disk overloaded systems, sometimes the GCI increment is stalled for longer,
and so higher micro-GCI values are seen – this can be a warning of redo disk IO
problems.
●
Epoch numbers are often logged as <GCI>/<microGCI>, generally more
readable than the 64bit representation.
●
Epoch numbers are assigned at transaction commit time, by the transaction's
Transaction Coordinator (TC) – a component on the data node
●
To get permission to commit, and an epoch number assigned, a transaction
must be fully prepared – and e.g. be holding all the row locks it needs. This
implies that it has no dependencies on any other running transaction.
MySQL cluster global checkpoint 3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.33
●
The Global Checkpoint protocol works by broadcasting a prepare-to-
increment signal to all TC instances in the cluster, causing them to gate
new transaction commits (but continue all other processing). Once all
TC instances have acknowledged, a commit-increment signal is
broadcast, and all TC instances resume committing.
●
The effect here is that the parallel streams of committing transactions
are divided into before and after with the following properties :
All transactions in epoch n 'happened before' those in epoch n+1
Therefore an epoch boundary is a consistent point
●
Note that a parallel system has many equivalent partial event sort
orders, and epochs are just one of them, selected arbitrarily.
MySQL cluster global checkpoint 4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.34
●
Epoch numbers are assigned when transactions begin to commit.
●
Commit of large transactions, especially involving disk data tables, can
take some time.
●
Post-commit triggers in the Tuple manager (TUP) component in the
data nodes send row change details to the Subscription manager
component.
●
The Subscription manager (SUMA) manages forwarding / multicasting
of row change details to NdbApi / MySQLD clients.
●
Each row change has an associated epoch number.
●
When a TUP instance has completed commit processing for all
transactions in an epoch, it notifies SUMA.
●
When all of the local TUP instances have completed an epoch, SUMA
informs its subscribers.
MySQL cluster row change events
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.35
.
MySQL cluster row change events flow
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
NdbApi NdbApi
events
Subscriber 1
Subscriber 2
Writes are synchronously replicated within a nodegroup.
All SUMA components in a nodegroup observe all write events.
Events are hashed to buckets independently of the node local fragment replica role.
e.g. write events can be delivered from 'Primary' or 'Backup' fragments.
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.36
.
MySQL cluster row change events - nodegroup
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
A single row write transaction from an API node routes to one TC instance, from where it is
forwarded to the LDM instance managing the primary fragment replica for the fragment. Then it is
synchronously replicated to the nodegroup peer. Both nodes forward the event to their SUMA
instance, but only one SUMA forwards the event to the subscribing API nodes. The other buffers it.
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.37
.
MySQL cluster row change events + epochs
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
After an epoch increment, the TUP instances gradually finish committing the transactions from the
previous epoch and forwarding their events to the local SUMA instance. When SUMA receives this
'epoch completed' signal it forwards it to its subscribers. This tells the subscribers that they have
received all of the events for the given epoch from the source data node.
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.38
.
MySQL cluster row change events + epoch ack
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
Once subscribers have received an epoch completed event, they immediately respond with an
acknowledgement back to the data nodes. Once all subscribers have acknowledged reception of
an epoch to all data nodes, SUMA can release the epoch's event buffer space for reuse.
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.39
.
MySQL cluster row change events + failure
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
When a node fails unexpectedly, there will likely be events sent to subscribers for as-yet
unacknowledged epochs. In this case, there is uncertainty about whether all of the subscribers
received all of the unacknowledged events or not. To solve this, a surviving node in the nodegroup
will 'takeover' the bucket, and use its buffered events to resend unacknowledged epochs.
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.40
.
MySQL cluster row change events + resend
TC
TC
TC
TC
API
API
API
API
LDM / TUP
LDM / TUP
LDM / TUP
LDM / TUP
SUMA
SUMA
SUMA
SUMA
API
API
API
API
Subscriber 1
Subscriber 2
NdbApi NdbApi
events
Depending on when the failure occurred, it's nature, buffers etc, the subscribers may have received
the original events or not. In any case they are re-sent by a nodegroup peer. Currently this can
mean that the API sees duplicate events within an epoch. Once all buffered events are re-sent, the
normal epoch-completed protocol is followed.
Data nodes, NoOfReplicas=2 STRICT mode problem
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.41
Observations
●
Row changes are distributed according to row's primary key
distribution hash
●
A single transaction can affect different fragments on same or different
data nodes
●
Even within a data node, the row changes are forwarded or buffered
based on bucket membership, which is a function of the primary key
hash, but independent of the local fragment replica's current primary or
backup role.
●
The cluster does not actively maintain transaction ordering
information within an epoch
●
Therefore : Events arrive at subscribers in a partially sorted order
MySQL cluster row change event ordering
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.42
Events arrive at subscribers in a partially sorted order
●
Events in epoch n+1 occurred after events in epoch n
●
Events within an epoch are only ordered w.r.t. individual primary keys.
(As it is guaranteed that a given primary key value will be in a particular
SUMA bucket)
Implications
●
Inter-row constraints may be temporarily violated if applying the row
events in-order (unique keys, foreign keys)
●
It's not trivial to extract and order the original 'user transactions' from
the event stream (requires per-event transaction id and a topological
sort)
●
Consistency is only guaranteed at epoch boundaries
MySQL cluster row change event ordering 2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.43
In node failure cases, unacknowledged epochs are re-sent to
subscribers.
Subscribers do not currently filter out the re-sends.
Therefore it is possible to have duplicated events in an epoch.
This is one reason why Ndb slaves require IDEMPOTENT mode – to
allow them to handle cases where sequences of operations to a
primary key are partially/fully repeated...
[INSERT, UPDATE] → [INSERT, UPDATE][INSERT, UPDATE]
[INSERT, UPDATE] → [INSERT][INSERT, UPDATE]
[DELETE] → [DELETE][DELETE]
[UPDATE, DELETE] → [UPDATE, DELETE][UPDATE, DELETE]
MySQL cluster row change event duplicates
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.44
The row changes occurring in each nodegroup are divided into a
number of buckets (or slices), using a primary key hash. The number
of slices is designed to always balance across the available nodes in a
nodegroup.
Responsibility for forwarding events in each slice to subscribers is
given to one of the nodes in the group, and the others will buffer the
events.
As each node in the nodegroup is forwarding and buffering the same
number of slices/buckets, and as they slices are based on an MD5
hash, their forwarding IO and buffering capacity should be balanced.
For NoOfReplicas=2, each nodegroup has two slices, with each node
responsible for forwarding one and buffering one.
MySQL cluster row change SUMA balance
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.45
●
Data nodes buffer changes in SUMA to handle unexpected data node
failure without a break in the NdbApi row change event streams.
●
SUMA event buffering is in-memory and finite.
●
Buffer space is consumed by data change in the local node, and
released by acknowledgements from all subscribing Api nodes
●
Buffer space is liable to increase due to : Network problems to
subscribers, Slow subscribers, Failed-but-not-yet-detected Api nodes,
Cluster write rate spikes.
●
To protect the data nodes, SUMA event buffering can be limited in
terms of the number of epochs buffered and the number of bytes
(Cluster config MaxBufferedEpochs and (new)
MaxBufferedEpochBytes)
MySQL cluster row change SUMA buffering
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.46
●
MaxBufferedEpochBytes limits the amount of memory SUMA can
use for buffering events.
●
MaxBufferedEpochs limits the number of unacknowledged epochs
that SUMA will accept from any subscriber.
●
Reaching the MaxBufferedEpochBytes is not an immediate
problem, as the data nodes can stop buffering, but keep forwarding.
However it means that the cluster is no longer resilient to data node
failure - in the event of a data node failure, the Api nodes will be
informed that there is a gap in the event stream. For replication, this
requires a Backup + Restore cycle to resync the Slave. For this reason
it should be avoided..
●
“Out of event buffer: nodefailure will cause event failures, consider 
increasing MaxBufferedEpochBytes”
●
MySQL cluster row change SUMA buffering
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.47
●
When MaxBufferedEpochs is reached, subscribers with that epoch
lag will be disconnected, allowing their buffered data to be released.
●
This asynchronous disconnect from the data node side of the
connection will result in all data nodes disconnecting the subscribing
Api node, and will appear to NdbApi like a 'cluster failure'.
●
For MySQLD, the Binlog Injector thread will inject a GAP Incident
event in the Binlog.
●
The Api node is then free to reconnect and attempt to establish new
subscriptions...
●
“Disconnecting lagging nodes ...”
MySQL cluster row change SUMA buffering
MaxBufferedEpochs can be used like a 'watchdog' on lagging subscribers – perhaps disconnecting them and allowing them
to reconnect can clear problems they may be having.
It is not really necessary as a guard on the data node buffering capacity, as that is limited by MaxBufferedEpochBytes.
However beware if setting it to an ultra-high value that a SUMA-internal pool is allocated based on it's setting.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.48
Monitoring :
●
Bytes buffered in SUMA : Cannot be directly monitored
●
Epochs buffered in SUMA : NdbInfo ndb$pools, block is SUMA,
resource is “GCP”.
●
DUMP 8013 puts summary of oldest buffered epoch and which
subscriber's acknowledgements are pending into the cluster log.
MySQL cluster row change SUMA buffering
Potential Improvements
NdbInfo views on subscriptions, subscribers, buckets, epochs,
buffered bytes, volumes of data sent to individual subscribers etc...
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.49
●
Most deployments have two or more 'Binlogging MySQLDs', and a set of
NdbApi clients and/or non-binlogging MySQLD servers.
●
The Binlogging MySQLDs subscribe to row change events on almost every
table in the cluster - this means that the data sent from data nodes to Api nodes
is noticeably higher for Binlogging MySQLDs than other Api clients.
●
When the rate of change in the Cluster is high, this imbalance can cause
problems with the SendBuffer resource.
●
SendBuffer is used to decouple the non-blocking data node core from the
blocking socket send protocols used to actually send data remotely.
●
Binlogging MySQLDs generally need more SendBuffer configured on the data
nodes than other Api nodes, to soak up spikes in the change rate.
.
MySQL cluster row change SendBuffer
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.50
.
MySQL cluster row change SendBuffer
API
API
API
API
API
API
API
API
NdbApi NdbApi
events
Subscriber 1
Subscriber 2
Generally many more Api nodes than subscribers.
Binlogging MySQLDs subscribe to all row change events – so are directly affected by
the rate of change in a cluster.
Links to Binlogging MySQLDs must be allowed more SendBuffer than other Api links
Data nodes, NoOfReplicas=2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.51
Epoch boundaries are used for system restore capabilities.
A cluster system restart is where all nodes start from on-disk state -
LCP + redo logs.
However only a subset of epoch boundaries (GCI boundaries) are
eligible as system restore points.
GCI boundaries are epochs with micro_GCI == 0.
They normal occur every TimeBetweenGlobalCheckpoints
milliseconds – defaults to 2000 millis.
However epochs are incremented every TimeBetweenEpochs millis –
defaults to 100 millis.
Therefore the completed epochs will almost always contain
changes which will not be available after a sudden system restart
MySQL cluster system restart + row changes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.52
Completed epochs will almost always contain changes which will
not be available after a sudden system restart
●
Therefore the Binlogs of subscribing MySQLD nodes can contain
changes which are not in the cluster data nodes after a restart.
●
Therefore the downstream slave(s) can contain changes which are
not stored in the (old) Master cluster after a restart.
●
Therefore care must be taken to understand the restoration point of
the failed cluster during the restart, so that a decision about how to
resync can be made - Perhaps local or slave Binlogs can be replayed?
●
A failed cluster should not be allowed to come online and serve
traffic after a system restart without some analysis of the data
lost.
MySQL cluster system restart + row changes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.53
NdbApi has the NdbEvent and NdbEventOperation classes.
Clients can create and share NdbEvents by name. Each NdbEvent
refers to a single table, and is parameterised by whether only modified
columns, or all columns are included. NdbEvents have a lifecycle
independent of the creating NdbApi client, so care must be taken.
Clients create and use NdbEventOperation objects to request that the
row change events defined by a particular NdbEvent object should be
sent to the NdbApi client. A single NdbApi client might use many
NdbEventOperations (one per table) to get a view of the row changes
occurring in some subset of tables.
Row change events in NdbApi
SUMA p.o.v :
Event = Subscription,
EventOperation = Subscriber
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.54
The set of currently defined events in a cluster can be see by looking at
the hidden NDB$EVENTS_0 table...
> ndb_select_all ­c<> ­dsys NDB$EVENTS_0
NAME EVENT_TYPE TABLEID TABLEVERSION TABLE_NAME ATTRIBUTE_MASK SUBID
SUBKEY
"REPL$mysql/ndb_schema" 262143 4 1 "mysql/def/ndb_schema" [511 0 0 0] 1
65537
"REPL$mysql/ndb_apply_status" 65535 6 1 "mysql/def/ndb_apply_status"
[31 0 0 0] 3 65539
"NDB$BLOBEVENT_REPL$mysql/ndb_schema_3" 393215 5 1 "mysql/def/NDB$BLOB_4_3"
[15 0 0 0] 2 65538
3 rows returned
NDBT_ProgramExit: 0 ­ OK
Row change events in NdbApi
MySQL replication created events usually start with REPL$ or REPLF$
for modified-only or all-columns variants.
The subscription to mysql/ndb_schema is used by all MySQLDs to
communicate about schema changes.
Note that the mysql/ndb_schema table (id 4) has a blob column which
needs its own event to track those changes.
The attribute_mask here is deprecated and irrelevant.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.55
NdbApi allows users to define and share Events, and to define
EventOperations to subscribe to row changes defined by the events.
NdbApi event == SUMA subscription
A description of the type of changes that are of interest.
Can be shared by many NdbApi clients
NdbApi event operation == SUMA subscriber
A way to request the flow of row events from a particular
Event/Subscription start flowing to this client.
EventOperations are associated with an Ndb object.
NdbApi event concepts
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.56
Ndb
object
Ndb
object
NdbApi event concepts
Subscription Subscriber
n 1 n
EventOperation
1
Event 1 n
1
1
Ndb
object
1
n
Table 1
EventBuffer
Api node
n
1
1
Cluster
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.57
●
An Ndb object can be used to subscribe to changes on one or more
tables by creating NdbEventOperations on them, and then polling for
incoming events.
●
The SUMA components on the data nodes will start sending the row
changes on some epoch boundary.
●
As events and epoch boundary signals arrive, thread(s) internal to the
NdbApi library receive, acknowledge and buffer them.
●
The buffering exists to decouple the data transfer from the data nodes
to the Api client from the Api client's consumption of events.
●
In cases where the change rate is very high, or the Api client is slow or
stalled, the buffer will grow
●
Excessive growth of the NdbApi event buffer can cause stability issues.
NdbApi event buffering
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.58
NdbApi event buffering
Data node Data node Data node Data node
Receiver
NdbApi
Api process
Event buffer
User code Ndb::pollEvents()
Within NdbApi, the receiver
thread(s) buffer and
acknowledge reception of row
data for an epoch. This
occurs independently of the
User code behaviour, and
helps with quick release of
data node SUMA event buffer
space.
User code can retrieve new
events by calling the
pollEvents() method. This will
return the next event from the
head of the Ndb object's event
buffer.
Only events in completed
epochs are made available to
the pollEvents method.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.59
NdbApi event buffering
NdbApi Event Buffer
implementation
+ Zeroconf
- Unbounded growth
- Can destabilise host
+ Avoid OS malloc/free
- Never return memory to OS
- Often continues to grow despite
having 'free space'
Cases observed where :
- Event buffer growth causes
host slowdown due to paging,
resulting in decrease in both
user code event consumption
rate and receiver thread
performance. Eventually
MaxBufferedEpochs drives
client disconnect and buffer re-
initialisation (good outcome)
.
- Event buffer growth causes
Linux OOM killer to choose an
ndbmtd host to kill to relieve
memory pressure.
- Memory allocated from OS
continues to grow despite
buffer having large % free.
Hard crashing limit on size has
been implemented.
Soft 'GAP insert' limit in-
progress.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.60
Each Event Buffer is linked to an Ndb object. Generally want minimum
buffers per process.
Currently NdbApi event Api only exposes buffering information in terms
of epochs :
- Latest epoch(GCI): Most recent epoch completely received by NdbApi
receiver thread (Tail of event buffer)
- Apply epoch(GCI): Epoch of event currently being consumed by
NdbApi user code (Head of event buffer).
These are of limited use as :
- epoch numbers are sparse, there's not direct indication of the number
of epochs buffered.
- epochs are of different sizes in terms of row changes and change size
NdbApi event buffer monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.61
NdbApi (and MySQLD) allow a 'GCI slip threshold' to be configured.
  ­­ndb­report­thresh­binlog­epoch­slip=# 
                      Threshold on number of epochs to be behind before
                      reporting binlog status. E.g. 3 means that if the
                      difference between what epoch has been received from the
                      storage nodes and what has been applied to the binlog is
                      3 or more, a status message will be sent to the cluster
                      log.
  ­­ndb­report­thresh­binlog­mem­usage=# 
                      Threshold on percentage of free memory before reporting
                      binlog status. E.g. 10 means that if amount of available
                      memory for receiving binlog data from the storage nodes
                      goes below 10%, a status message will be sent to the
                      cluster log.
These threshold crossings cause cluster log events to be generated...
NdbApi event buffer monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.62
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/21
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/22
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/23
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/24
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/25
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=236/26
…
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=237/9
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=31KB(100%) alloc=31KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=237/10
[MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=31KB(100%) alloc=31KB(0%) max=0B 
apply_epoch=236/18 latest_epoch=237/11
NdbApi event buffer monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.63
NdbApi also includes some per-Ndb event stream monitoring counters.
These are incremented independently of event consumption.
Ndb::getClientStat()
    DataEventsRecvdCount     = 18, /* Number of table data change 
events received */
    NonDataEventsRecvdCount  = 19, /* Number of non­data events 
received */
    EventBytesRecvdCount     = 20, /* Number of bytes of event data 
received */
These can be seen from a MySQLD instance using :
 > SHOW STATUS LIKE 'ndb_api%injector';
NdbApi event buffer monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.64
A configurable crash-stop limit on event buffer size was implemented
recently. The idea is that a process shutdown and restart, with
accompanying replication channel failover) is preferable to host OS
destabilisation when event buffer growth is excessive.
  ­­ndb­eventbuffer­max­alloc=# 
    Maximum memory that can be allocated for 
buffering events by the ndb api
Work is in progress to implement a less severe limit – where excessive
buffering causes incoming events to be discarded, and the consumer
becomes aware of the 'gap' when it reaches it.
NdbApi event buffer limiting
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.65
●
Ndb Blobs are mostly implemented at the NdbApi layer of the stack.
●
Data for Blob columns are split into a small header, and zero or more
'parts' (e.g. 256 byte header, 0..n parts of 2000bytes, 4000 bytes...)
●
The 'header' data is a normal column in the table, something like a
VARBINARY(272).
●
The parts are rows in a hidden table, defined just to hold parts for that
particular column. These Blob part tables are named
NDB$BLOB<tableid>_<columnnumber>
●
From the point of view of the data nodes, the part tables are normal
'user tables'.
●
This allows arbitrary length data to be transactionally stored in Blob (or
Text) columns, but adds complexity at the NdbApi layer.
NdbApi events for Blobs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.66
●
When a transaction modifying a Blob value (insert, update, delete)
occurs, it internally involves operations on the main table as normal, and
zero or more operations on rows in the parts table(s) involved.
●
TUP and SUMA treat the Blob part tables as separate tables.
●
NdbApi receives row changes for the Blob part tables separate to the
main table row change, with no ordering constraints between different
rows as normal.
●
With the merge_events option on, NdbApi correlates the main table and
part table events so that the Blob part table row changes are used to
create a pseudo main table row change event containing all the Blob
changes.
●
This is implemented using an in-memory hash by PK in the NdbApi.
NdbApi events for Blobs
Event merge merges all events on a
row in an epoch – e.g. separate user
transactions are merged together.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.67
●
NdbApi receives row events and epoch completion signals from the
SUMA components of the running data nodes
●
NdbApi internally buffers events for incomplete epochs. These are not
made visible to NdbApi users until all data nodes have indicated that the
epoch is completed.
●
As data nodes complete epochs in-order, the NdbApi nodes will release
event data for each epoch to the user, in-order.
●
A user thread serially consuming events from the event Api can be sure
that when the first event for some epoch m > n is received, there will be
no more events for epoch n.
●
This sequencing is a kind of merge-sort by epoch number on the data
node event streams.
NdbApi event sequencing
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.68
●
In a quiet or idle cluster, epochs can occur which have no row changes
in them.
●
The normal global checkpoint mechanisms occur, and all data nodes
will send epoch completed signals, and expect acknowledgements.
●
This can be seen in the 'Latest epoch' values, which continue to climb
in an idle cluster. This is the only indication of empty epochs occurring at
the NdbApi layer.
●
In some cases, an epoch may have events which a user does not
consider relevant – e.g. slave-applied updates when –log-slave-
updates=0. In this case the epoch is not empty at the NdbApi level, but
may be considered as empty by the next layer up.
Empty epochs at the NdbApi layer
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.69
●
Mostly looked at the data nodes and the NdbApi event Api so far.
●
The Ndb Binlog Injector (BI) is a component of MySQL Servers when
part of a MySQL Cluster
●
A better name might be 'Ndb event listener', as it is responsible for
more than just Binlog generation.
●
The BI uses the NdbApi event Api to listen to row changes on an
internal schema table (mysql.ndb_schema).
●
The BI also uses the NdbApi event Api to listen to row changes on all
other tables (unless exceptions are made with mysql.ndb_replication)
●
The BI writes Binlog transactions and DDL statements to the Binlog.
●
The BI injector maintains a local mysql.ndb_binlog_index table for
mapping epoch numbers to Binlog files and positions.
Ndb Binlog Injector
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.70
●
The Binlog Injector is always process #1 in the SHOW PROCESSLIST
output – this can be used to check it's state :
mysql> show processlist;
+­­­­+­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+­­­­­­+­­­­­­­­­+­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
+­­­­­­­­­­­­­­­­­­+
| Id | User        | Host            | db   | Command | Time | State                             | Info     
        |
+­­­­+­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+­­­­­­+­­­­­­­­­+­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
+­­­­­­­­­­­­­­­­­­+
|  1 | system user |                 |      | Daemon  |    0 | Waiting for event from ndbcluster | NULL     
        |
●
This can be used for monitoring, or debugging event buffer growth.
Ndb Binlog Injector
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.71
●
Schema changes in MySQL Cluster involve cooperation between all
attached MySQL Servers, so that they can take any necessary steps,
and serve the new schema immediately when it is committed.
●
Event subscription to an internal table (mysql.ndb_schema) is used to
accomplish this. All MySQL Servers listen for changes on this table
using their BI and modifications to this table are used to communicate
(rows as shared memory!)
●
Schema changes can generate binlog entries - “DROP INDEX...”
●
The volume of change on this table is very low, and related to DDL.
●
BI uses a separate Ndb object (and EventBuffer) for its subscription to
mysql.ndb_schema row changes, so occasionally it can be seen in
epoch_slip logs. Usually the buffer is very small.
Ndb Binlog Injector + schema distribution
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.72
●
The BI is an event processing list, which consumes events from a
'schema event' subscription to the mysql.ndb_schema table, and a 'data
event' subscription to all the tables being Binlogged by the server.
●
The loop has the following pseudo-code :
while (!(disconnected || error ||...)
consume all schema events for epoch, taking steps required.
begin binlog transaction
insert 'fake' ndb_apply_status write row for epoch
consume all data events for epoch, writing to binlog transaction
decide whether to commit or rollback binlog transaction
commit/rollback
write details of epoch transaction to mysql.ndb_binlog_index table
Ndb Binlog Injector main loop
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.73
Ndb Binlog Injector main loop
schema
data
binlogs
BEGIN
WRITE_ROW
UPDATE_ROW
DELETE_ROW
WRITE_ROW
UPDATE_ROW
COMMIT
DROP INDEX A;
BEGIN
WRITE_ROW
...
NdbApi Server
Binlog
code
Transaction
cache
Server
MyISAM
code
mysql.ndb_binlog_index
Ndb
Binlog
Injector
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.74
Observations :
●
BI is a bottleneck for all changes to be Binlogged in the cluster. Not
commonly a problem though...
●
BI relies on generic MySQL Binlogging code (Binlog transaction cache,
Inline Binlog rotate and purge)
●
BI contends for OS locks inside generic MySQL Binlogging code
(though should be low/no contention if deployed as recommended)
●
BI relies on generic MySQL processing and MyISAM table handling for
ndb_binlog_index table maintenance (Table locking)
●
Health and liveness of the Binlog Injector is very important for
MySQL Cluster replication
Ndb Binlog Injector
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.75
The epoch slip cluster logs mentioned before indicate where the BI is not
keeping up with the latest available epochs.
NdbApi counts exposed as status variables indicate the volume of data
processed by the data subscriptions of the BI :
mysql> show status like 'ndb_api%injector';
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+
| Variable_name                        | Value |
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+
| Ndb_api_event_data_count_injector    | 0     |
| Ndb_api_event_nondata_count_injector | 2     |
| Ndb_api_event_bytes_count_injector   | 256   |
+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+
The output of SHOW ENGINE NDB STATUS indicates the progress of
the BI in terms of epochs
Ndb Binlog Injector Monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.76
mysql>SHOW ENGINE NDB STATUS
| ndbcluster | binlog                | latest_epoch=10672993730572, 
latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572, 
latest_handled_binlog_epoch=10672993730572, 
latest_applied_binlog_epoch=1022202216475 |
latest_epoch : Latest completed epoch from NdbApi point of view – tail of event buffer
latest_trans_epoch : Epoch of most recent transaction committed to Ndb from this server.
latest_received_binlog_epoch : Epoch of the most recently consumed event from the head
of event buffer.
latest_handled_binlog_epoch : Epoch of the most recently completely processed epoch
latest_applied_binlog_epoch : Epoch of the most recently completely processed epoch,
which resulted in Binlog write.
Ndb Binlog Injector Monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.77
mysql>SHOW ENGINE NDB STATUS
| ndbcluster | binlog                | latest_epoch=10672993730572, 
latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572, 
latest_handled_binlog_epoch=10672993730572, 
latest_applied_binlog_epoch=1022202216475 |
Observations :
- latest_epoch == latest_received_binlog_epoch == latest_handled_binlog_epoch
The NdbApi event buffer is empty, and the BI is idle
- latest_trans_epoch < latest_handled_binlog_epoch
Every cluster write done by this server has been binlogged
- latest_handled_binlog_epoch > latest_applied_binlog_epoch
Recent epochs have not had any binloggable content (Quiet cluster, or slave
updates...)
Ndb Binlog Injector Monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.78
mysql>SHOW ENGINE NDB STATUS
| ndbcluster | binlog                | latest_epoch=10672993730572, 
latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572, 
latest_handled_binlog_epoch=10672993730572, 
latest_applied_binlog_epoch=1022202216475 |
Inferences :
●
latest_epoch > other epochs : There are epochs in the NdbApi Event Buffer
●
latest_received_binlog_epoch > latest_handled_binlog_epoch : BI is processing an epoch
now.
●
latest_trans_epoch > latest_handled_binlog_epoch : Some transactions committed by this
server are not yet in the Binlog.
●
Also : SHOW BINARY LOGS and SHOW MASTER STATUS
The progress of Binlog writing can be seen.
Ndb Binlog Injector Monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.79
The BI produces uniform transactions in the Binlog. Each Binlog transaction describes all
of the changes that occurred in a single cluster epoch. For this reason they are sometimes
referred to as Epoch transactions. Each user transaction occurs in one epoch, so each
user transaction's changes are recorded in one epoch transaction, and cannot span
epochs.
DDL statements and any other Binlogging activity will occur between these transactions,
not within them.
These transactions have the structure:
●
BEGIN event :The position of the BEGIN event is the position of the transaction
●
1+ TABLE_MAP events : At least one for the mysql.ndb_apply_status table
●
1 WRITE_ROW event to mysql.ndb_apply_status
●
1+ other events to other tables (WRITE_ROW, UPDATE_ROW, DELETE_ROW)
As normal with RBR, each event can contain changes to multiple rows
●
COMMIT event
The first WRITE_ROW to ndb_apply_status is a 'fake' event generated by the BI.
Ndb Binlog transactions
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.80
.
Ndb Binlog transactions
BEGIN
COMMIT
Cluster
Binlog content
Multi-row user transactions
Commit
in
same
epoch
Epoch
transaction
Table
maps 'Fake'
WRITE_ROW
Epoch transactions contain all the changes necessary to move a
Slave from one consistent epoch boundary to the next.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.81
Merging all of the user transactions that occurred in an epoch into a
single epoch transaction in the Binlog is a strength and weakness. It
allows higher performance at the Slave, but complicates Binlog
positioning.
When looking at the Binlog on a Slave cluster, we can see that the first
Master's epochs are considered to be user transactions to the Slave, so
they can be merged together into one epoch transaction in the Slave's
binlog.
This is a source of efficiency, but can cause problems when performing
failover between clusters.
Ndb Binlog transactions
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.82
One problem with a slave cluster merging a master clusters epochs
together is slave promotion.
A common topology is a 'read scaled' cluster with 1 Master cluster, and n
Slave clusters.
When the Master cluster fails, one of the Slaves is selected to become
the new Master, and the other Slaves must failover their replication
to the new Master.
The problem with epoch merging here is that the old Master's epoch
stream (A1,A2,A3,A4,A5) may have been applied by Slave B as (B1(A1),
B2(A2,A3), B3, B4(A4,A5)). If Slave B becomes Master, and Slave C
has stopped at old Master epoch A2, which epoch transaction boundary
should it begin replicating from in Slave B's Binlog? B2 or B3?
Slave promotion Another motivation for Slave defaulting
to IDEMPOTENT mode
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.83
Ndb supports some extra optimisations to minimise the size of the Binlog
transactions it produces :
- log_update_as_write
This causes update events on tables to be logged as WRITE_ROW
(Insert) events. It requires that the downstream slave can idempotently
apply a WRITE_ROW event. The optimisation is that the row's 'before
image' need not be sent.
- log_updated_only
This causes the NdbApi event (and SUMA subscription) to only send
modified columns. The BI then only puts the modified columns in an
update or write_row event, saving space and time.
Both of these options need care to ensure behaviour is correct.
Ndb Binlog transaction optimisations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.84
For finer-grained logging control, the mysql.ndb_replication table can be
used.
This allows binlogging on-or-off, log-update-as-write and log-updated-
only to be controlled per table, per binlogging server.
It now supports wildcards for easier use.
It also supports defining conflict detection / resolution algorithms.
Ndb_replication table
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.85
●
Used to map epoch number to binlog file and position
●
Append(Insert) only from binlog injector
●
Read only during failover from mysql clients
●
Bulk deletes occur during PURGE or RESET MASTER.
●
MyISAM, so concurrency controlled using single table lock!
●
One problem here is that a long running activity holding the table lock
can block the BI, as its thread gets stalled waiting for a table lock to
insert into the table. This generally causes epoch slip and event buffer
backlogs.
●
Known bad case is where a Binlog file is PURGEd, manually or
automatically, requiring many many rows to be deleted from
mysql.ndb_binlog_index. This can stall the BI for some time.
ndb_binlog_index table
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.86
Schema has evolved over time :
  `Position` bigint(20) unsigned NOT NULL,
  `File` varchar(255) NOT NULL,
  `epoch` bigint(20) unsigned NOT NULL,
  `inserts` int(10) unsigned NOT NULL,
  `updates` int(10) unsigned NOT NULL,
  `deletes` int(10) unsigned NOT NULL,
  `schemaops` int(10) unsigned NOT NULL,
  `orig_server_id` int(10) unsigned NOT NULL,
  `orig_epoch` bigint(20) unsigned NOT NULL,
  `gci` int(10) unsigned NOT NULL,
  `next_position` bigint(20) unsigned NOT NULL,
  `next_file` varchar(255) NOT NULL,
  PRIMARY KEY (`epoch`,`orig_server_id`,`orig_epoch`)
ndb_binlog_index table content
Basic start position mapping
Epoch statistics
Slave epoch merge info
Handy gci number
New :Next position mapping
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.87
The slave epoch merge info is only present with the –ndb-log-orig server
option. If it is not set, those values are set to 0.
Normally the BI will insert one row per epoch transaction into
ndb_binlog_index.
With –ndb-log-orig, it will insert one additional row for every upstream
master epoch transaction that a Slave MySQLD has applied to this
cluster in this epoch.
This gives an indication of how an upstream master's epoch transactions
were merged into a Slaves epoch transactions – useful for cutover.
These upstream epoch rows do not contain epoch statistics values –
those are only produced for the local cluster's row.
ndb_binlog_index table epoch merge info
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.88
The ndb_binlog_index table has always contained a mapping from an
epoch number to an epoch transaction start position.
However for replication channel cutover, generally the slave cluster has
an already-applied epoch number from the ndb_apply_status table, and
so what they need is to get binlog content after the last applied epoch.
Various tricky and error prone techniques have evolved to do this in
cases where it is not easy (last applied epoch transaction is the last
epoch transaction in the binlog).
Recently added 'next event position' columns to ndb_binlog_index so
that rather than trying to find some entry representing a next event, we
can just directly obtain the correct position.
ndb_binlog_index table next position
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.89
Previously
●
ndb_binlog_index contains only
epoch->start_pos
●
Cutover involves inequality in
WHERE epoch > <x>, sort + limit
(Requires scanning)
●
Cutover does not detect that
new Master is missing relevant
events.
●
Cutover can silently skip over
non-epoch-transaction events,
e.g. DDL.
ndb_binlog_index table next position
Now
●
ndb_binlog_index also contains
epoch->next_pos
●
Cutover involves equality
WHERE epoch = <x>
●
Cutover detects that new Master
is missing relevant events.
●
Cutover will find non-epoch-
transaction events e.g. DDL, and
can stop
Recommended!
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.90
Limitations of the next_position cutover
●
ndb_apply_status and ndb_binlog_index track only epoch transactions so
'inter-epoch DDL' application status is not visible.
●
Previously failover could silently skip inter-epoch DDL at a cutover point.
●
Now it will find it. This can lead to duplicate application of DDL causing the
Slave to stop.
●
Duplicate DDL can be ignored using –slave-skip-errors=ddl_exist_errors
●
ndb_binlog_index only tracks empty epochs if –ndb-log-empty-epochs=1 is
set. This has disk + network bandwidth impacts.
●
Backup and Restore can insert an ndb_apply_status entry with the restore
point of the backup as an epoch number, so that replication can be used to
catch up from this position.
●
If the restore point epoch was empty, and –ndb-log-empty-epochs=0, then it
won't be in ndb_binlog_index and we revert to trying to find the 'next' position.
ndb_binlog_index table next position
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.91
●
Bulk deletes as part of PURGE are of the form :
DELETE FROM mysql.ndb_binlog_index where 
File=<name>;
●
File column is unindexed, but even with an Index, this is a lot of work.
●
Worst case is many many epochs per Binlog file (e.g. small
epochs/large files). Can happen on low write-rate clusters.
●
Workaround : Split the DELETE into multiple invocations with LIMIT
clause. Can allow the BI to progress in most cases.
●
Better designs : Use InnoDB? Writers don't block readers. Use
partition by File? DELETE becomes DROP PARTITION.
●
Check BI status with SHOW PROCESSLIST to see if it's blocked on a
table lock.
ndb_binlog_index table and PURGE
Avoid –expire-logs-
days, PURGE manually
and consider pre-
deleting rows from
ndb_binlog_index in
batches
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.92
●
Queries on ndb_binlog_index are used for replication channel cutover,
so can be time-critical.
●
In cases where the mysql.ndb_binlog_index file is huge, they can be
slow. Beware a low client timeout here !
●
Indexes can be added using normal ALTER TABLE mechanisms, to
speed up performance.
●
Low write rate clusters can have high #s of epochs per file. Review
whether these clusters are keeping excessive binlog (and therefore have
excessively large mysql.ndb_binlog_index files), and consider rotating
and purging more often.
ndb_binlog_index table and queries
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.93
●
BI injector regularly needs to hold :
●
Server internal Binlog index lock (Unavailable during rotate/purge)
●
Server internal Binlog lock for binlogging (Unavailable if other client
threads are binlogging)
●
Table lock on mysql.ndb_binlog_index (Unavailable during any other
access due to MyISAM).
●
BI is 'just another client thread' from the p.o.v. of the generic MySQL
Binlog code. So it can get involved in Binlog rotation itself. If –
expire_logs_days is used then this can involve PURGE!
Binlog Injector problems
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.94
●
The Ndb Slave is almost entirely the standard MySQL replication
slave system until calls into the Ndb storage engine component are
made.
●
The IO thread is not modified in any way.
●
The SQL thread makes the normal calls into the SE interface, but
IDEMPOTENT mode is hard-coded on.
●
Batching is the number one source of Ndb performance
improvements, and this is the case in the slave.
●
The standard RBR events allow limited batching of multiple row
changes within a single event.
●
Ndb extends this batching using the –slave-allow-batching server
option.
Ndb Slave Batching, batching, batching
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.95
●
When applying an epoch transaction from an upstream master, each
event is applied serially as normal.
●
The Ndb handler uses the event application to define NdbApi
operations for the events, but only executes them when either a full
batch is defined, or there is a data dependency.
●
The batch size in bytes is specified using the –ndb_batch_size
Server parameter
●
In recent experiments, there appeared to be very little downside
in maximising the configured batch size, so that most epoch
transactions are executed in a single batch.
●
Batching effectiveness can be measured – see later
Ndb Slave batching Increase your batch size!
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.96
●
For PK insert/update/delete to in-memory or cached data, which is
what RBR should mostly be doing, most of the response time is due to
communication latency between the nodes involved.
●
So we spread this latency cost over as many operations in a batch as
possible.
●
What's more, the operations in a batch can run in parallel on different
threads of a data node, or different data nodes.
●
Finally, even with disk-data the operations in a batch get parallel
access to the underlying table space
Ndb Slave batching Increase your batch size!
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.97
●
Batching is similar to pipelining in a processor, it can be broken by
data dependencies and system limitations.
●
Where a RBR event needs to read some data, there is a data
dependency that must be satisfied before it can execute. This requires
that the current batch is flushed, then the read is performed, then the
update. Many round trips!
●
This is one reason why tables without primary keys are inefficient –
they require reads or even worse, scans to find matching rows for
update and delete.
●
Where Blob/Text columns are being modified there is an implicit
dependency in the implementation which requires that we lock the main
table row before modifying parts table rows. This requires a batch flush
to obtain the lock, so breaks up any surrounding batch.
Ndb Slave batch breakup
Avoid writing to PK-less
tables
Beware of Blobs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.98
Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
3 round trip
latencies
20 events
Moderate batching 20:3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.99
Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
1 round trip
latency
20 events
Max batching 20:1
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.100
Ndb Slave batching
BEGIN
COMMIT
Relay log
content
Table
maps
Slave
Cluster
Replication
code
NDB SE Slave
code
20 round trip
latencies
20 events
Min batching 20:20
The Slave SQL thread
CPU capacity is finite and
without good batching it is
limited by waiting for
responses
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.101
●
The Ndb Slave has Ndb object statistics which can be monitored while
it is running. Beware that these statistics are currently lost when the
Slave thread is stopped.
●
mysql> SHOW STATUS LIKE 'ndb_api%slave';
●
The meaning of the values is documented in the manual, but some
are of special interest :
●
Ndb_api_bytes_sent_count_slave
Can give rough indication of the apply rate of the slave in bytes/s
●
Ndb_api_trans_commit_count_slave
Can give indication of the apply rate of the slave in epochs/s
●
Ndb_api_wait_exec_complete_count_slave
Can give indication of the round trips performed by the slave
Ndb Slave monitoring
These are monotonic counters, so to get rates, you must
sample on some period, and determine the difference
between samples. Details online.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.102
●
Ndb_api_pk_op_count_slave
Can give rough indication of the apply rate of the slave in rows/s
●
Ndb_api_trans_abort_count_slave
Can give indication of slave apply problems – locking or temporary errors.
(bad)
●
Ndb_api_read_row_count_slave
Can give indication of whether the slave is performing any reads (bad)
●
Ndb_api_table|range_scan_count_slave
Can give indication of whether the slave is performing any scan reads (bad)
●
Ndb_api_wait_nanos_count_slave
Can indicate time spent waiting for the data nodes – with caveats.
Ndb Slave monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.103
Interesting ratios :
●
Avg batches/epoch transaction (Batching ratio) :
Ndb_api_wait_exec_complete_count_slave /
Ndb_api_trans_commit_count_slave
●
Avg bytes/epoch transaction : Ndb_api_trans_commit_count_slave /
Ndb_api_trans_commit_count_slave
●
Avg rows/epoch transaction : Ndb_api_pk_op_count_slave /
Ndb_api_trans_commit_count_slave
Ndb Slave monitoring
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.104
The Ndb slave is often unlike all other NdbApi clients, as it is entirely serial, but
generates large transactions with mixed operation types and can be very
intensive.
Most intense period is when a Slave is 'catching up' with a Binlog – it can apply
epoch transactions much faster than they were originally committed on the
Master.
This can cause overload for the Slave cluster – redo logs can get stressed,
SendBuffers on the Slave sending to Slave cluster binlogging MySQLDs can be
overloaded at commit time.
Current recommendation is :
- Experiment with 'worst cases' and check behaviour and rates measured.
- Monitor rates in production to get notification when they approach their tested
limits.
Ndb Slave notes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.105
●
One observation of the MySQL Cluster replication system is that it is missing
some end-to-end checks. It relies on the correct operation of the lower layers of
generic MySQL replication.
●
Partly this is by design – the replication layer treats events and transactions
separately and avoids dependencies between them. Most SEs have no slave-
specific logic.
●
However some cross-checks are simple and effective :
●
No jumping back : Received ndb_apply_status epoch numbers should
never decline without a Master pos change.
●
No repeats : Received ndb_apply_status epoch numbers should never
repeat without a rollback or Master pos change.
●
No retry failures : Received ndb_apply_status epoch numbers should not
increase without a commit, or Master pos change.
Ndb Slave improvements
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.106
●
Even better would be a check that epochs are in the expected sequence, but
that requires binlogging changes :
●
No gaps : Received ndb_apply_status epoch numbers should follow a
sequence, where each epoch includes its succesfully binlogged
predecessor's number.
Ndb Slave improvements
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.107
Insert Picture Here
Recommendations
M S
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.108
●
Performance recommendations
Not the main focus
●
Robustness recommendations
●
Potential cluster improvements.
Technical details in preceding slides
Recommendations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.109
Binlog Injector
●
Set binlog_cache_size so that there's no spill-to-disk
Slave
●
--slave_allow_batching
●
Increase ndb_batch_size (+ test)
●
Avoid primary key less tables
●
Beware replicating Blobs/Text
●
Monitor slave activity using SHOW STATUS LIKE 'ndb_api%slave'
Performance recommendations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.110
Data nodes
●
Set MaxBufferedEpochs high enough so that it indicates a real (hard
to reproduce) issue
●
Test SendBuffer configuration, especially from data nodes to
binlogging MySQLDs to ensure commit of the largest transactions and
heaviest load can be handled. (Slave catchup?)
Binlog Injector
●
Monitor SHOW STATUS LIKE 'ndb_api%injector' to understand
normal and excess flows.
●
Monitor SHOW PROCESSLIST to check BI state
●
Monitor SHOW ENGINE NDB STATUS to get NdbApi buffering
indication
Robustness recommendations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.111
Binlog Injector continued
●
Monitor SHOW MASTER STATUS to understand outgoing binlog
rates
●
Consider using –ndb-eventbuffer-max-alloc to avoid excessive event
buffer usage destabilising host
ndb_binlog_index table
●
Avoid using –expire-logs-days
●
Consider manual purge, potentially with pre-delete of
ndb_binlog_index rows in small batches
●
Consider adding indexes if cutover queries are too slow
Robustness recommendations
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.112
System restart
●
Ensure that after a system restart, cluster is not brought back online
immediately as it will need some form of consistency restoration.
Replication channel cutover
●
Consider using the new replication channel cutover query, alongside –
slave-skip-errors=ddl_exist_errors
Slave
●
Test system robustness under prolonged 'Master catchup' scenario.
Monitor Slave cluster redo logs, redo log state, SendBuffer overload,
Binlogging MySQLD lag etc.
Robustness recommendations

Mais conteúdo relacionado

Mais procurados

My First 100 days with an Exadata (WP)
My First 100 days with an Exadata  (WP)My First 100 days with an Exadata  (WP)
My First 100 days with an Exadata (WP)Gustavo Rene Antunez
 
Oracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesOracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesGustavo Rene Antunez
 
DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)Gustavo Rene Antunez
 
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09OSSCube
 
Galera Cluster: Synchronous Multi-Master Replication for MySQL HA
Galera Cluster: Synchronous Multi-Master Replication for MySQL HAGalera Cluster: Synchronous Multi-Master Replication for MySQL HA
Galera Cluster: Synchronous Multi-Master Replication for MySQL HALudovico Caldara
 
12 Things about Oracle WebLogic Server 12c
12 Things	 about Oracle WebLogic Server 12c12 Things	 about Oracle WebLogic Server 12c
12 Things about Oracle WebLogic Server 12cGuatemala User Group
 
MySQL Performance Tuning Variables
MySQL Performance Tuning VariablesMySQL Performance Tuning Variables
MySQL Performance Tuning VariablesFromDual GmbH
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceSeveralnines
 
Plugging in oracle database 12c pluggable databases
Plugging in   oracle database 12c pluggable databasesPlugging in   oracle database 12c pluggable databases
Plugging in oracle database 12c pluggable databasesKellyn Pot'Vin-Gorman
 
Oracle 12c Multi Tenant
Oracle 12c Multi TenantOracle 12c Multi Tenant
Oracle 12c Multi TenantRed Stack Tech
 
Exploring Oracle Database 12c Multitenant best practices for your Cloud
Exploring Oracle Database 12c Multitenant best practices for your CloudExploring Oracle Database 12c Multitenant best practices for your Cloud
Exploring Oracle Database 12c Multitenant best practices for your Clouddyahalom
 
Oracle Database 12c Multitenant for Consolidation
Oracle Database 12c Multitenant for ConsolidationOracle Database 12c Multitenant for Consolidation
Oracle Database 12c Multitenant for ConsolidationYudi Herdiana
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedGear6
 
2020 pre fosdem mysql clone
2020 pre fosdem   mysql clone2020 pre fosdem   mysql clone
2020 pre fosdem mysql cloneGeorgi Kodinov
 
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.Geir Høydalsvik
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)Gustavo Rene Antunez
 
Oracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerOracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerGuatemala User Group
 
12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All Editions12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All EditionsFranck Pachot
 
Online MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackupOnline MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackupKenny Gryp
 

Mais procurados (20)

My First 100 days with an Exadata (WP)
My First 100 days with an Exadata  (WP)My First 100 days with an Exadata  (WP)
My First 100 days with an Exadata (WP)
 
Oracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesOracle 12c and its pluggable databases
Oracle 12c and its pluggable databases
 
DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)DBA 101 : Calling all New Database Administrators (WP)
DBA 101 : Calling all New Database Administrators (WP)
 
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09
OSSCube MySQL Cluster Tutorial By Sonali At Osspac 09
 
My sql 5.6&MySQL Cluster 7.3
My sql 5.6&MySQL Cluster 7.3My sql 5.6&MySQL Cluster 7.3
My sql 5.6&MySQL Cluster 7.3
 
Galera Cluster: Synchronous Multi-Master Replication for MySQL HA
Galera Cluster: Synchronous Multi-Master Replication for MySQL HAGalera Cluster: Synchronous Multi-Master Replication for MySQL HA
Galera Cluster: Synchronous Multi-Master Replication for MySQL HA
 
12 Things about Oracle WebLogic Server 12c
12 Things	 about Oracle WebLogic Server 12c12 Things	 about Oracle WebLogic Server 12c
12 Things about Oracle WebLogic Server 12c
 
MySQL Performance Tuning Variables
MySQL Performance Tuning VariablesMySQL Performance Tuning Variables
MySQL Performance Tuning Variables
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
 
Plugging in oracle database 12c pluggable databases
Plugging in   oracle database 12c pluggable databasesPlugging in   oracle database 12c pluggable databases
Plugging in oracle database 12c pluggable databases
 
Oracle 12c Multi Tenant
Oracle 12c Multi TenantOracle 12c Multi Tenant
Oracle 12c Multi Tenant
 
Exploring Oracle Database 12c Multitenant best practices for your Cloud
Exploring Oracle Database 12c Multitenant best practices for your CloudExploring Oracle Database 12c Multitenant best practices for your Cloud
Exploring Oracle Database 12c Multitenant best practices for your Cloud
 
Oracle Database 12c Multitenant for Consolidation
Oracle Database 12c Multitenant for ConsolidationOracle Database 12c Multitenant for Consolidation
Oracle Database 12c Multitenant for Consolidation
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
2020 pre fosdem mysql clone
2020 pre fosdem   mysql clone2020 pre fosdem   mysql clone
2020 pre fosdem mysql clone
 
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
 
Oracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerOracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with Docker
 
12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All Editions12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All Editions
 
Online MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackupOnline MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackup
 

Semelhante a MySQL Cluster Asynchronous replication (2014)

MySql's NoSQL -- best of both worlds on the same disks
MySql's NoSQL -- best of both worlds on the same disksMySql's NoSQL -- best of both worlds on the same disks
MySql's NoSQL -- best of both worlds on the same disksDave Stokes
 
2012 replication
2012 replication2012 replication
2012 replicationsqlhjalp
 
MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014Dave Stokes
 
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best PracticesOracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best PracticesSven Sandberg
 
Using The Mysql Binary Log As A Change Stream
Using The Mysql Binary Log As A Change StreamUsing The Mysql Binary Log As A Change Stream
Using The Mysql Binary Log As A Change StreamLuís Soares
 
OUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeOUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeGeorgi Kodinov
 
GLOC 2014 NEOOUG - Oracle Database 12c New Features
GLOC 2014 NEOOUG - Oracle Database 12c New FeaturesGLOC 2014 NEOOUG - Oracle Database 12c New Features
GLOC 2014 NEOOUG - Oracle Database 12c New FeaturesBiju Thomas
 
The Amazing and Elegant PL/SQL Function Result Cache
The Amazing and Elegant PL/SQL Function Result CacheThe Amazing and Elegant PL/SQL Function Result Cache
The Amazing and Elegant PL/SQL Function Result CacheSteven Feuerstein
 
MySQL Performance - Best practices
MySQL Performance - Best practices MySQL Performance - Best practices
MySQL Performance - Best practices Ted Wennmark
 
My sql fabric webinar v1.1
My sql fabric webinar v1.1My sql fabric webinar v1.1
My sql fabric webinar v1.1Ricky Setyawan
 
Collaborate 2012 - Administering MySQL for Oracle DBAs
Collaborate 2012 - Administering MySQL for Oracle DBAsCollaborate 2012 - Administering MySQL for Oracle DBAs
Collaborate 2012 - Administering MySQL for Oracle DBAsNelson Calero
 
MySQL Performance Metrics that Matter
MySQL Performance Metrics that MatterMySQL Performance Metrics that Matter
MySQL Performance Metrics that MatterMorgan Tocker
 
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...Dave Stokes
 
2012 scale replication
2012 scale replication2012 scale replication
2012 scale replicationsqlhjalp
 
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015Dave Stokes
 
MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)Frazer Clement
 
MySQL 5.7: Core Server Changes
MySQL 5.7: Core Server ChangesMySQL 5.7: Core Server Changes
MySQL 5.7: Core Server ChangesMorgan Tocker
 

Semelhante a MySQL Cluster Asynchronous replication (2014) (20)

MySQL NoSQL APIs
MySQL NoSQL APIsMySQL NoSQL APIs
MySQL NoSQL APIs
 
MySql's NoSQL -- best of both worlds on the same disks
MySql's NoSQL -- best of both worlds on the same disksMySql's NoSQL -- best of both worlds on the same disks
MySql's NoSQL -- best of both worlds on the same disks
 
MySQL Replication
MySQL ReplicationMySQL Replication
MySQL Replication
 
2012 replication
2012 replication2012 replication
2012 replication
 
MySQL-InnoDB
MySQL-InnoDBMySQL-InnoDB
MySQL-InnoDB
 
MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014
 
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best PracticesOracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
 
Using The Mysql Binary Log As A Change Stream
Using The Mysql Binary Log As A Change StreamUsing The Mysql Binary Log As A Change Stream
Using The Mysql Binary Log As A Change Stream
 
OUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source CodeOUGLS 2016: Guided Tour On The MySQL Source Code
OUGLS 2016: Guided Tour On The MySQL Source Code
 
GLOC 2014 NEOOUG - Oracle Database 12c New Features
GLOC 2014 NEOOUG - Oracle Database 12c New FeaturesGLOC 2014 NEOOUG - Oracle Database 12c New Features
GLOC 2014 NEOOUG - Oracle Database 12c New Features
 
The Amazing and Elegant PL/SQL Function Result Cache
The Amazing and Elegant PL/SQL Function Result CacheThe Amazing and Elegant PL/SQL Function Result Cache
The Amazing and Elegant PL/SQL Function Result Cache
 
MySQL Performance - Best practices
MySQL Performance - Best practices MySQL Performance - Best practices
MySQL Performance - Best practices
 
My sql fabric webinar v1.1
My sql fabric webinar v1.1My sql fabric webinar v1.1
My sql fabric webinar v1.1
 
Collaborate 2012 - Administering MySQL for Oracle DBAs
Collaborate 2012 - Administering MySQL for Oracle DBAsCollaborate 2012 - Administering MySQL for Oracle DBAs
Collaborate 2012 - Administering MySQL for Oracle DBAs
 
MySQL Performance Metrics that Matter
MySQL Performance Metrics that MatterMySQL Performance Metrics that Matter
MySQL Performance Metrics that Matter
 
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...
MySQL 5.7 NEW FEATURES, BETTER PERFORMANCE, AND THINGS THAT WILL BREAK -- Mid...
 
2012 scale replication
2012 scale replication2012 scale replication
2012 scale replication
 
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015
 
MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)
 
MySQL 5.7: Core Server Changes
MySQL 5.7: Core Server ChangesMySQL 5.7: Core Server Changes
MySQL 5.7: Core Server Changes
 

Último

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 

Último (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 

MySQL Cluster Asynchronous replication (2014)

  • 1. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 121 Insert Picture HereMySQL Cluster replication Frazer Clement Senior Software Engineer, Oracle frazer.clement@oracle.com messagepassing.blogspot.com April 2014
  • 2. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2 Session Agenda • Intro • MySQL replication • MySQL Cluster replication • Recommendations • Advanced topics
  • 3. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3 Disclaimer THE FOLLOWING IS INTENDED TO OUTLINE OUR GENERAL PRODUCT DIRECTION. IT IS INTENDED FOR INFORMATION PURPOSES ONLY, AND MAY NOT BE INCORPORATED INTO ANY CONTRACT. IT IS NOT A COMMITMENT TO DELIVER ANY MATERIAL, CODE, OR FUNCTIONALITY, AND SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISION. THE DEVELOPMENT, RELEASE, AND TIMING OF ANY FEATURES OR FUNCTIONALITY DESCRIBED FOR ORACLE'S PRODUCTS REMAINS AT THE SOLE DISCRETION OF ORACLE
  • 4. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4 Insert Picture Here Introduction
  • 5. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5 Frazer Clement Senior Software Engineer Oracle Based in Edinburgh UK. Joined MySQL AB in 2007, then Sun, then Oracle... Worked on NdbApi, Replication, Cluster membership, Conflict detection, most areas of Cluster. Worked with customers for several years to help solve their problems. Previously worked for Nortel / IBM on HLR, HSS... Included using MySQL Cluster as HSS database since ~ 2005. Strong telco focus Who?
  • 6. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6 ● Audience are familiar with replication concepts, limitations etc. Will review some of that material, but just as context ● You have questions that will come up as we proceed – happy to answer as we go, unless it is too time consuming – then we defer to the end ● I don't have material to cover everything, but can use a white board ● I will not know all the answers :) ● I might ask you questions :) ● You will have great ideas and suggestions! ● Happy to discuss concepts and ideas, but I cannot commit to any future development or fixes. Expectations
  • 7. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7 ● I cover MySQL replication separately to MySQL Cluster replication here ● This reflects the implementation – much of it is in the generic MySQL Server code base, which means : ● We benefit from all the features implemented at that level ● We benefit from the testing performed by the huge installed base of users. ● It is implemented by a different team within Oracle – they work on it for us for 'free' :) ● It is designed to be generic across different storage engines (MyISAM, InnoDB, Ndb cluster) ● It is not so easily modified to suit the specific needs of MySQL Cluster Structure
  • 8. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8 Insert Picture Here MySQL replication M S
  • 9. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9 MySQL replication topologies M S M S S S/M S M/S M/S M/S M/S M/S M/S SS Master Slave Master-Master Circular + Slaves Master – multi slave tree Constraint : 1 Master per Slave @ 1 time
  • 10. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10 MySQL replication components SQL clients Server logic Pluggable Storage Engines Binlogging Binlog files Binlog Dump Threads Slave servers Master Server Slave IO thread Relay log files Slave SQL thread MySQL uses IO buffering, so when producer and consumer are close, data is passed in memory
  • 11. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11 MySQL Binlogging 1 SQL clients Server logic Pluggable Storage Engines Binlogging Binlog Index file BinlogFile.000006 BinlogFile.000005 BinlogFile.000004 BinlogFile.000003 Time Binlog rotation based on file size, or manual flush command. Purge can be time based or manual. During transaction execution, a binlog transaction is cached by client threads. At transaction commit time, client thread takes binlog lock + writes transaction to binlog. Light yellow binlog transaction caches are in- memory, strong yellow have spilled to disk
  • 12. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12 Binlog gives storage engine independent operation log ● For single-server systems, Binlog durability via fsync is important ● fsync has large performance impact ● Binlog file size (--max_binlog_size) trades off number of files, rotation/open/close cost with granularity of purge. ● Size-based rotation occurs as part of writing an event to the binlog – e.g. some client session does the work and spends the time. ● Time-based PURGE (--expire_logs_days) is triggered when rotating, and also performed by some client session. ● Binlog transaction cache is largest in-memory cacheable binlog transaction (--binlog_cache_size). Larger transactions are spooled to disk prior to commit-to-binlog. See SHOW STATUS LIKE '%binlog_cache%'; MySQL Binlogging 2 PURGE LOGS and FLUSH LOGS commands can manually invoke purge and rotate actions
  • 13. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13 ● Binlog is a series of variable sized events ● Events have a type, size, serverid ● Some are Binlog metadata : FORMAT_DESC, ROTATE ● Original implementation was 'Statement Based Replication' (SBR) – including statements in QUERY event types. Challenging to ensure determinism at the Slave. ● “Row Based Replication' (RBR) – uses WRITE_ROW, UPDATE_ROW, DELETE_ROW events and can improve slave performance for small transactions. ● In both cases transactions are demarcated by BEGIN and COMMIT QUERY events ● Transactions are kept in a single Binlog file – often rotation occurs after a COMMIT event. MySQL Binlogging 3 MySQL Cluster uses Row based replication
  • 14. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14 Row based replication ● Basic format of transactions is : BEGIN, TABLE_MAP*, (WRITE_ROW|UPDATE_ROW| DELETE_ROW)*, COMMIT ● TABLE_MAP is for efficiency – mapping a table name to an id for the scope of the transaction – later *_ROW events use the id. ● Each *_ROW event can contain one or more sets of row images (changed rows), where each set of images : ● Is the same operation type (INSERT/UPDATE/DELETE) ● Is on the same table ● Affects the same columns ● RBR can include some statements : DDL etc... ● As normal, SQL_LOG_BIN=0 can temporarily disable Binlogging MySQL RBR RBR events have few determinism issues
  • 15. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15 RBR Example FORMAT_DESC CREATE TABLE HLR.AUTH_INFO(... BEGIN TABLE_MAP(HLR.CUG, 1) TABLE_MAP(HLR.AUTH_INFO, 2) WRITE_ROW(1, …) WRITE_ROW(2, …) UPDATE_ROW(1, …) WRITE_ROW(1, …) DELETE_ROW(1, …) COMMIT BEGIN TABLE_MAP(HLR.AUTH_INFO, 1) MySQL RBR Description event at start of Binlog file Create table recorded in event Binlog transaction start Table map events with transaction scope Events containing one or more row image sets Binlog transaction end Start of new Binlog transaction DDL statements appear between DML transactions – not interleaved mysqlbinlog –verbose great for analysing Binlogs + Relay logs Also SHOW BINARY LOGS, SHOW BINLOG EVENTS
  • 16. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16 DDL statements are (should be) relatively rare, so the Binlog files should be mostly sequences of back-to-back transactions. Binlog file positions are denoted using a file {name, position} pair. The position is a byte offset, and many offsets are invalid. Valid offsets are the start of an event. MySQL Binlogging New events Active Binlog file Byte offsets 0
  • 17. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.17 MySQL Binlog dump threads SQL clients Server logic Pluggable Storage Engines Binlogging Binlog files Binlog Dump Threads Slave servers - Slaves connect to a Master in the normal way as a client, authenticate, then issue a BINLOG_DUMP command, which causes their session thread to become what we call a BINLOG DUMP thread. - BINLOG DUMP threads have a Binlog {File, Position} pair from where they are reading. Each can have a different position. - Where they are close to the 'head' of the Active Binlog, data is passed via memory. - Generally the Binlog Dump files read as much data as is available and can be sent over to the Slave – TCP backpressure allows the Slave to control the rate.
  • 18. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.18 MySQL slave threads IO thread Server logic Pluggable Storage Engines Master Server Slave IO thread Slave SQL thread Relay log Index file RelaylogFile.000022 RelaylogFile.000021 RelaylogFile.000020 RelaylogFile.000019 Time - Slave IO thread connects to Master and issues Binlog DUMP command. - Events received are filtered by server id, then written to relay log files. - Slave IO thread can operate entirely separately to Slave SQL thread, replicating the Binlog files from the Master. - Relay logs are almost exactly the same as Binlogs. - FLUSH LOGS can manually rotate the active relay log IO and SQL threads can be stopped and started together or separately using START SLAVE and STOP SLAVE.
  • 19. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.19 MySQL slave threads SQL thread Server logic Pluggable Storage Engines Master Server Slave IO thread Slave SQL thread Relay log Index file RelaylogFile.000022 RelaylogFile.000021 RelaylogFile.000020 RelaylogFile.000019 Time - Slave SQL thread reads relay logs, via memory if possible, and executes the events. - Event execution can result in Storage Engine (SE) calls, defining transactions, operating on data etc. - When the Slave SQL thread reaches the end of a relay log file and moves onto the next, the old relay log file is purged automatically. - Relay logs are mostly 'invisible' in normal operation. Separate IO and SQL threads decouples k-safety / geo redundancy from replica consistency / slave apply performance limitations or issues
  • 20. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.20 ● Most storage engines (e.g. InnoDB, MyISAM) have no slave-specific code. They are invoked by the Server logic and behave as they would for any client session invocation. ● Normal MySQL replication has a 1:1 mapping between 'user transactions' and 'binlog transactions'. ● User transactions execute in parallel at the Master and are serialised only when writing their binlog cache contents to the Binlog at transaction commit time. ● The Binlog forces a single serial order on concurrent transactions. ● At the Slave, the SQL thread executes transactions in Binlog order. This may be less concurrent than the causing execution at the Master. ● At the Slave, the SQL thread must perform all blocking disk I/O, which might have been performed by concurrent threads on the Master. This can be the limit on Slave throughput. MySQL slave SQL thread Recent MTS work alleviates single-threaded slave SQL thread limitation somewhat
  • 21. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.21 ● The slave SQL thread can encounter permanent or temporary errors ● Permanent errors generally stop the SQL thread (but not the IO thread) ● Some permanent error types can be ignored : --slave-skip-errors ● Some classes of errors can ignored with an alias : --slave-skip- errors=ddl_exist_errors ● Temporary errors result in transaction rollback and limited retries ● --slave_transaction_retries controls the maximum number of retries before a temporary error will cause the SQL thread to stop. ● Temporary errors result in an entry in the Server's error log file ● Retries are not immediate, but have bounded growing inter-retry delays to give time for temporary conditions to resolve. ● Problematic events can be manually skipped by setting the sql_slave_skip_counter variable before restarting the Slave SQL thread. MySQL slave SQL thread
  • 22. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.22 Insert Picture Here MySQL Cluster replication M S
  • 23. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.23 ● Topologies and Components ● Cluster internals ● HA and distribution ● Event ordering ● SUMA buffering and duplicates ● SendBuffer ● NdbApi and SUMA concepts ● NdbApi internals ● NdbApi event buffering ● Events and Blobs ● Binlog Injector ● Ndb slave MySQL Cluster replication
  • 24. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.24 ● Ndb Binlog Injector ● Ndb slave MySQL Cluster replication
  • 25. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.25 Built on-top of MySQL replication ● Tried and tested stack ● Flexibility – replicate to and from other systems ● Benefit from performance improvements and bug fixing MySQL Cluster adds : ● Binlogging of changes from other cluster clients including NdbApi ● HA replication – no SPOF ● Transactional replication position on the slave ● Higher replication performance through batching and parallelism ● Moving parts and complexity MySQL Cluster replication
  • 26. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.26 MySQL Cluster replication topologies Master Slave Master-Master Circular + Slaves Master – multi slave tree All the standard topologies, with whole clusters as M, S or S/M. M S M/S M/S M/S M/S M/S M/S SS M S/M S S S
  • 27. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.27 MySQL Cluster replication topologies 2 Star/Hub with upstream master and downstream Slave tree A Slave Cluster can have multiple Masters Multi-master M M M S M M/S M/S M/S M/S M/S S S
  • 28. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.28 MySQL Cluster replication components For HA, each cluster has redundant Master and Slave MySQL Servers. Most commonly two servers, with only one Slave Server active at a time. Both MySQL Servers write Binlog, but commonly only one is serving Binlogs to downstream slaves. A single server can perform both Master and Slave roles. S M M MySQL Cluster data nodes, other clients etc. S Slave servers Master servers MySQL protocol MySQL protocol MySQL protocol MySQL protocol NdbApi events NdbApi DML/ DDL
  • 29. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.29 MySQL Cluster replication components 2 MySQL Cluster data nodes, other clients etc. Slave servers MySQL protocol NdbApi DML/ DDL MySQL protocol NdbApi events MySQL protocol MySQL protocol Master servers A single server can perform both Master and Slave roles simultaneously IO SQL IO SQL INJ DUMP INJ DUMP NDB NDB NDB NDB BINLOGS BINLOGS Relay logs Relay logs
  • 30. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.30 ● MySQL Cluster is designed for high write-rate systems, with low stable latency ● Parallelism and concurrency exist at many levels : rows, operations, fragments, data nodes, transactions, clients. ● Operations, transactions and therefore clients generally only interact/contend where the operate on the same data (rows). ● Otherwise transactions are entirely parallel and unsynchronised. ● Great for throughput with low latency ● Does not provide a single serial history of transactions ● Does not provide a notion of consistent points in time ● Consistent points are identified by a separate mechanism – Global checkpoint. MySQL cluster global checkpoint 1
  • 31. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.31 ● Consistent points are used for : - Creating potential system / backup recovery points - Defining points in change event streams (replication) ● The set of changes between two consistent points is referred to as an epoch, and is identified by a cluster unique 64 bit epoch number. ● Epoch numbers have a high 32-bit word called GCI (Global checkpoint index), and a low 32-bit word called micro-GCI ● System/Backup recovery points are at GCI boundaries, and are created on the period TimeBetweenGlobalCheckpoints – defaults to 2000 millis. ● Event stream consistency points are at micro-GCI boundaries, and are created on the period TimeBetweenEpochs – defaults to 100 millis. MySQL cluster global checkpoint 2
  • 32. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.32 ● The ratio between these times results in the pattern of epoch numbers seen. The default ratio is 20:1, resulting in micro-GCI values 0..19, then an increment of the GCI value. ● In disk overloaded systems, sometimes the GCI increment is stalled for longer, and so higher micro-GCI values are seen – this can be a warning of redo disk IO problems. ● Epoch numbers are often logged as <GCI>/<microGCI>, generally more readable than the 64bit representation. ● Epoch numbers are assigned at transaction commit time, by the transaction's Transaction Coordinator (TC) – a component on the data node ● To get permission to commit, and an epoch number assigned, a transaction must be fully prepared – and e.g. be holding all the row locks it needs. This implies that it has no dependencies on any other running transaction. MySQL cluster global checkpoint 3
  • 33. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.33 ● The Global Checkpoint protocol works by broadcasting a prepare-to- increment signal to all TC instances in the cluster, causing them to gate new transaction commits (but continue all other processing). Once all TC instances have acknowledged, a commit-increment signal is broadcast, and all TC instances resume committing. ● The effect here is that the parallel streams of committing transactions are divided into before and after with the following properties : All transactions in epoch n 'happened before' those in epoch n+1 Therefore an epoch boundary is a consistent point ● Note that a parallel system has many equivalent partial event sort orders, and epochs are just one of them, selected arbitrarily. MySQL cluster global checkpoint 4
  • 34. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.34 ● Epoch numbers are assigned when transactions begin to commit. ● Commit of large transactions, especially involving disk data tables, can take some time. ● Post-commit triggers in the Tuple manager (TUP) component in the data nodes send row change details to the Subscription manager component. ● The Subscription manager (SUMA) manages forwarding / multicasting of row change details to NdbApi / MySQLD clients. ● Each row change has an associated epoch number. ● When a TUP instance has completed commit processing for all transactions in an epoch, it notifies SUMA. ● When all of the local TUP instances have completed an epoch, SUMA informs its subscribers. MySQL cluster row change events
  • 35. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.35 . MySQL cluster row change events flow TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API NdbApi NdbApi events Subscriber 1 Subscriber 2 Writes are synchronously replicated within a nodegroup. All SUMA components in a nodegroup observe all write events. Events are hashed to buckets independently of the node local fragment replica role. e.g. write events can be delivered from 'Primary' or 'Backup' fragments. Data nodes, NoOfReplicas=2
  • 36. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.36 . MySQL cluster row change events - nodegroup TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API Subscriber 1 Subscriber 2 NdbApi NdbApi events A single row write transaction from an API node routes to one TC instance, from where it is forwarded to the LDM instance managing the primary fragment replica for the fragment. Then it is synchronously replicated to the nodegroup peer. Both nodes forward the event to their SUMA instance, but only one SUMA forwards the event to the subscribing API nodes. The other buffers it. Data nodes, NoOfReplicas=2
  • 37. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.37 . MySQL cluster row change events + epochs TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API Subscriber 1 Subscriber 2 NdbApi NdbApi events After an epoch increment, the TUP instances gradually finish committing the transactions from the previous epoch and forwarding their events to the local SUMA instance. When SUMA receives this 'epoch completed' signal it forwards it to its subscribers. This tells the subscribers that they have received all of the events for the given epoch from the source data node. Data nodes, NoOfReplicas=2
  • 38. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.38 . MySQL cluster row change events + epoch ack TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API Subscriber 1 Subscriber 2 NdbApi NdbApi events Once subscribers have received an epoch completed event, they immediately respond with an acknowledgement back to the data nodes. Once all subscribers have acknowledged reception of an epoch to all data nodes, SUMA can release the epoch's event buffer space for reuse. Data nodes, NoOfReplicas=2
  • 39. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.39 . MySQL cluster row change events + failure TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API Subscriber 1 Subscriber 2 NdbApi NdbApi events When a node fails unexpectedly, there will likely be events sent to subscribers for as-yet unacknowledged epochs. In this case, there is uncertainty about whether all of the subscribers received all of the unacknowledged events or not. To solve this, a surviving node in the nodegroup will 'takeover' the bucket, and use its buffered events to resend unacknowledged epochs. Data nodes, NoOfReplicas=2
  • 40. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.40 . MySQL cluster row change events + resend TC TC TC TC API API API API LDM / TUP LDM / TUP LDM / TUP LDM / TUP SUMA SUMA SUMA SUMA API API API API Subscriber 1 Subscriber 2 NdbApi NdbApi events Depending on when the failure occurred, it's nature, buffers etc, the subscribers may have received the original events or not. In any case they are re-sent by a nodegroup peer. Currently this can mean that the API sees duplicate events within an epoch. Once all buffered events are re-sent, the normal epoch-completed protocol is followed. Data nodes, NoOfReplicas=2 STRICT mode problem
  • 41. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.41 Observations ● Row changes are distributed according to row's primary key distribution hash ● A single transaction can affect different fragments on same or different data nodes ● Even within a data node, the row changes are forwarded or buffered based on bucket membership, which is a function of the primary key hash, but independent of the local fragment replica's current primary or backup role. ● The cluster does not actively maintain transaction ordering information within an epoch ● Therefore : Events arrive at subscribers in a partially sorted order MySQL cluster row change event ordering
  • 42. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.42 Events arrive at subscribers in a partially sorted order ● Events in epoch n+1 occurred after events in epoch n ● Events within an epoch are only ordered w.r.t. individual primary keys. (As it is guaranteed that a given primary key value will be in a particular SUMA bucket) Implications ● Inter-row constraints may be temporarily violated if applying the row events in-order (unique keys, foreign keys) ● It's not trivial to extract and order the original 'user transactions' from the event stream (requires per-event transaction id and a topological sort) ● Consistency is only guaranteed at epoch boundaries MySQL cluster row change event ordering 2
  • 43. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.43 In node failure cases, unacknowledged epochs are re-sent to subscribers. Subscribers do not currently filter out the re-sends. Therefore it is possible to have duplicated events in an epoch. This is one reason why Ndb slaves require IDEMPOTENT mode – to allow them to handle cases where sequences of operations to a primary key are partially/fully repeated... [INSERT, UPDATE] → [INSERT, UPDATE][INSERT, UPDATE] [INSERT, UPDATE] → [INSERT][INSERT, UPDATE] [DELETE] → [DELETE][DELETE] [UPDATE, DELETE] → [UPDATE, DELETE][UPDATE, DELETE] MySQL cluster row change event duplicates
  • 44. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.44 The row changes occurring in each nodegroup are divided into a number of buckets (or slices), using a primary key hash. The number of slices is designed to always balance across the available nodes in a nodegroup. Responsibility for forwarding events in each slice to subscribers is given to one of the nodes in the group, and the others will buffer the events. As each node in the nodegroup is forwarding and buffering the same number of slices/buckets, and as they slices are based on an MD5 hash, their forwarding IO and buffering capacity should be balanced. For NoOfReplicas=2, each nodegroup has two slices, with each node responsible for forwarding one and buffering one. MySQL cluster row change SUMA balance
  • 45. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.45 ● Data nodes buffer changes in SUMA to handle unexpected data node failure without a break in the NdbApi row change event streams. ● SUMA event buffering is in-memory and finite. ● Buffer space is consumed by data change in the local node, and released by acknowledgements from all subscribing Api nodes ● Buffer space is liable to increase due to : Network problems to subscribers, Slow subscribers, Failed-but-not-yet-detected Api nodes, Cluster write rate spikes. ● To protect the data nodes, SUMA event buffering can be limited in terms of the number of epochs buffered and the number of bytes (Cluster config MaxBufferedEpochs and (new) MaxBufferedEpochBytes) MySQL cluster row change SUMA buffering
  • 46. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.46 ● MaxBufferedEpochBytes limits the amount of memory SUMA can use for buffering events. ● MaxBufferedEpochs limits the number of unacknowledged epochs that SUMA will accept from any subscriber. ● Reaching the MaxBufferedEpochBytes is not an immediate problem, as the data nodes can stop buffering, but keep forwarding. However it means that the cluster is no longer resilient to data node failure - in the event of a data node failure, the Api nodes will be informed that there is a gap in the event stream. For replication, this requires a Backup + Restore cycle to resync the Slave. For this reason it should be avoided.. ● “Out of event buffer: nodefailure will cause event failures, consider  increasing MaxBufferedEpochBytes” ● MySQL cluster row change SUMA buffering
  • 47. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.47 ● When MaxBufferedEpochs is reached, subscribers with that epoch lag will be disconnected, allowing their buffered data to be released. ● This asynchronous disconnect from the data node side of the connection will result in all data nodes disconnecting the subscribing Api node, and will appear to NdbApi like a 'cluster failure'. ● For MySQLD, the Binlog Injector thread will inject a GAP Incident event in the Binlog. ● The Api node is then free to reconnect and attempt to establish new subscriptions... ● “Disconnecting lagging nodes ...” MySQL cluster row change SUMA buffering MaxBufferedEpochs can be used like a 'watchdog' on lagging subscribers – perhaps disconnecting them and allowing them to reconnect can clear problems they may be having. It is not really necessary as a guard on the data node buffering capacity, as that is limited by MaxBufferedEpochBytes. However beware if setting it to an ultra-high value that a SUMA-internal pool is allocated based on it's setting.
  • 48. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.48 Monitoring : ● Bytes buffered in SUMA : Cannot be directly monitored ● Epochs buffered in SUMA : NdbInfo ndb$pools, block is SUMA, resource is “GCP”. ● DUMP 8013 puts summary of oldest buffered epoch and which subscriber's acknowledgements are pending into the cluster log. MySQL cluster row change SUMA buffering Potential Improvements NdbInfo views on subscriptions, subscribers, buckets, epochs, buffered bytes, volumes of data sent to individual subscribers etc...
  • 49. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.49 ● Most deployments have two or more 'Binlogging MySQLDs', and a set of NdbApi clients and/or non-binlogging MySQLD servers. ● The Binlogging MySQLDs subscribe to row change events on almost every table in the cluster - this means that the data sent from data nodes to Api nodes is noticeably higher for Binlogging MySQLDs than other Api clients. ● When the rate of change in the Cluster is high, this imbalance can cause problems with the SendBuffer resource. ● SendBuffer is used to decouple the non-blocking data node core from the blocking socket send protocols used to actually send data remotely. ● Binlogging MySQLDs generally need more SendBuffer configured on the data nodes than other Api nodes, to soak up spikes in the change rate. . MySQL cluster row change SendBuffer
  • 50. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.50 . MySQL cluster row change SendBuffer API API API API API API API API NdbApi NdbApi events Subscriber 1 Subscriber 2 Generally many more Api nodes than subscribers. Binlogging MySQLDs subscribe to all row change events – so are directly affected by the rate of change in a cluster. Links to Binlogging MySQLDs must be allowed more SendBuffer than other Api links Data nodes, NoOfReplicas=2
  • 51. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.51 Epoch boundaries are used for system restore capabilities. A cluster system restart is where all nodes start from on-disk state - LCP + redo logs. However only a subset of epoch boundaries (GCI boundaries) are eligible as system restore points. GCI boundaries are epochs with micro_GCI == 0. They normal occur every TimeBetweenGlobalCheckpoints milliseconds – defaults to 2000 millis. However epochs are incremented every TimeBetweenEpochs millis – defaults to 100 millis. Therefore the completed epochs will almost always contain changes which will not be available after a sudden system restart MySQL cluster system restart + row changes
  • 52. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.52 Completed epochs will almost always contain changes which will not be available after a sudden system restart ● Therefore the Binlogs of subscribing MySQLD nodes can contain changes which are not in the cluster data nodes after a restart. ● Therefore the downstream slave(s) can contain changes which are not stored in the (old) Master cluster after a restart. ● Therefore care must be taken to understand the restoration point of the failed cluster during the restart, so that a decision about how to resync can be made - Perhaps local or slave Binlogs can be replayed? ● A failed cluster should not be allowed to come online and serve traffic after a system restart without some analysis of the data lost. MySQL cluster system restart + row changes
  • 53. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.53 NdbApi has the NdbEvent and NdbEventOperation classes. Clients can create and share NdbEvents by name. Each NdbEvent refers to a single table, and is parameterised by whether only modified columns, or all columns are included. NdbEvents have a lifecycle independent of the creating NdbApi client, so care must be taken. Clients create and use NdbEventOperation objects to request that the row change events defined by a particular NdbEvent object should be sent to the NdbApi client. A single NdbApi client might use many NdbEventOperations (one per table) to get a view of the row changes occurring in some subset of tables. Row change events in NdbApi SUMA p.o.v : Event = Subscription, EventOperation = Subscriber
  • 54. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.54 The set of currently defined events in a cluster can be see by looking at the hidden NDB$EVENTS_0 table... > ndb_select_all ­c<> ­dsys NDB$EVENTS_0 NAME EVENT_TYPE TABLEID TABLEVERSION TABLE_NAME ATTRIBUTE_MASK SUBID SUBKEY "REPL$mysql/ndb_schema" 262143 4 1 "mysql/def/ndb_schema" [511 0 0 0] 1 65537 "REPL$mysql/ndb_apply_status" 65535 6 1 "mysql/def/ndb_apply_status" [31 0 0 0] 3 65539 "NDB$BLOBEVENT_REPL$mysql/ndb_schema_3" 393215 5 1 "mysql/def/NDB$BLOB_4_3" [15 0 0 0] 2 65538 3 rows returned NDBT_ProgramExit: 0 ­ OK Row change events in NdbApi MySQL replication created events usually start with REPL$ or REPLF$ for modified-only or all-columns variants. The subscription to mysql/ndb_schema is used by all MySQLDs to communicate about schema changes. Note that the mysql/ndb_schema table (id 4) has a blob column which needs its own event to track those changes. The attribute_mask here is deprecated and irrelevant.
  • 55. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.55 NdbApi allows users to define and share Events, and to define EventOperations to subscribe to row changes defined by the events. NdbApi event == SUMA subscription A description of the type of changes that are of interest. Can be shared by many NdbApi clients NdbApi event operation == SUMA subscriber A way to request the flow of row events from a particular Event/Subscription start flowing to this client. EventOperations are associated with an Ndb object. NdbApi event concepts
  • 56. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.56 Ndb object Ndb object NdbApi event concepts Subscription Subscriber n 1 n EventOperation 1 Event 1 n 1 1 Ndb object 1 n Table 1 EventBuffer Api node n 1 1 Cluster
  • 57. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.57 ● An Ndb object can be used to subscribe to changes on one or more tables by creating NdbEventOperations on them, and then polling for incoming events. ● The SUMA components on the data nodes will start sending the row changes on some epoch boundary. ● As events and epoch boundary signals arrive, thread(s) internal to the NdbApi library receive, acknowledge and buffer them. ● The buffering exists to decouple the data transfer from the data nodes to the Api client from the Api client's consumption of events. ● In cases where the change rate is very high, or the Api client is slow or stalled, the buffer will grow ● Excessive growth of the NdbApi event buffer can cause stability issues. NdbApi event buffering
  • 58. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.58 NdbApi event buffering Data node Data node Data node Data node Receiver NdbApi Api process Event buffer User code Ndb::pollEvents() Within NdbApi, the receiver thread(s) buffer and acknowledge reception of row data for an epoch. This occurs independently of the User code behaviour, and helps with quick release of data node SUMA event buffer space. User code can retrieve new events by calling the pollEvents() method. This will return the next event from the head of the Ndb object's event buffer. Only events in completed epochs are made available to the pollEvents method.
  • 59. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.59 NdbApi event buffering NdbApi Event Buffer implementation + Zeroconf - Unbounded growth - Can destabilise host + Avoid OS malloc/free - Never return memory to OS - Often continues to grow despite having 'free space' Cases observed where : - Event buffer growth causes host slowdown due to paging, resulting in decrease in both user code event consumption rate and receiver thread performance. Eventually MaxBufferedEpochs drives client disconnect and buffer re- initialisation (good outcome) . - Event buffer growth causes Linux OOM killer to choose an ndbmtd host to kill to relieve memory pressure. - Memory allocated from OS continues to grow despite buffer having large % free. Hard crashing limit on size has been implemented. Soft 'GAP insert' limit in- progress.
  • 60. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.60 Each Event Buffer is linked to an Ndb object. Generally want minimum buffers per process. Currently NdbApi event Api only exposes buffering information in terms of epochs : - Latest epoch(GCI): Most recent epoch completely received by NdbApi receiver thread (Tail of event buffer) - Apply epoch(GCI): Epoch of event currently being consumed by NdbApi user code (Head of event buffer). These are of limited use as : - epoch numbers are sparse, there's not direct indication of the number of epochs buffered. - epochs are of different sizes in terms of row changes and change size NdbApi event buffer monitoring
  • 61. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.61 NdbApi (and MySQLD) allow a 'GCI slip threshold' to be configured.   ­­ndb­report­thresh­binlog­epoch­slip=#                        Threshold on number of epochs to be behind before                       reporting binlog status. E.g. 3 means that if the                       difference between what epoch has been received from the                       storage nodes and what has been applied to the binlog is                       3 or more, a status message will be sent to the cluster                       log.   ­­ndb­report­thresh­binlog­mem­usage=#                        Threshold on percentage of free memory before reporting                       binlog status. E.g. 10 means that if amount of available                       memory for receiving binlog data from the storage nodes                       goes below 10%, a status message will be sent to the                       cluster log. These threshold crossings cause cluster log events to be generated... NdbApi event buffer monitoring
  • 62. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.62 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/21 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/22 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/23 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/24 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/25 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=236/26 … [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=21KB(100%) alloc=21KB(0%) max=0B  apply_epoch=236/18 latest_epoch=237/9 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=31KB(100%) alloc=31KB(0%) max=0B  apply_epoch=236/18 latest_epoch=237/10 [MgmtSrvr] INFO     ­­ Node 4: Event buffer status: used=31KB(100%) alloc=31KB(0%) max=0B  apply_epoch=236/18 latest_epoch=237/11 NdbApi event buffer monitoring
  • 63. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.63 NdbApi also includes some per-Ndb event stream monitoring counters. These are incremented independently of event consumption. Ndb::getClientStat()     DataEventsRecvdCount     = 18, /* Number of table data change  events received */     NonDataEventsRecvdCount  = 19, /* Number of non­data events  received */     EventBytesRecvdCount     = 20, /* Number of bytes of event data  received */ These can be seen from a MySQLD instance using :  > SHOW STATUS LIKE 'ndb_api%injector'; NdbApi event buffer monitoring
  • 64. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.64 A configurable crash-stop limit on event buffer size was implemented recently. The idea is that a process shutdown and restart, with accompanying replication channel failover) is preferable to host OS destabilisation when event buffer growth is excessive.   ­­ndb­eventbuffer­max­alloc=#      Maximum memory that can be allocated for  buffering events by the ndb api Work is in progress to implement a less severe limit – where excessive buffering causes incoming events to be discarded, and the consumer becomes aware of the 'gap' when it reaches it. NdbApi event buffer limiting
  • 65. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.65 ● Ndb Blobs are mostly implemented at the NdbApi layer of the stack. ● Data for Blob columns are split into a small header, and zero or more 'parts' (e.g. 256 byte header, 0..n parts of 2000bytes, 4000 bytes...) ● The 'header' data is a normal column in the table, something like a VARBINARY(272). ● The parts are rows in a hidden table, defined just to hold parts for that particular column. These Blob part tables are named NDB$BLOB<tableid>_<columnnumber> ● From the point of view of the data nodes, the part tables are normal 'user tables'. ● This allows arbitrary length data to be transactionally stored in Blob (or Text) columns, but adds complexity at the NdbApi layer. NdbApi events for Blobs
  • 66. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.66 ● When a transaction modifying a Blob value (insert, update, delete) occurs, it internally involves operations on the main table as normal, and zero or more operations on rows in the parts table(s) involved. ● TUP and SUMA treat the Blob part tables as separate tables. ● NdbApi receives row changes for the Blob part tables separate to the main table row change, with no ordering constraints between different rows as normal. ● With the merge_events option on, NdbApi correlates the main table and part table events so that the Blob part table row changes are used to create a pseudo main table row change event containing all the Blob changes. ● This is implemented using an in-memory hash by PK in the NdbApi. NdbApi events for Blobs Event merge merges all events on a row in an epoch – e.g. separate user transactions are merged together.
  • 67. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.67 ● NdbApi receives row events and epoch completion signals from the SUMA components of the running data nodes ● NdbApi internally buffers events for incomplete epochs. These are not made visible to NdbApi users until all data nodes have indicated that the epoch is completed. ● As data nodes complete epochs in-order, the NdbApi nodes will release event data for each epoch to the user, in-order. ● A user thread serially consuming events from the event Api can be sure that when the first event for some epoch m > n is received, there will be no more events for epoch n. ● This sequencing is a kind of merge-sort by epoch number on the data node event streams. NdbApi event sequencing
  • 68. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.68 ● In a quiet or idle cluster, epochs can occur which have no row changes in them. ● The normal global checkpoint mechanisms occur, and all data nodes will send epoch completed signals, and expect acknowledgements. ● This can be seen in the 'Latest epoch' values, which continue to climb in an idle cluster. This is the only indication of empty epochs occurring at the NdbApi layer. ● In some cases, an epoch may have events which a user does not consider relevant – e.g. slave-applied updates when –log-slave- updates=0. In this case the epoch is not empty at the NdbApi level, but may be considered as empty by the next layer up. Empty epochs at the NdbApi layer
  • 69. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.69 ● Mostly looked at the data nodes and the NdbApi event Api so far. ● The Ndb Binlog Injector (BI) is a component of MySQL Servers when part of a MySQL Cluster ● A better name might be 'Ndb event listener', as it is responsible for more than just Binlog generation. ● The BI uses the NdbApi event Api to listen to row changes on an internal schema table (mysql.ndb_schema). ● The BI also uses the NdbApi event Api to listen to row changes on all other tables (unless exceptions are made with mysql.ndb_replication) ● The BI writes Binlog transactions and DDL statements to the Binlog. ● The BI injector maintains a local mysql.ndb_binlog_index table for mapping epoch numbers to Binlog files and positions. Ndb Binlog Injector
  • 70. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.70 ● The Binlog Injector is always process #1 in the SHOW PROCESSLIST output – this can be used to check it's state : mysql> show processlist; +­­­­+­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+­­­­­­+­­­­­­­­­+­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ +­­­­­­­­­­­­­­­­­­+ | Id | User        | Host            | db   | Command | Time | State                             | Info              | +­­­­+­­­­­­­­­­­­­+­­­­­­­­­­­­­­­­­+­­­­­­+­­­­­­­­­+­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ +­­­­­­­­­­­­­­­­­­+ |  1 | system user |                 |      | Daemon  |    0 | Waiting for event from ndbcluster | NULL              | ● This can be used for monitoring, or debugging event buffer growth. Ndb Binlog Injector
  • 71. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.71 ● Schema changes in MySQL Cluster involve cooperation between all attached MySQL Servers, so that they can take any necessary steps, and serve the new schema immediately when it is committed. ● Event subscription to an internal table (mysql.ndb_schema) is used to accomplish this. All MySQL Servers listen for changes on this table using their BI and modifications to this table are used to communicate (rows as shared memory!) ● Schema changes can generate binlog entries - “DROP INDEX...” ● The volume of change on this table is very low, and related to DDL. ● BI uses a separate Ndb object (and EventBuffer) for its subscription to mysql.ndb_schema row changes, so occasionally it can be seen in epoch_slip logs. Usually the buffer is very small. Ndb Binlog Injector + schema distribution
  • 72. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.72 ● The BI is an event processing list, which consumes events from a 'schema event' subscription to the mysql.ndb_schema table, and a 'data event' subscription to all the tables being Binlogged by the server. ● The loop has the following pseudo-code : while (!(disconnected || error ||...) consume all schema events for epoch, taking steps required. begin binlog transaction insert 'fake' ndb_apply_status write row for epoch consume all data events for epoch, writing to binlog transaction decide whether to commit or rollback binlog transaction commit/rollback write details of epoch transaction to mysql.ndb_binlog_index table Ndb Binlog Injector main loop
  • 73. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.73 Ndb Binlog Injector main loop schema data binlogs BEGIN WRITE_ROW UPDATE_ROW DELETE_ROW WRITE_ROW UPDATE_ROW COMMIT DROP INDEX A; BEGIN WRITE_ROW ... NdbApi Server Binlog code Transaction cache Server MyISAM code mysql.ndb_binlog_index Ndb Binlog Injector
  • 74. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.74 Observations : ● BI is a bottleneck for all changes to be Binlogged in the cluster. Not commonly a problem though... ● BI relies on generic MySQL Binlogging code (Binlog transaction cache, Inline Binlog rotate and purge) ● BI contends for OS locks inside generic MySQL Binlogging code (though should be low/no contention if deployed as recommended) ● BI relies on generic MySQL processing and MyISAM table handling for ndb_binlog_index table maintenance (Table locking) ● Health and liveness of the Binlog Injector is very important for MySQL Cluster replication Ndb Binlog Injector
  • 75. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.75 The epoch slip cluster logs mentioned before indicate where the BI is not keeping up with the latest available epochs. NdbApi counts exposed as status variables indicate the volume of data processed by the data subscriptions of the BI : mysql> show status like 'ndb_api%injector'; +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+ | Variable_name                        | Value | +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+ | Ndb_api_event_data_count_injector    | 0     | | Ndb_api_event_nondata_count_injector | 2     | | Ndb_api_event_bytes_count_injector   | 256   | +­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­+­­­­­­­+ The output of SHOW ENGINE NDB STATUS indicates the progress of the BI in terms of epochs Ndb Binlog Injector Monitoring
  • 76. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.76 mysql>SHOW ENGINE NDB STATUS | ndbcluster | binlog                | latest_epoch=10672993730572,  latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572,  latest_handled_binlog_epoch=10672993730572,  latest_applied_binlog_epoch=1022202216475 | latest_epoch : Latest completed epoch from NdbApi point of view – tail of event buffer latest_trans_epoch : Epoch of most recent transaction committed to Ndb from this server. latest_received_binlog_epoch : Epoch of the most recently consumed event from the head of event buffer. latest_handled_binlog_epoch : Epoch of the most recently completely processed epoch latest_applied_binlog_epoch : Epoch of the most recently completely processed epoch, which resulted in Binlog write. Ndb Binlog Injector Monitoring
  • 77. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.77 mysql>SHOW ENGINE NDB STATUS | ndbcluster | binlog                | latest_epoch=10672993730572,  latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572,  latest_handled_binlog_epoch=10672993730572,  latest_applied_binlog_epoch=1022202216475 | Observations : - latest_epoch == latest_received_binlog_epoch == latest_handled_binlog_epoch The NdbApi event buffer is empty, and the BI is idle - latest_trans_epoch < latest_handled_binlog_epoch Every cluster write done by this server has been binlogged - latest_handled_binlog_epoch > latest_applied_binlog_epoch Recent epochs have not had any binloggable content (Quiet cluster, or slave updates...) Ndb Binlog Injector Monitoring
  • 78. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.78 mysql>SHOW ENGINE NDB STATUS | ndbcluster | binlog                | latest_epoch=10672993730572,  latest_trans_epoch=1022202216475, latest_received_binlog_epoch=10672993730572,  latest_handled_binlog_epoch=10672993730572,  latest_applied_binlog_epoch=1022202216475 | Inferences : ● latest_epoch > other epochs : There are epochs in the NdbApi Event Buffer ● latest_received_binlog_epoch > latest_handled_binlog_epoch : BI is processing an epoch now. ● latest_trans_epoch > latest_handled_binlog_epoch : Some transactions committed by this server are not yet in the Binlog. ● Also : SHOW BINARY LOGS and SHOW MASTER STATUS The progress of Binlog writing can be seen. Ndb Binlog Injector Monitoring
  • 79. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.79 The BI produces uniform transactions in the Binlog. Each Binlog transaction describes all of the changes that occurred in a single cluster epoch. For this reason they are sometimes referred to as Epoch transactions. Each user transaction occurs in one epoch, so each user transaction's changes are recorded in one epoch transaction, and cannot span epochs. DDL statements and any other Binlogging activity will occur between these transactions, not within them. These transactions have the structure: ● BEGIN event :The position of the BEGIN event is the position of the transaction ● 1+ TABLE_MAP events : At least one for the mysql.ndb_apply_status table ● 1 WRITE_ROW event to mysql.ndb_apply_status ● 1+ other events to other tables (WRITE_ROW, UPDATE_ROW, DELETE_ROW) As normal with RBR, each event can contain changes to multiple rows ● COMMIT event The first WRITE_ROW to ndb_apply_status is a 'fake' event generated by the BI. Ndb Binlog transactions
  • 80. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.80 . Ndb Binlog transactions BEGIN COMMIT Cluster Binlog content Multi-row user transactions Commit in same epoch Epoch transaction Table maps 'Fake' WRITE_ROW Epoch transactions contain all the changes necessary to move a Slave from one consistent epoch boundary to the next.
  • 81. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.81 Merging all of the user transactions that occurred in an epoch into a single epoch transaction in the Binlog is a strength and weakness. It allows higher performance at the Slave, but complicates Binlog positioning. When looking at the Binlog on a Slave cluster, we can see that the first Master's epochs are considered to be user transactions to the Slave, so they can be merged together into one epoch transaction in the Slave's binlog. This is a source of efficiency, but can cause problems when performing failover between clusters. Ndb Binlog transactions
  • 82. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.82 One problem with a slave cluster merging a master clusters epochs together is slave promotion. A common topology is a 'read scaled' cluster with 1 Master cluster, and n Slave clusters. When the Master cluster fails, one of the Slaves is selected to become the new Master, and the other Slaves must failover their replication to the new Master. The problem with epoch merging here is that the old Master's epoch stream (A1,A2,A3,A4,A5) may have been applied by Slave B as (B1(A1), B2(A2,A3), B3, B4(A4,A5)). If Slave B becomes Master, and Slave C has stopped at old Master epoch A2, which epoch transaction boundary should it begin replicating from in Slave B's Binlog? B2 or B3? Slave promotion Another motivation for Slave defaulting to IDEMPOTENT mode
  • 83. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.83 Ndb supports some extra optimisations to minimise the size of the Binlog transactions it produces : - log_update_as_write This causes update events on tables to be logged as WRITE_ROW (Insert) events. It requires that the downstream slave can idempotently apply a WRITE_ROW event. The optimisation is that the row's 'before image' need not be sent. - log_updated_only This causes the NdbApi event (and SUMA subscription) to only send modified columns. The BI then only puts the modified columns in an update or write_row event, saving space and time. Both of these options need care to ensure behaviour is correct. Ndb Binlog transaction optimisations
  • 84. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.84 For finer-grained logging control, the mysql.ndb_replication table can be used. This allows binlogging on-or-off, log-update-as-write and log-updated- only to be controlled per table, per binlogging server. It now supports wildcards for easier use. It also supports defining conflict detection / resolution algorithms. Ndb_replication table
  • 85. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.85 ● Used to map epoch number to binlog file and position ● Append(Insert) only from binlog injector ● Read only during failover from mysql clients ● Bulk deletes occur during PURGE or RESET MASTER. ● MyISAM, so concurrency controlled using single table lock! ● One problem here is that a long running activity holding the table lock can block the BI, as its thread gets stalled waiting for a table lock to insert into the table. This generally causes epoch slip and event buffer backlogs. ● Known bad case is where a Binlog file is PURGEd, manually or automatically, requiring many many rows to be deleted from mysql.ndb_binlog_index. This can stall the BI for some time. ndb_binlog_index table
  • 86. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.86 Schema has evolved over time :   `Position` bigint(20) unsigned NOT NULL,   `File` varchar(255) NOT NULL,   `epoch` bigint(20) unsigned NOT NULL,   `inserts` int(10) unsigned NOT NULL,   `updates` int(10) unsigned NOT NULL,   `deletes` int(10) unsigned NOT NULL,   `schemaops` int(10) unsigned NOT NULL,   `orig_server_id` int(10) unsigned NOT NULL,   `orig_epoch` bigint(20) unsigned NOT NULL,   `gci` int(10) unsigned NOT NULL,   `next_position` bigint(20) unsigned NOT NULL,   `next_file` varchar(255) NOT NULL,   PRIMARY KEY (`epoch`,`orig_server_id`,`orig_epoch`) ndb_binlog_index table content Basic start position mapping Epoch statistics Slave epoch merge info Handy gci number New :Next position mapping
  • 87. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.87 The slave epoch merge info is only present with the –ndb-log-orig server option. If it is not set, those values are set to 0. Normally the BI will insert one row per epoch transaction into ndb_binlog_index. With –ndb-log-orig, it will insert one additional row for every upstream master epoch transaction that a Slave MySQLD has applied to this cluster in this epoch. This gives an indication of how an upstream master's epoch transactions were merged into a Slaves epoch transactions – useful for cutover. These upstream epoch rows do not contain epoch statistics values – those are only produced for the local cluster's row. ndb_binlog_index table epoch merge info
  • 88. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.88 The ndb_binlog_index table has always contained a mapping from an epoch number to an epoch transaction start position. However for replication channel cutover, generally the slave cluster has an already-applied epoch number from the ndb_apply_status table, and so what they need is to get binlog content after the last applied epoch. Various tricky and error prone techniques have evolved to do this in cases where it is not easy (last applied epoch transaction is the last epoch transaction in the binlog). Recently added 'next event position' columns to ndb_binlog_index so that rather than trying to find some entry representing a next event, we can just directly obtain the correct position. ndb_binlog_index table next position
  • 89. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.89 Previously ● ndb_binlog_index contains only epoch->start_pos ● Cutover involves inequality in WHERE epoch > <x>, sort + limit (Requires scanning) ● Cutover does not detect that new Master is missing relevant events. ● Cutover can silently skip over non-epoch-transaction events, e.g. DDL. ndb_binlog_index table next position Now ● ndb_binlog_index also contains epoch->next_pos ● Cutover involves equality WHERE epoch = <x> ● Cutover detects that new Master is missing relevant events. ● Cutover will find non-epoch- transaction events e.g. DDL, and can stop Recommended!
  • 90. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.90 Limitations of the next_position cutover ● ndb_apply_status and ndb_binlog_index track only epoch transactions so 'inter-epoch DDL' application status is not visible. ● Previously failover could silently skip inter-epoch DDL at a cutover point. ● Now it will find it. This can lead to duplicate application of DDL causing the Slave to stop. ● Duplicate DDL can be ignored using –slave-skip-errors=ddl_exist_errors ● ndb_binlog_index only tracks empty epochs if –ndb-log-empty-epochs=1 is set. This has disk + network bandwidth impacts. ● Backup and Restore can insert an ndb_apply_status entry with the restore point of the backup as an epoch number, so that replication can be used to catch up from this position. ● If the restore point epoch was empty, and –ndb-log-empty-epochs=0, then it won't be in ndb_binlog_index and we revert to trying to find the 'next' position. ndb_binlog_index table next position
  • 91. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.91 ● Bulk deletes as part of PURGE are of the form : DELETE FROM mysql.ndb_binlog_index where  File=<name>; ● File column is unindexed, but even with an Index, this is a lot of work. ● Worst case is many many epochs per Binlog file (e.g. small epochs/large files). Can happen on low write-rate clusters. ● Workaround : Split the DELETE into multiple invocations with LIMIT clause. Can allow the BI to progress in most cases. ● Better designs : Use InnoDB? Writers don't block readers. Use partition by File? DELETE becomes DROP PARTITION. ● Check BI status with SHOW PROCESSLIST to see if it's blocked on a table lock. ndb_binlog_index table and PURGE Avoid –expire-logs- days, PURGE manually and consider pre- deleting rows from ndb_binlog_index in batches
  • 92. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.92 ● Queries on ndb_binlog_index are used for replication channel cutover, so can be time-critical. ● In cases where the mysql.ndb_binlog_index file is huge, they can be slow. Beware a low client timeout here ! ● Indexes can be added using normal ALTER TABLE mechanisms, to speed up performance. ● Low write rate clusters can have high #s of epochs per file. Review whether these clusters are keeping excessive binlog (and therefore have excessively large mysql.ndb_binlog_index files), and consider rotating and purging more often. ndb_binlog_index table and queries
  • 93. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.93 ● BI injector regularly needs to hold : ● Server internal Binlog index lock (Unavailable during rotate/purge) ● Server internal Binlog lock for binlogging (Unavailable if other client threads are binlogging) ● Table lock on mysql.ndb_binlog_index (Unavailable during any other access due to MyISAM). ● BI is 'just another client thread' from the p.o.v. of the generic MySQL Binlog code. So it can get involved in Binlog rotation itself. If – expire_logs_days is used then this can involve PURGE! Binlog Injector problems
  • 94. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.94 ● The Ndb Slave is almost entirely the standard MySQL replication slave system until calls into the Ndb storage engine component are made. ● The IO thread is not modified in any way. ● The SQL thread makes the normal calls into the SE interface, but IDEMPOTENT mode is hard-coded on. ● Batching is the number one source of Ndb performance improvements, and this is the case in the slave. ● The standard RBR events allow limited batching of multiple row changes within a single event. ● Ndb extends this batching using the –slave-allow-batching server option. Ndb Slave Batching, batching, batching
  • 95. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.95 ● When applying an epoch transaction from an upstream master, each event is applied serially as normal. ● The Ndb handler uses the event application to define NdbApi operations for the events, but only executes them when either a full batch is defined, or there is a data dependency. ● The batch size in bytes is specified using the –ndb_batch_size Server parameter ● In recent experiments, there appeared to be very little downside in maximising the configured batch size, so that most epoch transactions are executed in a single batch. ● Batching effectiveness can be measured – see later Ndb Slave batching Increase your batch size!
  • 96. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.96 ● For PK insert/update/delete to in-memory or cached data, which is what RBR should mostly be doing, most of the response time is due to communication latency between the nodes involved. ● So we spread this latency cost over as many operations in a batch as possible. ● What's more, the operations in a batch can run in parallel on different threads of a data node, or different data nodes. ● Finally, even with disk-data the operations in a batch get parallel access to the underlying table space Ndb Slave batching Increase your batch size!
  • 97. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.97 ● Batching is similar to pipelining in a processor, it can be broken by data dependencies and system limitations. ● Where a RBR event needs to read some data, there is a data dependency that must be satisfied before it can execute. This requires that the current batch is flushed, then the read is performed, then the update. Many round trips! ● This is one reason why tables without primary keys are inefficient – they require reads or even worse, scans to find matching rows for update and delete. ● Where Blob/Text columns are being modified there is an implicit dependency in the implementation which requires that we lock the main table row before modifying parts table rows. This requires a batch flush to obtain the lock, so breaks up any surrounding batch. Ndb Slave batch breakup Avoid writing to PK-less tables Beware of Blobs
  • 98. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.98 Ndb Slave batching BEGIN COMMIT Relay log content Table maps Slave Cluster Replication code NDB SE Slave code 3 round trip latencies 20 events Moderate batching 20:3
  • 99. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.99 Ndb Slave batching BEGIN COMMIT Relay log content Table maps Slave Cluster Replication code NDB SE Slave code 1 round trip latency 20 events Max batching 20:1
  • 100. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.100 Ndb Slave batching BEGIN COMMIT Relay log content Table maps Slave Cluster Replication code NDB SE Slave code 20 round trip latencies 20 events Min batching 20:20 The Slave SQL thread CPU capacity is finite and without good batching it is limited by waiting for responses
  • 101. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.101 ● The Ndb Slave has Ndb object statistics which can be monitored while it is running. Beware that these statistics are currently lost when the Slave thread is stopped. ● mysql> SHOW STATUS LIKE 'ndb_api%slave'; ● The meaning of the values is documented in the manual, but some are of special interest : ● Ndb_api_bytes_sent_count_slave Can give rough indication of the apply rate of the slave in bytes/s ● Ndb_api_trans_commit_count_slave Can give indication of the apply rate of the slave in epochs/s ● Ndb_api_wait_exec_complete_count_slave Can give indication of the round trips performed by the slave Ndb Slave monitoring These are monotonic counters, so to get rates, you must sample on some period, and determine the difference between samples. Details online.
  • 102. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.102 ● Ndb_api_pk_op_count_slave Can give rough indication of the apply rate of the slave in rows/s ● Ndb_api_trans_abort_count_slave Can give indication of slave apply problems – locking or temporary errors. (bad) ● Ndb_api_read_row_count_slave Can give indication of whether the slave is performing any reads (bad) ● Ndb_api_table|range_scan_count_slave Can give indication of whether the slave is performing any scan reads (bad) ● Ndb_api_wait_nanos_count_slave Can indicate time spent waiting for the data nodes – with caveats. Ndb Slave monitoring
  • 103. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.103 Interesting ratios : ● Avg batches/epoch transaction (Batching ratio) : Ndb_api_wait_exec_complete_count_slave / Ndb_api_trans_commit_count_slave ● Avg bytes/epoch transaction : Ndb_api_trans_commit_count_slave / Ndb_api_trans_commit_count_slave ● Avg rows/epoch transaction : Ndb_api_pk_op_count_slave / Ndb_api_trans_commit_count_slave Ndb Slave monitoring
  • 104. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.104 The Ndb slave is often unlike all other NdbApi clients, as it is entirely serial, but generates large transactions with mixed operation types and can be very intensive. Most intense period is when a Slave is 'catching up' with a Binlog – it can apply epoch transactions much faster than they were originally committed on the Master. This can cause overload for the Slave cluster – redo logs can get stressed, SendBuffers on the Slave sending to Slave cluster binlogging MySQLDs can be overloaded at commit time. Current recommendation is : - Experiment with 'worst cases' and check behaviour and rates measured. - Monitor rates in production to get notification when they approach their tested limits. Ndb Slave notes
  • 105. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.105 ● One observation of the MySQL Cluster replication system is that it is missing some end-to-end checks. It relies on the correct operation of the lower layers of generic MySQL replication. ● Partly this is by design – the replication layer treats events and transactions separately and avoids dependencies between them. Most SEs have no slave- specific logic. ● However some cross-checks are simple and effective : ● No jumping back : Received ndb_apply_status epoch numbers should never decline without a Master pos change. ● No repeats : Received ndb_apply_status epoch numbers should never repeat without a rollback or Master pos change. ● No retry failures : Received ndb_apply_status epoch numbers should not increase without a commit, or Master pos change. Ndb Slave improvements
  • 106. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.106 ● Even better would be a check that epochs are in the expected sequence, but that requires binlogging changes : ● No gaps : Received ndb_apply_status epoch numbers should follow a sequence, where each epoch includes its succesfully binlogged predecessor's number. Ndb Slave improvements
  • 107. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.107 Insert Picture Here Recommendations M S
  • 108. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.108 ● Performance recommendations Not the main focus ● Robustness recommendations ● Potential cluster improvements. Technical details in preceding slides Recommendations
  • 109. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.109 Binlog Injector ● Set binlog_cache_size so that there's no spill-to-disk Slave ● --slave_allow_batching ● Increase ndb_batch_size (+ test) ● Avoid primary key less tables ● Beware replicating Blobs/Text ● Monitor slave activity using SHOW STATUS LIKE 'ndb_api%slave' Performance recommendations
  • 110. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.110 Data nodes ● Set MaxBufferedEpochs high enough so that it indicates a real (hard to reproduce) issue ● Test SendBuffer configuration, especially from data nodes to binlogging MySQLDs to ensure commit of the largest transactions and heaviest load can be handled. (Slave catchup?) Binlog Injector ● Monitor SHOW STATUS LIKE 'ndb_api%injector' to understand normal and excess flows. ● Monitor SHOW PROCESSLIST to check BI state ● Monitor SHOW ENGINE NDB STATUS to get NdbApi buffering indication Robustness recommendations
  • 111. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.111 Binlog Injector continued ● Monitor SHOW MASTER STATUS to understand outgoing binlog rates ● Consider using –ndb-eventbuffer-max-alloc to avoid excessive event buffer usage destabilising host ndb_binlog_index table ● Avoid using –expire-logs-days ● Consider manual purge, potentially with pre-delete of ndb_binlog_index rows in small batches ● Consider adding indexes if cutover queries are too slow Robustness recommendations
  • 112. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.112 System restart ● Ensure that after a system restart, cluster is not brought back online immediately as it will need some form of consistency restoration. Replication channel cutover ● Consider using the new replication channel cutover query, alongside – slave-skip-errors=ddl_exist_errors Slave ● Test system robustness under prolonged 'Master catchup' scenario. Monitor Slave cluster redo logs, redo log state, SendBuffer overload, Binlogging MySQLD lag etc. Robustness recommendations