Tech-Spark: Scaling Databases

1. Introduction: Why Scale?
2. Vertical & Horizontal
Partitioning
3. Partitioned Tables
4. Distributed Partitioned Views
5. Database Sharding
6. Stretch Databases (optional)

Ralph: Who am I?
• An Enterprise Architect
• at iGamingCloud, Gaming Innovation Group
• focus on Data Platforms
• A Microsoft Certified Trainer
• deliver MTA, MCSA, MCSE locally
• covering Windows, SQL Server, C#
• I’m here to describe the need for database scalability, describe a
number of possible cross platform solutions, and demonstrate
technologies available in MS SQL Server 2016 and Azure.

1. Introduction
Why do we need to scale databases?
Overview of possible options

Scaling Databases: Why?
• Most application environments are developed as a monolith, a single application running a single
database on a single server.
• In time, the whole application environment starts slowing down:
• increased data volumes
• increased work loads
• The simplest option is to introduce an app/web farm to balance the application across multiple
servers whilst using the same old single database.
• But this might not be enough… we need to scale the database!

Scaling Databases: Optimisations
• Unless the whole application environment is redesigned and redeveloped, one needs to look into
optimising the database layer.
• Large database problems include:
• Queries become slower, possibly giving time-outs under load
• Backups are slower to take, to ship, and to restore
• Performing index maintenance impacts even more
• Common optimisations include:
• Vertical Scaling: scale up current servers to max disk/memory/cpu, or simply migrate to a bigger server
• Read Scaling: scale out to introduce an (a)sync server to split read-only queries from the application
• Database restructuring: improved table designs, introduction of aggregation tables
• Offload data: move old transactional data to archive servers, deletion of log data
• But this might not be enough… we need to partition the database!

Scaling Databases: Data Partitioning
• Even though we can scale vertically by adding more resources,
a single database would need to be scaled within itself:
• Vertical & Horizontal Partitioning
• Partitioned Tables
• When a single database is too big, horizontal scaling is done
using distributed databases:
• Distributed Partitioned Views
• Database Sharding
Scale
Up
Scale
Out

Scaling Databases: Domain Partitioning
• A different approach is to partition your data by domain.
• This is achieved by splitting data by domain and moving them into their own database.
• This could be fairly easy if tables are already grouped into their own schema by domain.
• However it could be problematic if application queries and reports span multiple schemas
• reports would now need to mesh multiple databases together
• or read from a consolidated data warehouse
• Even though this breaks the database down into smaller databases, each smaller database has the
potential to become a problem on its own.
• Refactoring a monolith application into various microservices adopts this principle with each
microservice having its own data store.
• Microservices are usually polyglot persistent. The appropriate data store is chosen according to the
required features and partition usage: e.g. using a mix of SQL & NoSQL datastores.

2. Partitioning
Benefits
Strategies: Horizontal & Vertical Partitioning
Updatable Views
DEMO

Partitioning Benefits
• Scalability: Scale-up will eventually reach a physical hardware limit.
• Performance: Data access takes place on smaller partitions, in parallel for multiple partitions.
• Availability: Reduce single point of failures; multiple disk drives, multiple databases, multiple servers.
• Security: Separate sensitive and non-sensitive data into different partitions.
• Flexibility: Varied operational management strategies by partition; monitoring, backups, restores,
indexing, etc.
https://docs.microsoft.com/en-us/azure/architecture/best-practices/data-partitioning

Strategy: Vertical Partitioning
ProductID Name Price DateCreated Stock LastOrderded
AR-5381 Adjustable Race 50 11-Jan-2016 8 17-Nov-2016
AA-8327 Bearing Ball 100 11-Feb-2016 46 21-Nov-2017
BE-2349 BB Ball Bearing 105 11-Mar-2016 52 16-Sep-2017
CE-2908
Headset Ball
Bearings
90 11-Jan-2017 13 12-Feb-2017
CL-2036 Blade 70 11-Feb-2017 28 01-Dec-2017
DA-5965 LL Crankarm 150 11-Mar-2017 30 08-Dec-2017
ProductID Name Price DateCreated
AR-5381 Adjustable Race 50 11-Jan-2016
AA-8327 Bearing Ball 100 11-Feb-2016
BE-2349 BB Ball Bearing 105 11-Mar-2016
CE-2908 Headset Ball Bearings 90 11-Jan-2017
CL-2036 Blade 70 11-Feb-2017
DA-5965 LL Crankarm 150 11-Mar-2017
ProductID Stock LastOrderded
AR-5381 8 17-Nov-2016
AA-8327 46 21-Nov-2017
BE-2349 52 16-Sep-2017
CE-2908 13 12-Feb-2017
CL-2036 28 01-Dec-2017
DA-5965 30 08-Dec-2017

Strategy: Horizontal Partitioning
ProductID Name Price Stock DateCreated LastOrderded
AR-5381 Adjustable Race 50 8 11-Jan-2016 17-Nov-2016
AA-8327 Bearing Ball 100 46 11-Feb-2016 21-Nov-2017
BE-2349 BB Ball Bearing 105 52 11-Mar-2016 16-Sep-2017
CE-2908 Headset Ball Bearings 90 13 11-Jan-2017 12-Feb-2017
CL-2036 Blade 70 28 11-Feb-2017 01-Dec-2017
DA-5965 LL Crankarm 150 30 11-Mar-2017 08-Dec-2017
CE-2908 Headset Ball Bearings 90 13 11-Jan-2017 12-Feb-2017
CL-2036 Blade 70 28 11-Feb-2017 01-Dec-2017
DA-5965 LL Crankarm 150 30 11-Mar-2017 08-Dec-2017
AR-5381 Adjustable Race 50 8 11-Jan-2016 17-Nov-2016
AA-8327 Bearing Ball 100 46 11-Feb-2016 21-Nov-2017
BE-2349 BB Ball Bearing 105 52 11-Mar-2016 16-Sep-2017
Production.Products_2016
Production.Products
Production.Products_2017

Horizontal Partitioning: Why?
• The idea behind horizontal partitioning is that to split a large table into multiple smaller tables.
• Query-wise
• One smaller table is faster to query than a larger table
• However querying multiple smaller tables is problematic
• Administration-wise, multiple tables can be placed into different file groups, which
• Can be placed into different physical disks > parallelism can be faster
• Can be backed up individually > smaller backup windows
• Set as read-only > protect older data from modifications, backup once and forget

Horizontal Partitioning: Dynamic Queries
DECLARE @SQL AS NVARCHAR(MAX) = CONCAT('
SELECT ProductId, Name, Price, Stock, DateCreated, LastOrdered
FROM Production.Products_', dbo.GetPartition('Production.Products', @FromDate), ' WITH(NOLOCK)
WHERE DateCreated >= @FromDate AND Date <= @ToDate
')
EXECUTE sp_ExecuteSql @Stmt = @SQL
, @Params = N'@FromDate AS DATETIME, @ToDate AS DATETIME‘
, @FromDate = @FromDate
, @ToDate = @ToDate

Horizontal Partitioning: UNIONed Queries
;WITH products AS
(
FROM Production.Products_2016 WITH(NOLOCK)
UNION ALL
FROM Production.Products_2017 WITH(NOLOCK)
UNION ALL
...
)
FROM products

Views
• Dynamic Queries are a pain! No syntax checking, string concatenation, etc…
• Constantly creating CTEs to union tables is heavy for everyone.
• Usually create VIEWs to provide a unified view
• however could be cumbersome and repetitive e.g. every month
• thus we dynamically create them using custom code and jobs
• VIEWS help transparently replace an existing table with multiple smaller ones
• no code changes required
• however not all views are updatable

Updateable Views
• You can modify the data of an underlying base table through a view, as long as the following
conditions are true:
• Any modifications, including UPDATE, INSERT, and DELETE statements, must reference columns from only one base
table.
• The columns being modified in the view must directly reference the underlying data in the table columns.
• The columns being modified are not affected by GROUP BY, HAVING, or DISTINCT clauses.
• TOP is not used anywhere in the select statement of the view together with the WITH CHECK OPTION clause.
• INSTEAD OF triggers can be created on a view to make it updatable. The INSTEAD OF trigger is
executed instead of the data modification statement on which the trigger is defined.
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-view-transact-sql

3. Partitioned Tables
Defining Partition Functions & Partition Schemes
Tooling: Custom Partition framework
Myths and performance issues
DEMO

Partitioned Tables: Definition…
• Microsoft introduced Partitioned Tables in MSSQL SERVER 2005
• It supports the use of multiple file groups
• It provides a single table to query from irrespective of partitions
• The above example partitions a table into:
• A partition per month within the current year
• A partition per year for the last two years
• A partition for all the previous years
2015 2016
Jan
2017
E
M
P
T
Y
Feb
2017
E
M
P
T
Y
Pre-2015

• A Partition Function
• A Data Type – typically DATE related
• A Range – LEFT or RIGHT
CREATE PARTITION FUNCTION PF_Name (DATETIME2)
AS RANGE RIGHT FOR VALUES ('20170101','20170201','20170301');
• A Partition Scheme – that associates file groups to the partition function
CREATE PARTITION SCHEME PS_Name
AS PARTITION PF_Name
TO (FG000000, FG201701, FG201702, FG201703);

• With a RIGHT Range, the previous partitioned table example requires 4 partitions:
• A partition on the left, containing everything from beginning of time till before Jan 2017 – should be empty
• A partition from Jan 2017 till before Feb 2017
• A partition from Feb 2017 till before Mar 2017
• A partition from Mar 2017 till the end of time – should be empty
Jan
2017
Feb
2017
E
M
P
T
Y
Mar
2017
(EMPTY)

Partitioned Tables: Splitting…
• A partitioned table can be extended by splitting an existing partition
• We first add a new file group to the partition scheme
ALTER PARTITION SCHEME PS_Name
NEXT USED [FG201704]
• We then split the partition function to the right
ALTER PARTITION FUNCTION PF_Name()
SPLIT RANGE ('20170401')
Jan
2017
Feb
2017
E
M
P
T
Y
Mar
2017
(EMPTY)
Jan
2017
Feb
2017
E
M
P
T
Y
Apr
2017
(EMPTY)
Mar
2017

Summary: Required steps
• On setup:
1. Create file group for non-partitioned indexes (if required)
2. Create file group for left hand side (to remain empty)
3. Create Partition Function (partitioning key datatype, range direction)
4. Create Partition Scheme (with empty file group)
• Regularly (e.g. monthly)
1. Create file group
2. Split partition

Tooling: Custom partitioning framework
• We created a number of stored procedures to handle these steps:
• Maintenance.UspCreateFileGroup – used to create files and file groups
• Maintenance.UspCreatePartition – used once to create the partition function and partition scheme
• Maintenance.UspCreatePartitionView – used to create a monthly view per partition by date range
• Maintenance.UspSplitPartition – used monthly to create a new file group, split partition, create view
• Maintenance.UspSplitPartitionAllTables – used monthly to split all partition tables via agent job

Partitioned Tables: Merging…
• A partitioned table can have multiple partitions merged into one
ALTER PARTITION FUNCTION PF_Name()
MERGE RANGE('20170201');
• Note: Merging partitions with data movement across file groups will be slow
Jan & Feb
2017
E
M
P
T
Y
Mar
2017
Apr
2017
(EMPTY)
Jan
2017
Feb
2017
E
M
P
T
Y
Mar
2017
Apr
2017
(EMPTY)

Partitioned Tables: Switching…
• Partition switching reduces locks whilst:
• Loading data into a warehouse
• Deleting old data during archival
• Move data between tiered storage
• Partitions need to be in the same file group
• Re-create the staging indexes to move physical data
ALTER TABLE schema.StgTable
SWITCH PARTITION $PARTITION.PF_Name('20170201')
TO schema.PrdTable PARTITION $PARTITION.PF_Name('20170201')
Jan 2017 Feb 2017 Mar 2017
Apr 2017
(EMPTY)
Empty
Partition Function
Production table
Staging table

https://support.microsoft.com/en-us/help/2965553/decreased-performance-for-sql-server-when-you-run-a-top--max-or-min-ag
Decreased performance: TOP, MAX or MIN

Decreased performance: TOP, MAX or MIN
• Test results show that TOP is slower on partitioned tables by 10%
• ROWCOUNT can be used instead

Increased performance: SELECT using non-clustered PK
• When using ROWCOUNT, throughput on partitioned tables is faster
• by 22% throughput
• and has a 3% improvement on response time when using the non-clustered primary key

Increased performance: SELECT using Partitioning Key
• When using ROWCOUNT, throughput on partitioned tables is faster
• by 6% throughput
• and has a 7% improvement on response time when using the clustered partitioning date key

Increased performance: Inserts
• Combined INSERT & SELECT tests found partitioned tables to be faster:
• SELECT – 9% Throughput benefit / 11% improvement in response times
• INSERT – 4% Throughput benefit / 9% improvement in response times

https://blogs.msdn.microsoft.com/sqlmeditation/2013/04/02/dealing-with-unique-columns-when-using-table-partitioning
Unique columns

Unique columns
• Traditionally developers create an IDENTITY(1,1) PRIMARY KEY to provide uniqueness
• This cannot be used with partitioned tables
• Should be replaced with a UNIQUEIDENTIFIER generated at application level (also in preparation for distributed
tables…)
• A PRIMARY KEY is by default CLUSTERED and stored with the data
• In partitioned tables, the Partitioning Key has to be CLUSTERED to split the data
• Thus if the PRIMARY KEY does not contain the Partitioning Key this cannot be CLUSTERED
• An un-partitioned NONCLUSTERED PRIMARY KEY can be used to enforce uniqueness
• However this prohibits SWITCHING of partitions due to unaligned indexes

https://www.mssqltips.com/sqlservertip/1914/sql-server-database-partitioning-myths-and-truths
Myth: Metadata only operations

Myth: Metadata only operations
• Switching partitions in & out
• Requires schema lock on both source and destination tables
• Usually the command is set with a timeout; and try again later
• Splitting & merging partitions
• Altering the partition function is an offline operation
• Splitting a partition which contains data requires data movement
• If the range split introduces a different file group, data needs to physically move between files
• This is why we keep an empty partition on the left and right, and we always split the empty partition

4. Distributed Partitioned Views
Definition
Requirements… loads!
DEMO

Distributed Partitioned Views: Definition
• Basically a view which unions data from multiple databases hosted on different servers.
• Also referred to as Federated Databases.
• Used when applications are unaware of such partitioning.
• Requires Linked Servers.
• Performance improves with lazy schema validation option.
• Read-only views work everywhere.
• Updatable views require Enterprise Edition.
• INSTEAD OF triggers can be used to make views updatable on Standard Edition.

https://docs.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-2016
Distributed Partitioned Views

Distributed Partitioned Views: Requirements
• Tables Rules
• Member tables cannot be referenced more than one time in the view.
• Member tables cannot have indexes created on any computed columns.
• Member tables must have all PRIMARY KEY constraints on the same number of columns.
• Member tables must have the same ANSI padding setting.
• Column Rules
• All columns in each member table must be included in the same ordinal position in the select list.
• Columns cannot be referenced more than one time in the select list.
• The columns in the select list of each SELECT statement must be of the same type.
• The key ranges of the CHECK constraints in each table cannot overlap with the ranges of any other table.
• Partitioning Column Rules
• The partitioning column cannot be an identity, default, timestamp, or computed column.
• The partitioning column must be in the same ordinal location in the select list of each SELECT statement in the view.
• The partitioning column cannot allow for nulls.
• The partitioning column must be a part of the primary key of the table.
• There must be only one constraint on the partitioning column.
• There are no restrictions on the updatability of the partitioning column.
https://technet.microsoft.com/en-us/library/ms188299(v=sql.105).aspx

Distributed Partitioned Views: Updatable
• INSERT Statements
• All columns must be included in the INSERT statement even if the column can be NULL in the base table or has a DEFAULT constraint defined in
the base table.
• The DEFAULT keyword cannot be specified in the VALUES clause of the INSERT statement.
• INSERT statements must supply a value that satisfies the logic of the CHECK constraint defined on the partitioning column for one of the
member tables.
• INSERT statements are not allowed if a member table contains a column with an identity property.
• INSERT statements are not allowed if a member table contains a timestamp column.
• INSERT statements are not allowed if there is a self-join with the same view or any one of the member tables.
• UPDATE Statements
• UPDATE statements cannot specify the DEFAULT keyword as a value in the SET clause even if the column has a DEFAULT value defined in the
corresponding member table
• The value of a column with an identity property cannot be changed: however, the other columns can be updated.
• The value of a PRIMARY KEY cannot be changed if the column contains text, image, or ntext data.
• Updates are not allowed if a base table contains a timestamp column.
• Updates are not allowed if there is a self-join with the same view or any one of the member tables.
• DELETE Statements
• DELETE statements are not allowed when there is a self-join with the same view or any one of the member tables.
https://technet.microsoft.com/en-us/library/ms187067(v=sql.105).aspx

5. Database Sharding
Definition
Sharding Strategies

Database Sharding: Definition
• A form of horizontal partitioning in which partitions are distributed on commodity servers.
• An individual partition is referred to as a shard.
• The application is shard-aware and can route connection requests autonomously without the
need of distributed partitioned views.
• Sharding is used to truly circumvent issues of having a single monolith database or a single entry-
point in terms of Storage space, Computing resources, Network bandwidth, and Geography.
https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding

Database Sharding: Problems
• Queries that JOIN shards together are problematic and would need to be meshed together via the
application.
• Multiple shards can be queried in parallel and merged together either in memory or client-side.
• Referential integrity might be non existent.
• Shards are usually used with domain-based partitioning and thus referenced tables could be in different databases.
• Un-partitioned reference tables would also be placed outside the shards.
• However, static reference tables could be treated as global tables, thus copied and replicated into all shards.
• Rebalancing sharded data is problematic. This might be required when
• a shard key changes and thus data need to move between shards
• a new shard is added and data needs to be redistributed
https://docs.microsoft.com/en-us/azure/architecture/patterns/sharding

Database Sharding: Strategies
• The Lookup strategy
• A map is used to route a request for data to the shard that contains such data using the shard key.
• Multi-tenant applications can store all the data for a tenant together in a shard using the tenant ID as
shard key.
• Multiple tenants can share the same shard, but the data for a single tenant cannot spread across
multiple shards.
• The Range strategy
• Sequential shard keys are ordered and grouped together.
• Useful for applications that frequently retrieve sets of items using range queries.
• The Hash strategy
• This is used to reduce the chance of hotspots (shards that receive a disproportionate amount of load).
• The chosen hashing function should distribute data evenly across the shards, possibly by introducing
some random element into the computation.

6. Stretch Database
Definition
Demo

Stretch Database: Definition
• Stretch Database is a feature of SQL Server 2016.
• This is used to move cold data from on-premise instances
directly into the cloud with only a few clicks.
• Eliminates the need to manually create archiving procedures
that move data out of production db and into archive db.
• Requires an Azure subscription.
• Download “Data Migration Assistant” to identify candidate tables to stretch.
https://docs.microsoft.com/en-us/sql/sql-server/stretch-database/stretch-database

Stretch Database: Limitations
• Limitations for Stretch-enabled tables
• Uniqueness is not enforced for UNIQUE constraints and PRIMARY KEY constraints in the Azure table that contains the
migrated data.
• You can't UPDATE or DELETE rows that have been migrated, or rows that are eligible for migration.
• You can't INSERT rows into a Stretch-enabled table on a linked server.
• You can't create an index for a view that includes Stretch-enabled tables.
• Filters on SQL Server indexes are not propagated to the remote table.
• Limitations that currently prevent you from enabling Stretch for a table
• Tables that have more than 1,023 columns or more than 998 indexes
• FileTables or tables that contain FILESTREAM data
• Tables that are replicated, or that are actively using Change Tracking or Change Data Capture
• Memory-optimized tables
• Data types: text, ntext, image, timestamp, sql_variant, XML, and CLR data types including geometry, geography, hierarchyid
• Computed columns
• Default constraints and check constraints
• Foreign key constraints that reference the table.
• Full text indexes, XML indexes, Spatial indexes, Indexed views
https://docs.microsoft.com/en-us/sql/sql-server/stretch-database/limitations-for-stretch-database

• Today’s event was sponsored by:
Microsoft Malta : location and refreshments
Gaming Innovation Group : Parking vouchers
• The Tech-Spark community requires your help. Sponsor an event by providing a meeting place,
refreshments, and why not, deliver a session! Feel free to contact us should you want to help.

Contact Us
Ralph Attard
raland@raland.net
Tech Spark
http://www.tech-spark.com
https://www.facebook.com/techsparkmalta

Tech-Spark: Scaling Databases

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Tech-Spark: Scaling Databases

Semelhante a Tech-Spark: Scaling Databases (20)

Último

Último (20)

Tech-Spark: Scaling Databases