Machine Learning for Capacity Management

Machine Learning for
Capacity Management
Dave Page
December 2020

Dave Page
● EDB (CTO Office)
○ VP & Chief Architect, Database
Infrastructure
● PostgreSQL
○ Core Team
○ pgAdmin Lead Developer

© Copyright EnterpriseDB Corporation, 2020. All rights reserved.3
What is Capacity Management?
“Capacity management's primary goal is to ensure that information technology resources are
right-sized to meet current and future business requirements in a cost-effective manner.”
https://en.wikipedia.org/wiki/Capacity_management

• When will I run out of disk space?
• How many rows will be in this table in 6 months time?
• At current growth rate, how many servers will I need to support my users in three years?
“Classic” Capacity Management Tools
Focus on trends

• Calculate a trend from collected metrics
• Extrapolate to a point in time
• Extrapolate to a threshold value
• Plan based on results
Linear Trend Analysis

• Capacity requirements are usually more complex than upward or downward trends:
• Seasonality
• User numbers may be higher in the daytime on weekdays than at night or the weekend
• System load may be higher on the last day of the month due to month-end batch processing
• Noise:
• Metrics are rarely completely stable
• Noise can affect linear trend analysis in ways we might not see
Limitations

• Utility or cloud based computing can easily scale up or down:
• Scale up in anticipation of load spikes
• Scale down when capacity is not required to save money
• Minimise unnecessary fluctuations in scale (e.g. an unexpected load spike immediately following
a scale down)
• Noise can (sometimes) make it hard to see or understand actual trends
• “Cut through” the noise
Limitations - why do we care?

Machine
Learning
● Python
● Tensorflow 2
● pl/python3 (optional)

What is Tensorflow?
“TensorFlow is a free and open-source software library for machine learning. It can be used across
a range of tasks but has a particular focus on training and inference of deep neural networks. ”
https://en.wikipedia.org/wiki/TensorFlow

Tensorflow Installation
To use from within PostgreSQL:
• Install for the interpreter used by pl/python3
• In this case, the EDB LanguagePack with PostgreSQL 13 on macOS
Assuming Python 3.5+ is installed:

What will we build?
• Multiple one-dimensional convolutional layers
• Set of filters with trainable parameters
• Increasing dilation to detect seasonal patterns
• Lower layers learn short term patterns
• Higher layers learn long term patterns
• Originally designed for audio analysis
• Ideal for time series prediction
A deep Neural Network in the Wavenet architecture
https://paperswithcode.com/method/wavenet

• Thankfully, little is required for time series prediction
• Time series data consists of:
• A series of recorded numeric values or metrics...
• ...spread equidistantly over a period of time
Input Data
Careful data preparation is usually critical in Machine Learning

• The input data is split into 4 arrays:
• Training timestamps
• Training metrics
• Validation timestamps
• Validation metrics
• The neural network doesn’t need the timestamps; they’re only used when plotting results
• Training data is used to train the network
• Validation data is used to measure how effective the training is
Data Preparation
Training & Validation

• filters: The number of neurons in the layer, which corresponds to the required output dimension.
• kernel_size: The length of the 1D convolutional window; i.e. the number of time steps used to compute
each output value.
• strides: This is the number of time steps the filter will move as it processes the input.
• padding: This determines how data is padded to ensure that inputs at the beginning and end of the time
series can be processed (with a kernel size of 2, we need two additional values to process the first and last
time steps). Causal padding adds values to the beginning of the time series so that we don't end up
predicting future values based on padding added to the future.
• activation: The activation function helps the network handle non-linear and interaction effects (where
different variables in the network can be affected by one another). Rectified Linear Unit or ReLU is a simple
algorithm that works well here in many cases!
Model Parameters

• Once the model is defined (and compiled), it must be trained
• Training runs through multiple iterations or epochs, using our training data set
• Weights and biases etc. are adjusted with each epoch to tune the network
• The validation data set is used to assess the accuracy of the model
• A loss function (keras.losses.Huber() in this case) is used to measure the accuracy
Training
Supervised Learning

• If we train for too many epochs, overfitting is likely to occur
• The network will learn the data we have, and likely fail to work well with new data
• We prevent this with Early Stopping:
• Training is completed when no significant improvement has been made
Early Stopping
Overfitting occurs when the network learns the data rather than the patterns

• Every epoch we assess the accuracy of the
model
• If, and only if, improvement is measured over all
previous epochs, we save the model
• Once training is complete, we load the last
saved model
Checkpoints
Training doesn’t always improve!
Example of training & validation RMSE vs.
training epoch for a regression network

• Increase the number of epochs, if early stopping isn’t kicking in
• Adjust the learning rate
• Try a different optimiser (though Adam is usually best)
• Experiment with the number of layers and filters (neurons) in the model
• Try a different activation function
• Adjust the kernel size and stride
• Try different ratios of training to validation data
Tuning the model
We can experiment with various parameters to fine tune the model. Experience helps!

Testing
● Generated Data
● Generated Workload

Generated Data
• A dataset was generated in Python, with:
• Seasonality
• Trend
• Noise
Initial testing based on generated data

• The model is used to
predict or forecast future
data, and validated
against our validation
set
• Once we’re happy the
network is accurate, we
can predict future data
Generated Data
Results

• Workload generated with pgWorkload
(https://github.com/EnterpriseDB/pgworkload)
• Number of rows in pg_stat_activity
recorded every five minutes
• Workload designed to simulated
increasing and decreasing numbers of
users over a week
Simulated Workload
Metrics recorded from PostgreSQL

Simulated Workload
Results

Thought for the future
Monitoring and Alerting
● “Classic” alerting works based on thresholds:
○ “Tell me when the number of active users exceeds 200”
● But “normal” can vary based on time or date:
○ Peak time on Monday: 180 users is normal
○ Midnight on Sunday: >2 users is abnormal
● Solution:
○ Dynamically configure thresholds based on predicted usage patterns

Conclusions
Using Machine Learning techniques
we can more accurately predict
metrics with seasonality and trend:
● Scale up in anticipation of
demand
● Scale down to save money
● Plan for additional resource
requirements
● Detect abnormal conditions
taking into account time of
day/week/month etc.

Useful information
pgWorkload:
https://github.com/EnterpriseDB/pgworkload
Experimental Code:
https://github.com/dpage/ml-experiments/tree/main/time_series
Blogs:
https://www.enterprisedb.com/blogs/dave.page/published-articles
https://pgsnake.blogspot.com/
Contact:
dave.page@enterprisedb.com
Thank You

Machine Learning for Capacity Management

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Machine Learning for Capacity Management

Semelhante a Machine Learning for Capacity Management (20)

Mais de EDB

Mais de EDB (20)

Último

Último (20)

Machine Learning for Capacity Management