SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Machine Learning for
Capacity Management
Dave Page
December 2020
Dave Page
● EDB (CTO Office)
○ VP & Chief Architect, Database
Infrastructure
● PostgreSQL
○ Core Team
○ pgAdmin Lead Developer
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.3
What is Capacity Management?
“Capacity management's primary goal is to ensure that information technology resources are
right-sized to meet current and future business requirements in a cost-effective manner.”
https://en.wikipedia.org/wiki/Capacity_management
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.4
• When will I run out of disk space?
• How many rows will be in this table in 6 months time?
• At current growth rate, how many servers will I need to support my users in three years?
“Classic” Capacity Management Tools
Focus on trends
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.5
• Calculate a trend from collected metrics
• Extrapolate to a point in time
• Extrapolate to a threshold value
• Plan based on results
“Classic” Capacity Management Tools
Linear Trend Analysis
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.6
• Capacity requirements are usually more complex than upward or downward trends:
• Seasonality
• User numbers may be higher in the daytime on weekdays than at night or the weekend
• System load may be higher on the last day of the month due to month-end batch processing
• Noise:
• Metrics are rarely completely stable
• Noise can affect linear trend analysis in ways we might not see
“Classic” Capacity Management Tools
Limitations
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.7
• Utility or cloud based computing can easily scale up or down:
• Scale up in anticipation of load spikes
• Scale down when capacity is not required to save money
• Minimise unnecessary fluctuations in scale (e.g. an unexpected load spike immediately following
a scale down)
• Noise can (sometimes) make it hard to see or understand actual trends
• “Cut through” the noise
“Classic” Capacity Management Tools
Limitations - why do we care?
Machine
Learning
● Python
● Tensorflow 2
● pl/python3 (optional)
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.9
What is Tensorflow?
“TensorFlow is a free and open-source software library for machine learning. It can be used across
a range of tasks but has a particular focus on training and inference of deep neural networks. ”
https://en.wikipedia.org/wiki/TensorFlow
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.10
Tensorflow Installation
To use from within PostgreSQL:
• Install for the interpreter used by pl/python3
• In this case, the EDB LanguagePack with PostgreSQL 13 on macOS
Assuming Python 3.5+ is installed:
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.11
What will we build?
• Multiple one-dimensional convolutional layers
• Set of filters with trainable parameters
• Increasing dilation to detect seasonal patterns
• Lower layers learn short term patterns
• Higher layers learn long term patterns
• Originally designed for audio analysis
• Ideal for time series prediction
A deep Neural Network in the Wavenet architecture
https://paperswithcode.com/method/wavenet
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.12
• Thankfully, little is required for time series prediction
• Time series data consists of:
• A series of recorded numeric values or metrics...
• ...spread equidistantly over a period of time
Input Data
Careful data preparation is usually critical in Machine Learning
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.13
• The input data is split into 4 arrays:
• Training timestamps
• Training metrics
• Validation timestamps
• Validation metrics
• The neural network doesn’t need the timestamps; they’re only used when plotting results
• Training data is used to train the network
• Validation data is used to measure how effective the training is
Data Preparation
Training & Validation
Creating the
Model
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.15
• filters: The number of neurons in the layer, which corresponds to the required output dimension.
• kernel_size: The length of the 1D convolutional window; i.e. the number of time steps used to compute
each output value.
• strides: This is the number of time steps the filter will move as it processes the input.
• padding: This determines how data is padded to ensure that inputs at the beginning and end of the time
series can be processed (with a kernel size of 2, we need two additional values to process the first and last
time steps). Causal padding adds values to the beginning of the time series so that we don't end up
predicting future values based on padding added to the future.
• activation: The activation function helps the network handle non-linear and interaction effects (where
different variables in the network can be affected by one another). Rectified Linear Unit or ReLU is a simple
algorithm that works well here in many cases!
Model Parameters
Model
Architecture
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.17
• Once the model is defined (and compiled), it must be trained
• Training runs through multiple iterations or epochs, using our training data set
• Weights and biases etc. are adjusted with each epoch to tune the network
• The validation data set is used to assess the accuracy of the model
• A loss function (keras.losses.Huber() in this case) is used to measure the accuracy
Training
Supervised Learning
Training
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.19
• If we train for too many epochs, overfitting is likely to occur
• The network will learn the data we have, and likely fail to work well with new data
• We prevent this with Early Stopping:
• Training is completed when no significant improvement has been made
Early Stopping
Overfitting occurs when the network learns the data rather than the patterns
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.20
• Every epoch we assess the accuracy of the
model
• If, and only if, improvement is measured over all
previous epochs, we save the model
• Once training is complete, we load the last
saved model
Checkpoints
Training doesn’t always improve!
Example of training & validation RMSE vs.
training epoch for a regression network
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.21
• Increase the number of epochs, if early stopping isn’t kicking in
• Adjust the learning rate
• Try a different optimiser (though Adam is usually best)
• Experiment with the number of layers and filters (neurons) in the model
• Try a different activation function
• Adjust the kernel size and stride
• Try different ratios of training to validation data
Tuning the model
We can experiment with various parameters to fine tune the model. Experience helps!
Testing
● Generated Data
● Generated Workload
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.23
Generated Data
• A dataset was generated in Python, with:
• Seasonality
• Trend
• Noise
Initial testing based on generated data
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.24
• The model is used to
predict or forecast future
data, and validated
against our validation
set
• Once we’re happy the
network is accurate, we
can predict future data
Generated Data
Results
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.25
• Workload generated with pgWorkload
(https://github.com/EnterpriseDB/pgworkload)
• Number of rows in pg_stat_activity
recorded every five minutes
• Workload designed to simulated
increasing and decreasing numbers of
users over a week
Simulated Workload
Metrics recorded from PostgreSQL
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.26
Simulated Workload
Results
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.27
Thought for the future
Monitoring and Alerting
● “Classic” alerting works based on thresholds:
○ “Tell me when the number of active users exceeds 200”
● But “normal” can vary based on time or date:
○ Peak time on Monday: 180 users is normal
○ Midnight on Sunday: >2 users is abnormal
● Solution:
○ Dynamically configure thresholds based on predicted usage patterns
Conclusions
Using Machine Learning techniques
we can more accurately predict
metrics with seasonality and trend:
● Scale up in anticipation of
demand
● Scale down to save money
● Plan for additional resource
requirements
● Detect abnormal conditions
taking into account time of
day/week/month etc.
© Copyright EnterpriseDB Corporation, 2020. All rights reserved.29
Useful information
pgWorkload:
https://github.com/EnterpriseDB/pgworkload
Experimental Code:
https://github.com/dpage/ml-experiments/tree/main/time_series
Blogs:
https://www.enterprisedb.com/blogs/dave.page/published-articles
https://pgsnake.blogspot.com/
Contact:
dave.page@enterprisedb.com
Thank You

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

PostgreSQL to Accelerate Innovation
PostgreSQL to Accelerate InnovationPostgreSQL to Accelerate Innovation
PostgreSQL to Accelerate Innovation
 
Keynote: The Postgres Ecosystem
Keynote: The Postgres EcosystemKeynote: The Postgres Ecosystem
Keynote: The Postgres Ecosystem
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and BigtopAccelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
 
PostgreSQL 12: What is coming up?, Enterprise Postgres Day
PostgreSQL 12: What is coming up?, Enterprise Postgres DayPostgreSQL 12: What is coming up?, Enterprise Postgres Day
PostgreSQL 12: What is coming up?, Enterprise Postgres Day
 
True Postgres High Availability Architecture with Industry Standard Open-Sou...
 True Postgres High Availability Architecture with Industry Standard Open-Sou... True Postgres High Availability Architecture with Industry Standard Open-Sou...
True Postgres High Availability Architecture with Industry Standard Open-Sou...
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
 
Webinar: Managing Postgres at Scale
Webinar: Managing Postgres at ScaleWebinar: Managing Postgres at Scale
Webinar: Managing Postgres at Scale
 
Automating Postgres Deployments on AWS and VMware, with Terraform and Ansible
Automating Postgres Deployments on AWS and VMware, with Terraform and AnsibleAutomating Postgres Deployments on AWS and VMware, with Terraform and Ansible
Automating Postgres Deployments on AWS and VMware, with Terraform and Ansible
 
Whats New in Postgres 12
Whats New in Postgres 12Whats New in Postgres 12
Whats New in Postgres 12
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Remote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needsRemote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needs
 
Remote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needsRemote DBA Service: Powering your DBA needs
Remote DBA Service: Powering your DBA needs
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
Not all open source is the same
Not all open source is the sameNot all open source is the same
Not all open source is the same
 
Cloud Native PostgreSQL - APJ
Cloud Native PostgreSQL - APJCloud Native PostgreSQL - APJ
Cloud Native PostgreSQL - APJ
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Migrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQLMigrating Oracle to PostgreSQL
Migrating Oracle to PostgreSQL
 

Semelhante a Machine Learning for Capacity Management

IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
In-Memory Computing Summit
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 

Semelhante a Machine Learning for Capacity Management (20)

IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
 
Presentation 7.pptx
Presentation 7.pptxPresentation 7.pptx
Presentation 7.pptx
 
20180522 infra autoscaling_system
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_system
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
HR management system
HR management systemHR management system
HR management system
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 

Mais de EDB

EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021
EDB
 
Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?
EDB
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
 

Mais de EDB (20)

Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaSCloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021EFM Office Hours - APJ - July 29, 2021
EFM Office Hours - APJ - July 29, 2021
 
Benchmarking Cloud Native PostgreSQL
Benchmarking Cloud Native PostgreSQLBenchmarking Cloud Native PostgreSQL
Benchmarking Cloud Native PostgreSQL
 
Las Variaciones de la Replicación de PostgreSQL
Las Variaciones de la Replicación de PostgreSQLLas Variaciones de la Replicación de PostgreSQL
Las Variaciones de la Replicación de PostgreSQL
 
NoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQLNoSQL and Spatial Database Capabilities using PostgreSQL
NoSQL and Spatial Database Capabilities using PostgreSQL
 
Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?Is There Anything PgBouncer Can’t Do?
Is There Anything PgBouncer Can’t Do?
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
A Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQLA Journey from Oracle to PostgreSQL
A Journey from Oracle to PostgreSQL
 
Psql is awesome!
Psql is awesome!Psql is awesome!
Psql is awesome!
 
EDB 13 - New Enhancements for Security and Usability - APJ
EDB 13 - New Enhancements for Security and Usability - APJEDB 13 - New Enhancements for Security and Usability - APJ
EDB 13 - New Enhancements for Security and Usability - APJ
 
Comment sauvegarder correctement vos données
Comment sauvegarder correctement vos donnéesComment sauvegarder correctement vos données
Comment sauvegarder correctement vos données
 
Cloud Native PostgreSQL - Italiano
Cloud Native PostgreSQL - ItalianoCloud Native PostgreSQL - Italiano
Cloud Native PostgreSQL - Italiano
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 

Último

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Machine Learning for Capacity Management

  • 1. Machine Learning for Capacity Management Dave Page December 2020
  • 2. Dave Page ● EDB (CTO Office) ○ VP & Chief Architect, Database Infrastructure ● PostgreSQL ○ Core Team ○ pgAdmin Lead Developer
  • 3. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.3 What is Capacity Management? “Capacity management's primary goal is to ensure that information technology resources are right-sized to meet current and future business requirements in a cost-effective manner.” https://en.wikipedia.org/wiki/Capacity_management
  • 4. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.4 • When will I run out of disk space? • How many rows will be in this table in 6 months time? • At current growth rate, how many servers will I need to support my users in three years? “Classic” Capacity Management Tools Focus on trends
  • 5. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.5 • Calculate a trend from collected metrics • Extrapolate to a point in time • Extrapolate to a threshold value • Plan based on results “Classic” Capacity Management Tools Linear Trend Analysis
  • 6. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.6 • Capacity requirements are usually more complex than upward or downward trends: • Seasonality • User numbers may be higher in the daytime on weekdays than at night or the weekend • System load may be higher on the last day of the month due to month-end batch processing • Noise: • Metrics are rarely completely stable • Noise can affect linear trend analysis in ways we might not see “Classic” Capacity Management Tools Limitations
  • 7. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.7 • Utility or cloud based computing can easily scale up or down: • Scale up in anticipation of load spikes • Scale down when capacity is not required to save money • Minimise unnecessary fluctuations in scale (e.g. an unexpected load spike immediately following a scale down) • Noise can (sometimes) make it hard to see or understand actual trends • “Cut through” the noise “Classic” Capacity Management Tools Limitations - why do we care?
  • 8. Machine Learning ● Python ● Tensorflow 2 ● pl/python3 (optional)
  • 9. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.9 What is Tensorflow? “TensorFlow is a free and open-source software library for machine learning. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. ” https://en.wikipedia.org/wiki/TensorFlow
  • 10. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.10 Tensorflow Installation To use from within PostgreSQL: • Install for the interpreter used by pl/python3 • In this case, the EDB LanguagePack with PostgreSQL 13 on macOS Assuming Python 3.5+ is installed:
  • 11. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.11 What will we build? • Multiple one-dimensional convolutional layers • Set of filters with trainable parameters • Increasing dilation to detect seasonal patterns • Lower layers learn short term patterns • Higher layers learn long term patterns • Originally designed for audio analysis • Ideal for time series prediction A deep Neural Network in the Wavenet architecture https://paperswithcode.com/method/wavenet
  • 12. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.12 • Thankfully, little is required for time series prediction • Time series data consists of: • A series of recorded numeric values or metrics... • ...spread equidistantly over a period of time Input Data Careful data preparation is usually critical in Machine Learning
  • 13. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.13 • The input data is split into 4 arrays: • Training timestamps • Training metrics • Validation timestamps • Validation metrics • The neural network doesn’t need the timestamps; they’re only used when plotting results • Training data is used to train the network • Validation data is used to measure how effective the training is Data Preparation Training & Validation
  • 15. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.15 • filters: The number of neurons in the layer, which corresponds to the required output dimension. • kernel_size: The length of the 1D convolutional window; i.e. the number of time steps used to compute each output value. • strides: This is the number of time steps the filter will move as it processes the input. • padding: This determines how data is padded to ensure that inputs at the beginning and end of the time series can be processed (with a kernel size of 2, we need two additional values to process the first and last time steps). Causal padding adds values to the beginning of the time series so that we don't end up predicting future values based on padding added to the future. • activation: The activation function helps the network handle non-linear and interaction effects (where different variables in the network can be affected by one another). Rectified Linear Unit or ReLU is a simple algorithm that works well here in many cases! Model Parameters
  • 17. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.17 • Once the model is defined (and compiled), it must be trained • Training runs through multiple iterations or epochs, using our training data set • Weights and biases etc. are adjusted with each epoch to tune the network • The validation data set is used to assess the accuracy of the model • A loss function (keras.losses.Huber() in this case) is used to measure the accuracy Training Supervised Learning
  • 19. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.19 • If we train for too many epochs, overfitting is likely to occur • The network will learn the data we have, and likely fail to work well with new data • We prevent this with Early Stopping: • Training is completed when no significant improvement has been made Early Stopping Overfitting occurs when the network learns the data rather than the patterns
  • 20. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.20 • Every epoch we assess the accuracy of the model • If, and only if, improvement is measured over all previous epochs, we save the model • Once training is complete, we load the last saved model Checkpoints Training doesn’t always improve! Example of training & validation RMSE vs. training epoch for a regression network
  • 21. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.21 • Increase the number of epochs, if early stopping isn’t kicking in • Adjust the learning rate • Try a different optimiser (though Adam is usually best) • Experiment with the number of layers and filters (neurons) in the model • Try a different activation function • Adjust the kernel size and stride • Try different ratios of training to validation data Tuning the model We can experiment with various parameters to fine tune the model. Experience helps!
  • 22. Testing ● Generated Data ● Generated Workload
  • 23. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.23 Generated Data • A dataset was generated in Python, with: • Seasonality • Trend • Noise Initial testing based on generated data
  • 24. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.24 • The model is used to predict or forecast future data, and validated against our validation set • Once we’re happy the network is accurate, we can predict future data Generated Data Results
  • 25. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.25 • Workload generated with pgWorkload (https://github.com/EnterpriseDB/pgworkload) • Number of rows in pg_stat_activity recorded every five minutes • Workload designed to simulated increasing and decreasing numbers of users over a week Simulated Workload Metrics recorded from PostgreSQL
  • 26. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.26 Simulated Workload Results
  • 27. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.27 Thought for the future Monitoring and Alerting ● “Classic” alerting works based on thresholds: ○ “Tell me when the number of active users exceeds 200” ● But “normal” can vary based on time or date: ○ Peak time on Monday: 180 users is normal ○ Midnight on Sunday: >2 users is abnormal ● Solution: ○ Dynamically configure thresholds based on predicted usage patterns
  • 28. Conclusions Using Machine Learning techniques we can more accurately predict metrics with seasonality and trend: ● Scale up in anticipation of demand ● Scale down to save money ● Plan for additional resource requirements ● Detect abnormal conditions taking into account time of day/week/month etc.
  • 29. © Copyright EnterpriseDB Corporation, 2020. All rights reserved.29 Useful information pgWorkload: https://github.com/EnterpriseDB/pgworkload Experimental Code: https://github.com/dpage/ml-experiments/tree/main/time_series Blogs: https://www.enterprisedb.com/blogs/dave.page/published-articles https://pgsnake.blogspot.com/ Contact: dave.page@enterprisedb.com Thank You