Platform Observability and Infrastructure Closed Loops

Sunku Ranganath
https://www.linkedin.com/in/sunkuranganath/

Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to
obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on
request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, Intel Resource Director Technology, Intel Run Sure Technology, Intel Node Manager, are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
Copyright © 2017 Intel Corporation. All rights reserved.

Acknowledgements
Timothy Verrall
John Browne
Damien Power
Emma Collins
Jean Christophe Bouche
Krzysztof Kepka

Agenda
Platform Observability
Service Assurance
Closed Loop Automation

Platform Observability & Service Assurance (SA)
• Observability: Ability to expose state of the platform to ensure Service Level
Objectives are met
• Observability Considerations: Logging, Metrics & Tracing
• Communications Service Provider Context:
• Care about overall Service Assurance
• Both Monitoring & Observability are important
• Service Assurance
• Application of policies to ensure services meet a pre-defined service quality level
• FCAPS (Fault, Configuration, Accounting, Performance & Security) attributes on
existing network infrastructure

6
Three Key Elements of SA Platform
 Monitoring: Enabling deeper
management and tracking of
specific service levels
 Presentation: Reporting to
enable reaction to service level
changes
 Provisioning: Enable
configuration of service levels
based on workload or service
priority
Figure: Service Assurance elements mapping to ETSI NFV Model

7
Collectd Monitoring Agent
Collectd: Why & What
• Statistics collection daemon
• Uses read or write plugins to collect metrics write to an end
point
• Open source
• Widely adopted
• Configurable Collection Interval
Various Plugin types:
• Input/Output
• Binding Plugins
• Logging Plugins
• Notification Plugins
• Other: Network plugin with both send/receive feature
Figure: Collectd Architecture
https://github.com/collectd/collectd

8
Platform Telemetry Exposure & Integration
Compute Network Storage
Hypervisor [RT/SA KVM4NFV extensions]
NFVI
IPFIX
Virtualised
Compute
Virtualised
Network
Virtualised
Storage
E.g.
Working/Protect
Failover
Local
Corrective
Action
Enterprise
MIB
SYSLOG
Collectd
PMU^
counters
NIC counters
vSwitch
counters
SNMP API
Perfmon
MIB
Common / Standard Open APIs
Fast Path
Triggers on events or
counters
VM Stall Detection/
RT Stall Detection
Monitoring/
Analytics
Systems
Slow Path
Periodic Pull 1/15mins
RAS Hypervisor/Container
Counters
Container
Monitoring
Solutions
(Prometheus
….)
Includes
NetFlow Collectors
Vendor SA
Middleware
Intel® Node
Manager
NFV Platform
MIB
Standard Open APIs
Intel Components
Open Platform
Collectors
Intel® Run Sure Technology
MCA* PCIe AER
Resilient System Technology
Resilient Memory Technology
SDDC DDDC+1 Mirroring
RAID/
NVMe*
Intel® Rapid
Storage
Technology
sFlow
Intel®
Management
Engine
IPMI
Ceilometer
Aodh
Vitrage
Congress
In progress
Done/Integrated
OpenStack*
Collectd PluginsIntel® Infrastructure
Management Technologies ®
Gnocchi
VES Plugin
Redfish
C
M
T
Intel® RDT
C
A
T
M
B
M
C
D
P
PO
W
ER
Out Of
Band
Telemetry
Kafka Prometheus
OpenStack*
VIM
PMU^: Performance Monitoring Unit

Multiple Closed Loops
Plan & Provision
Offline
feedback loop
Design Analyze
Use cases (Loops)
• Capacity planning
• Peering planning
• Cache placement
• …
Optimize
MonitorOrchestrate
Near-real
Time
Feedback loop Real-Time
Feedback loop
Use cases (Loops)
• Service assurance
• Security operations
• …
Use cases (Loops)
• Traffic Engineering:
Network Optimization
• Demand placement
• Workload placement…
Telemetry
Telemetry
Real-time/Near Real-time Loops - Automated
Telemetry
Offline Processing
Online Processing
Source: https://pndablog.com/2017/06/05/feedback-loops-and-closed-loop-control/

10
Networking Closed Loops – High Level
Architecture
Platform Resources
Forwarding Plane
Interfaces
Interfaces
TrafficTraffic
Platform
Analytics
Systems
Business Applications
Setting of Policy
SDN/NMS
Network Services
Cloud and Virtual
Management
MANO
EMS VNFM
Infrastructure
Control
Application
Independent Closed Loops: SDN, Cloud & Virtual Mgt, Platform
Local
Platform
Agent
Telemetry
distribution or
storage or
…..
Platform
Telemetry
Policy Based Provisioning
Control Loops

11
Closed Loops – Networking Stack
Application Layer
Network Data Analytics
Orchestration, Management, Policy
Cloud & Virtual Management
Network Control
Operating Systems
Data Path
Hardware/
Disaggregated Hardware
ServicesManagement&ControlInfrastructure
Micro-seconds/
Milliseconds
Mins/Hours/Days
Closed Loop
Reaction Time
Domain Knowledge
Local to
Platform
End to End
Enforce Local
Policy
Deployment
Policies
Enforce Network
Domain Policy
Map Policies
HW Enabled
Loops (eg
RAS)
Enforce DP
Loops (HA etc.)
Analyze/
Plan Policies
High Speed Control Loops are Close to the Platform
Seconds/Mins

Analytics
12
Closed Loops – Business Cases
Improved Customer
Experience
Cloud Optimization &
Efficiency
Edge Placement
Service Healing
Differentiated QoS
Service Optimization
Energy Optimization
Capacity Optimization
Cloud Configurations
Business
Use Cases
AI/ML/DL
Platform(s)
Feature Exposure Provisioning Telemetry
Local Policy Enforcement Agent(s)
For Local Dynamic Control
Intel® Infrastructure
Management Tech
Intel®
RDT
Power
Monitoring/Storage
NFV Orchestrator (NFVO) [eg ONAP/OSM]
Security
Threat Detection
Threat Response
Business Applications
collectd
Policy Based Provisioning
Control Loops
VNF Manager (VNFM)
OpenStack* Kubernetes* Telemetry I/FTelemetry I/F
Actively
Contributing
Intel® Run Sure
Technology
Bare Metal
Telemetry I/F

Closed Loop Resiliency Demo
Goal: Maximize Service Availability
of Virtual Border Network Gateway
(vBNG) in memory error scenario
Figure 1 Source: OpenSAF and VMware from the Perspective of High Availability - Ali Nikzad, Ferhat KhendekMaria Toeroe
Concordia University Ericsson SVM’2013 – Zurich – October 2013
Figure 1: Service Recovery Timeline Figure 2: Closed Loop Resiliency
Demo with Kubernetes
More Details on Demo: https://networkbuilders.intel.com/social-hub/video/closed-loop-
platform-automation-workload-resiliency-demo

Use Cases & Gaps
• 5G Network Slicing
• Demand based Energy Savings
• Workload Resiliency
• Noisy Neighbor Detection & Avoidance
• And many more….
Figure: 5G Network Slicing Architecture
Source: https://www.researchgate.net/figure/5G-network-slicing-architecture_fig1_324175599
Gaps, On Going Work
• Telemetry tagging
• Policy delivery & management across
VIM to NFVI
• ONAP, OPNFV, ETSI, etc.

Summary
Platform Observability & Monitoring play crucial role in ensuring service assurance
Platform telemetry heavily differentiate the services, along side of application telemetry
Various levels of closed loops are required for autonomous networks
Real-time & Near Real-time closed loops require automation
Collaborate through Open Source Communities
Figure out use cases of interest
Leverage relevant infrastructure telemetry
Call To Action

17
Service Assurance “Phased” Evolution for NFV/SDN
• Strategic Framework for SA “Phase” Evolution
 Phase 1 - Equivalence (Virtualized + Interworking with existing management systems)
 Phase 2 - Automated by MANO+SDN Controller
 Phase 3 - Predict failures and adapt automatically
Platform Service Assurance -
Equivalence
• Platform Service Assurance supporting:
• Intel RunSure Technologies
• Cache Config & Monitoring
• Bios Config & Reporting
• Fastpath DPDK Interface Reporting
• Fastpath DPDK Keep Alive
• Virtual Switch Health
• Host Health
Platform Service Assurance
(MANO + SDN Controller)
• VIM and above, support:
• Enable RAS Technologies
• Enable DPDK and Keep
Alive
• Enable Host Health
• Policy Based Provisioning
Predictive Platform Service
Assurance
• Predict Failures and Adapt
Automatically:
• Automated and Adaptive
to changes notified in
metrics
• Closed loop and Dynamic
SA environment
Phase 1 Phase 2 Phase 3
Evolving from Equivalence towards NFV/SDN Automation
Never Stops Solution of the day Under Construction

18
Platform Plugins Contributed by Intel
Plugin Domain Description
Intel® Run Sure
Technology/ RAS
Mcelog, PCIe AER, logparser: Metrics & notifications pertaining to Intel Run Sure
Technology
Intel® RDT Intel® Resource Director Technologies (Intel® RDT) related metrics
Virt Libvirt related metrics
OVS Ovs_stats, ovs_events: Metrics related to Open Virtual Switch
DPDK Dpdk_stats, dpdk_events, hugepages: Data Plane Development Kit (DPDK)
related metrics
OpenStack* Gnocchi, Aodh: Integration in OpenStack projects
Cloud Write_Kafka, Write_Prometheus, VES: Integration in to various cloud platforms
Storage RAID, NVMe*: Storage related Metrics
Power/Energy CPUFreq, Turbostat: Frequency & power related metrics
Platform IPMI, RedFish, PMU: Out of Band metrics & platform counters
Infrastructure Metrics are Crucial as Application Metrics
Details: https://github.com/collectd/collectd

19
Barometer Strategy:
• Ensure platform metrics/events are
accessible through open industry standard
interfaces.
• Demonstrate platform technologies can be
monitored, consumed and actioned in real
time
Opnfv barometer
One Click Install:
 Easy install/configuration
for customers
 One command to install
Collectd/Influxdb/Grafana
• Three container approach for
Collectd:
• Stable Container: latest stable branch
• Master Container: up to date with
master
• Experimental Container: cherry pick
features of interest
Source: https://opnfv-barometer.readthedocs.io/en/latest/release/userguide/docker.userguide.html

Platform Observability and Infrastructure Closed Loops

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Platform Observability and Infrastructure Closed Loops

Semelhante a Platform Observability and Infrastructure Closed Loops (20)

Mais de Liz Warner

Mais de Liz Warner (18)

Último

Último (20)

Platform Observability and Infrastructure Closed Loops

Notas do Editor