The document provides a legal disclaimer for Sunku Ranganath's LinkedIn profile. It states that no intellectual property rights are granted and disclaims all warranties. It also notes that the information provided is subject to change and that customers should contact their Intel representative for the latest specifications. The document lists Intel as a trademark and acknowledges several individuals.
5. Platform Observability & Service Assurance (SA)
• Observability: Ability to expose state of the platform to ensure Service Level
Objectives are met
• Observability Considerations: Logging, Metrics & Tracing
• Communications Service Provider Context:
• Care about overall Service Assurance
• Both Monitoring & Observability are important
• Service Assurance
• Application of policies to ensure services meet a pre-defined service quality level
• FCAPS (Fault, Configuration, Accounting, Performance & Security) attributes on
existing network infrastructure
6. 6
Three Key Elements of SA Platform
Monitoring: Enabling deeper
management and tracking of
specific service levels
Presentation: Reporting to
enable reaction to service level
changes
Provisioning: Enable
configuration of service levels
based on workload or service
priority
Figure: Service Assurance elements mapping to ETSI NFV Model
7. 7
Collectd Monitoring Agent
Collectd: Why & What
• Statistics collection daemon
• Uses read or write plugins to collect metrics write to an end
point
• Open source
• Widely adopted
• Configurable Collection Interval
Various Plugin types:
• Input/Output
• Binding Plugins
• Logging Plugins
• Notification Plugins
• Other: Network plugin with both send/receive feature
Figure: Collectd Architecture
https://github.com/collectd/collectd
8. 8
Platform Telemetry Exposure & Integration
Compute Network Storage
Hypervisor [RT/SA KVM4NFV extensions]
NFVI
IPFIX
Virtualised
Compute
Virtualised
Network
Virtualised
Storage
E.g.
Working/Protect
Failover
Local
Corrective
Action
Enterprise
MIB
SYSLOG
Collectd
PMU^
counters
NIC counters
vSwitch
counters
SNMP API
Perfmon
MIB
Common / Standard Open APIs
Fast Path
Triggers on events or
counters
VM Stall Detection/
RT Stall Detection
Monitoring/
Analytics
Systems
Slow Path
Periodic Pull 1/15mins
RAS Hypervisor/Container
Counters
Container
Monitoring
Solutions
(Prometheus
….)
Includes
NetFlow Collectors
Vendor SA
Middleware
Intel® Node
Manager
NFV Platform
MIB
Standard Open APIs
Intel Components
Open Platform
Collectors
Intel® Run Sure Technology
MCA* PCIe AER
Resilient System Technology
Resilient Memory Technology
SDDC DDDC+1 Mirroring
RAID/
NVMe*
Intel® Rapid
Storage
Technology
sFlow
Intel®
Management
Engine
IPMI
Ceilometer
Aodh
Vitrage
Congress
In progress
Done/Integrated
OpenStack*
Collectd PluginsIntel® Infrastructure
Management Technologies ®
Gnocchi
VES Plugin
Redfish
C
M
T
Intel® RDT
C
A
T
M
B
M
C
D
P
PO
W
ER
Out Of
Band
Telemetry
Kafka Prometheus
OpenStack*
VIM
PMU^: Performance Monitoring Unit
10. 10
Networking Closed Loops – High Level
Architecture
Platform Resources
Forwarding Plane
Interfaces
Interfaces
TrafficTraffic
Platform
Analytics
Systems
Business Applications
Setting of Policy
SDN/NMS
Network Services
Cloud and Virtual
Management
MANO
EMS VNFM
Infrastructure
Control
Application
Independent Closed Loops: SDN, Cloud & Virtual Mgt, Platform
Local
Platform
Agent
Telemetry
distribution or
storage or
…..
Platform
Telemetry
Policy Based Provisioning
Control Loops
11. 11
Closed Loops – Networking Stack
Application Layer
Network Data Analytics
Orchestration, Management, Policy
Cloud & Virtual Management
Network Control
Operating Systems
Data Path
Hardware/
Disaggregated Hardware
ServicesManagement&ControlInfrastructure
Micro-seconds/
Milliseconds
Mins/Hours/Days
Closed Loop
Reaction Time
Domain Knowledge
Local to
Platform
End to End
Enforce Local
Policy
Deployment
Policies
Enforce Network
Domain Policy
Map Policies
HW Enabled
Loops (eg
RAS)
Enforce DP
Loops (HA etc.)
Analyze/
Plan Policies
High Speed Control Loops are Close to the Platform
Seconds/Mins
12. Analytics
12
Closed Loops – Business Cases
Improved Customer
Experience
Cloud Optimization &
Efficiency
Edge Placement
Service Healing
Differentiated QoS
Service Optimization
Energy Optimization
Capacity Optimization
Cloud Configurations
Business
Use Cases
AI/ML/DL
Platform(s)
Feature Exposure Provisioning Telemetry
Local Policy Enforcement Agent(s)
For Local Dynamic Control
Intel® Infrastructure
Management Tech
Intel®
RDT
Power
Monitoring/Storage
NFV Orchestrator (NFVO) [eg ONAP/OSM]
Security
Threat Detection
Threat Response
Business Applications
collectd
Policy Based Provisioning
Control Loops
VNF Manager (VNFM)
OpenStack* Kubernetes* Telemetry I/FTelemetry I/F
Actively
Contributing
Intel® Run Sure
Technology
Bare Metal
Telemetry I/F
13. Closed Loop Resiliency Demo
Goal: Maximize Service Availability
of Virtual Border Network Gateway
(vBNG) in memory error scenario
Figure 1 Source: OpenSAF and VMware from the Perspective of High Availability - Ali Nikzad, Ferhat KhendekMaria Toeroe
Concordia University Ericsson SVM’2013 – Zurich – October 2013
Figure 1: Service Recovery Timeline Figure 2: Closed Loop Resiliency
Demo with Kubernetes
More Details on Demo: https://networkbuilders.intel.com/social-hub/video/closed-loop-
platform-automation-workload-resiliency-demo
14. Use Cases & Gaps
• 5G Network Slicing
• Demand based Energy Savings
• Workload Resiliency
• Noisy Neighbor Detection & Avoidance
• And many more….
Figure: 5G Network Slicing Architecture
Source: https://www.researchgate.net/figure/5G-network-slicing-architecture_fig1_324175599
Gaps, On Going Work
• Telemetry tagging
• Policy delivery & management across
VIM to NFVI
• ONAP, OPNFV, ETSI, etc.
15. Summary
Platform Observability & Monitoring play crucial role in ensuring service assurance
Platform telemetry heavily differentiate the services, along side of application telemetry
Various levels of closed loops are required for autonomous networks
Real-time & Near Real-time closed loops require automation
Collaborate through Open Source Communities
Figure out use cases of interest
Leverage relevant infrastructure telemetry
Call To Action
17. 17
Service Assurance “Phased” Evolution for NFV/SDN
• Strategic Framework for SA “Phase” Evolution
Phase 1 - Equivalence (Virtualized + Interworking with existing management systems)
Phase 2 - Automated by MANO+SDN Controller
Phase 3 - Predict failures and adapt automatically
Platform Service Assurance -
Equivalence
• Platform Service Assurance supporting:
• Intel RunSure Technologies
• Cache Config & Monitoring
• Bios Config & Reporting
• Fastpath DPDK Interface Reporting
• Fastpath DPDK Keep Alive
• Virtual Switch Health
• Host Health
Platform Service Assurance
(MANO + SDN Controller)
• VIM and above, support:
• Enable RAS Technologies
• Enable DPDK and Keep
Alive
• Enable Host Health
• Policy Based Provisioning
Predictive Platform Service
Assurance
• Predict Failures and Adapt
Automatically:
• Automated and Adaptive
to changes notified in
metrics
• Closed loop and Dynamic
SA environment
Phase 1 Phase 2 Phase 3
Evolving from Equivalence towards NFV/SDN Automation
Never Stops Solution of the day Under Construction
18. 18
Platform Plugins Contributed by Intel
Plugin Domain Description
Intel® Run Sure
Technology/ RAS
Mcelog, PCIe AER, logparser: Metrics & notifications pertaining to Intel Run Sure
Technology
Intel® RDT Intel® Resource Director Technologies (Intel® RDT) related metrics
Virt Libvirt related metrics
OVS Ovs_stats, ovs_events: Metrics related to Open Virtual Switch
DPDK Dpdk_stats, dpdk_events, hugepages: Data Plane Development Kit (DPDK)
related metrics
OpenStack* Gnocchi, Aodh: Integration in OpenStack projects
Cloud Write_Kafka, Write_Prometheus, VES: Integration in to various cloud platforms
Storage RAID, NVMe*: Storage related Metrics
Power/Energy CPUFreq, Turbostat: Frequency & power related metrics
Platform IPMI, RedFish, PMU: Out of Band metrics & platform counters
Infrastructure Metrics are Crucial as Application Metrics
Details: https://github.com/collectd/collectd
19. 19
Barometer Strategy:
• Ensure platform metrics/events are
accessible through open industry standard
interfaces.
• Demonstrate platform technologies can be
monitored, consumed and actioned in real
time
Opnfv barometer
One Click Install:
Easy install/configuration
for customers
One command to install
Collectd/Influxdb/Grafana
• Three container approach for
Collectd:
• Stable Container: latest stable branch
• Master Container: up to date with
master
• Experimental Container: cherry pick
features of interest
Source: https://opnfv-barometer.readthedocs.io/en/latest/release/userguide/docker.userguide.html
Notas do Editor
Monitoring:
Platform & Network counters to track usage and performance against KPIs
Provide open standard components and interfaces
Presentation:
Human & Dynamic intervention for threshold violations or failures
Support for the detection of trending against configured parameters and the enabling of capacity plan changes based on those trends
Provisioning:
Includes allocating or partitioning platform resources
Intercepts with every layer of NFV framework
Capabilities have to be built in for easy interoperability and smooth consumption of telemetry data
Open standard interfaces play an important role
Given that existing FCAPS systems are widely deployed and represent significant investments from a business perspective, the approach taken is to introduce a phased approach for Service Assurance, starting with equivalence with existing systems then adding MANO+SDN integration and then evolving to a fully automated and predictive management and orchestration system.