SplunkLive! Utrecht 2017 - ASML Customer Presentation

Richard van der Ven
21-11-2017
Alert & Health Monitoring
A Splunk and ITSI implementation
Public
Function Cluster Architect Litho Computing Platform

21 Nov 2017
Slide 2
Public
• Who am I?
• Environment
• Alert & Health Monitoring
• Wrap-up

21 Nov 2017
Slide 3
Public
Who am I?
Worked at ASML for 16 years
• 13 years - IT Infrastructure
• DBA, Storage, ITIL processes
• IT Management
• 3 years - Functional Cluster Architect
• Litho Computing Platform
• Alert & Health Monitoring
Richard van der Ven

21 Nov 2017
Slide 4
Public
ASML makes the machines for making chips
• Lithography is the critical tool
for producing chips
• All of the world’s top chip
makers are our customers
• 2016 sales: €6.8 bln
• More than 17,000 employees
(FTE) worldwide

21 Nov 2017
Slide 5
Public
A global presence
3,900 employees
Source: ASML Q1 2017
Offices in over 60 cities in 16 countries worldwide
9,600 employees 3,600 employees

21 Nov 2017
Slide 6
Public
A tightly integrated set of solutions for scaling and yield
Image
Compute/SW
Measure

21 Nov 2017
Slide 7
Public
Litho Computing Platform
• A cloud infra stack, called the Litho Computing Platform, designed for
high availability and scalability
• Virtual machines are abstracted from the hardware
 HW may change or break  Virtual machines stay up  High
Available
• It’s centralized  all applications in one place
• It can serve 40 Scanners & 50 Yieldstars
• It runs in a dark site at ASML customers
An extendable HW platform that scales with application needs

21 Nov 2017
Slide 8
Public
Availability is key
Availability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
97% 10.96 days 21.6 hours 5.04 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
NOTE: This is availability of the functionality that we sell as perceived by the customer
thus Infra + HW + Virtualization layer + Application + Connectivity

21 Nov 2017
Slide 9
Public
Some history
2011
• 1st LCP in the field
• Start off with monitoring infrastructure components with Nagios
• Supported by PHP development
• Changing knowledge experts on custom build setup
End of 2015
• The need for improved monitoring and local analysis came up after
some situations where:
• Engineers didn’t notice application components failing
• It took a long time to get requested log files via customer approval
• It required several iterations to get the log files needed
Timeline

21 Nov 2017
Slide 10
Public
• Avoid unplanned downtime
• Reduce planned maintenance times
• A smart and robust monitoring solution platform to enable live monitoring
AHM product will enable CS engineers to
• Identify if LCP operation is at risk
• Diagnose root cause of incidents
• Pro-active maintenance
• Capacity planning
• Verify configuration state
Why Alert and Health Monitoring?

21 Nov 2017
Slide 11
Public
Alerting
• Alert when KPI over threshold
Monitoring / quick trouble shooting
• Health monitoring
HW / SW/ FW/ Environment health, including network infra, databases OSes
• Configuration reporting
Exact HW/SW/FW config and changes including licenses and serial numbers
Analysis / debugging
• Timeline reconstruction
Chronological list of major events and threshold alerts
• Diagnostics deep dive
• Data downloading
Key features

21 Nov 2017
Slide 12
Public
Support flow & organization
AHM
Customer
ASML local
equipment
support
ASML GSC
equipment
support
App 1
App 2
Remote intervention
Alert
email
Troubleshooting
VPN
MonitoringStatus
Report
Under virtual
escort by
customer
Action Plan

21 Nov 2017
Slide 13
Public
AHM High-level Architecture
Alerting
Analysis / Debugging
Monitoring /
Quick troubleshooting
Hardware
Virtualization
Operating
Systems
Middle-
ware
Litho
apps
AHM
Data
Collection
Scripts
Search
HeadIndex
ForwardersForwarders
@
Central Instance
Alert and Health Monitoring
Data Onboarding Data Processing
Config
Manager
AHM
Configurator
Configuration
Metrics

21 Nov 2017
Slide 14
Public
Keyfigures
1x
165-239 KPI’s< 5GB daily
6-10
500GB
~221 sourcetypes
77 hosts
~2125 sources
> 20GB daily
>25TB
> 50x
>3000 hosts

21 Nov 2017
Slide 15
Public
• Lead time
• Importance of log files for monitoring
• What determines application availability
• Changing requirements from stakeholders
• Service model
• Implementation ITSI
Challenges

21 Nov 2017
Slide 16
Public
• Service Model
• Not usable out of the box
• Generated with own tool
• UI: not usable
• ITSI Dashboard: not configurable to our needs
• Glass tables: static, where we need flexibility due to variable applications
• Event alerting
• Implementation customer specific thresholds
Challenges with Splunk core and ITSI

21 Nov 2017
Slide 17
Public
• Service Model
• Generated with own configuration tool
• ‘Manual’ regenerate at every change on applications
• Using Mind Maps for discussions
• UI
• Dashboards build with tables and hyper links
• New feature drill down promising
• Event alerting
• Aligning ITSI queries and core Splunk
• Implementation customer specific thresholds
How did we solve?

21 Nov 2017
Slide 18
Public
Implementation Splunk and ITSI
Easy and clear drill down dashboards
Users are non IT

21 Nov 2017
Slide 19
Public
• Easier access to log files, metrics and application data
• Less time spent on regular service checks
• Combine application and infra data
• Unforseen side effects of changes diagnosed in field and at internal testing
• More confidence in actual system state
• Memory leak issue spotted in field, before impact
Benefits

21 Nov 2017
Slide 20
Public
Wrap up

SplunkLive! Utrecht 2017 - ASML Customer Presentation

SplunkLive! Utrecht 2017 - ASML Customer Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to SplunkLive! Utrecht 2017 - ASML Customer Presentation

Similar to SplunkLive! Utrecht 2017 - ASML Customer Presentation (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

SplunkLive! Utrecht 2017 - ASML Customer Presentation

Editor's Notes