ASML's customer slides from SplunkLive! Utrecht 2017, presented by Richard van Der Ven, Architect, Litho Computing Platform ASML.
Attendees of SplunkLive! Utrecht learnt how more than 14,000 enterprises, government agencies, universities and service providers in over 110 countries use Splunk software to deepen business and customer understanding, mitigate cybersecurity risk, prevent fraud, improve service performance and reduce cost.
1. Richard van der Ven
21-11-2017
Alert & Health Monitoring
A Splunk and ITSI implementation
Public
Function Cluster Architect Litho Computing Platform
2. 21 Nov 2017
Slide 2
Public
• Who am I?
• Environment
• Alert & Health Monitoring
• Wrap-up
3. 21 Nov 2017
Slide 3
Public
Who am I?
Worked at ASML for 16 years
• 13 years - IT Infrastructure
• DBA, Storage, ITIL processes
• IT Management
• 3 years - Functional Cluster Architect
• Litho Computing Platform
• Alert & Health Monitoring
Richard van der Ven
4. 21 Nov 2017
Slide 4
Public
ASML makes the machines for making chips
• Lithography is the critical tool
for producing chips
• All of the world’s top chip
makers are our customers
• 2016 sales: €6.8 bln
• More than 17,000 employees
(FTE) worldwide
5. 21 Nov 2017
Slide 5
Public
A global presence
3,900 employees
Source: ASML Q1 2017
Offices in over 60 cities in 16 countries worldwide
9,600 employees 3,600 employees
6. 21 Nov 2017
Slide 6
Public
A tightly integrated set of solutions for scaling and yield
Image
Compute/SW
Measure
7. 21 Nov 2017
Slide 7
Public
Litho Computing Platform
• A cloud infra stack, called the Litho Computing Platform, designed for
high availability and scalability
• Virtual machines are abstracted from the hardware
HW may change or break Virtual machines stay up High
Available
• It’s centralized all applications in one place
• It can serve 40 Scanners & 50 Yieldstars
• It runs in a dark site at ASML customers
An extendable HW platform that scales with application needs
8. 21 Nov 2017
Slide 8
Public
Availability is key
Availability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
97% 10.96 days 21.6 hours 5.04 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
NOTE: This is availability of the functionality that we sell as perceived by the customer
thus Infra + HW + Virtualization layer + Application + Connectivity
9. 21 Nov 2017
Slide 9
Public
Some history
2011
• 1st LCP in the field
• Start off with monitoring infrastructure components with Nagios
• Supported by PHP development
• Changing knowledge experts on custom build setup
End of 2015
• The need for improved monitoring and local analysis came up after
some situations where:
• Engineers didn’t notice application components failing
• It took a long time to get requested log files via customer approval
• It required several iterations to get the log files needed
Timeline
10. 21 Nov 2017
Slide 10
Public
Alert & Health Monitoring
• Avoid unplanned downtime
• Reduce planned maintenance times
• A smart and robust monitoring solution platform to enable live monitoring
AHM product will enable CS engineers to
• Identify if LCP operation is at risk
• Diagnose root cause of incidents
• Pro-active maintenance
• Capacity planning
• Verify configuration state
Why Alert and Health Monitoring?
11. 21 Nov 2017
Slide 11
Public
Alert & Health Monitoring
Alerting
• Alert when KPI over threshold
Monitoring / quick trouble shooting
• Health monitoring
HW / SW/ FW/ Environment health, including network infra, databases OSes
• Configuration reporting
Exact HW/SW/FW config and changes including licenses and serial numbers
Analysis / debugging
• Timeline reconstruction
Chronological list of major events and threshold alerts
• Diagnostics deep dive
• Data downloading
Key features
12. 21 Nov 2017
Slide 12
Public
Alert & Health Monitoring
Support flow & organization
AHM
Customer
ASML local
equipment
support
ASML GSC
equipment
support
App 1
App 2
Remote intervention
Alert
email
Troubleshooting
VPN
MonitoringStatus
Report
Under virtual
escort by
customer
Action Plan
13. 21 Nov 2017
Slide 13
Public
AHM High-level Architecture
Alerting
Analysis / Debugging
Monitoring /
Quick troubleshooting
Hardware
Virtualization
Operating
Systems
Middle-
ware
Litho
apps
AHM
Data
Collection
Scripts
Search
HeadIndex
ForwardersForwarders
@
Central Instance
Alert and Health Monitoring
Data Onboarding Data Processing
Config
Manager
AHM
Configurator
Configuration
Metrics
15. 21 Nov 2017
Slide 15
Public
Alert & Health Monitoring
• Lead time
• Importance of log files for monitoring
• What determines application availability
• Changing requirements from stakeholders
• Service model
• Implementation ITSI
Challenges
16. 21 Nov 2017
Slide 16
Public
Alert & Health Monitoring
• Service Model
• Not usable out of the box
• Generated with own tool
• UI: not usable
• ITSI Dashboard: not configurable to our needs
• Glass tables: static, where we need flexibility due to variable applications
• Event alerting
• Implementation customer specific thresholds
Challenges with Splunk core and ITSI
17. 21 Nov 2017
Slide 17
Public
Alert & Health Monitoring
• Service Model
• Generated with own configuration tool
• ‘Manual’ regenerate at every change on applications
• Using Mind Maps for discussions
• UI
• Dashboards build with tables and hyper links
• New feature drill down promising
• Event alerting
• Aligning ITSI queries and core Splunk
• Implementation customer specific thresholds
How did we solve?
18. 21 Nov 2017
Slide 18
Public
Implementation Splunk and ITSI
Easy and clear drill down dashboards
Users are non IT
19. 21 Nov 2017
Slide 19
Public
Alert & Health Monitoring
• Easier access to log files, metrics and application data
• Less time spent on regular service checks
• Combine application and infra data
• Unforseen side effects of changes diagnosed in field and at internal testing
• More confidence in actual system state
• Memory leak issue spotted in field, before impact
Benefits
ASML makes the machines for making chips.
A sophisticated copy machine based on photography principle: use negative to create a print. And this mulitple times on 1 wafer.
And mulitple layers stacked on top of eachother
Feedback loop:
Initially of 1 per 3 days, to inline measurement
Structures on chips become more complex and process requires short feedbackloop to improve the yield.
Prevent detection of a malfunctioning chip at the end of the process.
To support the process SW developed to improve the yield via a sophisticated feedback loop.
Dark site: hardly remote access, , customers worried by their IP, working on new products, competitor edge
Not managed by IT
Lack of 1 single interface
Access required to multiple sources (metrics, log files, HW)
CS engineer = non IT engineer, knows a lot of the scanners, and used SW
Alert & Health Monitoring SW provides health status of LCP HW/SW and Application SW
Central GSC organization remotely monitors the performance of the LCP IBL using a frequently sent status report (~ 6 – 12 times/hr)
AHM sends out an alert in case of a critical or high prio event
GSC starts remote troubleshooting via the AHM application
GSC aligns the action plan with the local team and customer before doing a remote intervention
GSC executes the action plan under virtual escort by the customer
The customer can follow live and audit what is happening via VPN
Configuration: needed because of dymanic ASML customer config -> not provided by Splunk
Each cust setup different HW, and SW config
Diff Thresholds due to different use cases @customer (dev vs HVM/prod)
Different management compared to central IT environment
Standardisation important
Initial plan to go to 1st customer: Jan 2017
March 2017
Customer release
October 2017
Powerfull service model mechanism in ITSI, requires a lot of itterative sessions to explain to stakeholders.
UI of ITSI not usable by our non IT users
It takes time to create traction with stakeholders and end users
Monitoring is difficult
Change WoW is difficult
Show progress
Scrum demo
Spread the word
Arrange training sessions
Involve end users
ask them what would help them
Start to work with it
don’t wait until the product is finished
You need consultancy to implement ITSI