SlideShare a Scribd company logo
1 of 21
Richard van der Ven
21-11-2017
Alert & Health Monitoring
A Splunk and ITSI implementation
Public
Function Cluster Architect Litho Computing Platform
21 Nov 2017
Slide 2
Public
• Who am I?
• Environment
• Alert & Health Monitoring
• Wrap-up
21 Nov 2017
Slide 3
Public
Who am I?
Worked at ASML for 16 years
• 13 years - IT Infrastructure
• DBA, Storage, ITIL processes
• IT Management
• 3 years - Functional Cluster Architect
• Litho Computing Platform
• Alert & Health Monitoring
Richard van der Ven
21 Nov 2017
Slide 4
Public
ASML makes the machines for making chips
• Lithography is the critical tool
for producing chips
• All of the world’s top chip
makers are our customers
• 2016 sales: €6.8 bln
• More than 17,000 employees
(FTE) worldwide
21 Nov 2017
Slide 5
Public
A global presence
3,900 employees
Source: ASML Q1 2017
Offices in over 60 cities in 16 countries worldwide
9,600 employees 3,600 employees
21 Nov 2017
Slide 6
Public
A tightly integrated set of solutions for scaling and yield
Image
Compute/SW
Measure
21 Nov 2017
Slide 7
Public
Litho Computing Platform
• A cloud infra stack, called the Litho Computing Platform, designed for
high availability and scalability
• Virtual machines are abstracted from the hardware
 HW may change or break  Virtual machines stay up  High
Available
• It’s centralized  all applications in one place
• It can serve 40 Scanners & 50 Yieldstars
• It runs in a dark site at ASML customers
An extendable HW platform that scales with application needs
21 Nov 2017
Slide 8
Public
Availability is key
Availability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
97% 10.96 days 21.6 hours 5.04 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
NOTE: This is availability of the functionality that we sell as perceived by the customer
thus Infra + HW + Virtualization layer + Application + Connectivity
21 Nov 2017
Slide 9
Public
Some history
2011
• 1st LCP in the field
• Start off with monitoring infrastructure components with Nagios
• Supported by PHP development
• Changing knowledge experts on custom build setup
End of 2015
• The need for improved monitoring and local analysis came up after
some situations where:
• Engineers didn’t notice application components failing
• It took a long time to get requested log files via customer approval
• It required several iterations to get the log files needed
Timeline
21 Nov 2017
Slide 10
Public
Alert & Health Monitoring
• Avoid unplanned downtime
• Reduce planned maintenance times
• A smart and robust monitoring solution platform to enable live monitoring
AHM product will enable CS engineers to
• Identify if LCP operation is at risk
• Diagnose root cause of incidents
• Pro-active maintenance
• Capacity planning
• Verify configuration state
Why Alert and Health Monitoring?
21 Nov 2017
Slide 11
Public
Alert & Health Monitoring
Alerting
• Alert when KPI over threshold
Monitoring / quick trouble shooting
• Health monitoring
HW / SW/ FW/ Environment health, including network infra, databases OSes
• Configuration reporting
Exact HW/SW/FW config and changes including licenses and serial numbers
Analysis / debugging
• Timeline reconstruction
Chronological list of major events and threshold alerts
• Diagnostics deep dive
• Data downloading
Key features
21 Nov 2017
Slide 12
Public
Alert & Health Monitoring
Support flow & organization
AHM
Customer
ASML local
equipment
support
ASML GSC
equipment
support
App 1
App 2
Remote intervention
Alert
email
Troubleshooting
VPN
MonitoringStatus
Report
Under virtual
escort by
customer
Action Plan
21 Nov 2017
Slide 13
Public
AHM High-level Architecture
Alerting
Analysis / Debugging
Monitoring /
Quick troubleshooting
Hardware
Virtualization
Operating
Systems
Middle-
ware
Litho
apps
AHM
Data
Collection
Scripts
Search
HeadIndex
ForwardersForwarders
@
Central Instance
Alert and Health Monitoring
Data Onboarding Data Processing
Config
Manager
AHM
Configurator
Configuration
Metrics
21 Nov 2017
Slide 14
Public
Alert & Health Monitoring
Keyfigures
1x
165-239 KPI’s< 5GB daily
6-10
500GB
~221 sourcetypes
77 hosts
~2125 sources
> 20GB daily
>25TB
> 50x
>3000 hosts
21 Nov 2017
Slide 15
Public
Alert & Health Monitoring
• Lead time
• Importance of log files for monitoring
• What determines application availability
• Changing requirements from stakeholders
• Service model
• Implementation ITSI
Challenges
21 Nov 2017
Slide 16
Public
Alert & Health Monitoring
• Service Model
• Not usable out of the box
• Generated with own tool
• UI: not usable
• ITSI Dashboard: not configurable to our needs
• Glass tables: static, where we need flexibility due to variable applications
• Event alerting
• Implementation customer specific thresholds
Challenges with Splunk core and ITSI
21 Nov 2017
Slide 17
Public
Alert & Health Monitoring
• Service Model
• Generated with own configuration tool
• ‘Manual’ regenerate at every change on applications
• Using Mind Maps for discussions
• UI
• Dashboards build with tables and hyper links
• New feature drill down promising
• Event alerting
• Aligning ITSI queries and core Splunk
• Implementation customer specific thresholds
How did we solve?
21 Nov 2017
Slide 18
Public
Implementation Splunk and ITSI
Easy and clear drill down dashboards
Users are non IT
21 Nov 2017
Slide 19
Public
Alert & Health Monitoring
• Easier access to log files, metrics and application data
• Less time spent on regular service checks
• Combine application and infra data
• Unforseen side effects of changes diagnosed in field and at internal testing
• More confidence in actual system state
• Memory leak issue spotted in field, before impact
Benefits
21 Nov 2017
Slide 20
Public
Alert & Health Monitoring
Wrap up
SplunkLive! Utrecht 2017 - ASML Customer Presentation

More Related Content

What's hot

Neural networks
Neural networksNeural networks
Neural networks
Slideshare
 

What's hot (12)

BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Machine Learning at the Edge
Machine Learning at the EdgeMachine Learning at the Edge
Machine Learning at the Edge
 
Using AWS IoT & Amazon SageMaker to Improve Manufacturing Operations - SVC204...
Using AWS IoT & Amazon SageMaker to Improve Manufacturing Operations - SVC204...Using AWS IoT & Amazon SageMaker to Improve Manufacturing Operations - SVC204...
Using AWS IoT & Amazon SageMaker to Improve Manufacturing Operations - SVC204...
 
Mobile Applikationen: Idee, Konzeption, Architektur - Erfolgreicher Start für...
Mobile Applikationen: Idee, Konzeption, Architektur - Erfolgreicher Start für...Mobile Applikationen: Idee, Konzeption, Architektur - Erfolgreicher Start für...
Mobile Applikationen: Idee, Konzeption, Architektur - Erfolgreicher Start für...
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
 
Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BI
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Building a Data Driven Culture and AI Revolution With Gregory Little | Curren...
Building a Data Driven Culture and AI Revolution With Gregory Little | Curren...Building a Data Driven Culture and AI Revolution With Gregory Little | Curren...
Building a Data Driven Culture and AI Revolution With Gregory Little | Curren...
 
Intro to AI & ML at Amazon
Intro to AI & ML at AmazonIntro to AI & ML at Amazon
Intro to AI & ML at Amazon
 
Kibana overview
Kibana overviewKibana overview
Kibana overview
 
Neural networks
Neural networksNeural networks
Neural networks
 

Similar to SplunkLive! Utrecht 2017 - ASML Customer Presentation

[Sirius Day Eindhoven 2018] ASML's MDE Going Sirius
[Sirius Day Eindhoven 2018]  ASML's MDE Going Sirius[Sirius Day Eindhoven 2018]  ASML's MDE Going Sirius
[Sirius Day Eindhoven 2018] ASML's MDE Going Sirius
Obeo
 

Similar to SplunkLive! Utrecht 2017 - ASML Customer Presentation (20)

Grainger: Our Rookie Year with Zenoss
Grainger: Our Rookie Year with ZenossGrainger: Our Rookie Year with Zenoss
Grainger: Our Rookie Year with Zenoss
 
[Sirius Day Eindhoven 2018] ASML's MDE Going Sirius
[Sirius Day Eindhoven 2018]  ASML's MDE Going Sirius[Sirius Day Eindhoven 2018]  ASML's MDE Going Sirius
[Sirius Day Eindhoven 2018] ASML's MDE Going Sirius
 
SplunkLive! Utrecht 2016 - NXP
SplunkLive! Utrecht 2016 - NXPSplunkLive! Utrecht 2016 - NXP
SplunkLive! Utrecht 2016 - NXP
 
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a ServiceRakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
 
IBM Monitoring and Event Management Solutions
IBM Monitoring and Event Management SolutionsIBM Monitoring and Event Management Solutions
IBM Monitoring and Event Management Solutions
 
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
 
SplunkLive! Warsaw 2016 - Cisco
SplunkLive! Warsaw 2016 - Cisco SplunkLive! Warsaw 2016 - Cisco
SplunkLive! Warsaw 2016 - Cisco
 
Designing an unobtrusive analytics framework for monitoring java applications...
Designing an unobtrusive analytics framework for monitoring java applications...Designing an unobtrusive analytics framework for monitoring java applications...
Designing an unobtrusive analytics framework for monitoring java applications...
 
16370 cics project opening and project update f
16370  cics project opening and project update f16370  cics project opening and project update f
16370 cics project opening and project update f
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
 
451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Pivoting to Cloud: How an MSP Brokers Cloud Services
Pivoting to Cloud: How an MSP Brokers Cloud Services Pivoting to Cloud: How an MSP Brokers Cloud Services
Pivoting to Cloud: How an MSP Brokers Cloud Services
 
Aw (3) webinar serverless-fisher-rymer
Aw (3) webinar serverless-fisher-rymerAw (3) webinar serverless-fisher-rymer
Aw (3) webinar serverless-fisher-rymer
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Learning Lessons the Hard Way: A Large Utility’s Experience Upgrading from Ar...
Learning Lessons the Hard Way: A Large Utility’s Experience Upgrading from Ar...Learning Lessons the Hard Way: A Large Utility’s Experience Upgrading from Ar...
Learning Lessons the Hard Way: A Large Utility’s Experience Upgrading from Ar...
 
The Practical Application of 5D BIM to controls
The Practical Application of 5D BIM to controlsThe Practical Application of 5D BIM to controls
The Practical Application of 5D BIM to controls
 
Powering the Enterprise Cloud with CSC and Hitachi Data Systems
Powering the Enterprise Cloud with CSC and Hitachi Data SystemsPowering the Enterprise Cloud with CSC and Hitachi Data Systems
Powering the Enterprise Cloud with CSC and Hitachi Data Systems
 
Smart Document Processing-IQ+Alfresco-ver-22aug
Smart Document Processing-IQ+Alfresco-ver-22augSmart Document Processing-IQ+Alfresco-ver-22aug
Smart Document Processing-IQ+Alfresco-ver-22aug
 

More from Splunk

More from Splunk (20)

.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
 
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
 
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica).conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
 
.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank International.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank International
 
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett .conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
 
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär).conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
 
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu....conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
 
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever....conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
 
.conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex).conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex)
 
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
 
Splunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11ySplunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11y
 
Splunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go KölnSplunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go Köln
 
Splunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go KölnSplunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go Köln
 
Data foundations building success, at city scale – Imperial College London
 Data foundations building success, at city scale – Imperial College London Data foundations building success, at city scale – Imperial College London
Data foundations building success, at city scale – Imperial College London
 
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
 
SOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security WebinarSOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security Webinar
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - Keynote.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - Keynote
 
.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform Session.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform Session
 
.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security Session.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security Session
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

SplunkLive! Utrecht 2017 - ASML Customer Presentation

  • 1. Richard van der Ven 21-11-2017 Alert & Health Monitoring A Splunk and ITSI implementation Public Function Cluster Architect Litho Computing Platform
  • 2. 21 Nov 2017 Slide 2 Public • Who am I? • Environment • Alert & Health Monitoring • Wrap-up
  • 3. 21 Nov 2017 Slide 3 Public Who am I? Worked at ASML for 16 years • 13 years - IT Infrastructure • DBA, Storage, ITIL processes • IT Management • 3 years - Functional Cluster Architect • Litho Computing Platform • Alert & Health Monitoring Richard van der Ven
  • 4. 21 Nov 2017 Slide 4 Public ASML makes the machines for making chips • Lithography is the critical tool for producing chips • All of the world’s top chip makers are our customers • 2016 sales: €6.8 bln • More than 17,000 employees (FTE) worldwide
  • 5. 21 Nov 2017 Slide 5 Public A global presence 3,900 employees Source: ASML Q1 2017 Offices in over 60 cities in 16 countries worldwide 9,600 employees 3,600 employees
  • 6. 21 Nov 2017 Slide 6 Public A tightly integrated set of solutions for scaling and yield Image Compute/SW Measure
  • 7. 21 Nov 2017 Slide 7 Public Litho Computing Platform • A cloud infra stack, called the Litho Computing Platform, designed for high availability and scalability • Virtual machines are abstracted from the hardware  HW may change or break  Virtual machines stay up  High Available • It’s centralized  all applications in one place • It can serve 40 Scanners & 50 Yieldstars • It runs in a dark site at ASML customers An extendable HW platform that scales with application needs
  • 8. 21 Nov 2017 Slide 8 Public Availability is key Availability % Downtime per year Downtime per month* Downtime per week 90% ("one nine") 36.5 days 72 hours 16.8 hours 95% 18.25 days 36 hours 8.4 hours 97% 10.96 days 21.6 hours 5.04 hours 98% 7.30 days 14.4 hours 3.36 hours 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 99.5% 1.83 days 3.60 hours 50.4 minutes 99.8% 17.52 hours 86.23 minutes 20.16 minutes 99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes 99.95% 4.38 hours 21.56 minutes 5.04 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds NOTE: This is availability of the functionality that we sell as perceived by the customer thus Infra + HW + Virtualization layer + Application + Connectivity
  • 9. 21 Nov 2017 Slide 9 Public Some history 2011 • 1st LCP in the field • Start off with monitoring infrastructure components with Nagios • Supported by PHP development • Changing knowledge experts on custom build setup End of 2015 • The need for improved monitoring and local analysis came up after some situations where: • Engineers didn’t notice application components failing • It took a long time to get requested log files via customer approval • It required several iterations to get the log files needed Timeline
  • 10. 21 Nov 2017 Slide 10 Public Alert & Health Monitoring • Avoid unplanned downtime • Reduce planned maintenance times • A smart and robust monitoring solution platform to enable live monitoring AHM product will enable CS engineers to • Identify if LCP operation is at risk • Diagnose root cause of incidents • Pro-active maintenance • Capacity planning • Verify configuration state Why Alert and Health Monitoring?
  • 11. 21 Nov 2017 Slide 11 Public Alert & Health Monitoring Alerting • Alert when KPI over threshold Monitoring / quick trouble shooting • Health monitoring HW / SW/ FW/ Environment health, including network infra, databases OSes • Configuration reporting Exact HW/SW/FW config and changes including licenses and serial numbers Analysis / debugging • Timeline reconstruction Chronological list of major events and threshold alerts • Diagnostics deep dive • Data downloading Key features
  • 12. 21 Nov 2017 Slide 12 Public Alert & Health Monitoring Support flow & organization AHM Customer ASML local equipment support ASML GSC equipment support App 1 App 2 Remote intervention Alert email Troubleshooting VPN MonitoringStatus Report Under virtual escort by customer Action Plan
  • 13. 21 Nov 2017 Slide 13 Public AHM High-level Architecture Alerting Analysis / Debugging Monitoring / Quick troubleshooting Hardware Virtualization Operating Systems Middle- ware Litho apps AHM Data Collection Scripts Search HeadIndex ForwardersForwarders @ Central Instance Alert and Health Monitoring Data Onboarding Data Processing Config Manager AHM Configurator Configuration Metrics
  • 14. 21 Nov 2017 Slide 14 Public Alert & Health Monitoring Keyfigures 1x 165-239 KPI’s< 5GB daily 6-10 500GB ~221 sourcetypes 77 hosts ~2125 sources > 20GB daily >25TB > 50x >3000 hosts
  • 15. 21 Nov 2017 Slide 15 Public Alert & Health Monitoring • Lead time • Importance of log files for monitoring • What determines application availability • Changing requirements from stakeholders • Service model • Implementation ITSI Challenges
  • 16. 21 Nov 2017 Slide 16 Public Alert & Health Monitoring • Service Model • Not usable out of the box • Generated with own tool • UI: not usable • ITSI Dashboard: not configurable to our needs • Glass tables: static, where we need flexibility due to variable applications • Event alerting • Implementation customer specific thresholds Challenges with Splunk core and ITSI
  • 17. 21 Nov 2017 Slide 17 Public Alert & Health Monitoring • Service Model • Generated with own configuration tool • ‘Manual’ regenerate at every change on applications • Using Mind Maps for discussions • UI • Dashboards build with tables and hyper links • New feature drill down promising • Event alerting • Aligning ITSI queries and core Splunk • Implementation customer specific thresholds How did we solve?
  • 18. 21 Nov 2017 Slide 18 Public Implementation Splunk and ITSI Easy and clear drill down dashboards Users are non IT
  • 19. 21 Nov 2017 Slide 19 Public Alert & Health Monitoring • Easier access to log files, metrics and application data • Less time spent on regular service checks • Combine application and infra data • Unforseen side effects of changes diagnosed in field and at internal testing • More confidence in actual system state • Memory leak issue spotted in field, before impact Benefits
  • 20. 21 Nov 2017 Slide 20 Public Alert & Health Monitoring Wrap up

Editor's Notes

  1. ASML makes the machines for making chips. A sophisticated copy machine based on photography principle: use negative to create a print. And this mulitple times on 1 wafer. And mulitple layers stacked on top of eachother
  2. Feedback loop: Initially of 1 per 3 days, to inline measurement Structures on chips become more complex and process requires short feedbackloop to improve the yield. Prevent detection of a malfunctioning chip at the end of the process.
  3. To support the process SW developed to improve the yield via a sophisticated feedback loop. Dark site: hardly remote access, , customers worried by their IP, working on new products, competitor edge Not managed by IT
  4. Lack of 1 single interface Access required to multiple sources (metrics, log files, HW)
  5. CS engineer = non IT engineer, knows a lot of the scanners, and used SW
  6. Alert & Health Monitoring SW provides health status of LCP HW/SW and Application SW Central GSC organization remotely monitors the performance of the LCP IBL using a frequently sent status report (~ 6 – 12 times/hr) AHM sends out an alert in case of a critical or high prio event GSC starts remote troubleshooting via the AHM application GSC aligns the action plan with the local team and customer before doing a remote intervention GSC executes the action plan under virtual escort by the customer The customer can follow live and audit what is happening via VPN
  7. Configuration: needed because of dymanic ASML customer config -> not provided by Splunk Each cust setup different HW, and SW config Diff Thresholds due to different use cases @customer (dev vs HVM/prod)
  8. Different management compared to central IT environment Standardisation important
  9. Initial plan to go to 1st customer: Jan 2017  March 2017 Customer release  October 2017
  10. Powerfull service model mechanism in ITSI, requires a lot of itterative sessions to explain to stakeholders. UI of ITSI not usable by our non IT users
  11. It takes time to create traction with stakeholders and end users Monitoring is difficult Change WoW is difficult Show progress Scrum demo Spread the word Arrange training sessions Involve end users ask them what would help them Start to work with it don’t wait until the product is finished You need consultancy to implement ITSI