AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

© 2015 IBM Corporation
Preparing to Fail: Practical
WebSphere Application
Server High Availability
Tom Alcott STSM

Agenda
• Level Set and Definitions
• WebSphere Application Server Request Processing
• WebSphere Application Server High Availability
• Application Availability
• Preparing to Fail
• Final Thoughts

Why Do We Care About High Availability ?
• Aside from the numerous application functional requirements, one
critical and often assumed non-functional requirement is application
availability
• Increasing impacts from downtime driving shorter service interruption
toleration
• One minute of system downtime can cost an organization
anywhere from $2,500 to $10,000 per minute. Using that metric,
even 99.9 data availability can cost a company $5 million a
year” - The Standish Group
2

Introduction
• Specific to a Highly Available Infrastructure
• Have you prepared for failure ?
o Hardware Components
o Software Components
o Applications
o Procedures
• Have you tested your infrastructure under failure
conditions ?
• If not, this could apply to you……
"Failing to prepare is preparing to fail“
– John Wooden

Definitions
• High Availability (HA)
• Ensuring that the system can continue to process work within
one location after routine single component failures
• Usually we assume a single failure
• Usually the goal is very brief disruptions for only some users for
unplanned events
• Continuous Operations
• Ensuring that the system is never unavailable during planned
activities
• E.g., if the application is upgraded to a new version, we do it in
a way that avoids all downtime

Definitions
• Continuous Availability (CA)
• High Availability coupled with Continuous Operations
• No tolerance for planned downtime
• Little unplanned downtime as possible
• Very expensive
• Note that while achieving CA almost always requires an
aggressive DR plan, they are not the same thing
• Disaster Recovery (DR)
• Ensuring that the system can be reconstituted and/or activated
at another location and can process work after an unexpected
catastrophic failure at one location
• Often multiple single failures (which are normally handled by
high availability techniques) is considered catastrophic
• There may or may not be significant downtime as part of a
disaster recovery

Definitions - Disaster Recovery vs. High Availability
• Both High Availability and Disaster Recovery have a
common goal
• Business continuity
• But under different conditions
• HA: localized failures, e.g., server crash
• DR: loss of entire production system
o Natural disasters – flood, fire, earthquake
o Man-made disasters

WAS-ND HA Architecture
• WAS-ND Full Profile and Liberty Profile Cluster
Composed of Multiple Identical Peers
• Each Capable of Performing the Same Work
• Application Servers Independent of Management Runtime and Each Other
• Application Servers Load Configuration From Local File System
• JNDI Lookups
o Each Application Server Has It’s Own JNDI Service
• Security
o Each Application Server Has Its Own Security Server
• Transactions
o Each Application Server Logs and Manages Distributed Transactions
• Systems Management
o Each Application Server Has Its Own JMX MBean Server
Above Applies to WAS V5.x, WAS V6.x, WAS V7. and WAS V8.x

Realities
• 100% of requests aren't going to work perfectly 100% of the time
• But WAS makes provisions to insure important requests;
transactions and persistent messages always work (eventually).
• Optimized HA/CA Requires
• Considerable expense and planning to execute properly
• An architecture built to purpose around this level of requirement.
• Alignment of Processes and Procedures with the Architecture and
Operational Requirements
• No Single Checklist Covers It All.
– Consider Engaging Services for Assistance
8

Request Processing Definitions
• Request Distribution
• Incoming Requests are Distributed Across Multiple Identical Server
Instances
• Distribution “Agent” Typically Employs an Algorithm
• Request Redirection
• There is an attempt to contact a server to make a request, when the
request fails, the request is redirected to another server
• Also Known as Failover
o Assumes That There are Multiple Identical Servers (a cluster)
• Work Load Management
• Balances Client Requests Across Servers
o Active Monitoring
 Response Time
 Capacity
o Request Distribution Based on Workload

WAS V8.x (and before) HTTP Request Distribution
11
• HTTP Server Plug-in
– Distributes Requests to Cluster Members
• Round Robin or Random
– Maintains Client Affinity Using HTTP Session
– Detects Application Server Failure
• Connection or I/O Timeout
– Marks Container as Unavailable
– Periodic Retry
– Tries Next Cluster Member
HTTP
Server
HTTP
Server
Plugin
Application
Server
Web
Container
Application
Server
Web
Container
(Static) Cluster
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/crun_srvgrp.html

WAS V8.x (and before) Web Service Requests
WS Client Direct Connection - No WLM/Failover
Stateful Client to Web Service in WAS-ND via WAS-ND Proxy or
Datapower
Stateful Client, using HTTP Session to Web Service in
WAS-ND via HTTP Server, WAS-ND Proxy or Datapower Employ WS-Addressing for
Clustering/Failover
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/cwbs_wsa_eprs.html

WAS-ND V8.x (and Before) IIOP Request Distribution
13
• Java ORB Plug-in
– Weighted Round Robin Request Distribution
– Maintains Client Affinity As Appropriate
– Stateful Requests
– Transactions
– Detects Failure
• Connection or I/O Timeout
– Marks Container as Unavailable
– Periodic Retry
– Tries Next Cluster Member
Application
Server
Application
Server
EJB
Container
Application
Server
EJB
Container
Java
Client
ORB
Plugin
ORB
Plugin
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/crun_srvgrp.html

WAS-ND V8.5 Intelligent Management HTTP ** WLM
14
HTTP
Server
HTTP
Server
Plugin
Application
Server
Web
Container
Application
Server
Web
Container
On Demand
Router(s)
• On Demand Router
– Distributes Requests to Cluster Members
• Weighted Outstanding Requests
– Maintains Client Affinity Using HTTP Session
– Detects Application Server Failure
• Connection or I/O Timeout (as with plugin)
• Change in Server Status (On Demand Configuration)
– Works In Conjunction with
– Autonomic Request Flow Management
– Controls Request Flow
– May suspend and re-order requests in order to
prevent overload and achieve service policy
– Health Management Controller
– Routes Requests to Replacement Server
– Application Placement Controller
– Adjusts Cluster Size
(Dynamic) Cluster
** IIOP and JMS Requests are not managed by ODR, but are WLM’d at Application Server

V8.5.5 Intelligent Management for Webservers
ODR (On Demand Router) provides features
such as automatic discovery, edition-aware routing
and caching, health policies, dynamic clusters,
maintenance mode, conditional trace, etc
ODR
Tier AppServer
Tier
WebServer
Tier
IHS/Apache
w/ ODRLIB AppServer
Tier

V8.5.5 Intelligent Management for Webservers - Exclusions
• ODR routing rules
• E.g. No load balancing or failover for the same application in multiple cells
• CPU/memory overload protection
• Throttles traffic when the CPU utilization or heap utilization goes above a
configured threshold on an application server host
• Application Lazy Start
• Request prioritization
• No queuing and re-ordering of requests based on service policies
• Highly available deployment manager
• Request Classification based on the user identity in a LTPA token
• Workload and storm-drain health policies

Service Policies
 Service policies are used to
define application service
level goals
 Allow workloads to be
classified, prioritized and
intelligently routed
 Enables application
performance monitoring
 Resource adjustments are
made if needed to
consistently achieve service
policies Service Policies define the relative importance
and response time goals of application services;
defined in terms the end user result the
customer wishes to achieve

WAS-ND : HA Architecture – A Brief Review
• Peer Recovery Model with Active Hot Standbys
for persistent services
• Transactions
• Messaging
• If a JVM fails then any Singletons running in
that JVM are restarted on a Peer once the
Failure is detected
• Starting on an already running Peer eliminates
the start up time of a new process which could
take minutes
• Planned failover takes a few seconds
• This low failover time means WAS can tolerate
many failures without exceeding the 5.5 minute
yearly maximum outage dictated by 99.999%
SLA
19
High Availability Manager
Distribution and Consistency Services
(DCS)
Reliable Multicast Messaging (RMM)
Transaction
Service
Workload
Management
(WLM)
Data Replication
Services (DRS)
Messaging
Engine
On-Demand
Configuration
(ODC)
WAS-ND JVM

HA: Example Transaction Log Peer Failover
• Provides Failover of In-flight 2 PC
transactions
• WAS-ND Can Be Configured to
Store Transaction Logs For Each
Server on a Shared File System ***
• Allows All Peers to See All Transaction
Logs
• Automatic HAManager Triggered Failover
• When a WAS-ND cluster Member
Fails, a Peer is Elected to Process
the Transaction Log From the Failed
Server
• In Doubt Transactions From a Failed
Server Are Processed Very Quickly,
Typically In Seconds (or less!)
• Significantly Faster Than Hardware
Clustering Which Can Take Minutes
• Resource Managers Locks Released
Quickly
*** Database option in V8.0.07 and later
Application
Server
Transaction
Manager
WAS-ND Cluster
Application
Server
Transaction
Manager
Shared File System
CRASH
Tran
Log
Tran
Log
Database
Message
Queue
Resource Managers

Service Integration Bus - High Availability
WAS-ND Cluster
Cluster member A Cluster member B Cluster member C
SIBus
MEME
Failover
•The SIBus Messaging Engine is Managed by HA Manager
•HA is provided by failing over the ME service to a different cluster member.
•Default is “One of N” Core Group Policy
• Options Exist for Multiple/Partitioned Queue
• Options Exist for Multiple MDB Consumers as well as Single Consumer

WAS V8.5 ME Enhancements
• Restrict long running Database Locks
• Active ME now holds only short locks on the SIBOWNER table while revalidating its ownership at regular intervals
• Ability for SIBus to detect a hang in the “active” ME and switch over to the “standby” ME
• Adds ME Last Update Time to SIBOWNER Table
• Backup ME Can Safely Take Ownership and avoid Split Brain
• ME able to gracefully stop from database failures instead of killing the entire JVM
• Other Applications In JVM Hosting ME Continue to Run
• Automatically “re-enable” a ME if it enters a “disabled” state
• In a Large Cluster It Can Be Difficult to Administratively Determine “disabled” ME
• Configure a new ME to recover data from a orphaned persistence store
• Reads and Updated ME UUID from Persistent Records
• Persist JMS re-delivery count value
• Avoids Reprocessing of Message That May Cause Outage
• Utilization of multi-cores for quicker ME start-up when large number of messages and
destinations are present

WAS-ND V8.5 Health Management
 Automate “Sick” Application Server Restart
 Predefined health policies and custom health
policies can be defined for common server
health conditions
 When a health policy's condition is true,
corrective action execute automatically or
require approval
• Notify administrator (send email or SNMP
trap)
• Capture diagnostics (generate heap dump,
java core)
• Restart server
 Excessive response time means you are
monitoring what matters most: your customer's
experience!
 Application server restarts are done in a way
that prevent outages and service policy
violations
 Each health policy can be in supervise or
automatic mode. Supervise mode is like
training wheels to allow you to verify that a
health policy does what you want before
making it automatic.
Health Conditions
• Excessive request timeouts: % of timed out requests
• Excessive response time: average response time
• Excessive garbage collection: % of time spent in GCs
• Excessive memory: % of maximum JVM heap size
• Age-based: amount of time server has been running
• Memory leak: JVM heap size after garbage collection
• Storm drain: significant drop in response time
• Workload: total number of requests

• Each deployment manager on a separate machine
• Only one is active
• Others are standby
• Shared file system required for dmgrs to share configuration repository
• File system with recoverable locks required - e.g. SAN FS,NFS v4, GPFS
• JMX traffic proxied through WVE On-demand Router (ODR)
• SOAP connector only
• Clustered ODRs recommended
• (they’re recommended any HA production component anyway)
• hadmgrAdd command line utility provided to perform configuration
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.wve.doc/
ae/rwve_xdhadmgrAdd.html
• Pre WAS V8.5 Options Outlined in” The WebSphere Contrarian: Run
time management high availability options, redux “ still applicable
http://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/10
01_webcon.html
WAS-ND v8.5 Deployment Manager HA

• Liberty Collective: NEW multi-server administrative domain
architected exclusively for Liberty!
• Liberty Controller: NEW scalable admin server built on Liberty.
• Liberty Clusters: NEW support for Liberty cluster management
© 2014 IBM Corporation 25
Liberty Collective: Liberty Specific Management
g
Liberty Controller
WLP
Liberty Members
WLP WLP
WLP WLP
v8.5.5: Liberty Collective
WLP=WebSphere Liberty Profile

Liberty Profile in a WAS-ND Cell
• Manage Liberty profiles as integral part of ND Cell !
• Built on Intelligent Management Middleware Server support
– Available v8.5.5.1
• Assisted Lifecycle Management
– Basic console/scripting access to Liberty
– config access (i.e. server.xml)
– lifecycle (start/stop/status)
– log access (messages.log, etc)
– Dynamic Clusters and Health Management
26dmgr
nodes
ND Cell DB
HTTP
/ODR
node
agent
app
server
app
server
liberty
26
node
agent
app
server
app
server
liberty

Capabilities Comparison
27
Capability Liberty ND Cell Liberty Collective
Lightweight No Yes
Setup speed Low High
Memory use High Low
Reconfigurability Low High
Domain Scalability Low High
Admin Scalability Low High
Liberty deploy No Yes
Admin HA Yes Yes
Static clusters Yes Yes
Health Manager Yes No (for now)
Dynamic clusters Yes Yes
Extends existing ND env Yes No
= cell advantage = collective advantage

Application Resiliency
• Efficient Request Processing
• Avoid Long Running SQL Queries
o Employ setMaxRows(int)
o Employ setFetchSize(int)
• Explicitly catch WAS StaleConnectionException
• Asynchronous Application Architecture
• Limit Work Manager Thread Timeout
o WaitForAll(workItems,timeout_ms)
• join(workItems, JOIN_AND,timeout_ms)
o WaitForAny(workItems, timeout_ms)
• join(workItems, JOIN_AND,timeout_ms)
o startWork(Work, timeout_ms, WorkListener)
• Java 7 EE Concurrency Thread Timeout
• service.submit(new Timeout()).get(2000, TimeUnit.MILLISECONDS);
• Stateless Application Architecture
• Or Minimize Application State Overhead
• Externalize State for Recovery/Failover

WAS-ND Application Update (Pre V8.5)
• Adminconsole “Rollout Update” & wsadmin “updateAppOnCluster”
• Stops Cluster Member(s) on a Node
• Distributes Update to Node
• Re-starts Cluster Member(s) on Node
• Employs “Application Update” Function for Correct Event Registration and
Synchronization
• While attractive in theory, this doesn’t provide for seamless updates from the end
user’s perspective
o Plug-in detects Server Outage and Can Then Select Another Cluster Member
o Additional Effort May Be Required for Uninterrupted Service (see below)
• Primary Benefit is that’s it’s Superior over Manual & Scripted Approaches Using
Stop Server, Sleep, etc
• Better Approaches for minimizing dowtime
• Dual Cells
• Single cell wsadmin script that sets ServerWeight to 0, employs isAppReady and
getDeployStatus and manually synch’s each node, then resets ServerWeight

WAS-ND V8.5 Application Edition Management
• Interruption-free update of application on existing deployment targets (e.g.
dynamic cluster)
• Workload quiesce on and diverted from each server or cluster as edition swap
is performed
• Group Rollout
• Old edition is replaced with new edition one server at a time or ‘batches’
• Atomic Rollout
• Old edition completely offline before new edition is available
• Application requests arriving in the window are queued by on-demand router
• Edition back-out
• Ability to undo an edition rollout
• Simply use edition rollout capability to rollout a previous edition
• Validation
• Hosting of new edition in production environment on ‘clone’ of original
deployment targets
• Use routing policy to control edition visibility – e.g. only ‘test’ personnel

Application State
• Typically with a Planned Outage Application State (requests) Can Be “Drained”
• For Unplanned Outages is it worth investing in Application State failover?
• Application State/ Session Failover
• Application Code Transparent
• State is Automatically Retrieved From Remote Copy if Not Present in
Local Copy
• Session Distribution Options for Update of Remote/Backup Session Object
• Time Based Write (default of 10 seconds)
– Employ "NoAffinitySwitchBack” custom property when using TBW
• At End of the Servlet Service Method
• Manually (Requires Use of IBM Extension to HttpSession )
• Session Manager is Distribution Client
• No Application Visibility to DB or Replicator/Session Outage
• WAS V6.02 and above - Updates Occur in Local Copy During Replication/DB Outage
o Messages In Logs and Administration Client During Outage
• Performance Will Degrade As Remote Updates are Attempted

Session Manager to DB
• We Need Your Help!
• RFE 34283 - HTTP Session manager DB resiliency
• Vote at http://www.ibm.com/developerworks/rfe/
• In the Interim
• Employ StaleConnectionRetry = 0
– APAR PI04871
– Will not suppress the error messages
– It will reduce the messages.
– Default is 3 times

WAS-ND HTTP Session Failover – DRS Peer to Peer
Peer to Peer Configuration

WAS-ND HTTP Session Failover – DRS Client Server
Client Server Configuration

WAS Full Profile and Liberty Profile Session Failover – Database
HTTP
Server
HTTP
Server
Plugin
Application
Server
Local
State
Copy
Application
Server
Local
State
Copy
Cluster
Remote
State
Copy
Database

WebSphere eXtreme Scale
• Alternative to DB Replication for
Application State
• Independent of WAS-ND Cell
Infrastructure
• Servlet Filter Replacement for
Session Manager
• Installs in Any J2EE Application
• Replication Zones Allows
Alignment Along Data Center
Boundaries
WXS Session Failover (WAS Full Profile and Liberty Profile)

Web Server
(100 Threads)
Queue
Web
Container
(50 threads)
Queue
ORB Pool
(25 threads)
Requests
Database
JDBC Pool
(10 objects)
Web
Clients
Queue Queue
Typical WebSphere Queuing Network
• Guiding Principle – Keep the website moving!
• Don’t allow a large request queue to build in the App Server.
• It’s better to prematurely fail a small number of latent/long running requests than to stall the entire website

Sizing Pools, Queues and Timeouts
• Monitor Performance* at Each Layer and Set a Budget
• Pool Size = Req/Sec X Latency + (~20 %) Cushion
o E.g. 36 =30 Rec/Sec X 1 Sec + 6
• Queue Size
o As Close to Zero as Practical
• Request Timeout = Latency Timeout + Successful Retry
o E.g. 3.0 seconds = 2.0 Seconds + 1 Second
• Connect Timeout = 99 % Average Network Latency
o E.g. 5ms
• 1 Second is WAS Minimum in Many Cases, So You’ll Need to Round Up for
Some Settings
* See Chart in Backup Slides For PMI Suggestions

A Slight Digression On Tuning for Performance
• The Guidance on the Preceding Slide is Focused on Resiliency and
Failover
• As it Turns Out, This Same General Guidance without the Additional
Attention to Queue Depth and Timeouts Also Optimizes Performance

Web Server and Plugin
Web Server
(100 Threads)
Queue
Web
Clients
• Web Server
• Threads
• Processes (StartServers in Apache/HIS)
• Web Server Plugin
• MaxConnections
• ConnectTimeout
• ServerIOTimeOut
• PostSizeLimit
• PostBufferSize
• ServerIOTimeoutRetryOriginally Introduced in WAS V8.5,
included in V8.0 and v7.0 service streams
APAR PM94198 adds URI specific ServerIOTimeout, ServerIOTimeoutRetry, and Extended Handshake
rules , available in 7.0.0.31, 8.0.0.8, 8.5.5.2.

HTTP Server & Plugin Tuning
• MaxConnections and MaxClients
• MaxConnections = ceiling(#ConcurrentUsers / #IHS / #JVMs)
• Then…….. MaxClients = MaxConnections * #JVMsPerNode

Web Container
Web
Container
(50 threads)
Queue Queue
• Web Container
• Thread Pool (Web Container)
• Read timeout
• Write timeout
• Persistent Connections/Request
• Maximum Open Connections
•listenBackLog
•TCP Channel Custom Property

Web Service Requests
Transport Policy Set
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rxml_wsfphtt
ptransport.html
Timeout and Message Properties
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rwbs_jaxwsti
meouts.html
com.ibm.ws.websvcs.transport.common.TransportConstants.READ_TIMEOUT Default 300 seconds
com.ibm.ws.websvcs.transport.common.TransportConstants.WRITE_TIMEOUT Default 300 seconds
com.ibm.ws.websvcs.transport.common.TransportConstants.CONN_TIMEOUT Default 180 seconds
• Use WS Policy Set to configure timeouts, message
properties etc.
Web
Container

ORB Service (EJB Container)
Queue
ORB Pool
(25 threads)
Queue
• EJB Container
• Thread Pool (ORB)
• Thread pool timeout
• Request timeout
• Request retries count
•EJB Client
•
Dcom.ibm.websphere.wlm.unusable.interval
• Dcom.ibm.CORBA.RequestTimeout
• com.ibm.CORBA.ConnectTimeout
• Locate request timeout

Connection Pool and Database
Database
JDBC Pool
(10 objects)
Queue• Connection Pool
•Maximum Connections (Pool Size)
• Connection Timeout
• PurgePolicy
•EntirePool (Default)
•JDBC Provider
• Read Timeout *
• Login Timeout *
•Database
• Maximum Connections *
* Name varies by vendor

WAS V8 Resource Adapter High Availability
• Resource failover and retry logic for relational data sources and JCA connection
factories
• Simplifies application development
o Minimizes the application code required to handle failure of connections to relational
databases and other JCA resources
o Provides a common mechanism for applications to uniformly respond to planned or
unplanned outages
• Typically Employed with Database Replication (e.g DB2 HADR, Oracle RAC)
• Administrator can tailor data sources and connection factory configuration based
on application needs:
o Alternate/failover resource reference on primary data source
o Optionally
• Number of connection retries
• Pre-population of alternate/failover resource connection pool
• Auto failback
• Full control of functionality available to scripts and programs via management
Mbean
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r0/topic/com.ibm.websphere.nd.doc/info/
ae/ae/cdat_dsfailover.html

Agenda
• Level Set and Definitions
• WebSphere Application Server Request Processing
• WebSphere Application Server High Availability
• Application Availability
• Final Thoughts

High Availability for Non-WAS Components
• Firewall
• IP Sprayer
• WebSphere MQ
• Security Registry
• Database Server
• SOA Appliance (DataPower)
• File System
• Make all HA !
• Via hardware clustering or software clustering
• 99.999% Can Only Be Achieved When All Components Are Engineered
for This Availability Level
• WAS-ND Without an Overall 99.999 % Infrastructure Will Not Assure
99.999% Availability

“Gold Standard” – Dual WAS Cells or Liberty Collectives in One Data
Center
Sharing HTTP Session Between Cells is NOT Recommended

“Gold Standard”
• Two (or More) Cells/Collectives
• Provide Hardware Isolation
• Provide Software Isolation
• Infrastructure for Planned Maintenance without Outage
• Insurance Against Catastrophic Administrative Outage
• Requires More Administrative Effort and Rigor (Scripting)
• Don’t Forget “Rule of 3”
• Discussion Typically Is in Context of HA Clusters of Size 2
• With “Only” 2 of “Everything”
– An Outage (Planned or Unplanned) Reduces Capacity by 50 %
– Is No Longer Fault Tolerant

Causes of Downtime
 Primary causes of downtime
– Hardware and environmental problems: 20%
– People and process problems: 80%
 WAS addresses the 20%
 Solve the other 80% first
– Dedicated, well-trained system administrators
– Strict change control
– Load testing of new applications
– Carefully-planned automated production deployment

Simulating Failure
• Essential for “Preparing to Fail”
• “Pull the Plug and Duck”
• Disconnect Network Cable
• Hang an Application Server (or Other Process)
• Many Monitoring and Diagnostic Tools Assist Here
• Inject Local or Global failure for WAS Messging Engine
• Local = “auto restart”
• Global = “use intervention”
• http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjt0037_.html
• Hang OS
• Write script to consume CPU
• Insidious, but Very Effective
• See Backup for Sample
54

Learn from Your Mistakes
• Mistakes and failures will occur, learn from them
• What separates mediocre organizations from the good and great isn't so much
perfection as it is the constant striving to get better – to not repeat mistake
• After every outage perform
• Root cause analysis
o Capture diagnostic information
o Meet as a team including all key players to discuss
o Determine precisely what went wrong
• Wrong doesn't mean “Bob made an error.”
• Find the process flaw that led to the problem
• Determine a corrective action that will prevent this from happening again
o If you can't, determine what diagnostic information is needed next time this happens and
ensure it is collected
• Implement that corrective action
o All too often this last step isn't done
o Verify that action corrected problem
• A senior manager must own this process

Indispensable When Preparing to Fail

Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM
shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,
EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF
THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT
OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not
necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither
intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products in connection with this
publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any
IBM patents, copyrights, trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document
Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand,
ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,
PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,
urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and
service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on
the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Thank You
Your Feedback is
Important!
Access the InterConnect 2015
Conference CONNECT Attendee
Portal to complete your session
surveys from your smartphone,
laptop or conference kiosk.

Shameless Self Promotion
IBM WebSphere Deployment and
Advanced Configuration
By Roland Barcia, Bill Hines,
Tom Alcott and Keys Botzum
ISBN: 0131468626

Another Recommended Book
IBM WebSphere v5.0 System Administration
By Leigh Williamson, Lavena Chan,Roger Cundiff,
Shawn Lauzon and Christopher C. Mitchell
ISBN: 0131446045

64
WebSphere Application Server PMI Monitoring Detail (1/2)
Connection Pools JVM Runtime HTTP (Servlet) Session
Manager
ORB /EJB
JDBC
AllocateCount
ReturnCount
CreateCount
CloseCount
FreePoolSize
PoolSize
JDBCTime
UseTime
WaitTime
WaitingThreadCount
PrepStmtCacheDiscardCount
JMS (JCA)
JMS Queue Connection Factory
Connection Pools.
Pool Size
Percent Maxed
Percent Used
Wait Time
HeapSize
UsedMemory
ProcessCpuUsage
Optional
% Free after GC,
% Time spent in GC.
ActiveCount
CreateCount
InvalidateCount
LiveCount
LifeTime
TimeSinceLastActivated
TimeoutInvalidationCount
Optional
SessionObjectSize **
Thread Pool
ActiveCount
ActiveTime
CreateCount
DestroyCount
PoolSize
DeclaredThreaHungCount
Requests
WaitTime**
MethodResponseTime **
** Usually in test only
 Set PMI to custom and enable just the following metrics:

65
WebSphere Application Server PMI Monitoring Detail (2/2)
Web ODR Messaging Engine (SIB)
Requests
ResponseTime
ConcurrentRequests
ErrorCount
Thread Pool
ActiveCount
ActiveTime
CreateCount
DestroyCount
PoolSize
DeclaredThreaHungCount
Proxy Module
ActiveOutboundConnectionCount
RequestCount
ResponseTime(TTLB)
odrStatModule
TotalNumberOfRequests
CurrentOutstandingRequests
PercentOfErrors
BufferedReadBytesCount,
BufferedWriteBytesCount
CacheStoredDiscardCount ***
CacheNotStoredDiscardCount ****
Optional
AvailableMessageCount,
LocalMessageWaitTime
 Set PMI to custom and enable just the following metrics:
**** Not PMI, in System.Out log

Example Server/OS Hang Script
• #!/usr/bin/ksh
• clear
• echo 'This script will burn CPU cycles and chew up all available memory'
• echo
• echo 'Start "nmon" in another window to watch when the memory is gone and it locks up. (the display will stop updating every 2 seconds when it is hung)'
• echo
• echo -e "Press 'Enter' to continue and crash the system... c"
• read ANS
• echo -e ".c"
• ( x=0; while true; do ((x=x+1)); done ) &
• sleep 5
• echo -e ".c"
• ( while true; do true; done ) &
• sleep 5
• echo -e ".c"
• cat /dev/urandom >/dev/null &
• sleep 5
• echo -e ".c"
• tail /dev/zero &
• echo -e "nnThe system should crash soon."
• #-- Loop to display time
• while true
• do
• sleep 1
• TM=$(date '+%T')
• echo -e "$TM bbbbbbbbbc"
• done
66

Licensing Servers as Back Up Servers
From IBM Contracts and Practices Database
• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes
HOT-WARM-COLD backups:
• All programs running in backup mode must be under the customer's control, even if running at another enterprise's
location.
• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been
started.
• There is no charge for this copy.
• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not
doing any work of any kind.
• There is no charge for this copy.
• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However,
this program must be ordered.
• There is a charge for this copy.
• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include
other activities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources
(e.g. active linking with another machine, program, data base or other resource, etc.) or any activity or configurability
that would allow an active hot-switch or other synchronized switch-over between programs, data bases, or other
resources to occur
Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information

Automatic routing
‒ Automatically discovers and recognizes all changes which affect routing: server/cluster
create/start/stop/delete, application install/start/stop/uninstall, virtual host updates, session affinity
configuration changes, dynamic server weight changes, etc.
‒ Lower administrative overhead. Simply connect a cell and go. When new clusters are created in target
cells, no change is made or needed to the plugin-cfg.xml.
Application edition routing
‒ Upgrade applications without interruption to end users
‒ Easy-to-use validation mode allowing new versions of application to be
validated before sending production traffic
‒ Concurrently run multiple editions of a single application, using routing policy to
route users to the appropriate edition
Application edition caching
‒ The plugin's ESI (Edge Side Include) cache is edition-aware, which means that
edition 1 and edition 2 content is stored separately in the cache
Health policy support for ODR-related health policies
‒ Recognize a sick server and automatically take corrective action
‒ “Excessive Response Time” and “Excessive Request Timeout” health policy
support

• Dynamic clusters
• JVM and VM elasticity
• APC (Application Placement Controller) dynamically starts/stops
servers and calls to IBM Workload Deployer (IWD) to provision/de-
provision servers in order to meet current demand. IHS/Apache
automatically routes appropriately.
• Node and server maintenance mode
• When a node or server is placed into maintenance mode, application
optimization automatically routes appropriately
• Multi Cell Routing
• WLOR (Weighted Least Outstanding) load balancing
• Evens out response times due to dynamically changing weights
• Quick to send less traffic to slow or hung servers

Deployment Manager HA
“Warm-standby Model”

Deployment Manager HA
Take-over after primary failure…

Application Edition Management – Group Rollout
quiesce &
stop
Edition 1.0
Edition 1.0
Edition 1.0
On-demand
routers
Dynamic cluster
Edition 2.0
restart
application
requests

Application Edition Management– Atomic Rollout
quiesce
& stop
Edition 1.0
Edition 1.0
Edition 1.0
On-demand
routers
Dynamic cluster
Edition 1.0
application
requests
Edition 2.0
Edition 2.0
quiesce
& stop
Edition 2.0
Edition 2.0
request
request
request
restart
restart

AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Semelhante a AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability (20)

Mais de WASdev Community

Mais de WASdev Community (13)

Último

Último (20)

AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability