A presentation made while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich: in which we "believe 6 impossible things before breakfast". This project encountered and partially solved many of the problems that Cloud computing finally solved.
Paul Brebner, Oxford University Computing Laboratory invited talk: "Grid middleware is easy to install, configure, debug and manage - across multiple sites (One can't believe impossible things)", 15 October 2004.
The project web site is still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Injustice - Developers Among Us (SciFiDevCon 2024)
Grid middleware is easy to install, configure, secure, debug and manage across multiple sites ("One can't believe impossible things")
1. "One can't believe impossible
things"
UK OGSA Evaluation Project
(UCL, Imperial, Newcastle, Edinburgh)
(Full list of project members)
Paul Brebner
University College London
P.Brebner@cs.ucl.ac.uk
"Grid middleware is easy to install, configure,
secure, debug and manage - across multiple sites"
6. Grid Simplicity – Start with something simple
• OGSA
– OGSI
• GT3.2 – exemplar of a Grid SOA
• Initially evaluate installation, configuration,
and security
• Then performance and scalability,
deployment, architectural choices, etc.
7. Grid Realism – But realistic test-bed
• Heterogeneous platforms
– Linux, Solaris, Windows
• Cross-organisational
– Four nodes
– Independently administered
– Firewalls and access restrictions
• Security
– UK e-Science CA
8. Grid Confusion – What is Globus?
• How is Globus intended to be used?
– 1: Science as first-order services: Middleware
for building and hosting Grid Applications, by
exposing science code as Grid services.
– 2: Middleware as services: As a set of high
level Grid services, composed to provide new
Grid functionality. Science isn’t first-order
service, but managed by Grid services.
11. Grid Confusion – Science services or Grid services
Client
2
D=A+2B+C2
E = mc2
E=mc2
1
D=A+2B+C2
12. Grid Confusion – How to evaluate
• Do we evaluate GT3 as middleware for
hosting Grid services, or as a toolkit for
constructing Grid middleware?
• If the first, only need GT3 Core – just the
container. If the second, need “All Services”
(and more – there’s no scheduler).
13. Grid Simplicity – Incremental
• Start with Core Package
• Add Security
• Then try “All Services”
• Simple enough – in theory
14. Grid Steps – single node
Install
OS/HW
GT3
Install
23. Grid Reality – What we found
• Port number management
• Host access
• Remote visibility of installation, container,
services
• Installation by System Administrators
• Tomcat or Test container
• Compilation issues on Solaris
• Exponential increase in testing complexity as
number of nodes increases.
24. Grid Reality – What we found
• Port number management
– Post number conflicts (with other services)
– What port is the container running on?
25. Grid Reality – What we found
• Host access
– Is the container visible on that port externally?
– From which machines?
– For which users?
– Non-trivial to test/debug if/when something
goes wrong
26. Grid Reality – What we found
• Remote visibility of installation, container,
services
– What infrastructure is installed?
– What packages and versions?
– How is it configured?
– What state is it in?
27. Grid Reality – What we found
• Installation by System Administrators
– Division of roles
– Didn’t meet expectations
– Extra effort to support multiple roles
• System Administrators – install, configure and
secure
• Globus Administrators – test, maintain
• Globus Developers – develop, deploy, test/use Grid
services
28. Grid Reality – What we found
• Tomcat or Test container
– Differences in deployment, configuration, and
management
– With Tomcat, increased potential for centralised
management, and sand-boxing of run-time
environment
29. Grid Reality – What we found
• Compilation issues on Solaris
– Took longer than expected
– Only Linux testing and support can be taken for
granted
30. Grid Reality – What we found
• Exponential increase in testing complexity
as number of nodes increases
– Testing (and maintaining) interoperability
between m client machines, and n servers gets
complicated.
– How well will this scale for 100s, 1000s of
nodes?
31. Grid Reality – Security
• In theory just had to
– obtain (and update) host, client, and CA certificates
– convert
– install
– configure
– generate (and update) proxies.
• However, parts of “All Services” package also
needed.
32. Grid Security - What we found
• Interactions between security for multiple
installations
• Essential to test non-secure interoperability first
• Windows client-side security
• Testing and viewing security configuration
• Debugging secure calls
• Client side security is programmatic
• Security management scalability
– Construction and maintenance of user accounts and
grid-map file entries.
33. Grid Security - What we found
• Interactions between security for multiple
installations
– For testing may want
• multiple versions, or duplicates (with different
configurations) of same versions.
• One container with no security, and another
container with security
– May want test/production environments
34. Grid Security - What we found
• Essential to test non-secure interoperability
first
– Trying to test interoperability and security
simultaneously wasn’t fun
35. Grid Security - What we found
• Windows client-side security
– Still havn’t got it working
– Not obvious exactly what parts of Globus are
needed for client side code with security (no
“client plus security” package).
36. Grid Security - What we found
• Testing and viewing security configuration
– Need to be able to view/edit and check security
configuration for containers and services
– Confusion about hierarchical security settings
• Virtual Organisations, clusters, servers, containers,
factories, services, methods, and instances.
– Remotely
– Validate security deployment before run-time
37. Grid Security - What we found
• Debugging secure calls (or any stateful service)
– Proxy interceptor approach (e.g. TCPMON) won’t
work with stateful services
• As grid handle returned to client contains the port number of
the instance, not the proxy
– But proxies are an important design pattern for SOAs…
– GT4/WS-RF may be different
• Handle resolvers, WS-Addressing and WS-
RenewableReferences
38. Grid Security - What we found
• Client side security is programmatic
– Client side code modifications required to call
services/methods with required protocols
– Should be declarative
– Sensitive to server side security credentials
39. Grid Security - What we found
• Security management scalability
– Construction and maintenance of user accounts and grid-map file
entries.
– For each server, each user needs an account, and an entry in the
container gridmap file (mapping client certificate to account)
– May also need service specific gridmap files
– Not scalable for large numbers of users, servers, services.
• Alternatives?
– Tool support
– Role based authentication
– Shared accounts or certificates
40. Grid Recommendations
• If Globus is middleware, then need:
– Platform independent, automatic, installation.
– Tool support for configuration and deployment
creation, validation, viewing and editing.
– Management console for grid, nodes, globus
packages, containers and services.
– Support for remote, location independent,
cross-organisational, multiple role scenarios.
41. Grid Recommendations (continued)
• If Globus is middleware, then need:
– Remote deployment and management of
services.
– Remote distributed debugging of grid
installations, services, and applications.
– Tool support, and more scalable processes for
security.
42. Grid Alternatives
• Next we plan to evaluate the two architectural
choices in more detail
– Science exposed as services, vs science code managed
by higher level grid services.
• Explore alternative mechanisms for:
– Load balancing and resource management
– Directory services (service and resource discovery)
– Data movement approaches (e.g. SOAP Attachments vs
GridFTP)
43. Grid Performance
• First approach (initial results)
– Scientific benchmark (SciMark2.0) modified to
measure throughput, and invoked as a Stateful Grid
Service
– Metric is Calls Per Minute (CPM) – one unit of work.
– No data movement, just computation and memory load.
– JVM: 512MB Heap and –server (of course J)
• Good performance and scalability
– Security has minimal overhead
– Problem with client side timeouts as response times
increase
44. Grid Performance
ART (s)
0
50
100
150
200
0 10 20 30 40 50 60 70
Threads
Time(s)
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu Intel)
Edinburgh (4 hyperthread cpu Intel)
All
Tomcat
Fastest: 3.6s (Edinburgh)
Slowest: 25s (UCL)
45. Grid Performance
Throughput (CPM)
0
10
20
30
40
50
60
70
80
0 20 40 60 80
Threads
CPM
UCL (4 cpu Sun)
Newcastle (2 cpu Intel)
Imperial (2 cpu intel)
Edinburgh (4 hyperthread cpu Intel)
All (12 cpus)
Theoretical Maximum
95% of predicted maximum throughput
46. Grid Performance
• Tomcat vs Test container
– No difference on 3 out of 4 nodes
– But 67% faster on one node (Newcastle, slowest Intel
box)
• Attachments will work with GT3 and Tomcat
– But not with security
– Limit of 1GB (DIME)
– Bug in Axis – doesn’t clean up temporary files.
47. Grid Performance
• Stateful instances can be problematic
– Intermittent unreliability
• On some runs, 1 exception in 300 calls (reliability of .9967)
– But non-repeatable, SOAP/network related?
• What is the safe response to exceptions? Can’t just retry.
– Possible to kill container (relies on clients being well
behaved):
• By invoking same instance/method more than once.
• By consuming container resources
– But instances can be passivated/activated in theory
– Could be used to enable fine-grain (per instance) control over
resource usage.
48. Grid Deployment
• How to install and configure Grid infrastructure
and services - scalably and securely?
• Install GT3 infrastructure and security manually
– MMJFS allows executable code to be staged
automatically (But not services - could provide a
deployment service).
• Install bootstrapping code, and then install and
deploy all other code and security automatically.
– Using SmartFrog (HP) in the lab, and then test-bed.
– Configuring GT3 security remotely is an open-issue, as
is “trust” with System Administrators.
49. Grid Dreams - Debugging
• Debugging distributed systems is tricky
– Need better support for cross-cutting non-functional concerns such
as deployment and debugging.
– (One) problem with debugging services is not knowing the context
of errors (to aid diagnosis or cure) – a service is just an interface.
• Deployment aware debugging:
– Starting from functional work-flows, generate deployment-flows,
which are executed prior to, or concurrent with, functional work-
flows.
– If failure in functional work-flow, then corresponding deployment-
flow is examined to determine likely causes, and parts are re-
executed.
50. Grid Dreams - Debugging
• Backtrack through deployment steps (Like peeling
an onion)
– Some steps will need to be reversed
– Track dependencies, and redundant operations.
• This approach may fix an (interesting) sub-class of
problems:
• Those which can be fixed by simply redoing (or replicating) (part
of) the installation, E.g.
– Intermittent failure of container or services
– Resource starvation or overload
• Security problems that can be fixed with reconfiguration or
refresh of certificates/proxies.
– But not:
• network, or all configuration and security/access problems.
51. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
52. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not
53. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite)
54. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the
55. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the End
56. UK OGSA Evaluation Project
• Thank you J
– Questions/Comments?
• Email: P.Brebner@cs.ucl.ac.uk
– After November: Paul.Brebner@csiro.au
• Not (quite) the End…
57. Postscript – The Secret Life of Grid?
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc
58. Postscript – The Secret Life of Grid?
Our experiences Evaluating Grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc
59. Postscript – The Secret Life of Grid?
Our experiences Evaluating Grid technology reminds me of an
Australian book (“The Secret Life of Wombats”) about a school boy
who used to sneak out of his dormitory after everyone was asleep to go
“wombatting”. He spent his nights secretly crawling down Wombat
burrows with a flashlight – a potentially lethal activity (not just from
cave-ins, as wombats are ferocious when cornered!) – and wrote
copious notes resulting in a substantial increase in knowledge of these
“mysterious and often misunderstood creatures”.
UK OGSA Evaluation Project Report 1.0
Evaluation of Globus Toolkit 3.2 (GT3.2)
Installation
http://sse.cs.ucl.ac.uk/UK-OGSA/Report1.doc