I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
Boost PC performance: How more available memory can improve productivity
Grid Computing July 2009
1. Grid computing Ian Foster Computation Institute Argonne National Lab & University of Chicago
2. “ When the network is as fast as the computer’s internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001)
3.
4. “ Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961)
9. We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos Zone of complexity
10. We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos
11.
12.
13.
14.
15.
16.
17. The Grid paradigm and information integration Data sources Platform services Radiology Medical records Name resources; move data around Make resources usable and useful Make resources accessible over the network Pathology Genomics Labs Manage who can do what RHIO
18. The Grid paradigm and information integration Data sources Platform services Transform data into knowledge Radiology Medical records Management Integration Publication Enhance user cognitive processes Incorporate into business processes Pathology Genomics Labs Security and policy RHIO
19. The Grid paradigm and information integration Data sources Platform services Value services Analysis Radiology Medical records Management Integration Publication Cognitive support Applications Pathology Genomics Labs Security and policy RHIO
20.
21.
22. Identity-based authZ Most simple - not scalable Unix Access Control Lists (Discretionary Access Control: DAC) Groups, directories, simple admin POSIX ACLs/MS-ACLs Finer-grained admin policy Role-based Access Control (RBAC) Separation of role/group from rule admin Mandatory Access Control (MAC) Clearance, classification, compartmentalization Attribute-based Access Control (ABAC) Generalization of attributes >>> Policy language abstraction level and expressiveness >>>
29. Children’s Oncology Group Enterprise/Grid Interface service DICOM protocols Grid protocols (Web services) DICOM XDS HL7 Vendor-specific Wide area service actor Plug-in adapters
30.
31. As of Oct 19, 2008: 122 participants 105 services 70 data 35 analytical
32.
33.
34.
35.
36. Health Object Identifier (HOI) naming system uri:hdl :// 888 .us.npi. 1234567890 .dicom/ 8A648C33 -A5…4939EBE Random String for Identifier-Body PHI-free and guaranteed unique 888: CHI’s top-level naming authority National Provider Id used in hierarchical Identifier Namespace Application Context’s Namespace governed by provider Naming Authority HOI’s URI schema identifier—based on Handle
39. Integration : Making information useful ? 0% 100% Degree of prior syntactic and semantic agreement Degree of communication 0% 100% Rigid standards-based approach Loosely coupled approach Adaptive approach
40.
41. ECOG 5202 integrated sample management ECOG CC ECOG PCO MD Anderson Web portal OGSA-DQP OGSA-DAI OGSA-DAI OGSA-DAI Mediator
42.
43.
44. Many many tasks: Identifying potential drug targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
45. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
46.
47. Scaling Posix to petascale … . . . Large dataset CN-striped intermediate file system Torus and tree interconnects Global file system Chirp (multicast) MosaStore (striping) Staging Inter- mediate Local LFS Compute node (local datasets) LFS Compute node (local datasets)
48. Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors
53. Functioning in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos
54. The Grid paradigm and information integration Data sources Platform services Value services Analysis Radiology Medical records Management Integration Publication Cognitive support Applications Pathology Genomics Labs Security and policy RHIO
55. “ The computer revolution hasn’t happened yet.” Alan Kay, 1997
56. Time Connectivity (on log scale) Science Enterprise Consumer “ When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????
With high-speed networks, the Internet becomes more than a communications device—it becomes a computing device. We can disintegrate the computer – outsourcing computing and storage, for example. And we can aggregate capabilities (data and software; computing and storage) from many places The outsourcing/on-demand part is what people have called grid, utility computing, and more recently infrastructure as a service or cloud. It seems to be going mainstream, which is very exciting (and about time!) It’s worth remembering that these ideas are old
What I want to focus on today is the aggregation part, and in particular on the “virtual organization” concept. Let me remind us of another comment made back in 2001.
Early on, people realized that it didn’t make sense for people to travel to computers—that we should be able to compute outside the box. For example, AI pioneer John McCarthy spoke in these terms in 1961, at the launch of Project MAC (?) Here he is a couple of years ago, as such an industry is just emerging. It takes a while.
We cite [Rouse, Health Care as a CAS: Implications for Design… , NAE 2008] for the righthand side aprt. Must support Dynamic composition for a specific purpose Evolving community, function, environment Messy data, failure, incomplete knowledge Nice, but insufficient Data standards Platform standards Federal policies
Another perspective on the problem. A few words of explanation. If we are deploying a hospital IT system, we have Add other regions of agreement. You can’t achieve success via central planning. Quoted in Crossing the Quality Chasm, p. 312
We could show these things as moving if we wanted to be really clever Over time, things change, these groups evolve. If we are successful, they merge
Foster, Kesselman, and Tuecke claimed that grids were all about “virtual organizations.” The way one should interpret that claim, I would assert, is in the context of Gilder’s comments. Things are distributed, for one reason or another—either via deliberate disintegration process, via outsourcing, or because they just started out distributed. Now we need to reassemble them, in a controlled manner. We gave some examples
The first encompasses what people are tending to call “cloud” today. The fourth of course we are quite familiar with! Today, I would use some additional examples, taken from healthcare—a field that I believe will be the “killer app” for VO technologies
I particular, the organizational behavior and management community, who have studied virtual organizations for many years. Our VOs have a lot in common with their’s, but also differences—we’re not just about people, and maybe not even particularly about people. Fortunately we were able to speak to a lot of these people a couple of years ago, via some NSF workshops we organized.
The results are online – “a blueprint for advancing the design, development, and evaluation of virtual organizations.” One interesting anecdote: I found that just as CS can resent being brought into collaborative projects to “write code,” so organizational people can resent being brought in to “fix organizations” One thing I learned was that …
Technology that has been under development for some years Include Globus logo. caGrid, BIRN LHC
Sharing relationships form and devolve dynamically—e.g., temporally Picture on left?
“ Make data usable and useful” initially, I had “Address syntactic, semantic differences”
Talk about API vs Protocol Add “ilities,” function benefits to stack.
Talk about API vs Protocol Add “ilities,” function benefits to stack.
[Create an image here.] For example DICOM and HL7 combine messaging and data model in the same interoperability standard. People are contextualizing this problem at the data interoperability level. Systems interoperability often neglected. An area of differentiation, bringing in best practice in industry and science into health care space. Open source platform. Experience with systems interoperability standards: IETF, OASIS, W3C,
Attribute authorities emerge as an important system component Bridge between local and global: honest broker is an example Note sure what “policy in the network” means.
List services from
DO SOMETHING INTERESTING ON THE RIGHT Scaling via automating data adapters Representations of those things and semantics of those representations. Talk about how services are published, data modeling, etc. Publish data bases Publish services Name published objects
Why childhood cancer? Rare. 5-year survival rates for all childhood cancers combined increase dfrom 58.1 percent in 1975-77 to 79.6 percent in 1996-2003
07/25/09 Test Built using the same mechanisms used to build SOI. -- PKI, delegation, attribute-based authorization -- Registries, monitoring Operating a service is a pain! Would be nice to outsource. But they need to be near the data, which also has privacy concerns. So things become complicated.
Objects are published, they need to be named, then they can be moved around without losing track of them Bulk data movement Fine grain access for data integration
GridFTP = high-perf data movement, multiple protocols, credential delegation, restart RLS = P2P system, soft state, Bloom filters, BUT: the services themselves are operated by the LIGO community. Running persistent, reliable, scalable services is expensive and difficult
Clinical, administrative, research. Issues often hidden and escalate Uniqueness No guaranteed global uniqueness Name ownership No ability to prove that a certain entity issued that name PHI-tainted names Filenames for some images have patientID embedded – sharing of name only may constitute HIPPA violation
Talk about handle….
TO PUT IN A SLIDE? Loose coupling and encapsulation Interoperability through integration based on data mediation Evolutionary in nature Set of scalable systems and methods Explicit in architecture – data integration layer Demonstrated in GSI, GridFTP, MDS, ECOG
This would be a good place for a graphic, perhaps showing top down vs. bottom up.
No coordinated data systems Excel spreadsheet Web service to application Oracle data base
DO SOMETHING INTERESTING ON THE RIGHT Scaling via automating data adapters Representations of those things and semantics of those representations. Talk about how services are published, data modeling, etc. Publish data bases Publish services Name published objects
07/25/09 Test Workflows are becoming a widespread mechanism for coordinating the execution of scientific services and linking scientific resources. Analytical and data processing pipelines. Is this stuff real? EBI 3 million+ web service API submissions in 2007 A lot? We want to publish workflows as services. Think of caBIG services as service providers that then invoke grid services to execute services. (E.g., via TeraGrid gateways.)
"docking" is the identification of the low-energy binding modes of a small molecule (ligands) within the active site of a macromolecule (receptor) whose structure is known A compound that interacts strongly with (i.e. binds) a receptor associated with a disease may inhibit its function and thus act as a drug Typical Workload: Application Size: 7MB (static binary) Static input data: 35MB (binary and ASCII text) Dynamic input data:10KB (ASCII text) Output data: 10KB (ASCII text) Expected execution time: 5~5000 seconds Parameter space: 1 billion tasks
More precisely, step 3 is “GCMC + hydration.” Mike Kubal say: “This task is a Free Energy Perturbation computation using the Grand Canonical Monte Carlo algorithm for modeling the transition of the ligand (compound) between different potential states and the General Solvent Boundary Partition to explicitly model the water molecules in the volume around the ligand and pocket of the protein. The result is a binding energy just like the task at the top of the funnel; it is just a more rigorous attempt to model the actual interaction of protein and compound. To refer to the task in short hand, you can use "GCMC + hydration". This is a method that Benoit has pioneered.”
Application Efficiency was computed between the 16 rack and 32 rack runs. Sustained Utilization is the utilization achieved during the part of the experiment while there was enough work to do, 0 to 5300 sec. Overall utilization is the number of CPU hours used divided by total number of CPU hours allocated. The experiment included the caching of the 36 MB (52MB uncompressed) archive on each of the 1 st access per node We use “dd” to move data to and from GPFS…. The application itself had some bad I/O patterns in the write, which prevented it from scaling well, so we decided to write to RAM, and then dd back to GPFS. For this particular run, we had 464 Falkon services running on 464 I/O nodes, 118K workers (256 per Falkon service), and 1 client on a login node. The 32 rack job took 15 minutes to start. It took the client 6 minutes to establish a connection and setup the corresponding state with all 464 Falkon services. It took the client 40 seconds to dispatch 118K tasks to 118K CPUs. The rest can be seen from the graph and slide text…
We could show these things as moving if we wanted to be really clever Over time, things change, these groups evolve. If we are successful, they merge
Talk about API vs Protocol Add “ilities,” function benefits to stack.
Because we are still mostly computing inside the box
Why now? Law of unexpected consequences—like Web: not just Tim Berners-Lee’s genius, but also disk drive capacity What will happen when ubiquitous high-speed wireless means we can all reach any service anytime—and powerful tools mean we can author our own services? Fascinating set of challenges -- What sort of services? Applications? -- What does openness mean in this context? -- How do we address interoperability, portability, composition? -- Accounting, security, audit?