8. Idea Dates Back to the 1960s 8 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
9. What Do You Optimize? Goal: Minimize latency and control heat. Goal: Maximize data (with matching compute) and control cost.
14. What Resource is Managed? Scarce processors wait for data Manage cycles wait for an opening in the queue scatter the data to the processors and gather the results Persistent data wait for queries Manage data persistent data waits for queries computation done locally results returned Supercomputer Center Model Data Center Model
15. Part 2. Data Centers as the Unit of Computing Cloud computing is at the top of the Gartner hype cycle. “Cloud computing has become the center of investment and innovation.”Nicholas Carr, 2009 IDC Directions 15
18. Transition Taking Place A hand full of players are building multiple data centers a year and improving with each one. This includes Google, Microsoft, Yahoo, … A data center today costs $200 M – $400+ M Berkeley RAD Report points out analogy with semiconductor industry as companies stopped building their own Fabs and starting leasing Fabs from others as Fabs approached $1B 18
19. Which is the Operating System? 19 … … VM 1 VM 5 VM 50,000 VM 1 Data Center Operating System Hyperviser workstation data center
21. Some Programming Models for Data Centers Operations over data center of disks MapReduce (“string-based”) User-Defined Functions (UDFs) over data center SQL and Quasi-SQL over data center Data analysis / statistics over data center Operations over data center of memory Grep over distributed memory UDFs over distributed memory SQL and Quasi-SQL over distributed memory Data analysis / statistics over distributed memory
23. U.S. 501(3)(c) not-for-profit corporation Supports the development of standards and interoperability frameworks. Supports reference implementations for cloud computing. Manages testbeds: Open Cloud Testbed, IntercloudTestbed, Open Science Data Cloud Develops benchmarks. 23 www.opencloudconsortium.org
24. OCC Members Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo Universities: CalIT2, Johns Hopkins, Northwestern, University of Illinois at Chicago, University of Chicago Government agencies: NASA Organizations: Sector Project 24
33. Open Science Data Cloud sky cloud Planning to work with 5 international partners (all connected with 10 Gbps networks). biocloud 27
34. MalStone (OCC-Developed Benchmark) Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
35. Some Lessons Learned (So Far) Python over Hadoop Distributed File System surprisingly powerful. Tuning Hadoop can be a large (unacknowledged) cost. Performance of a cloud computation can be significantly impacted by just 1 or 2 nodes that are a bit slower. Wide area clouds can be practical in some cases. 29
36. Part 4. Sector 30 http://sector.sourceforge.net
37. Sector Overview Sector is fast As measured by MalStone & Terasort Sector is easy to program Supports UDFs, MapReduce & Python over streams Sector does not require extensive tuning. Sector is secure A HIPAA compliant Sector cloud is being set up Sector is reliable Sector v1.24 supports multiple master node servers 31
38. Google’s Large Data Cloud Compute Services Data Services Storage Services 32 Applications Google’s MapReduce Google’s BigTable Google File System (GFS) Google’s Stack
39. Hadoop’s Large Data Cloud Compute Services Storage Services 33 Applications Hadoop’sMapReduce Data Services Hadoop Distributed File System (HDFS) Hadoop’s Stack
40. Sector’s Large Data Cloud 34 Applications Compute Services Sphere’s UDFs Data Services Sector’s Distributed File System (SDFS) Storage Services UDP-based Data Transport Protocol (UDT) Routing & Transport Services Sector’s Stack
41. Generalization: Apply User Defined Functions (UDF) to Files in Storage Cloud map/shuffle reduce 35 UDF UDF
42. Hadoopvs Sector 36 Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
43. Terasort - Sector vsHadoop Performance Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
44. Sector Applications Distributing the 15 TB Sloan Digital Sky Survey to astronomers around the world (joint with JHU, 2005) Managing and analyzing high throughput sequence data (Cistrack, University of Chicago, Cistrack, 2007). Detecting emergent behavior in distributed network data (Angle, won SC 07 Analytics Challenge) Image processing for high throughput sequencing. Wide area clouds (won SC 09 BWC with 100 Gbps wide area computation) New ensemble-based algorithms for trees Graph processing 38
45. Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Cistrack Large Data Cloud Services Ingestion Services
46. Thank you For more information, please see blog.rgrossman.com 40