SVR17: Data-Intensive Computing on Windows HPC Server with the ...
1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17
2. Moving Parts Windows HPC Server 2008 – cluster management, job scheduling Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model PLINQ – multi-core parallelism across LINQ queries. DryadLINQ – Bring LINQ ease of programming to Dryad
3. Software Stack … Image Processing MachineLearning Graph Analysis DataMining .NET Applications DryadLINQ Dryad HPC Job Scheduler Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008
4. Dryad Provides a general, flexible distributed execution layer Dataflow graph as the computation model Can be modified by runtime optimizations Higher language layer supplies graph, vertex code, serialization code, hints for data locality Automatically handles distributed execution Distributes code, routes data Schedules processes on machines near data Masks failures in cluster and network
7. LINQLanguage Integrated Query Declarative extensions to C# and VB.NET for iterating over collections In memory Via data providers SQL-Like Broadly adoptable by developers Easy to use Reduces written code Predictable results Scalable experience Deep tooling support
8. PLINQ Parallel Language Integrated Query Value Proposition: Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism. Declarative data parallelism (focus on the “what” not the “how”) Alternative to LINQ-to-Objects Same set of query operators + some extras Default is IEnumerable<T> based Preview in Parallel Extensions to .NET Framework 3.5 CTP Shipping in .NET Framework 4.0 Beta 2
9. DryadLINQLINQ to clusters Declarative programming style of LINQ for clusters Automatic parallelization Parallel query plan exploits multi-node parallelism PLINQ underneath exploits multi-core parallelism Integration with VS and .NET Type safety, automatic serialization Query plan optimizations Static optimization rules to optimize locality Dynamic run-time optimizations
10. DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); logs where select
11. A Simple LINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
12. A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
13. A Simple DryadLINQQuery PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
14. PartitionedTable<T>Core data structure for DryadLINQ Scale-out, partitioned container for .NET objects Derives from IQueryable<T>, IEnumerable<T> ToPartitionedTable() extension methods DryadLINQ operators consume and produce PartitionedTable<T> DryadLINQ generates code to serialize/deserialize your .NET objects Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
17. A typical data-intensive query var logs = PartitionedTable.Get<string>(“weblogs.pt”); varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject. Go through logentries and keep only entries that are accesses by jvert. Group jvertaccesses according to what page they correspond to. For each page, count the occurrences. Sort the pages jverthas accessed according to access frequency.
18. Dryad Parallel DAG execution logs logentries varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; user accesses htmAccesses output
19. Query plan generation Separation of query from its execution context Add all the loaded assemblies as resources Eliminate references to local variables by partially evaluating all the expressions in the query Distribute objects used by the query Detect impure queries when possible Automatic code generation Object serialization code for Dryad channels Managed code for Dryad Vertices Static query plan optimizations Pipelining: composing multiple operators into one vertex Minimize unnecessary data repartitions Other standard DB optimizations
20. DryadLINQ query plan Query 0 Output: file://hpcmetahn01Cutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt DryadLinq0.dll was built successfully. Input: [PartitionedTable: file://weblogs.pt] Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key) Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
21. XML representationGenerated by DryadLINQ and passed to Dryad <Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan> <Query> List of files to be shipped to the cluster Vertex definitions
22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"vert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }
23. DryadLINQ query operators Almost all the useful LINQ operators Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate Operators introduced by DryadLINQ HashPartition, RangePartition, Merge, Fork Dryad Apply Operates on sequences rather than items
24. MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }
25. K-means in DryadLINQ public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("vectors.pt"); IQueryable<Vector> centers = vectors.Take(100); for (int i = 0; i < 10; i++) { centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“centers.pt”); public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…} }
26. Putting it all togetherIt’s LINQ all the way down Major League Baseball dataset Pitch-by-pitch data for every MLB game since 2007 47,909 pitch XML files (one for each pitcher appearance) 6,127 player XML files (one for each player) Hash partition the input data files to distribute the work LINQ to XML to shred the data DryadLINQ to analyze dataset
27. Load the dataset and partitionDefine Pitch and Player classes void StagePitchData(string[] fileList, string PartitionedFile) { // partition the list of filenames across // 20 nodes of the cluster varpitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile); } Void StagePlayerData(string[] fileList, string PartitionedFile) { varplayers = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0; }
30. DryadLINQ on HPC Server DryadLINQ program runs on client workstation Develop, debug, run locally When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager The JM then schedules additional tasks to execute the vertices of the DryadLINQ query When the job completes, the client program picks up the output result and continues.
31. Examples of DryadLINQ Applications Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation
32. Ongoing Work Advanced query optimizations Combination of static analysis and annotations Sampling execution of the query plan Dynamic query optimization Incremental computation Real-time event processing Global scheduling Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications Scale-out partitioned storage Pluggable storage providers DryadLINQ on Azure Better debugging, performance analysis, visualization, etc.
33. Additional Resources Dryad and DryadLINQ http://connect.microsoft.com/DryadLINQ DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. PLINQ Available in Parallel Extensions to .NET Framework 3.5 CTP Available in .NET Framework 4.0 Beta 2 http://msdn.microsoft.com/en-us/concurrency/default.aspx http://msdn.microsoft.com/en-us/magazine/cc163329.aspx Windows HPC Server 2008 http://www.microsoft.com/hpc Download it, try it, we want your feedback!
35. YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com
36. Learn More On Channel 9 Expand your PDC experience through Channel 9. Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses. channel9.msdn.com/learn Built by Developers for Developers….