SlideShare uma empresa Scribd logo
1 de 41
Expressiveness, Simplicity, and Users Craig Chambers Google
A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, …  Google: 07- Flume, …
Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available  
Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write  yetefficient to run
Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
Flume Classes and Methods Core data-parallel collection classes: PCollection<T>,  PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(),  top(CompareFn,N), ...
Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
Performance
Optimizer Impact
Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
Response: FlumeJava Replace custom language with Java + Flume library More verbose syntactically ,[object Object]
All standard libraries & coding idioms preserved
Much less risk
Easy to try out, easy to like, easy to adopt

Mais conteúdo relacionado

Mais procurados

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 

Mais procurados (20)

Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
L15 Data Source Layer
L15 Data Source LayerL15 Data Source Layer
L15 Data Source Layer
 

Destaque (11)

Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015
 
Maereg CVV
Maereg CVVMaereg CVV
Maereg CVV
 
Using triangles in Technical Analysis
Using triangles  in Technical AnalysisUsing triangles  in Technical Analysis
Using triangles in Technical Analysis
 
Proyecto tic numero 12
Proyecto tic numero 12Proyecto tic numero 12
Proyecto tic numero 12
 
American university ms back side
American university ms back sideAmerican university ms back side
American university ms back side
 
малинин
малининмалинин
малинин
 
Lojas virtuais
Lojas virtuaisLojas virtuais
Lojas virtuais
 
Etpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fleEtpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fle
 
Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?
 
C.V.
C.V.C.V.
C.V.
 
O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9
 

Semelhante a Expressiveness, Simplicity and Users

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineering
Fahe Em
 

Semelhante a Expressiveness, Simplicity and Users (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 

Mais de greenwop (9)

Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service ClientsUnifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) DataCategory theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOLA Featherweight Approach to FOOL
A Featherweight Approach to FOOL
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful RacketTurning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?
 
High Performance JavaScript
High Performance JavaScriptHigh Performance JavaScript
High Performance JavaScript
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Expressiveness, Simplicity and Users

  • 1. Expressiveness, Simplicity, and Users Craig Chambers Google
  • 2. A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, … Google: 07- Flume, …
  • 3. Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
  • 4. Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
  • 5. Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
  • 6. Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
  • 7. Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
  • 8. Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available 
  • 9. Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
  • 10. Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
  • 11. Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
  • 12. A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write yetefficient to run
  • 13. Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
  • 14. Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
  • 15. MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
  • 16. MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
  • 17. Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
  • 18. Flume Classes and Methods Core data-parallel collection classes: PCollection<T>, PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(), top(CompareFn,N), ...
  • 19. Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 20. Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 21. Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
  • 22. read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 23. Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
  • 24. read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 25. Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
  • 26. Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
  • 27. Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
  • 28. Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
  • 29. How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
  • 32. Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
  • 33. A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
  • 34. Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
  • 35. Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
  • 36. Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
  • 37. Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
  • 38.
  • 39. All standard libraries & coding idioms preserved
  • 41. Easy to try out, easy to like, easy to adopt
  • 42. Dynamic optimizer less constrained than static optimizer
  • 44.
  • 45. Conclusions Simpler ideas easier to adopt By researchers and by users Sophisticated ideas still needed,to support simple interfaces Doing things dynamically instead of staticallycan be liberating