Cascading: Enterprise Data Workflows based on Functional Programming
1. “Cascading:
Enterprise Data Workflows
based on Functional Programming”
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
1
2. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
2
3. Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware.
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this.
3
4. Circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
4
5. Circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
“Throw it over the wall”
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
5
6. Circa 2001: post- big ecommerce successes
Stakeholder Product Customers
dashboards UX
Engineering
models servlets
recommenders
Algorithmic + Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
6
7. Circa 2001: post- big ecommerce successes
Stakeholder Product Customers
“Data products”
dashboards UX
Engineering
models servlets
recommenders
Algorithmic + Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
7
8. Circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
8
9. Circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
“Optimizing topologies”
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
9
10. references…
by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
10
11. references…
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
11
12. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
12
13. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
13
14. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
14
15. functional programming… in production
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
15
16. Cascading – definitions
• a pattern language for Enterprise Data Workflows
Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
16
17. Cascading – usage
• Java API, DSLs in Scala, Clojure,
Customers
Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src, Web
App
http://conjars.org
• 5+ yrs production use, logs
logs
Logs
Cache
multiple Enterprise verticals Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
17
18. Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Customers
Amazon AWS, MapR, EMC, SpringSource,
Cloudera Web
• taps: Memcached, Cassandra, MongoDB,
App
HBase, JDBC, Parquet, etc. logs
logs Cache
• serialization: Avro, Thrift, Kryo, Support
Logs
JSON, etc. trap
source
tap sink
tap tap
• topologies: Apache Hadoop, Data
tuple spaces, local mode Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
18
19. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
19
20. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utilityworkflow abstraction
grids, telecom, addresses:
genomics, climatology, agronomics, etc.
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
20
21. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
21
22. The Ubiquitous Word Count
Document
Definition:
Collection
Tokenize
GroupBy
M token Count
count how often each word appears
count how often each word appears R Word
Count
in aacollection of text documents
in collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
• requires a minimal amount of code for each word w in segment(text):
emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
22
23. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
23
24. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
M token
String wcPath = args[ 1 ]; Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
24
26. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
26
27. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
27
28. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
28
29. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
29
30. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
• less learning curve than Cascalog
30
31. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
31
32. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
32
33. references…
pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices
amazon.com/dp/0195019199
design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”
amazon.com/dp/0201633612
33
34. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
design principles of the pattern
token
M
language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R
for robust, parallel data workflows
RHS
at scale Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
34
35. workflow abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
In formal terms, flow diagrams leverage a methodology Word
Count
called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration
35
36. references…
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
36
37. workflow abstraction – business process
Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
37
38. references…
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data
38
39. workflow abstraction – functional relational programming
The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in:
Moseley & Marks, 2006
“Out of the Tar Pit”
goo.gl/SKspn
39
40. workflow abstraction – functional relational programming
The combination of functional programming, pattern language,
DSLs, literate programming, business process, etc., traces back
to the original definition of the relational model (Codd, 1970)
prior to SQL.
Cascalog, in particular, implements more of what Codd intended
for a “data sublanguage” and is considered to be close to a full
implementation of the functional relational programming
paradigm defined in: several theoretical aspects converge
Moseley & Marks, 2006 into software engineering practices
“Out of the Tar Pit” which minimize the complexity of
goo.gl/SKspn
building and maintaining Enterprise
data workflows
40
41. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
41
42. Enterprise Data Workflows
Customers
Let’s consider a “strawman” architecture
for an example app… at the front end
Web
App
LOB use cases drive demand for apps
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
42
43. Enterprise Data Workflows
Customers
Same example… in the back office
Organizations have substantial investments Web
App
in people, infrastructure, process
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
43
44. Enterprise Data Workflows
Customers
Same example… the heavy lifting!
“Main Street” firms are migrating Web
App
workflows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
44
45. Cascading workflows – taps
• taps integrate other data frameworks, as tuple streams
Customers
• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions) Web
App
• text delimited, JDBC, Memcached,
logs
HBase, Cassandra, MongoDB, etc. logs
Logs
Cache
• data serialization: Avro, Thrift,
Support
source
trap sink
tap
Kryo, JSON, etc. tap tap
• extend a new kind of tap in just
Data
Modeling PMML
Workflow
a few lines of Java sink
source
tap
tap
Analytics
Cubes customer
Customer
profile DBs
schema and provenance get Hadoop
Prefs
Cluster
derived from analysis of the taps Reporting
45
46. Cascading workflows – taps
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
46
47. Cascading workflows – topologies
• topologies execute workflows on clusters
Customers
• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs) Web
App
- local mode (dev/test or special config)
logs Cache
- in-memory data grids (real-time)
logs
Logs
Support
• flow planner can be extended trap
tap
source
tap sink
tap
to support other topologies
Data
Modeling PMML
Workflow
source
sink
tap
blend flows in different topologies tap
Analytics
into the same app – for example, Cubes customer
Customer
profile DBs
batch (Hadoop) + transactions (IMDG) Hadoop
Prefs
Cluster
Reporting
47
48. Cascading workflows – topologies
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); topology
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
48
50. Cascading workflows – test-driven development
• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App
• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache
2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap
3. verify exceptions, code to remove them Modeling PMML
Data
Workflow
4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
redirect traps in production Reporting
Cluster
to Ops, QA, Support, Audit, etc.
50
51. Two Avenues to the App Layer…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
51
52. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
52
53. Cascading workflows – ANSI SQL
• collab with Optiq – industry-proven code base
Customers
• ANSI SQL parser/optimizer atop Cascading
flow planner Web
App
• JDBC driver to integrate into existing
tools and app servers logs
logs Cache
• relational catalog over a collection Support
Logs
of unstructured data trap
source
tap sink
tap tap
• SQL shell prompt to run queries Data
Modeling
• enable analysts without retraining
PMML
Workflow
on Hadoop, etc. sink
tap
source
tap
• transparency for Support, Ops, Analytics
Cubes customer
Customer
Finance, et al. profile DBs
Prefs
Hadoop
Cluster
Reporting
a language for queries – not a database,
but ANSI SQL as a DSL for workflows
53
54. Lingual – CSV data in local file system
cascading.org/lingual
54
57. abstraction layers in queries…
abstraction RDBMS JVM Cluster
parser ANSI SQL ANSI SQL
compliant parser compliant parser
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”
machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD flow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
provenance (manual audit) data set
producers/consumers
57
58. Lingual – JDBC driver
public void run() throws ClassNotFoundException, SQLException {
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection =
DriverManager.getConnection(
"jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(
"select *n"
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"
+ "join "EXAMPLE"."EMPLOYEE" as en"
+ "on e."EMPID" = s."CUST_ID"" );
while( resultSet.next() ) {
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();
for( int i = 1; i <= n; i++ ) {
builder.append( ( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i )
+ "="
+ resultSet.getObject( i ) );
}
System.out.println( builder );
}
resultSet.close();
statement.close();
connection.close();
}
58
59. Lingual – JDBC result set
$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian
Caveat: if you absolutely positively must have sub-second
SQL query response for Pb-scale data on a 1000+ node
cluster… Good luck with that! (call the MPP vendors)
This ANSI SQL library is primarily intended for batch
workflows – high throughput, not low-latency –
for many under-represented use cases in Enterprise IT.
In other words, SQL as a DSL.
cascading.org/lingual
59
60. Lingual – connecting Hadoop and R
# load the JDBC package
library(RJDBC)
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")
# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
60
61. Lingual – connecting Hadoop and R
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
cascading.org/lingual
61
62. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
62
63. Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers
• great open source tools – R, Weka, Web
App
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries – logs
logs Cache
Logs
Matrix API, etc. Support
• leverage PMML as another kind trap
tap
source
tap sink
tap
of DSL
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
cascading.org/pattern
63
64. Pattern – create a model in R
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
64
66. Pattern – score a model, within an app
public class Main {
public static void main( String[] args ) {
String pmmlPath = args[ 0 ];
String ordersPath = args[ 1 ];
String classifyPath = args[ 2 ];
String trapPath = args[ 3 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );
// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( classifyPipe, ordersTap )
.addTrap( classifyPipe, trapTap )
.addSink( classifyPipe, classifyTap );
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
}
66
67. Pattern – score a model, using pre-defined Cascading app
Customer
Orders
Scored GroupBy
Classify Assert
Orders token
M R
PMML
Model
Count
Failure Confusion
Traps Matrix
cascading.org/pattern
67
68. Pattern – score a model, using pre-defined Cascading app
## run an RF classifier at scale
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml
## run an RF classifier at scale, assert regression test, measure confusion matrix
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure
## run a predictive model at scale, measure RMSE
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure
68
69. PMML – model coverage
• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• Support Vector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
ibm.com/developerworks/industry/library/ind-PMML2/
69
71. experiments – Random Forest model
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
71
72. experiments – Logistic Regression model
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
72
73. experiments – evaluating results
•
use a confusion matrix to compare results for the classifiers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:
FN ∼ chargeback risk
FP ∼ customer support costs
• can extend this to evaluate
N models, M labels in an
N × M × M matrix
73
74. Cascading: Workflow Abstraction
Document
1. Machine Data
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
2. Cascading
Count
Word
Count
3. Sample Code
4. A Little Theory…
5. Workflows
6. Lingual
7. Pattern
8. Open Data
74
75. Palo Alto is quite a pleasant place
• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk
And example open source project:
github.com/Cascading/CoPA/wiki
75
76. 1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
Document
Collection
3. some curated metadata M
Tokenize
Scrub
token
HashJoin Regex
(which surfaces the value)
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”
76
77. discovery
The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
77
79. discovery
Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie
Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID:
598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year
Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic
Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width:
40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base
Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15
Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District
Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base
Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface
Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity:
none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse
Extent: 0
Trench Severity:
Ravelling Severity:
none
none
Trench Extent: 0
(unstructured data…)
Ravelling Extent:
Rutting Severity:
0 Ridability Severity:
none Rutting Extent:
none
0
Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0
Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent
Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols
Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
79
80. discovery
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)
(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
(specify what you require,
not how to achieve it…
data prep costs are 80/20)
80
81. discovery
(ad-hoc queries get refined
into composable predicates)
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point
81