The document discusses Reducers, a Clojure library that provides parallel versions of reduce, map, and filter functions. It does this by leveraging the Java Fork/Join framework and using "reduction transformers" that build map and filter on top of reduce, avoiding sequential execution and intermediate data structures. This allows collections to be processed in parallel using techniques like work stealing and combining partial results.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Clojure Reducers / clj-syd Aug 2012
1. Reducers
A library and model for collection processing in Clojure
Leonardo Borges
@leonardo_borges
http://www.leonardoborges.com
http://www.thoughtworks.com
Thursday, 30 August 12
2. Reducers
A library and model for collection processing in Clojure
less
or
m i ns
in 20
... Leonardo Borges
@leonardo_borges
http://www.leonardoborges.com
http://www.thoughtworks.com
Thursday, 30 August 12
4. Reducers huh? Here’s the gist
You get parallel versions of reduce, map and filter
Thursday, 30 August 12
5. Reducers huh? Here’s the gist
You get parallel versions of reduce, map and filter
Ta-da! I’m done!
Thursday, 30 August 12
6. Reducers huh? Here’s the gist
You get parallel versions of reduce, map and filter
Ta-da! I’m done!
and well under my 20 min limit :)
Thursday, 30 August 12
8. How do reducers make parallelism possible?
Thursday, 30 August 12
9. How do reducers make parallelism possible?
• JVM’s Fork/Join framework
• Reduction Transformers
Thursday, 30 August 12
10. Before we start - this is bleeding edge stuff
Java requirements
• Fork/Join framework
• Java 7 [1] or
• Java 6 + the JSR166 jar [2]
Clojure requirements
• 1.5.0-* (this is still MASTER on github [3] as of 30/08/2012)
[1] - http://jdk7.java.net/
[2] - http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar
[3] - https://github.com/clojure/clojure
Thursday, 30 August 12
13. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
Thursday, 30 August 12
14. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
•Uses deques - double ended queues.
Thursday, 30 August 12
15. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
•Uses deques - double ended queues.
•Progressively divides the workload into tasks, up to a threshold
Thursday, 30 August 12
16. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
•Uses deques - double ended queues.
•Progressively divides the workload into tasks, up to a threshold
•Once it finished one task, it pops another one form its deque
Thursday, 30 August 12
17. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
•Uses deques - double ended queues.
•Progressively divides the workload into tasks, up to a threshold
•Once it finished one task, it pops another one form its deque
•After at least two tasks have finished, results can be combined/joined
Thursday, 30 August 12
18. The Fork/Join Framework
•Based on divide and conquer
•Work stealing algorithm
•Uses deques - double ended queues.
•Progressively divides the workload into tasks, up to a threshold
•Once it finished one task, it pops another one form its deque
•After at least two tasks have finished, results can be combined/joined
•Idle workers can pop tasks from the deques of workers which fall behind
Thursday, 30 August 12
44. Let’s talk about Reducers
Motivations
• Performance
• via less allocation
• via parallelism (leverage Fork/Join)
Thursday, 30 August 12
45. Let’s talk about Reducers
Motivations Issues
• Performance • Lists and Seqs are sequential
• via less allocation • map / filter implies order
• via parallelism (leverage Fork/Join)
Thursday, 30 August 12
46. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
Thursday, 30 August 12
47. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
Thursday, 30 August 12
48. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
• Order
Thursday, 30 August 12
49. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
• Order
• Laziness (not shown)
Thursday, 30 August 12
50. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
• Order
• Laziness (not shown)
• Consumes List
Thursday, 30 August 12
51. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
• Order
• Laziness (not shown)
• Consumes List
• Builds List
Thursday, 30 August 12
52. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
• Recursion
• Order Oh, and it also applies the function
• Laziness (not shown) to each item before putting the result
• Consumes List into the new list
• Builds List
Thursday, 30 August 12
53. A closer look at what map does
;; a naive map implementation
(defn map [f coll]
(if (seq coll)
(cons (f (first coll)) (map f (rest coll)))
'()))
This is what mapping means!
• Recursion
• Order Oh, and it also applies the function
• Laziness (not shown) to each item before putting the result
• Consumes List into the new list
• Builds List
Thursday, 30 August 12
55. Reduction Transformers
• Idea is to build map / filter on top of reduce to break from sequentiality
Thursday, 30 August 12
56. Reduction Transformers
• Idea is to build map / filter on top of reduce to break from sequentiality
• map / filter then builds nothing and consumes nothing
Thursday, 30 August 12
57. Reduction Transformers
• Idea is to build map / filter on top of reduce to break from sequentiality
• map / filter then builds nothing and consumes nothing
• It changes what reduce means to the collection by transforming the reducing
functions
Thursday, 30 August 12
58. What map is really all about
(defn mapping [f]
(fn [f1]
(fn [result input]
(f1 result (f input)))))
Thursday, 30 August 12
59. But wait!
If map doesn’t consume the list any longer, who does?
• reduce does!
• Since Clojure 1.4 reduce lets the collection reduce itself
(through the CollReduce / CollFold protocols)
• Think of what this means for tree-like structures such as
vectors
• This is key to leveraging the Fork/Join framework
Thursday, 30 August 12
60. Now we can use mapping to create reducing functions
(reduce ((mapping inc) +) 0 [1 2 3 4])
;; 14
Thursday, 30 August 12
61. Now we can use mapping to create reducing functions
(reduce ((mapping inc) +) 0 [1 2 3 4])
;; 14
(fn [result input]
(+ result (inc input)))
Thursday, 30 August 12
62. Now we can use mapping to create reducing functions
(reduce ((mapping inc) conj) [] [1 2 3 4])
;; [2 3 4 5]
Thursday, 30 August 12
63. Now we can use mapping to create reducing functions
(reduce ((mapping inc) conj) [] [1 2 3 4])
;; [2 3 4 5]
(fn [result input]
(conj result (inc input)))
Thursday, 30 August 12
64. Now we can use mapping to create reducing functions
(reduce ((mapping inc) conj) [] [1 2 3 4])
;; [2 3 4 5]
(fn [result input]
(conj result (inc input)))
But it feels awkward to use it in this form
Thursday, 30 August 12
65. What do we have so far?
• Performance has been improved due to less allocations
• No intermediary lists need to be built (see Haskell’s StreamFusion [4])
• However reduce is still sequential
[4] - http://bit.ly/streamFusion
Thursday, 30 August 12
67. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
Thursday, 30 August 12
68. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
Thursday, 30 August 12
69. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
• Reduce/Combine strategy (think Fork/Join Framework)
Thursday, 30 August 12
70. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
• Reduce/Combine strategy (think Fork/Join Framework)
• Segments the collection
Thursday, 30 August 12
71. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
• Reduce/Combine strategy (think Fork/Join Framework)
• Segments the collection
• Runs multiple reduces in parallel
Thursday, 30 August 12
72. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
• Reduce/Combine strategy (think Fork/Join Framework)
• Segments the collection
• Runs multiple reduces in parallel
• Uses a combining function to join/reduce results
Thursday, 30 August 12
73. Enters fold
• Takes the sequentiality out or foldl, foldr and reduce
• Potentially parallel (fallsback to standard reduce otherwise)
• Reduce/Combine strategy (think Fork/Join Framework)
• Segments the collection
• Runs multiple reduces in parallel
• Uses a combining function to join/reduce results
(defn fold [combinef reducef coll]
...)
Thursday, 30 August 12
74. The combining function is a monoid
• A binary function with an identity element
• All the following functions are equivalent monoids
Thursday, 30 August 12
75. The combining function is a monoid
• A binary function with an identity element
• All the following functions are equivalent monoids
+
(+ 2 3) ; 5
(+) ; 0
Thursday, 30 August 12
76. The combining function is a monoid
• A binary function with an identity element
• All the following functions are equivalent monoids
(defn my-+
([] 0)
([a b] (+ a b)))
(my-+ 2 3) ; 5
(my-+) ; 0
Thursday, 30 August 12
77. The combining function is a monoid
• A binary function with an identity element
• All the following functions are equivalent monoids
(require ‘[clojure.core.reducers :as r])
(def my-+
(r/monoid + (fn [] 0)))
(my-+ 2 3) ; 5
(my-+) ; 0
Thursday, 30 August 12
78. fold by examples
;; all examples assume the reducers library
is available as r
(ns reducers-playground.core
(:require [clojure.core.reducers :as r]))
Thursday, 30 August 12
79. fold by examples:
increment all even positive integers up to 10 million
and add them all up
Thursday, 30 August 12
80. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
Thursday, 30 August 12
81. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
Thursday, 30 August 12
82. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
Thursday, 30 August 12
83. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
;; 500msecs
Thursday, 30 August 12
84. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
;; 500msecs
(time (reduce + (r/map inc (r/filter even? my-vector))))
Thursday, 30 August 12
85. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
;; 500msecs
(time (reduce + (r/map inc (r/filter even? my-vector))))
;; 260msecs
Thursday, 30 August 12
86. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
;; 500msecs
(time (reduce + (r/map inc (r/filter even? my-vector))))
;; 260msecs
(time (r/fold + (r/map inc (r/filter even? my-vector))))
Thursday, 30 August 12
87. fold by examples:
increment all even positive integers up to 10 million
and add them all up
;; these were taken from Rich’s reducers talk
(def my-vector (into [] (range 10000000)))
(time (reduce + (map inc (filter even? my-vector))))
;; 500msecs
(time (reduce + (r/map inc (r/filter even? my-vector))))
;; 260msecs
(time (r/fold + (r/map inc (r/filter even? my-vector))))
;; 130msecs
Thursday, 30 August 12
88. fold by examples:
standard word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
(defn count-words [text]
(reduce
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
{}
(map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
89. fold by examples:
standard word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
(defn count-words [text]
(reduce
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
{}
(map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
(time (count-words wiki-dump)) ;; 45 secs
Thursday, 30 August 12
90. fold by examples:
parallel word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
(defn p-count-words [text]
(r/fold
(r/monoid (partial merge-with +) hash-map)
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
(r/map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
91. fold by examples:
parallel word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
(defn p-count-words [text]
(r/fold
(r/monoid (partial merge-with +) hash-map) Combining fn
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
(r/map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
92. fold by examples:
parallel word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
Will be called at the leaves to merge the
(defn p-count-words [text] partial computations
(r/fold
(r/monoid (partial merge-with +) hash-map)
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
(r/map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
93. fold by examples:
parallel word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
Will be called with no arguments to
(defn p-count-words [text] provide a seed value
(r/fold
(r/monoid (partial merge-with +) hash-map)
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
(r/map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
94. fold by examples:
parallel word count
(def wiki-dump (slurp "subset-wiki-dump50")) ;50 MB
(defn p-count-words [text]
(r/fold
(r/monoid (partial merge-with +) hash-map)
(fn [memo word]
(assoc memo word (inc (get memo word 0))))
(r/map #(.toLowerCase %) (into [] (re-seq #"w+" text)))))
Thursday, 30 August 12
96. fold by examples:
Load 100k records into PostgreSQL
(def records
(into [] (line-seq
(BufferedReader. (FileReader. "dump.txt")))))
Thursday, 30 August 12
97. fold by examples:
Load 100k records into PostgreSQL
(time (doseq [record records]
(let [tokens (clojure.string/split record #"t" )]
(insert users/users
(values {
:account-id (nth tokens 0)
...
})))))
Thursday, 30 August 12
98. fold by examples:
Load 100k records into PostgreSQL
(time (doseq [record records]
(let [tokens (clojure.string/split record #"t" )]
(insert users/users
(values {
:account-id (nth tokens 0)
...
})))))
;; 90 secs
Thursday, 30 August 12
99. fold by examples:
Load 100k records into PostgreSQL in parallel
(time (r/fold
+
(r/map (fn [record]
(let [tokens (clojure.string/split record #"t" )]
(do (insert users/users
(values {
:account-id (nth tokens 0)
...
}))
1))) records)))
Thursday, 30 August 12
100. fold by examples:
Load 100k records into PostgreSQL in parallel
(time (r/fold
+
(r/map (fn [record]
(let [tokens (clojure.string/split record #"t" )]
(do (insert users/users
(values {
:account-id (nth tokens 0)
...
}))
1))) records)))
;; 50 secs
Thursday, 30 August 12
102. When to use it
• Exploring decision trees
Thursday, 30 August 12
103. When to use it
• Exploring decision trees
• Image processing
Thursday, 30 August 12
104. When to use it
• Exploring decision trees
• Image processing
• As a building block for bigger, distributed systems such as Datomic and
Cascalog (maybe around parallel agregators)
Thursday, 30 August 12
105. When to use it
• Exploring decision trees
• Image processing
• As a building block for bigger, distributed systems such as Datomic and
Cascalog (maybe around parallel agregators)
• Basically any list intensive program
Thursday, 30 August 12
106. When to use it
• Exploring decision trees
• Image processing
• As a building block for bigger, distributed systems such as Datomic and
Cascalog (maybe around parallel agregators)
• Basically any list intensive program
But the tools are available to anyone so be creative!
Thursday, 30 August 12
107. Resources
• The Anatomy of a Reducer - http://bit.ly/anatomyReducers
• Rich’s announcement post on Reducers - http://bit.ly/reducersANN
• Rich Hickey - Reducers - EuroClojure 2012 - http://bit.ly/reducersVideo
(this presentation was heavily inspired by this video)
• The Source on github - http://bit.ly/reducersCore
Leonardo Borges
@leonardo_borges
http://www.leonardoborges.com
http://www.thoughtworks.com
Thursday, 30 August 12