O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Mikio Braun – Data flow vs. procedural programming

6.412 visualizações

Publicada em

How to put your alogrithms into Flink

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Mikio Braun – Data flow vs. procedural programming

  1. 1. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 1 Flink Forward 2015 Data flow vs. procedural programming: How to put your algorithms into Flink October 13, 2015 Mikio L. Braun, Zalando SE @mikiobraun
  2. 2. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 2 Python vs Flink ● Coming from Python, what are the differences in programming style I have to know to get started in Flink?
  3. 3. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 3 Programming how we're used to ● Computing a sum ● Tools at our disposal: – variables – control flow (loops, if) – function calls as basic piece of abstraction def computeSum(a): sum = 0 for i in range(len(a)) sum += a[i] return sum
  4. 4. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 4 Data Analysis Algorithms Let's consider centering becomes or even just def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs xs - xs.mean(axis=0)
  5. 5. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 5 Don't use for-loops ● Put your data into a matrix ● Don't use for loops
  6. 6. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 6 Least Squares Regression ● Compute ● Becomes What you learn is thinking in matrices, breaking down computations in terms of matrix algebra def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w
  7. 7. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 7 Basic tools Advantage – very familiar – close to math Disadvantage – hard to scale ● Basic procedural programming paradigm ● Variables ● Ordered arrays and efficient functions on those
  8. 8. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 8 Parallel Data Flow Often you have stuff like Which is inherently easy to scale for i in someSet: map x[i] to y[i]
  9. 9. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 9 New Paradigm ● Basic building block is an (unordered) set. ● Basic operations inherently parallel
  10. 10. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 10 Computing, Data Flow Style Computing a sum Computing a mean sum(x) = xs.reduce((x,y) => x + y) mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)
  11. 11. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 11 Apache Flink ● Data Flow system ● Basic building block is a DataSet[X] ● For execution, sets up all computing nodes, streams through data
  12. 12. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 12 Apache Flink: Getting Started ● Use Scala API ● Minimal project with Maven (build tool) or Gradle ● Use an IDE like IntelliJ ● Always import org.apache.flink.api.scala._
  13. 13. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 13 Centering (First Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.map(x => x – mean) } You cannot nest DataSet operations!
  14. 14. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 14 Sorry, restrictions apply. ● Variables hold (lazy) computations ● You can't work with sets within the operations ● Even if result is just a single element, it's a DataSet[Elem]. ● So what to do? – cross joins – broadcast variables
  15. 15. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 15 Centering (Second Try) Works, but seems excessive because the mean is copied to each data element. def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2) }
  16. 16. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 16 Broadcast Variables ● Side information sent to all worker nodes ● Can be a DataSet ● Gets accessed as a Java collection
  17. 17. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 17 class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _ @throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) } override def map(value: T): O = { fun(value, broadcastVariable) } } Broadcast Variables
  18. 18. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 18 Centering (Third Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.mapWithBcVar(mean).map((x, m) => x – m) }
  19. 19. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 19 Intermediate Results pattern val x = someDataSetComputation() val y = someOtherDataSetComputation() val z = dataSet.mapWithBcVar(x)((d, x) => …) val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz … } x = someComputation() y = someOtherComputation() z = someComputationOn(dataSet, x) result = moreComputationOn(y, z)
  20. 20. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 20 Matrix Algebra ● No ordered sets per se in Data Flow context.
  21. 21. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 21 Vector operations by explicit joins ● Encode vector (a1, a2, …, an) with {(1, a1), (2, a2), … (n, an)} ● Addition: – a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2)) after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
  22. 22. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 22 Back to Least Squares Regression Two operations: computing X'X and X'Y def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _) C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2 val weight = XTX XTY } }
  23. 23. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 23 Summary and Outlook ● Procedural vs. Data Flow – basic building blocks elementwise operations on unordered sets – can't be nested – combine intermediate results via broadcast vars ● Iterations ● Beware of TypeInformation implicits.

×