SSR: Structured Streaming for R and Machine Learning

•

2 gostaram•360 visualizações

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2

Dados e análise

Felix Cheung
Principal Engineer & Spark Committer
SSR:
Structured Streaming for
R and Machine Learning

Disclaimer:
Apache Spark community contributions

Agenda
• Structured Streaming
• ML Pipeline
• R - putting it all together
• Considerations

Why Streaming?
• Faster insight at scale
• ETL
• Trends
• Latest data to static data
• Continuous Learning

Spark Streaming
1. Receiver
2. Direct DStream
3. Structured Streaming

Structured Streaming
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Structured Streaming
• "Streaming Logical Plan"
– Extending Dataset/DataFrame to include
incremental execution of unbounded input
– aka Rinse & Repeat

Same
• Transformations:
map
filter
aggregate
window
join* (*some limitations)

Better
• Trigger
• Consistency
• Fault Tolerance
• Event time – late data, watermark

Execution Plan
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

SS in a Circuit
DataFrame DataFrame
Trigger

Sink
File (new formats in 2.1+)
Console
Memory (aka Temp View)
Foreach
Kafka (new in 2.2)

Output Mode
Append (default)
Complete
Update (new in 2.1.1)

Remember the SS Flow?
DataFrame DataFrame

ML Pipeline fit()
• Essentially an Action
• Results in a Model
• Sink start() also an Action
• Structured Streaming circuit must be completed
with Sink start()

R
• Statistical computing and graphics
• 10.7k+ packages on CRAN

Why Streaming in R
• Single integrated job for everything
1. Ingest
2. ETL
3. Machine Learning
• Use your favorite packages - freedom to choose
• rkafka – last published 2015

SparkR
• DataFrame API like R data.frame, dplyr
– Full Spark optimizations
• SQL, Session, Catalog
• “Spark Packages”
• ML
• R-native UDF
• SS

Native R UDF
• User-Defined Functions - custom transformation
• Apply by Partition
• Apply by Group

https://spark-summit.org/east-2017/events/scalable-data-science-with-sparkr/

Native R UDF = DF Transform
DataFrame DataFrame
DataFrame DataFrame

SS in R
1. DataStreamReader/Writer
2. StreamingQuery
3. Extending DataFrame (isStreaming)

About Demo
• Create a job to discover trending news topics
– Structured Streaming
– Machine Learning with native R package in
UDF

Demo
• SS – read text stream from Kafka
• R-UDF – a partition with lines of text
– RTextTools – text vector into DTM – scrubbing
– LDA
– terms
• SQL – group by words, count
• SS – write to console

Read DataFrame vs Stream
read.df(datapath, source = "parquet")
read.stream("kafka",
kafka.bootstrap.servers = servers,
subscribe = topic)

Streaming WordCount in 1 line
library(magrittr)
kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092"
topic <- "test1"
read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>%
selectExpr("explode(split(value as string, ' ')) as word") %>%
group_by("word") %>%
count() %>%
write.stream("console", outputMode = "complete")

Streaming and ML
• Streaming – small batch
• ML – sometimes large data to build model
=> pre-trained model
=> online machine learning
• Adopting to data schema, pattern changes
• Updating model (when?)

Practical Implementation
- LSI – online training
- Online LDA
- kNN
- k-means with predict on new data

SS Considerations
• Schema of DataFrame from Kafka:
key (object), value (object), topic, partition,
offset, timestamp, timestampType
• OutputMode requirements

ML with R-UDF
• Native code UDF can break the job
- eg. ML packages could be sensitive to empty row
- more data checks In Real Life
• Debugging can be challenging – run separately first
• UDF must return that matches schema
• Model as state to distribute to each UDF instance

Future – SSR
• Configurable trigger
• Watermark for late data

Thank You.
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI

Mais conteúdo relacionado

Mais procurados

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit

Recent Developments In SparkR For Advanced AnalyticsDatabricks

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Parallelize R Code Using Apache Spark Databricks

ETL with SPARK - First Spark London meetupRafal Kwasny

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Spark at Bloomberg: Dynamically Composable Analytics Jen Aman

Spark DataFrames and ML PipelinesDatabricks

Building Robust ETL Pipelines with Apache SparkDatabricks

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Building a modern Application with DataFramesSpark Summit

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Sqoop on Spark for Data IngestionDataWorks Summit

SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit

Mais procurados (20)

Apache Spark MLlib 2.0 Preview: Data Science and Production

Robust and Scalable ETL over Cloud Storage with Apache Spark

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings

Recent Developments In SparkR For Advanced Analytics

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Parallelize R Code Using Apache Spark

ETL with SPARK - First Spark London meetup

From Pipelines to Refineries: Scaling Big Data Applications

Spark at Bloomberg: Dynamically Composable Analytics

Spark DataFrames and ML Pipelines

Building Robust ETL Pipelines with Apache Spark

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Building a modern Application with DataFrames

A look under the hood at Apache Spark's API and engine evolutions

Spark Under the Hood - Meetup @ Data Science London

Sqoop on Spark for Data Ingestion

SparkR: Enabling Interactive Data Science at Scale on Hadoop

Semelhante a SSR: Structured Streaming for R and Machine Learning

Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Large-Scale Data Science in Apache Spark 2.0Databricks

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

20170126 big data processingVienna Data Science Group

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Streaming Microservices With Akka Streams And Kafka StreamsLightbend

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner

Sink Your Teeth into Streaming at Any ScaleTimothy Spann

Sink Your Teeth into Streaming at Any ScaleScyllaDB

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community

Semelhante a SSR: Structured Streaming for R and Machine Learning (20)

Writing Continuous Applications with Structured Streaming Python APIs in Apac...

Spark (Structured) Streaming vs. Kafka Streams

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Large-Scale Data Science in Apache Spark 2.0

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Jump Start with Apache Spark 2.0 on Databricks

20170126 big data processing

Writing Continuous Applications with Structured Streaming PySpark API

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Streaming Microservices With Akka Streams And Kafka Streams

Writing Continuous Applications with Structured Streaming in PySpark

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Kafka Connect and Streams (Concepts, Architecture, Features)

Sink Your Teeth into Streaming at Any Scale

Jump Start with Apache Spark 2.0 on Databricks

What is Apache Kafka®?

What is apache Kafka?

Leveraging Azure Databricks to minimize time to insight by combining Batch an...

Último

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Discover Why Less is More in B2B Researchmichael115558

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823

SSR: Structured Streaming for R and Machine Learning

1. Felix Cheung Principal Engineer & Spark Committer SSR: Structured Streaming for R and Machine Learning

3. Disclaimer: Apache Spark community contributions

4. Agenda • Structured Streaming • ML Pipeline • R - putting it all together • Considerations

6. Why Streaming? • Faster insight at scale • ETL • Trends • Latest data to static data • Continuous Learning

7. Spark Streaming 1. Receiver 2. Direct DStream 3. Structured Streaming

8. Structured Streaming https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

9. Structured Streaming • "Streaming Logical Plan" – Extending Dataset/DataFrame to include incremental execution of unbounded input – aka Rinse & Repeat

10. Same • Transformations: map filter aggregate window join* (*some limitations)

11. Better • Trigger • Consistency • Fault Tolerance • Event time – late data, watermark

12. Execution Plan https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

13. SS in a Circuit DataFrame DataFrame Trigger

14. Source File Kafka Socket MQTT

15. Sink File (new formats in 2.1+) Console Memory (aka Temp View) Foreach Kafka (new in 2.2)

16. Output Mode Append (default) Complete Update (new in 2.1.1)

17. Streaming & ML Don't Mix*

18. ML Pipeline Model

19. Remember the SS Flow? DataFrame DataFrame

20. ML Pipeline fit() • Essentially an Action • Results in a Model • Sink start() also an Action • Structured Streaming circuit must be completed with Sink start()

21. R to the Rescue

22. R • Statistical computing and graphics • 10.7k+ packages on CRAN

23. Why Streaming in R • Single integrated job for everything 1. Ingest 2. ETL 3. Machine Learning • Use your favorite packages - freedom to choose • rkafka – last published 2015

24.

25. SparkR • DataFrame API like R data.frame, dplyr – Full Spark optimizations • SQL, Session, Catalog • “Spark Packages” • ML • R-native UDF • SS

26. Native R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group

27. Parallel Processing By Partition

28. https://spark-summit.org/east-2017/events/scalable-data-science-with-sparkr/

29. Native R UDF = DF Transform DataFrame DataFrame DataFrame DataFrame

30. SS in R 1. DataStreamReader/Writer 2. StreamingQuery 3. Extending DataFrame (isStreaming)

31. About Demo • Create a job to discover trending news topics – Structured Streaming – Machine Learning with native R package in UDF

32. Demo! https://goo.gl/0v6YxF

33. Demo • SS – read text stream from Kafka • R-UDF – a partition with lines of text – RTextTools – text vector into DTM – scrubbing – LDA – terms • SQL – group by words, count • SS – write to console

34. Read DataFrame vs Stream read.df(datapath, source = "parquet") read.stream("kafka", kafka.bootstrap.servers = servers, subscribe = topic)

35. Streaming WordCount in 1 line library(magrittr) kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092" topic <- "test1" read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>% selectExpr("explode(split(value as string, ' ')) as word") %>% group_by("word") %>% count() %>% write.stream("console", outputMode = "complete")

36. Challenges

37. Streaming and ML • Streaming – small batch • ML – sometimes large data to build model => pre-trained model => online machine learning • Adopting to data schema, pattern changes • Updating model (when?)

38. Practical Implementation - LSI – online training - Online LDA - kNN - k-means with predict on new data

39. SS Considerations • Schema of DataFrame from Kafka: key (object), value (object), topic, partition, offset, timestamp, timestampType • OutputMode requirements

40. ML with R-UDF • Native code UDF can break the job - eg. ML packages could be sensitive to empty row - more data checks In Real Life • Debugging can be challenging – run separately first • UDF must return that matches schema • Model as state to distribute to each UDF instance

41. Future – SSR • Configurable trigger • Watermark for late data

42. Thank You. https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI

SSR: Structured Streaming for R and Machine Learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a SSR: Structured Streaming for R and Machine Learning

Semelhante a SSR: Structured Streaming for R and Machine Learning (20)

Último

Último (20)

SSR: Structured Streaming for R and Machine Learning