Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Spark and machine learning in microservices architecture
1. Spark and Machine Learning in
Microservices Architecture
by Stepan Pushkarev
CTO of Hydrosphere.io
2. Mission: Accelerate Machine Learning to Production
Opensource Products:
- Mist: Spark Compute as a Service
- ML Lambda: ML Function as a Service
- Sonar: Data and ML Monitoring
Business Model: Subscription services and hands-on
consulting
About
5. Data, reporting and machine learning
architectures are different
● Raw SQL / HiveQL / SQL on Hadoop
● Datawarehouse / Data Lake centric
● Scripts driven ./bin/spark-submit
● Automated with Cron and/or Workflow Managers
● Hosted Notebooks culture
● Traditionally offline / for internal users
● File system aware (HDFS, S3)
● Defined by all-inclusive Hadoop distributions
6. - Data Pipelines on Microservices
- ML Functions as low latency prediction
services
Agenda
7. Part 1: Data Pipeline Intuition
Need to transform Source Data into desired shape
13. Problem: Unmanageable State in Shared folder
- Data Flow is not managed. DAG scheduling is a different
objection.
- Who is responsible for schema migration? Task 1, Task 2 or
Manager?
- What folder Task 1 should write to and Task 2 should read from?
- How to manage folders/resources between parallel sessions?
- When / how to clean up shared folder? Another cleanup pipeline?
- How to check that data batch is arrived and valid?
- How to unit test it?
- How to handle errors?
14. Statesafe Pipelines
1. Get rid of Workflow Manager!
2. Turn black box tasks and scripts into microservices.
3. Use Avro data contracts between stages. Data is also
an API to be standardized, versioned and validated.
4. Segregate black box tasks into (read) (process) and
(write) services.
5. Keep the state in shared folder/topic/session
manageable by framework rather than data engineers.
6. Abstract engineer from data transport and provide a
pure function to deal with.
21. On-demand batch pipelines
- Could not be
pre-calculated
- On-demand
parametrized jobs
- Interactive
- Require large
scale processing
- Reporting
- Simulation (pricing,
bank stress testing,
taxi rides)
- Forecasting (ad
campaign, energy
savings, others)
- Ad-hoc analytics tools
for business users
22. Bad Practice: Database as API
Execute reporting job
Mark Job as complete &
save result
Poll for new tasks
Poll for resultSet a flag and parameters
to build a report
24. From Vanila Spark to Spark Compute as a
Service
./bin/spark-submit
- Spark Sessions Pool
- REST API Framework
- Data API Framework
- Infrastructure
Integration (EMR,
Hortonworks, etc)
31. val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model =
PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()
46. Thank you
Looking for
- Feedback
- Advisors, mentors &
partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/
- spushkarev@hydrosphere.io