We use Apache Spark Structured Streaming on the Databricks Unified Analytics Platform to process live data and Spark MLlib to train models for predicting machine failure. Structured Streaming and MLlib combined in the Zeiss Measuring Capability App allows users to stay on top of all relevant machine information and to know at a glance if a machine is capable of performing reliably. We will demonstrate how Azure Databricks allows us to easily schedule and monitor an increasing number of Spark jobs, continuously adding new features to our app.
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Semelhante a Using Apache Spark Structured Streaming on Azure Databricks for Predictive Maintenance of Coordinate Measuring Machineswith Jan Phillipp Simen
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
Semelhante a Using Apache Spark Structured Streaming on Azure Databricks for Predictive Maintenance of Coordinate Measuring Machineswith Jan Phillipp Simen (20)
Using Apache Spark Structured Streaming on Azure Databricks for Predictive Maintenance of Coordinate Measuring Machineswith Jan Phillipp Simen
1. Spark + AI Summit Europe, London, 2018-10-03
Jan-Philipp Simen
Data Scientist
Carl Zeiss AG
Structured Streaming on Azure Databricks for
Predictive Maintenance of Coordinate Measuring Machines
#SAISEnt1
2. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 2
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
4. ZEISS Research & Quality Technology
More Than 20 Nobel Prizes Enabled by ZEISS Microscopes
https://www.zeiss.com/microscopy/int/about-us/nobel-prize-winners.html
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 4
5. ZEISS Semiconductor Manufacturing Technology
With Lithography at 13.5 Nanometers Wavelength,
ZEISS Is Enabling the Digital World
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 5
More on Extreme Ultra Violet Lithography: https://www.youtube.com/watch?v=Hfsp2jIjDpI
6. ZEISS Industrial Metrology
Precision up to 0.3 µm
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 6
Quality Assurance
Metrology
Coordinate Measuring Machines
Usually comparing CAD model to
produced workpiece
tactile
x-ray
optical
laser
7. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 7
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
8. How Will Predictive Maintenance Benefit Our Customers?
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 8
Main goals:
1. avoid downtime
2. ensure reliable
measurements
9. #SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 9
Delivering Customer Value from the Start
Start small, learn from data and
develop first value-adding services
One-click support request2
Predictive maintenance notifications in app4
Predictive models for internal use at ZEISS3
Condition monitoring app1
10. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 10
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
11. Streaming Architecture on Azure
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 11
Spark
Structured
Streaming
Azure
IoT Hub
…
Azure
Event HubsAzure Event Hubs
Connector for
Apache Spark
Azure
CosmosDB
Azure Cosmos
DB Connector
for Apache
Spark
Data Lake
Parquet
Files
{JSON}
12. In a Streaming Pipeline the Drawbacks of Untyped SQL
Operations Are Even More Apparent
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 12
def readSensor9Untyped(streamingDF: DataFrame): DataFrame =
streamingDF.select(
$"machineID",
$"dateTime",
udfParseDouble(”sensor9")($"jsonString"))
def udfParseDouble(key: String) = ???
What is in
there,
again?
Did I spell
this right?
Don’t forget to
document the
schema!
Our solution: Spark Datasets and Scala Case Classes
13. Spark Datasets + Scala Case Classes =
Typesafe Data Processing
def jsonToRecord(sensorID: Int)(event: Event): Record[Option[Double]] =
Record[Option[Double]](event.machineID,
event.dateTime,
parseSensorValue(sensorID, event.jsonString))
def parseSensorValue = ???
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 13
case class Record[T](machineID: String,
timestamp: Timestamp,
value: T)
case class Event(machineID: String,
dateTime: Timestamp
jsonString: String)
def readSensor9(streamingDS: Dataset[Event]): Dataset[Record[Option[Double]]] =
streamingDS.map(jsonToRecord(9))
Spark features wish list
• Typesafe joins for Datasets
• Import / export case classes from / to JSON Schema would allow centralized schema repository
Leveraging the full
potential of a good IDE:
• Auto completion
• Compile checks
• Quick fixes
• Refactoring
14. Implementing Alert Logic with Stateful Streaming
/** Checks a stream of JSON messages for threshold violations.
*
* @param jsonStream Dataset containing JSON messages as received from IoT Hub
* @param cmms Dataset with thresholds and other specifications for each machine
* @return Dataset with alert events as expected by the Eventhub
*/
def createAlerts(jsonStream: Dataset[JsonMessage],
cmms: Dataset[CmmSpecs]): Dataset[AlertEvent] = {
val joined: Dataset[JsonPlusCmmSpecs] = joinStaticOnStream(jsonStream, cmms)
// Group by serial number
val grouped = joined.groupByKey(_.machineID)
// Evaluate incoming messages in each group (that is for each individual machine)
grouped.flatMapGroupsWithState[AlertState, AlertEvent](...)(evaluateTemperatureState)
}
def evaluateTemperatureState = ???
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 14
15. Automated Build and Deployment
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 15
Code Repository
1
Build Agent
2
Azure Databricks Platform
3
Scala SBT project
shared code base
Scala notebooks
job specific logic
Shell scripts
job configurations
Build jar artifact
using SBT
assembly
DBFSUpload
Work-
space
Upload
DatabricksRESTAPI
Execute “create job”
requests
Job 1
Job 2
Job 3
16. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 16
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
17. Batch ETL Architecture
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 17
Azure
BlobManual upload
by service
technicians 50’000 zip files
30 to 200 MB per file
Spark
RDDs and
Datasets
Data Lake
Parquet
files
Winzip
files
18. Unzipping Lessons Learned
In Theory
val allFilesOnBlob: RDD[(String, PortableDataStream)] = sc.binaryFiles(path)
def unzip(zipFile: PortableDataStream): Map[String, String] = easyUnzipFunction(zipFile)
val unzippedFiles: RDD[(String, Map[String, String])] = allFilesOnBlob.mapValues(unzip)
In Practice
• 250 lines of code only for the unzipping (Winzip format)!
• A lot of exception handling during unzipping.
• Many debugging iterations.
• Size of zip files ranging from 30 MB to 200 MB (compressed) ➡ Had to reserve a lot of memory overhead
for big zip files ➡ Split RDD into three buckets, depending on file size
Next time:
Use gzip compression on file level instead of putting varying amounts of files into a Winzip archive.
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 18
19. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 19
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
20. 20
Historic
log files
Azure
Blob
Data Lake
RDDs and
Datasets
Combining Batch and Streaming Data Sources and
Adding Scheduled Machine Learning Notebooks
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG
MLlib
Training
historic logs,
streaming
events and
service data
trained models
MLlib
Predictiontrained models +
recent streaming
events
Prediction
Dashboard
Structured
Streaming
Azure
IoT Hub
…
Azure
Event Hubs
Azure
CosmosDB
{JSON} ∞
ESB
CRM and
service data
21. Agenda
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 21
1 About ZEISS
2 Motivation for Predictive Maintenance
3 Processing Live Data Streams
4 Processing Historical Data Batches
5 Bringing It Together
6 Summary
22. Things We Like
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 22
Things we like about Spark
q Datasets: Typesafe data processing
q Stateful Streaming: Powerful transformations of streaming datasets
q MLlib: State-of-the-art distributed algorithms
q All the other great things: Scalability, performance, …
Things we like about Azure Databricks
q Cluster management: Convenient setup, auto-scaling, auto-shutdown
q Job management: Convenient creation and scheduling of Spark jobs
q REST API: Enables automated deployment
23. Wish List
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 23
Features we would love to see in the future
q Datasets: Typesafe joins
q Central schema repository with export to different formats (Scala case class, JSON schema)
q Better monitoring for multiple, long-running jobs: Integration with external log aggregators or
customizable log metrics in Databricks
24. Contact
#SAISEnt1 Jan-Philipp Simen, Carl Zeiss AG 24
Dr. Jan-Philipp Simen
Data Scientist
Digital Innovation Partners
Carl Zeiss AG
Kistlerhofstraße 70
81379 Munich
jan-philipp.simen@zeiss.com
https://www.linkedin.com/in/jpsimen