GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
Semelhante a Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain (20)
Bridging the Gap Between Data Scientists and Software Engineers – Deploying Legacy Python Algorithms to Apache Spark with Minimum Pain
1. GEAviation
Digital
17 Oct 2019
Dr Lucas Partridge
Dr Peter Knight
Bridging the Gap
Between
Data Scientists and
Software Engineers
Deploying legacy Python algorithms
to Apache Spark with minimum pain
#UnifiedDataAnalytics #SparkAISummit
2. About us
Peter Knight (Data Scientist)
- predicts wear on aircraft engines to minimize unplanned downtime.
Lucas Partridge (Software Engineer)
- helps the Data Scientists scale up their algorithms for big data.
2GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
3. Outline
• About GE Aviation
• The problem
• Starting point
• Approach taken
• Some code
• Challenges
• Benefits
• Conclusions and recommendations
3GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
4. 4GE Aviation - Bridging the Gap between Data Scientists and
Software Engineers | 17 Oct 2019
General Electric -
Aviation
• 48k employees
• $30.6B revenue - 2018
• >33k commercial engines
“Every two seconds, an
aircraft powered by GE
technology takes off
somewhere in the world”
5. General problem
• GE Aviation has 100s of data scientists and engineers developing
Python algorithms.
• But most develop and test their algorithms on their local machines and
don’t have the time to learn Spark.
• Spark = good candidate to make these algorithms scale as the engine
fleet grows.
• But how do we deploy these legacy algorithms to Spark as quickly as
possible?
5GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
6. Specific problem
• Forecasting when aircraft engines should be removed for maintenance.
• So we can predict what engine parts will be needed, where, and when.
• ‘Digital Twin’ model exists for each important part to be tracked.
• Tens of engine lines, tens of tracked parts → 100s of algorithms to scale!
6GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
7. Starting point – a typical legacy Python algorithm
def execute(input_data):
# Calculate results from input data
# …
return results
7
a Pandas DataFrame!
also a Pandas DataFrame!
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
8. The legacy Python algorithms
• Used Pandas DataFrames.
• Were run on laptops. Didn’t exploit Spark.
• Each algorithm was run independently.
• Each fetched its own data and read from, and wrote to, csv files.
• Some Java Hadoop Map-Reduce and R ones too - not covered here.
8GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
9. The legacy Python algorithms
• Often failed at runtime.
• Typically processed data for more than one asset (engine) at a time;
they often tried to process all engines!
• All the data would be read into a single Pandas DataFrame
→ ran out of memory!
9
Bang!
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
10. The legacy Python algorithms
• Weren’t consistently written
• function arguments vs globals
• using different names for the same data column.
• Had complex config in JSON – hard to do what-if runs.
• Other considerations:
• The problem domain suggested the need for a pipeline of algorithms.
• Few data scientists and software engineers know about Spark, much less about
ML Pipelines!
10GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
11. Working towards a solution
• Studied representative legacy algorithms
• structure
• how do they process data – columns required, sorting of data rows
• are any tests available?! E.g., csv files of inputs and expected outputs.
• Assumed we couldn’t alter the legacy code at all
• so decided to wrap rather than port them to PySpark
• i.e., legacy Python algorithm is called in parallel across the cluster.
11GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
12. To wrap or to port a legacy algorithm?
12GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
• Performance is critical.
• Algorithm is small, simple and easy to test on Spark.
• The algorithm’s creator is comfortable working directly with Spark.
• Spark skills are available for foreseeable future.
Port when…
Wrap when…
• You wish to retain the ability to run, test and update the
algorithm outside Spark (e.g., on laptop, or in other Big Data
frameworks).
• An auto-code generation tool is available for generating all the
necessary wrapper code.
13. Initially tried wrapping with RDD.mapPartitions()…
• Call it after repartitioning the input data by engine id. This worked but…
• Could get unexpected key skew effects unless you experiment with the
way your data is partitioned.
• The data for more than one asset (engine) at a time could be passed into
the wrapped algorithm.
• ok if the algorithm can handle data for more than asset; otherwise not.
• We really wanted to use @pandas_udf if Spark 2.3+ was available, and it’s
‘grouped map’ usage means that the data for only one asset gets passed to
the algorithm.
13GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
14. …so then we switched to RDD.groupByKey()
• Where key = asset (engine) id.
• So the data for only one asset gets passed to the algorithm.
• This more closely mirrors the behaviour of @pandas_udf, so this code should be
easier to convert to use @pandas_udf later on.
• And it will work with algorithms that can only cope with the data for one asset at a
time.
14GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
15. Forecasting engine removals – solution components
• Multiple Digital Twin models – each one models the wear on a single engine part.
• Input Data Predictor - asks all the digital twin models what input data they need,
and then predicts those values n years into the future.
• Aggregator – compares all the predictions to estimate when a given engine should
be removed due to the wear of a particular part.
• → All of these were made into ML Pipeline Transformers…
15GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
PipelineModel
Persist results
Historic
data
16. Strategy taken
• Passed data to algorithms rather than have each algorithm fetch its
own data.
• algorithm shouldn’t have to know where the data comes from.
• Got representative digital twin models working in isolation, using
temporary table of predicted input data as input.
• Prototyped in notebook environment (Apache Zeppelin).
• Eventually incorporated the pipeline into a spark-submit job using a new
hierarchy of classes…
16GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
17. Class hierarchy
17
Params
pyspark.ml
Key
Parent class
Child class
A Abstract class
Estimator ATransformer A
GE code
GeAnalyticA
EngineWearModel A
GroupByKeyEngineWearModel A HadoopMapReduce
EngineWearModel
A
AnEstimator
DigitalTwinEnginePartXEngineTypeP
Code you write
DigitalTwinEnginePartYEngineTypeP
HasEsnCol
esnCol
HasDatetimeCol
datetimeCol
HasFleetCol
fleetCol
Sample mixin classes
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
19. _transform() method
19GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Note: Implement these methods in each DigitalTwinXXX class
def _transform(self, dataset):
no_nulls_data = self._handleMissingData(dataset)
data_with_processed_input_columns = self._processInputColumns(no_nulls_data)
asset_col = self.getEsnCol()
grouped_input_rdd = data_with_processed_input_columns
.rdd.map(lambda row: (row[asset_col], row)).groupByKey().mapValues(list)
results_rdd = grouped_input_rdd.mapValues(
self._runAnalyticForOneAsset(self.getFailFast(), asset_col))
results_df = self._convertRddOfPandasDataFramesToSparkDataFrame(results_rdd)
processed_results = self._processResultsColumns(results_df)
output_df = dataset.join(processed_results, asset_col, 'left_outer')
return output_df
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
20. Invoking the legacy
Pandas-based algorithm
20GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
a Pandas
DataFrame
also a Pandas
DataFrame
21. 21GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
def _runAnalyticForOneAsset(self, failFast, assetCol):
# Import the named legacy algorithm:
pandas_based_analytic_module = importlib.import_module(
self.getExecuteModuleName()) # A param set by each digital twin class.
def _assetExecute(assetData):
# Convert row data for asset into a Pandas DataFrame:
rows = list(assetData)
column_names = rows[0].__fields__
input_data = pd.DataFrame(rows, columns=column_names)
try:
results = pandas_based_analytic_module.execute(input_data) # Call legacy algorithm.
except Exception as e:
asset_id = input_data[assetCol].iloc[0]
ex = Exception("Encountered %s whilst processing asset id '%s'"
% (e.__class__.__name__, asset_id), e.args[0])
if failFast:
raise ex # Fail immediately, report error to driver node.
else:
# Log error message silently in the Spark executor's logs:
error_msg = "Silently ignoring this error: %s" % ex
print(datetime.now().strftime("%y/%m/%d %H:%M:%S : ") + error_msg)
return error_msg
return results
return _assetExecute
22. Converting the results back
into a Spark DataFrame
22GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
GroupByKeyEngineWearModel
Concrete methods:
_transform()
_runAnalyticForOneAsset()
_convertRddOfPandasDataFramesToSparkDataFrame()
A
def _convertRddOfPandasDataFramesToSparkDataFrame(self, resultsRdd):
errors_rdd = resultsRdd.filter(lambda results: not (isinstance(results[1], pd.DataFrame)))
if not (errors_rdd.isEmpty()):
print("Possible errors: %s" % errors_rdd.collect())
valid_results_rdd = resultsRdd.filter(lambda results: isinstance(results[1], pd.DataFrame))
if valid_results_rdd.isEmpty():
raise RuntimeError("ABORT! No valid results were obtained!")
# Convert the Pandas dataframes into lists and flatten into one list.
flattened_results_rdd = valid_results_rdd.flatMapValues(
lambda pdf: (r.tolist() for r in pdf.to_records(index=False))).values()
# Create Spark DataFrame, using a schema made from that of the first Pandas DataFrame.
spark = SparkSession.builder.getOrCreate()
first_pdf = valid_results_rdd.first()[1] # Pandas DataFrame
first_pdf_schema = spark.createDataFrame(first_pdf).schema
return spark.createDataFrame(flattened_results_rdd, first_pdf_schema)
23. 23
Algorithms before and after wrapping
GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Item or activity BEFORE
Standalone legacy algorithm
AFTER
Same algorithm wrapped for PySpark ML Pipeline
Hosting location On single node or laptop In platform that runs spark-submit jobs on schedule.
Configuration Held in separate JSON config file for each
algorithm
Stored in params of ML Pipeline which can be saved and loaded
from disk for the whole pipeline. Config is part of the pipeline
itself.
Acquisition of input data Each algorithm fetched its own input data: made a
separate Hive query, wrote its input data to csv,
then read it into a single in-memory Pandas
DataFrame for all applicable engines.
PySpark spark.sql(“SELECT …”) statement for data required by all
the algorithms in the pipeline. Passed as a Spark DataFrame into
transform() method for the whole pipeline.
All asset (engine) data Held in-memory on single machine Spread across executors of Spark cluster
Writing of results Each algorithm wrote its output to csv which was
then loaded into Hive as a separate table.
Each algorithm appends a column of output to the Spark
DataFrame that’s passed from one transform() to the next in the
pipeline.
Programming paradigm Written as an execute() function which called other
functions.
Inherits from a specialised pyspark.ml.Transformer class
24. But it wasn’t all a bed of roses! Challenges…
• Pipeline wasn’t really a simple linear pipeline
• Digital twin models operate independently – so could really be run in parallel.
• Many digital twins need to query data that’s in a different shape to the data that’s
passed into the transform() method for the whole ML Pipeline.
• Converting the Pandas DataFrames back into Spark DataFrames
without hitting data-type conversion issues at runtime was tricky!
24GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
Input Data
Predictor
PipelineModel
of Digital Twin
models
Aggregator
Historic
data
Persist results
Other
data
25. More challenges
• Debugging can be tricky! Tips:
• failFast flag
– True - stop processing if any asset throws an exception. Useful when debugging.
– False - silently log an error message for any asset that throws an exception, but continue
processing for other assets. Useful in production.
• run with fewer engines and/or fleets when testing; gradually expand out.
• Even simple things have to be encoded as extra transformers in the
pipeline or added as extra params.
• e.g., persisting data, when required, between different stages in the pipeline
25GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
26. Benefits of this approach
• Much more reliable – don’t run out of memory any more!
• Will scale with the number of assets as the engine fleet grows.
• Whole forecasting scenario runs as a single ML PipelineModel - one per
engine type/config.
• Consistent approach (and column names!) across the algorithms.
26GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
27. Key benefit
Data scientists who know little/nothing about Spark...
• can still develop and test their algorithm outside Spark on their own
laptop, and…
• yet still have it deployed to Spark to scale with Big Data☺.
You don’t have to rewrite each algorithm in PySpark to use the power of
Spark.
27GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
28. Potential next steps
• Auto-generate the wrapper code for new Pandas-based algorithms; e.g.,
from a Data Science Workbench UI. Or, at the very least, create formal
templates that encapsulate the lessons learned.
• Allow the same test data csv files on a laptop to be used unaltered for
testing in the deployed Spark environment. Need to verify that the
ported algorithms actually work!
• Switch to using @pandas_udf on later versions of Spark.
28GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
29. Potential next steps
• Look to optimize the entire pipeline, e.g., by removing Spark actions
where possible, such as persisting intermediate results.
• Many existing ‘algorithms’ – especially the digital twin models - are
themselves really codified workflows or pipelines of lower-level
algorithms.
• so you could convert each algorithm into a pipeline of lower-level algorithms.
• what are different algorithms now would simply become different pipelines; or
even the same pipeline of transformers that’s just configured for a different engine
part.
29GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019
30. Conclusions and recommendations
• Consider wrapping rather than porting to PySpark, especially if the Data
Scientists want to develop/test outside Spark.
• ML Pipelines offers a useful paradigm for running workflows of
algorithms and saving/reloading them.
• If algorithm can handle > 1 asset at a time then RDD.mapPartitions()
might suffice. Otherwise use RDD.groupByKey() or @pandas_udf.
• Push reusable code into a class hierarchy so each concrete wrapper
class needs very little code.
30GE Aviation - Bridging the Gap between Data Scientists and Software Engineers | 17 Oct 2019