Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx

Building a Feature Store around
Dataframes and Apache Spark
Jim Dowling, CEO @ Logical Clocks AB
Fabio Buso, Head of Engineering @ Logical Clocks AB

When Data Engineers are asked to re-use other teams’ features*
*Hide-the-pain-Harold smiles and says ‘yes’, but inside he’s in a world of pain

Known Feature Stores in Production
▪ Logical Clocks – Hopsworks (world’s ﬁrst/only fully open source)
▪ Uber Michelangelo
▪ Airbnb – Bighead/Zipline
▪ Comcast
▪ Twitter
▪ GO-JEK Feast
▪ Conde Nast
▪ Facebook FB Learner
▪ Netﬂix
▪ Reference: www.featurestore.org

Feature Store in Banking
▪ Problem: Manage TBs of Transactions as ML Features. Develop
models to reduce costs of Fraud.
▪ Solution:
Hopsworks provides the platform to train machine learning models to
classify transactions as suspected for Fraud or not. The Fraud dataset
contains billions of records (40 TB) and the solution involves using
Deep Learning (GPUs) to detect structural patterns in bank
transactions and temporal patterns based on the frequency of bank
transactions executed.
▪ Reference: Swedbank Talk at Spark/AI EU Summit 2019

Data Teams are moving from Analytics to ML
Event DataRaw Data
Data LakeDATA PIPELINES BI Platforms
SQL Data

Data Teams are moving from Analytics to ML
Event DataRaw Data
Data LakeDATA PIPELINES BI Platforms
SQL Data
FEATURE PIPELINES
Feature Store
Hopsworks
MODEL TRAINING
ONLINE MODEL SERVING
ANALYTICAL MODEL SCORING (BATCH)

Features are created/updated at different cadences
Click features every 10 secs
CDC data every 30 secs
User proﬁle updates every hour
Featurized weblogs data every day
Online
Feature
Store
Oﬄine
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs

FeatureGroup Ingestion in Hopsworks
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Proﬁle Updates
Weblogs
Real-time features
Kafka Output

No More End-to-End ML Pipelines!

Event DataRaw Data
Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING
MONITOR
Data Lake
ML Pipelines start and stop at the Feature Store

Feature Store Concepts
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned

Register a FeatureGroup with the Feature Store
from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup(df, ”titanic_df“, online=True)

Hopsworks Feature Store
Raw Data
Structured
Data
Events
Data Lake
Online
Feature Store
Oﬄine
Feature Store
Ingest
Data
From
Used
By
Online Apps
Batch Apps
Create Train/Test Data

Create Train/Test Datasets using the Feature Store
from hops import featurestore as fs
sample_data = fs.get_features([“name”, “Pclass”, “Sex”,
“Balance”, “Survived”])
fs.create_training_dataset(sample_data,
“titanic_training_dataset",
data_format="tfrecords“,
training_dataset_version=1)

Online Feature Store
US-West-la
MySQL
NDB1
Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store US-West-1c
MySQL
NDB3
Model
~5-50ms
US-West-1b
MySQL
NDB2
Model
2-20ms
2. Send the Feature Vector to a Model for Prediction

Good Decisions we took in Version 1
▪ General Purpose Data Frame API (DSL could be added later)
▪ Feature Store is a cache for materialized features, not a library.
▪ Online and Offline Feature Stores to support low latency and scale,
respectively
▪ Reuse of Features means JOINS – Spark as a join engine

Feature Store API v2
▪ Enforce feature-group scope and versioning (as best practice)
▪ Better support for multiple feature stores - join features from
development and production feature stores
▪ Better support for complex joins of features
▪ First class API support for time-travel
▪ More consistent developer experience

Connect and Support for Multiple Feature Stores
import hsfs
# Connect to the production feature store
conn = hsfs.connection(host="ea2.aws.hopsworks.ai",
project="prod")
prod_fs = conn.get_feature_store()
dev_fs = conn.get_feature_store(“dev”)

Feature Group Operations
# Create Feature group metadata
fg = dev_fs.create_feature_group(“temperature”,
description=”Temperature Features”,
version = 1,
online_enabled=True)
# Schema is inferred from the dataframe
fg.save(dataframe)
# Read the feature group as dataframe
df = fg.read()
# Append more data to the feature group
fg.insert(dataframe, overwrite=False)

fg = dev_fs.get_feature_group(“temperature”, version = 1)
fg.add_tag(“country”, “SE”)
fg.add_tags({“country”: “SE”, “year”: 2020})
Tags
▪ Allow feature groups, features and training datasets to be discoverable
▪ Tags are searchable from the Hopsworks UI

fg.add_feature("new_feature", type=”int”, default_value)
Schema Version Management
Non breaking schema changes (e.g. add a feature) can be applied without
bumping the version.

fg = dev_fs.get_feature_group(“temperature”, version=1)
# Returns a dataframe object
fg.read()
# Show a sample of 10 rows in the feature group
fg.show(10)
fg.select(["date", "location", "avg"]).show(10)
fg.select(["date”, “location”,”avg”]).read()
.filter(col(“location”) == “Stockholm”).show(10)
Exploratory Data Analysis

Joins - Pandas Style API
crop_fg = prod_fs.get_feature_group(“crop”, version = 1)
temperature = dev_fs.get_feature_group(“temperature”, version = 1)
rain = dev_fs.get_feature_group(“rain”, version = 1)
joined_features = crop_fg.select(["location", "yield"])
.join(temperature.select(["location", “season_avg"]))
.join(rain.select(["location", "avg_mm"]),
on=["location"],
join_type="left")
dataframe = joined_features.read()

Time-Travel
fs.get_feature_group("temperature", version = 1,
wallclock_time=None,
wallclock_time_start=None,
wallclock_time_end=None)
▪ Explore how the feature group looked like at X point in time in the past
▪ List value changes between timestamps

Create Train/Test Data from Joined Features
connector = fs.get_storage_connector("s3connector", "S3")
td = fs.create_training_dataset(name='crop_model',
description='Dataset to train the crop model',
version=1,
data_format='tfrecords',
connector = connector,
splits={'train': 0.7,'test': 0.2,'validate': 0.1})
td.save(joined_features)

fs.get_feature_vector(training_dataset=”crop”,
id = [{
‘location’: ‘Stockholm’,
‘crop’: ‘wheat’
}])
Get feature vector for online serving
▪ Return feature vector from the online feature store
▪ Feature order is maintained

Demo
Using Hopsworks Feature Store
in Databricks

Thank You!
Get Started
hopsworks.ai
github.com/logicalclocks/hopsworks
Twitter
@logicalclocks
Web
www.logicalclocks.com
Feature Store contributions from colleagues
▪ Moritz Meister
▪ Kim Hammar
▪ Alex Ormenisan
▪ Robin Andersson
▪ Ermias Gebremeskel
▪ Theoﬁlos Kakantousis
Thanks to the Logical Clocks Team!

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx

Semelhante a Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx (20)

Último

Último (20)

Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx