- Why organization(s) need FeatureStore to remove complex pipeline jungles and to simplify Machine Learning workflows
- How we developed MetaConfig driven FeatureStore @MakeMyTrip, Architecture
- Prediction Serving infrastructure for online & batch
3. Deliverables of this presentation:
- Why common feature store?
- Productionizating ML via standardization
- Machine Learning Life Cycle
- Prediction Serving + Challenges
- FeatureStore Components
- Architecture
- Tools
- Next Steps
- References
4. Motivation
Developing Unified Personalization platform for improving customer experience of millions of Indian
travellers
Business Goal: Through Hyper Personalization
● Raise Engagement
● Drive Conversions + Boost Revenue
● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip)
Tech Goal:
● Machine Learning Models are as good as the data they are trained on. Needs good Data Management.
● ML Systems are trained on set of features, a feature is a input to model which can be a column in a
dataset or complex computed metric or some other model output too
● Feature Store is a central common repository for highly curated features which are described through
well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
5. Before Feature Store : state of data platform
● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex
data pipelines | Machine Learning if not implemented in right manner creates high tech debt
○ Personalization : Cosmos
○ Customer Segmentation : HYDRA
○ Hotel Ranking / Sequencing + Intendo
○ DP : Dynamic / Differential Pricing : Hotel & Flights
○ Anomaly Detection, Destination trends, Demand Anomalies
● RealTime Features require Data Engineering support from Data Scientists
● Lack of standardization & discovery : Feature definitions are duplicated into the
different data pipelines even if it is same / computed multiple times and change to
definitions means fixing across different pipelines.
● Features used in training and serving were inconsistent
6. Productionizing ML via Standardization
● MetaConfigs & Feature Catalog : Documentation
● Reusability of features across projects / teams
● Standardized access of features between Training &
Serving | Data Governance + Data Quality
● More Self-serve : Reduces Data Scientist Time on DE
Tasks
● Reduce Time to get to Production for ML Projects
● Reduce Data Tech-Debt & Improved Feature Quality
Feature Store : Online
+ Historical
Data Store 1
Data Store 2
Data Store N
Raw Data
Data Sets 1
Data Sets N
Structured
Data
Feature Engineering
MODEL : TRAINING + DEPLOY
7. Machine Learning Life Cycle
ML LifeCycle Image source : UCB RISE LABs
Addition : FEATURE PIPELINES
8. Prediction serving
- ASK : 10 -30 ms / < 30 ms
- Challenges : DNN : Complex models
- Hardware : GPUs / TPUs
- SageMaker provides abstraction / middle layer between applications and complex
models thru docker containers
- Online : SageMaker Endpoints
- Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis
cluster / BoulderDB)
- Problems :
- Requires substantial computation and space
- Example doing the scoring for all customers
- Costly update -> rescore everything
9. FeatureStore Glossary
Feature : a measurable property of a phenomenon
under observation defined in FSConfig
FSConfig: used for storing config/ DSL + code to
compute features, feature version information,
feature analysis data and feature documentation
FSCompute: Computation Engine developed over
SPARK, supports mosts of the spark APIs for historical
and Online(Streaming)
FeatureStore : serves as a repository of features that
can be used for training and evaluation of machine
learning models.
FeatureGroup: internal to the system, to group
common compute jobs of related features having the
same entity, input data sources and filter conditions,
thereby optimizing the compute process.
FSScheduler: Internal service to create a feature
DAG(with Dependency Resolution) and trigger their
execution while handling retries and back pressure.
FS-DSA : Data Science Automation for Model Training
+ Deployment integrated with Feature Store |
Enables versioned and reproducible experiments.
FSBrokerAPI : Online Serving RESTful API endpoint for
consumer applications
10. FeatureStore Components & Data Flow
User Funnel Activity
Streams
Client-Side
Server-Side
DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE
Transactional Data
Booking Master
FSConfig :
Feature
Catalog
Master Datastore
Product Master, User
Master, Device
Master
New
Data
Stream
s
ML Automation
BT-Compute
BATCH Feature
Compute Jobs
RT-Compute
Feature
Compute
SERVING API
Offline Models
Online Models
Batch BULK API
(DataFrame)
Feature Definitions
BoulderDB REDIS
Feature
Storage
Job Scheduler
Sagemaker
TRAIN
Training + HPO
Deploy
Docker / Batch
Transform
11. FSConfig : Feature Definitions & Metadata
Feature Name :
<Entity>::<Feature_shortname>::<
Data Time Interval>::<Refresh
Frequency>::<Version>
Entity : <UserID>_<profileType> Short Name :
listing_conversion_rank
Versioning : v2 + Process :
RT/BT
FeatureGroup : (System
Generated ID)
8fda73d1_2eee_4cfc_a20f_e9afb1
78fbc3
Entity:
["uuid", "profile_type"]
Features [Array] Time Window(Refresh/
Data - Time duration): (ISO
Time Interval) P1D
Data Source [Array]:
[user_master, txn_search]
Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master,
txn_search]
Data Sink: Serving [Array] Data Store: GLUE
Catalog/S3/Redis/BoulderDb
Database Name :
rocksDB_<WAL Dir Path>
Table Name :
rocksDB_<columnFamily>
Compute Logic DSL + Spark SQL: metric_expr,
group_by_expr, filter_expr,
window_function,
window_function_alias
Code (Python/Scala/Java)
: GIT/Gerrit URI
Model(sagemaker) /
Embedding
Environment: Production Workspace: Dev/Staging/Production Namespace: <Project
Name>
Apache LIVY + Databricks
JOBs API Config
12. FS Store | online + historical
Output Schema (internal to the system)
● Historical Feature Data schema on S3 Parquet
|-- entity: string (nullable = false)
|-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
..
..
All features in that feature group
● Online Serving Data Schema on REDIS + BoulderDB
○ Serving at Feature Group level
Key -> <Entity_id>#<Feature_group_id>/<Feature_split>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
TimeStamp -> Compute_Processed_Time
○ Serving at Feature Level
Key -> <Entity_id>#<Feature_name>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
SERVING Config
- lambda (batch_feature_name
linkage for RT features)
- Support for linear QUERY DAGs
- MVEL based post-processing on any
feature per service/model if needed
Feature backfill (back_fill_required,
back_fill_duration)
13. FS-BrokerAPI : Online Feature Serving Framework
Data Access LayerREQUEST HANDLER Orchestration Layer
Orchestration +
Broker
Extractors Transport
Business Logics
+ MVEL
Extractors Transport
<uri>/v1/getFeature
s
(POST Request)
AKKA(Actors)
Request
Validations Feature
Definition
Request
Handler
REDIS
Boulder
DB
FeaturesbyName
FeaturesbyModel
FeaturesbyService
14. BoulderDB : Online Serving Store
- Build on top of RocksDB (embedded data store: developed by Facebook) : reducing
the distance to data on serving layer.
- Steps added to compute layer: post-processing:
- BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across
various executors into shared object storage : S3
- Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested
into sst file per partition / executor
- Cluster coordinator : Consul
- Atomic switching of DB snapshots
- Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
16. Next Steps
- Feature Stats Visualization / Analytics & Monitoring // Feature
Catalog
- Seamless integration with Experimentation Framework
- Per User Databases on top of feature-store for Personalization
- Notebook integration : More better Data Science Tools for Data
Scientists with Python libraries
- Perf Tools : Query Optimization & Analysis