SlideShare uma empresa Scribd logo
1 de 18
MetaConfig driven
FeatureStore@MakeMyTrip
~/Piyush
Head Data Platform Engineering
Namasté
About MakeMyTrip
Deliverables of this presentation:
- Why common feature store?
- Productionizating ML via standardization
- Machine Learning Life Cycle
- Prediction Serving + Challenges
- FeatureStore Components
- Architecture
- Tools
- Next Steps
- References
Motivation
Developing Unified Personalization platform for improving customer experience of millions of Indian
travellers
Business Goal: Through Hyper Personalization
● Raise Engagement
● Drive Conversions + Boost Revenue
● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip)
Tech Goal:
● Machine Learning Models are as good as the data they are trained on. Needs good Data Management.
● ML Systems are trained on set of features, a feature is a input to model which can be a column in a
dataset or complex computed metric or some other model output too
● Feature Store is a central common repository for highly curated features which are described through
well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
Before Feature Store : state of data platform
● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex
data pipelines | Machine Learning if not implemented in right manner creates high tech debt
○ Personalization : Cosmos
○ Customer Segmentation : HYDRA
○ Hotel Ranking / Sequencing + Intendo
○ DP : Dynamic / Differential Pricing : Hotel & Flights
○ Anomaly Detection, Destination trends, Demand Anomalies
● RealTime Features require Data Engineering support from Data Scientists
● Lack of standardization & discovery : Feature definitions are duplicated into the
different data pipelines even if it is same / computed multiple times and change to
definitions means fixing across different pipelines.
● Features used in training and serving were inconsistent
Productionizing ML via Standardization
● MetaConfigs & Feature Catalog : Documentation
● Reusability of features across projects / teams
● Standardized access of features between Training &
Serving | Data Governance + Data Quality
● More Self-serve : Reduces Data Scientist Time on DE
Tasks
● Reduce Time to get to Production for ML Projects
● Reduce Data Tech-Debt & Improved Feature Quality
Feature Store : Online
+ Historical
Data Store 1
Data Store 2
Data Store N
Raw Data
Data Sets 1
Data Sets N
Structured
Data
Feature Engineering
MODEL : TRAINING + DEPLOY
Machine Learning Life Cycle
ML LifeCycle Image source : UCB RISE LABs
Addition : FEATURE PIPELINES
Prediction serving
- ASK : 10 -30 ms / < 30 ms
- Challenges : DNN : Complex models
- Hardware : GPUs / TPUs
- SageMaker provides abstraction / middle layer between applications and complex
models thru docker containers
- Online : SageMaker Endpoints
- Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis
cluster / BoulderDB)
- Problems :
- Requires substantial computation and space
- Example doing the scoring for all customers
- Costly update -> rescore everything
FeatureStore Glossary
Feature : a measurable property of a phenomenon
under observation defined in FSConfig
FSConfig: used for storing config/ DSL + code to
compute features, feature version information,
feature analysis data and feature documentation
FSCompute: Computation Engine developed over
SPARK, supports mosts of the spark APIs for historical
and Online(Streaming)
FeatureStore : serves as a repository of features that
can be used for training and evaluation of machine
learning models.
FeatureGroup: internal to the system, to group
common compute jobs of related features having the
same entity, input data sources and filter conditions,
thereby optimizing the compute process.
FSScheduler: Internal service to create a feature
DAG(with Dependency Resolution) and trigger their
execution while handling retries and back pressure.
FS-DSA : Data Science Automation for Model Training
+ Deployment integrated with Feature Store |
Enables versioned and reproducible experiments.
FSBrokerAPI : Online Serving RESTful API endpoint for
consumer applications
FeatureStore Components & Data Flow
User Funnel Activity
Streams
Client-Side
Server-Side
DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE
Transactional Data
Booking Master
FSConfig :
Feature
Catalog
Master Datastore
Product Master, User
Master, Device
Master
New
Data
Stream
s
ML Automation
BT-Compute
BATCH Feature
Compute Jobs
RT-Compute
Feature
Compute
SERVING API
Offline Models
Online Models
Batch BULK API
(DataFrame)
Feature Definitions
BoulderDB REDIS
Feature
Storage
Job Scheduler
Sagemaker
TRAIN
Training + HPO
Deploy
Docker / Batch
Transform
FSConfig : Feature Definitions & Metadata
Feature Name :
<Entity>::<Feature_shortname>::<
Data Time Interval>::<Refresh
Frequency>::<Version>
Entity : <UserID>_<profileType> Short Name :
listing_conversion_rank
Versioning : v2 + Process :
RT/BT
FeatureGroup : (System
Generated ID)
8fda73d1_2eee_4cfc_a20f_e9afb1
78fbc3
Entity:
["uuid", "profile_type"]
Features [Array] Time Window(Refresh/
Data - Time duration): (ISO
Time Interval) P1D
Data Source [Array]:
[user_master, txn_search]
Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master,
txn_search]
Data Sink: Serving [Array] Data Store: GLUE
Catalog/S3/Redis/BoulderDb
Database Name :
rocksDB_<WAL Dir Path>
Table Name :
rocksDB_<columnFamily>
Compute Logic DSL + Spark SQL: metric_expr,
group_by_expr, filter_expr,
window_function,
window_function_alias
Code (Python/Scala/Java)
: GIT/Gerrit URI
Model(sagemaker) /
Embedding
Environment: Production Workspace: Dev/Staging/Production Namespace: <Project
Name>
Apache LIVY + Databricks
JOBs API Config
FS Store | online + historical
Output Schema (internal to the system)
● Historical Feature Data schema on S3 Parquet
|-- entity: string (nullable = false)
|-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
..
..
All features in that feature group
● Online Serving Data Schema on REDIS + BoulderDB
○ Serving at Feature Group level
Key -> <Entity_id>#<Feature_group_id>/<Feature_split>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
TimeStamp -> Compute_Processed_Time
○ Serving at Feature Level
Key -> <Entity_id>#<Feature_name>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
SERVING Config
- lambda (batch_feature_name
linkage for RT features)
- Support for linear QUERY DAGs
- MVEL based post-processing on any
feature per service/model if needed
Feature backfill (back_fill_required,
back_fill_duration)
FS-BrokerAPI : Online Feature Serving Framework
Data Access LayerREQUEST HANDLER Orchestration Layer
Orchestration +
Broker
Extractors Transport
Business Logics
+ MVEL
Extractors Transport
<uri>/v1/getFeature
s
(POST Request)
AKKA(Actors)
Request
Validations Feature
Definition
Request
Handler
REDIS
Boulder
DB
FeaturesbyName
FeaturesbyModel
FeaturesbyService
BoulderDB : Online Serving Store
- Build on top of RocksDB (embedded data store: developed by Facebook) : reducing
the distance to data on serving layer.
- Steps added to compute layer: post-processing:
- BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across
various executors into shared object storage : S3
- Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested
into sst file per partition / executor
- Cluster coordinator : Consul
- Atomic switching of DB snapshots
- Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
Tools
Next Steps
- Feature Stats Visualization / Analytics & Monitoring // Feature
Catalog
- Seamless integration with Experimentation Framework
- Per User Databases on top of feature-store for Personalization
- Notebook integration : More better Data Science Tools for Data
Scientists with Python libraries
- Perf Tools : Query Optimization & Analysis
References
- https://www.logicalclocks.com/feature-store/
- https://eng.uber.com/scaling-michelangelo/
- Airbnb : Zipline
- HopsML + Hopsworks
- Go-JEK : FEAST
- The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18
- https://medium.com/makemytrip-engineering
Piyush Kumar
E : piyush.kumar@makemytrip.com
W : www.makemytrip.com
T : https://twitter.com/piykumar
Thank you !!

Mais conteúdo relacionado

Mais procurados

JSON and the Oracle Database
JSON and the Oracle DatabaseJSON and the Oracle Database
JSON and the Oracle DatabaseMaria Colgan
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Sharding
ShardingSharding
ShardingMongoDB
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with DatabricksLiangjun Jiang
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 

Mais procurados (20)

Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
JSON and the Oracle Database
JSON and the Oracle DatabaseJSON and the Oracle Database
JSON and the Oracle Database
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Sharding
ShardingSharding
Sharding
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 

Semelhante a MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 by Piyush Kumar

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simplellangit
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineClouduEngine Solutions
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platformMostafa
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...ScyllaDB
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel BayetaSam B
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 

Semelhante a MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 by Piyush Kumar (20)

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloud
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 

Mais de Piyush Kumar

Know your customers closely with analytics
Know your customers closely with analyticsKnow your customers closely with analytics
Know your customers closely with analyticsPiyush Kumar
 
Open World of #OSS and #HealthTech
Open World of #OSS and #HealthTechOpen World of #OSS and #HealthTech
Open World of #OSS and #HealthTechPiyush Kumar
 
State of Cancer in India
State of Cancer in IndiaState of Cancer in India
State of Cancer in IndiaPiyush Kumar
 
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014Piyush Kumar
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Piyush Kumar
 
PyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPiyush Kumar
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 

Mais de Piyush Kumar (7)

Know your customers closely with analytics
Know your customers closely with analyticsKnow your customers closely with analytics
Know your customers closely with analytics
 
Open World of #OSS and #HealthTech
Open World of #OSS and #HealthTechOpen World of #OSS and #HealthTech
Open World of #OSS and #HealthTech
 
State of Cancer in India
State of Cancer in IndiaState of Cancer in India
State of Cancer in India
 
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
PyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPyCon India 2012: Celery Talk
PyCon India 2012: Celery Talk
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 

Último

Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Último (17)

Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 by Piyush Kumar

  • 3. Deliverables of this presentation: - Why common feature store? - Productionizating ML via standardization - Machine Learning Life Cycle - Prediction Serving + Challenges - FeatureStore Components - Architecture - Tools - Next Steps - References
  • 4. Motivation Developing Unified Personalization platform for improving customer experience of millions of Indian travellers Business Goal: Through Hyper Personalization ● Raise Engagement ● Drive Conversions + Boost Revenue ● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip) Tech Goal: ● Machine Learning Models are as good as the data they are trained on. Needs good Data Management. ● ML Systems are trained on set of features, a feature is a input to model which can be a column in a dataset or complex computed metric or some other model output too ● Feature Store is a central common repository for highly curated features which are described through well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
  • 5. Before Feature Store : state of data platform ● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex data pipelines | Machine Learning if not implemented in right manner creates high tech debt ○ Personalization : Cosmos ○ Customer Segmentation : HYDRA ○ Hotel Ranking / Sequencing + Intendo ○ DP : Dynamic / Differential Pricing : Hotel & Flights ○ Anomaly Detection, Destination trends, Demand Anomalies ● RealTime Features require Data Engineering support from Data Scientists ● Lack of standardization & discovery : Feature definitions are duplicated into the different data pipelines even if it is same / computed multiple times and change to definitions means fixing across different pipelines. ● Features used in training and serving were inconsistent
  • 6. Productionizing ML via Standardization ● MetaConfigs & Feature Catalog : Documentation ● Reusability of features across projects / teams ● Standardized access of features between Training & Serving | Data Governance + Data Quality ● More Self-serve : Reduces Data Scientist Time on DE Tasks ● Reduce Time to get to Production for ML Projects ● Reduce Data Tech-Debt & Improved Feature Quality Feature Store : Online + Historical Data Store 1 Data Store 2 Data Store N Raw Data Data Sets 1 Data Sets N Structured Data Feature Engineering MODEL : TRAINING + DEPLOY
  • 7. Machine Learning Life Cycle ML LifeCycle Image source : UCB RISE LABs Addition : FEATURE PIPELINES
  • 8. Prediction serving - ASK : 10 -30 ms / < 30 ms - Challenges : DNN : Complex models - Hardware : GPUs / TPUs - SageMaker provides abstraction / middle layer between applications and complex models thru docker containers - Online : SageMaker Endpoints - Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis cluster / BoulderDB) - Problems : - Requires substantial computation and space - Example doing the scoring for all customers - Costly update -> rescore everything
  • 9. FeatureStore Glossary Feature : a measurable property of a phenomenon under observation defined in FSConfig FSConfig: used for storing config/ DSL + code to compute features, feature version information, feature analysis data and feature documentation FSCompute: Computation Engine developed over SPARK, supports mosts of the spark APIs for historical and Online(Streaming) FeatureStore : serves as a repository of features that can be used for training and evaluation of machine learning models. FeatureGroup: internal to the system, to group common compute jobs of related features having the same entity, input data sources and filter conditions, thereby optimizing the compute process. FSScheduler: Internal service to create a feature DAG(with Dependency Resolution) and trigger their execution while handling retries and back pressure. FS-DSA : Data Science Automation for Model Training + Deployment integrated with Feature Store | Enables versioned and reproducible experiments. FSBrokerAPI : Online Serving RESTful API endpoint for consumer applications
  • 10. FeatureStore Components & Data Flow User Funnel Activity Streams Client-Side Server-Side DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE Transactional Data Booking Master FSConfig : Feature Catalog Master Datastore Product Master, User Master, Device Master New Data Stream s ML Automation BT-Compute BATCH Feature Compute Jobs RT-Compute Feature Compute SERVING API Offline Models Online Models Batch BULK API (DataFrame) Feature Definitions BoulderDB REDIS Feature Storage Job Scheduler Sagemaker TRAIN Training + HPO Deploy Docker / Batch Transform
  • 11. FSConfig : Feature Definitions & Metadata Feature Name : <Entity>::<Feature_shortname>::< Data Time Interval>::<Refresh Frequency>::<Version> Entity : <UserID>_<profileType> Short Name : listing_conversion_rank Versioning : v2 + Process : RT/BT FeatureGroup : (System Generated ID) 8fda73d1_2eee_4cfc_a20f_e9afb1 78fbc3 Entity: ["uuid", "profile_type"] Features [Array] Time Window(Refresh/ Data - Time duration): (ISO Time Interval) P1D Data Source [Array]: [user_master, txn_search] Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master, txn_search] Data Sink: Serving [Array] Data Store: GLUE Catalog/S3/Redis/BoulderDb Database Name : rocksDB_<WAL Dir Path> Table Name : rocksDB_<columnFamily> Compute Logic DSL + Spark SQL: metric_expr, group_by_expr, filter_expr, window_function, window_function_alias Code (Python/Scala/Java) : GIT/Gerrit URI Model(sagemaker) / Embedding Environment: Production Workspace: Dev/Staging/Production Namespace: <Project Name> Apache LIVY + Databricks JOBs API Config
  • 12. FS Store | online + historical Output Schema (internal to the system) ● Historical Feature Data schema on S3 Parquet |-- entity: string (nullable = false) |-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false) | |-- key: string | |-- value: integer (valueContainsNull = true) .. .. All features in that feature group ● Online Serving Data Schema on REDIS + BoulderDB ○ Serving at Feature Group level Key -> <Entity_id>#<Feature_group_id>/<Feature_split> Value -> Hashes key -> Feature_name Value -> Feature_value TimeStamp -> Compute_Processed_Time ○ Serving at Feature Level Key -> <Entity_id>#<Feature_name> Value -> Hashes key -> Feature_name Value -> Feature_value SERVING Config - lambda (batch_feature_name linkage for RT features) - Support for linear QUERY DAGs - MVEL based post-processing on any feature per service/model if needed Feature backfill (back_fill_required, back_fill_duration)
  • 13. FS-BrokerAPI : Online Feature Serving Framework Data Access LayerREQUEST HANDLER Orchestration Layer Orchestration + Broker Extractors Transport Business Logics + MVEL Extractors Transport <uri>/v1/getFeature s (POST Request) AKKA(Actors) Request Validations Feature Definition Request Handler REDIS Boulder DB FeaturesbyName FeaturesbyModel FeaturesbyService
  • 14. BoulderDB : Online Serving Store - Build on top of RocksDB (embedded data store: developed by Facebook) : reducing the distance to data on serving layer. - Steps added to compute layer: post-processing: - BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across various executors into shared object storage : S3 - Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested into sst file per partition / executor - Cluster coordinator : Consul - Atomic switching of DB snapshots - Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
  • 15. Tools
  • 16. Next Steps - Feature Stats Visualization / Analytics & Monitoring // Feature Catalog - Seamless integration with Experimentation Framework - Per User Databases on top of feature-store for Personalization - Notebook integration : More better Data Science Tools for Data Scientists with Python libraries - Perf Tools : Query Optimization & Analysis
  • 17. References - https://www.logicalclocks.com/feature-store/ - https://eng.uber.com/scaling-michelangelo/ - Airbnb : Zipline - HopsML + Hopsworks - Go-JEK : FEAST - The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18 - https://medium.com/makemytrip-engineering
  • 18. Piyush Kumar E : piyush.kumar@makemytrip.com W : www.makemytrip.com T : https://twitter.com/piykumar Thank you !!

Notas do Editor

  1. Cross-Sell , Recommendation Engine, Personalized Filters
  2. Feature store specific to ONE model; so multiple feature stores and multiple pipelines for multiple models.
  3. Will make sense in the context of an example