A Beginners Guide to Building a RAG App Using Open Source Milvus
Ideas spracklen-final
1. Big Data for the rest of us
Lawrence Spracklen
SupportLogic
lawrence@supportlogic.io
www.linkedin.com/in/spracklen
2. SupportLogic
• Extract Signals from enterprise CRM systems
• Applied machine learning
• Complete vertical solution
• Go-live in days!
• We are hiring!
3. @Scale 2018
• Sound like your Big Data problems?
• This is Extreme data!
• Do these solutions help or hinder Big Data for the rest of us?
“Exabytes of data…..”
“1500 manual labelers…..”
“Sub second global propagation of likes…..”
4. End-2-End Planning
• Numerous steps/obstacles to successfully leveraging ML
• Data Acquisition
• Data Cleansing
• Feature Engineering
• Model Selection and Training
• Model Optimization
• Model Deployment
• Model Feedback and Retraining
• Import to consider all steps before deciding on an approach
• Upstream decisions can severely limit downstream options
5. ML Landscape
• How do I build a successful production-grade solution from all these
disparate components that don’t play well together?
6. Data Set Availability
• Is the necessary data available?
• Are there HIPAA, PII, GDPR concerns?
• Is it spread across multiple systems?
• Can the systems communicate?
• Data fusion
• Move the compute to the data…
• Legacy infrastructure decisions can dictate optimal approach
7. Feature Engineering
• Essential for model performance, efficacy, robustness and simplicity
• Feature extraction
• Feature selection
• Feature construction
• Feature elimination
• Dimensionality reduction
• Traditionally a laborious manual process
• Automation techniques becoming available
• e.g. TransmogrifAI, Featuretools
• Leverage feature stores!
8. Model Training
• Big differences in the range of algorithms offered by different
frameworks
• Don’t just jump to the most complex!
• Easy to automate selection process
• Just click ‘go’
• Automate hyperparameter optimization
• Beyond the nested for-loop!
9. Model Ops
• What happens after the models are created?
• How does the business benefit from the insights
• Operationalization is frequently the weak link
• Operationalizing PowerPoint?
• Hand rolled scoring flows?
10. Barriers to Model Ops
• Scoring often performed on a different data platform to training
• Framework specific persistence formats
• Complex data preprocessing requirements
• Data cleansing and feature engineering
• Batch training versus RT/stream scoring
• How frequently are models updated?
• How is performance monitored?
12. PMML & PFA
• PMML has been long available as framework agnostic model
representation
• Frequently requires helper scripts
• PFA is the potential successor….
• Addresses lots of PMML’s shortcomings
• Scoring engines accepting R or Python scripts
• Easy to use AWS Lambda!
13. Interpreting Models
• A prediction without an explanation limits its value
• Why is this outcome being predicted?
• What action should be taken as a result?
• Avoid ML models that are “black Boxes”
• Tools for providing prediction explanations are emerging
• E.g. LIME
15. Prototype in Python
• Explore the space!
• Work through the end-2-end solution
• Don’t prematurely optimize
• Great Python tooling
• e.g. Juypter Notebooks, Cloudera Data Science workbench
• Don’t let the data leak to laptops!
16. Python is slow
• Python is simple, flexible and has massive available
functionality
• Pure Python typically hundreds of times slower than C
• Many Python implementations leverage C under-the-hood
• Even naive Scala or Java implementations are slow
18. Everything Python
• Python wrappers are available for most packages
• Even momentum in Spark is moving to Python
• Wrappers for C++ libraries like Shogun
19. Spark
• Optimizing for speed, data size or both?
• Increasingly rich set of ML algorithms
• Still missing common algorithms
• E.g. Multiclass GBTs
• Not all OSS implementations are good
• Hard to correctly resource Spark jobs
• Autotuning systems available
20. System Sizing
• Why go multi-node?
• CPU or Memory constraints
• Aggregate data size is very different from the size of the individual data sets
• A Data lake can contain Petabytes, but each dataset may be only 10’s of GB….
• Is the raw data bigger or smaller than final data being consumed by the model?
• Spark for ETL
• Is the algorithm itself parallel?
21. Single Node ML
• Single node memory on even x86 systems can now measure in
tens of terabytes
• Likely to expand further with NVDIMMs
• 40vCPU, ~1TB x86 only $4/hour on Google Cloud
• Many high performance single-node ML libraries exist!
22. Hive & Postgres
• On Hadoop, many data scientists are constrained to Hive or
Impala for security reasons
• Can be very limiting for ‘real’ data science
• Hivemall for analytics
• Is a traditional DB a better choice?
• Better performance in many instances
• Apache MadLib for analytics
23. Conclusions
• No one-size fits all!
• Much more to a successful ML project than a cool model
• Not all frameworks play together
• Decisions can limit downstream options
• Need to think about the problem end-2-end
• From data acquisition to model deployment