DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Ml master class cfa poland
1. Machine Learning and AI in Finance
2020 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
sri@quantuniversity.com
www.quantuniversity.com
10/22/2020
CFA Society Poland
2. 2
Speaker bio
• Advisory and Consultancy for Financial
Analytics
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers.
• Columnist for the Wilmott Magazine
• Author of forthcoming book
“The Model-Driven Enterprise”
• Teaches AI/ML and Fintech Related topics in
the MS and MBA programs at Northeastern
University, Boston
• Reviewer: Journal of Asset Management
Sri Krishnamurthy
Founder and CEO
QuantUniversity
3. 3
QuantUniversity
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 1000 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Building a platform for AI
and Machine Learning Exploration
and Experimentation
4. 1. Key trends in AI, Machine Learning & Fintech
2. An intuitive introduction to AI and ML
3. Case studies
4. Slides at:
5. https://academy.qusandbox.com/#/market/5f91612b99aa4a2469
1da7ef
6. Use Code: CFAPoland as registration code
Agenda
6. 6
The 4th Industrial revolution is Here!
Source: Christoph Roser at AllAboutLean.com
As per Wikipedia*, “The 4th Industrial Revolution ….. marked by emerging technology breakthroughs in a
number of fields, including robotics, artificial intelligence, nanotechnology, quantum computing, biotechnology,
the Internet of Things, the Industrial Internet of Things (IIoT), decentralized consensus, fifth-generation wireless
technologies (5G), additive manufacturing/3D printing and fully autonomous vehicles.”
* https://en.wikipedia.org/wiki/Fourth_Industrial_Revolution
7. 7
Scientists are disrupting the way we live!
Source: https://www.ladn.eu/tech-a-suivre/mobilite-2030-vehicules-volants-open-data/
8. 8
Interest in Machine learning continues to grow
https://www.wipo.int/edocs/pubdocs/en/wipo_pub_1055.pdf
11. 11
• Machine learning is the scientific study of algorithms and statistical
models that computer systems use to effectively perform a specific task
without using explicit instructions, relying on patterns and inference
instead1
• Artificial intelligence is intelligence demonstrated by machines, in
contrast to the natural intelligence displayed by humans and animals1
Defining Machine Learning and AI
11
1. https://en.wikipedia.org/wiki/Machine_learning
2. Figure Source: http://www.fsb.org/wp-content/uploads/P011117.pdf
12. 12
Machine Learning & AI in finance: A paradigm shift
12
Stochastic
Models
Factor Models
Optimization
Risk Factors
P/Q Quants
Derivative pricing
Trading Strategies
Simulations
Distribution
fitting
Quant
Real-time analytics
Predictive analytics
Machine Learning
RPA
NLP
Deep Learning
Computer Vision
Graph Analytics
Chatbots
Sentiment Analysis
Alternative Data
Data Scientist
14. 14
The rise of Big Data and Data Science
14
Image Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
15. 15
Smart Algorithms
15
Distributing Computing Frameworks Deep Learning Frameworks
1. Our labeled datasets were thousands of times too
small.
2. Our computers were millions of times too slow.
3. We initialized the weights in a stupid way.
4. We used the wrong type of non-linearity.
- Geoff Hinton
“Capital One was able to determine fraudulent credit
card applications in 100 milliseconds”*
* http://go.databricks.com/hubfs/pdfs/Databricks-for-FinTech-170306.pdf
17. 17
“Financial Technologies or “Fintech” is used to describe
a variety of
innovative business models
and
emerging technologies
that have the potential to transform the financial
services industry ”
Technology drives finance!
https://www.iosco.org/library/pubdocs/pdf/IOSCOPD554.pdf
23. Risk Systems That Read®
• Northfield uses machine learning based analysis of news text
to describe how current conditions in financial markets are
different than usual.
• Typically, over 8000 articles per day containing more than
20,000 “topics” (companies, industries, countries) are
processed.
• The nature and magnitudes of these difference are used to
revise expectations of financial market risks for all global
equities and credit instruments on a daily basis.
24. 24
1. Leveraging large and diverse datasets for
Investment decision making at J.P. Morgan1
2. Improving Quantitative investing at AQR2
3. Using Sandboxes and labs to further innovation
in fintech at Fidelity3
4. Use of AI and ML increasing in ssset
management from idea generation to execution -
Wells Fargo4
Additional Use cases
1. https://www.jpmorgan.com/global/cib/research/investment-decisions-using-machine-learning-ai
2. https://www.aqr.com/Learning-Center/Machine-Learning
3. https://www.fidelitylabs.com/
4. https://www08.wellsfargomedia.com/assets/pdf/personal/investing/investment-institute/IG_Machines_Are_Coming_ADA.pdf
26. 26
• Automation to increase
• Digital transformation and move to the cloud finally happening
• Use of Synthetic data to increase
• Edge cases of AI put to truth test!
• Fintechs feeling the pressure to prove themselves!
• Human-in-the-loop AI to regain focus!
The changes have been drastic and sudden! What’s in
store for the industry is yet to be seen!
What does Covid2019 mean to adoption of AI and ML in
Financial services?
29. 29
Let’s get under the hood
29
Source: https://www.pikrepo.com/fcsda/yellow-hot-rod-car-with-hood-open
30. Machine Learning Workflow
Data Scraping/
Ingestion
Data
Exploration
Data Cleansing
and Processing
Feature
Engineering
Model
Evaluation
& Tuning
Model
Selection
Model
Deployment/
Inference
Supervised
Unsupervised
Modeling
Data Engineer, Dev Ops Engineer
Data Scientist/QuantsSoftware/Web Engineer
• AutoML
• Model Validation
• Interpretability
Robotic Process Automation (RPA) (Microservices, Pipelines )
• SW: Web/ Rest API
• HW: GPU, Cloud
• Monitoring
• Regression
• KNN
• Decision Trees
• Naive Bayes
• Neural Networks
• Ensembles
• Clustering
• PCA
• Autoencoder
• RMS
• MAPS
• MAE
• Confusion Matrix
• Precision/Recall
• ROC
• Hyper-parameter
tuning
• Parameter Grids
Risk Management/ Compliance(All stages)
Analysts&
DecisionMakers
31. 31
1. Data
2. Goals
3. Machine learning algorithms
4. Process
5. Performance evaluation
Key steps involved
32.
33. 33
Dataset, variable and Observations
Dataset: A rectangular array with Rows as observations and
columns as variables
Variable: A characteristic of members of a population ( Age, State
etc.)
Observation: List of Variable values for a member of the
population
34. 34
Variables
A variable could be:
▫ Categorical
– Yes/No flags
– AAA,BB ratings for bonds
▫ Numerical
– 35 mpg
– $170K salary
38. 38
• Descriptive Statistics
▫ Goal is to describe the data at hand
▫ Backward-looking
▫ Statistical techniques employed here
• Predictive Analytics
▫ Goal is to use historical data to build a model for prediction
▫ Forward-looking
▫ Machine learning & AI techniques employed here
Goal
38
39. 39
• Given a dataset, build a model that captures the
similarities in different observations and assigns
them to different buckets- Clustering
• Given a set of variables, predict the value of
another variable in a given data set- Prediction
▫ Predict salaries given work experience, education etc.
▫ Predict whether a loan would be approved given fico
score, current loans, employment status etc.
Predictive Analytics : Cross sectional datasets
39
44. 44
Supervised Algorithms
▫ Given a set of variables 𝑥!, predict the value of another variable 𝑦 in
a given data set such that
▫ If y is numeric => Prediction
▫ If y is categorical => Classification
▫ Example: Given that a customer’s Debt-to-Income ratio increased 20%, what are
the chances he/she would default in 3 months?
Machine Learning
44
x1,x2,x3… Model F(X) y
45. 45
Unsupervised Algorithms
▫ Given a dataset with variables 𝑥!, build a model that captures the
similarities in different observations and assigns them to different
buckets => Clustering
▫ Example: Given a list of emerging market stocks, can we segment them
into three buckets?
Machine Learning
45
Obs1,
Obs2,Obs3
etc.
Model
Obs1- Class 1
Obs2- Class 2
Obs3- Class 1
46. 46
• Parametric models
▫ Assume some functional form
▫ Fit coefficients
• Examples : Linear Regression, Neural Networks
Supervised Learning models - Prediction
46
𝑌 = 𝛽! + 𝛽" 𝑋"
Linear Regression Model Neural network Model
47. 47
• Non-Parametric models
▫ No functional form assumed
• Examples : K-nearest neighbors, Decision Trees
Supervised Learning models
47
K-nearest neighbor Model Decision tree Model
53. 53
• What transformations do I need for the x and y variables ?
• Which are the best features to use?
▫ Dimension Reduction – PCA
▫ Best subset selection
– Forward selection
– Backward elimination
– Stepwise regression
Feature Engineering
53
57. 57
• Fit measures in classical regression modeling:
• Adjusted 𝑅! has been adjusted for the number of predictors. It increases
only when the improve of model is more than one would expect to see by
chance (p is the total number of explanatory variables)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅! = 1 −
⁄∑"#$
%
(𝑦" − 0𝑦")! (𝑛 − 𝑝 − 1)
∑"#$
%
𝑦" − 4𝑦"
! /(𝑛 − 1)
• MAE or MAD (mean absolute error/deviation) gives the magnitude of the
average absolute error
𝑀𝐴𝐸 =
∑"#$
%
𝑒"
𝑛
Prediction Accuracy Measures
58. 58
▫ MAPE (mean absolute percentage error) gives a percentage score of
how predictions deviate on average
𝑀𝐴𝑃𝐸 =
∑!"#
$
𝑒!/𝑦!
𝑛
×100%
• RMSE (root-mean-squared error) is computed on the training and
validation data
𝑅𝑀𝑆𝐸 = 1/𝑛 2
!"#
$
𝑒!
%
Prediction Accuracy Measures
59. 59
1. Data
2. Goals
3. Machine learning algorithms
4. Process
5. Performance Evaluation
Recap
60. Machine Learning Workflow
Data Scraping/
Ingestion
Data
Exploration
Data Cleansing
and Processing
Feature
Engineering
Model
Evaluation
& Tuning
Model
Selection
Model
Deployment/
Inference
Supervised
Unsupervised
Modeling
Data Engineer, Dev Ops Engineer
Data Scientist/QuantsSoftware/Web Engineer
• AutoML
• Model Validation
• Interpretability
Robotic Process Automation (RPA) (Microservices, Pipelines )
• SW: Web/ Rest API
• HW: GPU, Cloud
• Monitoring
• Regression
• KNN
• Decision Trees
• Naive Bayes
• Neural Networks
• Ensembles
• Clustering
• PCA
• Autoencoder
• RMS
• MAPS
• MAE
• Confusion Matrix
• Precision/Recall
• ROC
• Hyper-parameter
tuning
• Parameter Grids
Risk Management/ Compliance(All stages)
Analysts&
DecisionMakers
63. 63
Claim:
• Machine learning is better for fraud
detection, looking for arbitrage
opportunities and trade execution
Caution:
• Beware of imbalanced class problems
• A model that gives 99% accuracy may still
not be good enough
1. Machine learning is not a generic solution to all problems
64. 64
Claim:
• Our models work on
datasets we have tested on
Caution:
• Do we have enough data?
• How do we handle bias in
datasets?
• Beware of overfitting
• Historical Analysis is not
Prediction
2. A prototype model is not your production model
65. 65
AI and Machine Learning in Production
https://www.itnews.com.au/news/hsbc-societe-generale-run-
into-ais-production-problems-477966
Kristy Roth from HSBC:
“It’s been somewhat easy - in a funny way - to
get going using sample data, [but] then you hit
the real problems,” Roth said.
“I think our early track record on PoCs or pilots
hides a little bit the underlying issues.
Matt Davey from Societe Generale:
“We’ve done quite a bit of work with RPA
recently and I have to say we’ve been a bit
disillusioned with that experience,”
“the PoC is the easy bit: it’s how you get that
into production and shift the balance”
66. 66
Claim:
• It works. We don’t know how!
Caution:
• It’s still not a proven science
• Interpretability or “auditability” of
models is important
• Transparency in codebase is paramount
with the proliferation of opensource
tools
• Skilled data scientists who are
knowledgeable about algorithms and
their appropriate usage are key to
successful adoption
3. We are just getting started!
67. 67
Claim:
• Machine Learning models are
more accurate than
traditional models
Caution:
• Is accuracy the right metric?
• How do we evaluate the
model? RMS or R2
• How does the model behave
in different regimes?
4. Choose the right metrics for evaluation
68. 68
Claim:
• Machine Learning and AI will replace
humans in most applications
Caution:
• Beware of the hype!
• Just because it worked sometimes
doesn’t mean that the organization can
be on autopilot
• Will we have true AI or Augmented
Intelligence?
• Model risk and robust risk
management is paramount to the
success of the organization.
• We are just getting started!
5. The Robots are coming!
https://www.bloomberg.com/news/articles/2017-10-20/automation-
starts-to-sweep-wall-street-with-tons-of-glitches
71. 71
1. Case Intro
2. Data Exploration of the Credit risk data set
3. Problem Definition and Machine learning
4. Performance Evaluation
5. Deployment
Case study
72. 72
Credit decisions
Credit-scoring models and techniques assess the risk in
lending to customers.
Typical decisions:
• Grant credit/not to new applicants
• Increasing/Decreasing spending limits
• Increasing/Decreasing lending rates
• What new products can be given to existing applicants ?
73. 73
How Lending club works?
https://www.lendingclub.com/public/how-peer-lending-
works.action
74. 74
• How much should I expect as interest?
• Is my borrower credit worthy?
• How much interest would a similar borrower pay?
• What is the repayment and default rate for a similar borrower?
Investor’s big decisions
76. 76
Credit Risk pipeline
Data Ingestion
from Lending
Club
Pre-Processing
Feature
Engineering
Model
Development
and Tuning
Model
Deployment
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
79. 79
All scenarios haven’t
played out
• Stress scenarios
• What-if scenarios
Challenges with real datasets
Figure ref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
80. 80
Missing values
• Missing at random
• Missing sequences
• Need data to fill frames
Challenges with real datasets
81. 81
• Access
▫ Hard to find
▫ Rare class problems
▫ Privacy concerns
making it difficult to
share
Challenges with real datasets
82. 82
Imbalanced
• Need more samples of rare
class
• Need proxies for data points
that were not observed or
recorded
Challenges with real datasets
90. 90
1. Case Intro
2. Data Exploration of WIG20 stock data
3. Problem Definition and Machine learning
4. Deployment
Case study
91. 91
Clustering stocks
• Which stocks are like each other?
• Are growth stocks behaving like growth stocks or value
stocks?
• Does the time series of prices & returns reveal which
stocks are close to each other?
106. 106
• If computers can understand language, opens huge possibilities
▫ Read and summarize
▫ Translate
▫ Describe what’s happening
▫ Understand commands
▫ Answer questions
▫ Respond in plain language
Language allows understanding
107. 107
• Describe rules of grammar
• Describe meanings of words and their
relationships
• …including all the special cases
• ...and idioms
• ...and special cases for the idioms
• ...
• ...understand language!
Traditional language AI
https://en.wikipedia.org/wiki/Formal_language
108. 108
What is NLP ?
Jumping NLP Curves
https://ieeexplore.ieee.org/document/6786458/
110. 110
• Ambiguity:
▫ “ground”
▫ “jaguar”
▫ “The car hit the pole while it was moving”
▫ “One morning I shot an elephant in my pajamas. How he got into my
pajamas, I’ll never know.”
▫ “The tank is full of soldiers.”
“The tank is full of nitrogen.”
Language is hard to deal with
112. 112
• Many ways to say the same thing
▫ “the same thing can be said in many ways”
▫ “language is versatile”
▫ “The same words can be arranged in many different ways to express
the same idea”
▫ …
Language is hard to deal with
113. 113
• APIs
• Human Insight
• Expert Knowledge
• Build your own
Options?
117. Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
117