Existing state-of-the-art supervised methods in Machine Learning require large amounts of annotated data to achieve good performance and generalization. However, manually constructing such a training data set with sentiment labels is a labor-intensive and time-consuming task. With the proliferation of data acquisition in domains such as images, text and video, the rate at which we acquire data is greater than the rate at which we can label them. Techniques that reduce the amount of labeled data needed to achieve competitive accuracies are of paramount importance for deploying scalable, data-driven, real-world solutions.
At Envestnet | Yodlee, we have deployed several advanced state-of-the-art Machine Learning solutions that process millions of data points on a daily basis with very stringent service level commitments. A key aspect of our Natural Language Processing solutions is Semi-supervised learning (SSL): A family of methods that also make use of unlabelled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Pure supervised solutions fail to exploit the rich syntactic structure of the unlabelled data to improve decision boundaries. There is an abundance of published work in the field - but few papers have succeeded in showing significantly better results than state-of-the-art supervised learning. Often, methods have simplifying assumptions that fail to transfer to real-world scenarios. There is a lack of practical guidelines for deploying effective SSL solutions. We attempt to bridge that gap by sharing our learning from successful SSL models deployed in production
Productionizing Semi-Supervised Deep Learning Systems
1. Learnings from productionizing a semi-supervised
deep learning system at the petabyte scale.
Presented By : Samiran Roy
LinkedIn: samiranroy
Slides: http://bit.ly/tech_triveni_samiran
6. 6
❑ 100 bn+ text datapoints
❑ 1 PB+ size
❑ 20 mn+ unique words
❑ 10+ classification challenges
❑ 3 mn+ classes*
❑ >90% precision/recall*
Data analytics at scale
7. 7
❑ Acts on real world data (Complete pipeline)
❑ Impacts business decisions
❑ Online Inference (prediction on demand)
❑ Shifting data distribution
Production System - Aspects
8. 8
❑ Simple (but weird) Ideas
❑ Common Sense (in hindsight)
This presentation
16. 16
Data Swimming
The first step to training a neural net is to not touch any neural net code at all and instead begin by
thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in
units of hours) scanning through thousands of examples, understanding their distribution and looking for
patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate
examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will
typically also pay attention to my own process for classifying the data, which hints at the kinds of
architectures we’ll eventually explore. As an example - are very local features enough or do we need
global context? How much variation is there and what form does it take? What variation is spurious and
could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much
does detail matter and how far could we afford to downsample the images? How noisy are the labels?
~ Andrej Karpathy, (Senior Director of AI at Tesla)
(http://karpathy.github.io/2019/04/25/recipe/)
17. 17
Data Swimming Insights
❑ What kind of machine learning model to use
❑ Duplicates in data/Data reduction
❑ Unused variables
❑ Corrupted/noisy labels
❑ Imbalances
❑ Simple rules to classify the data
❑ Hints on cleaning the data
❑ Distribution of variables
❑ And much much more!
18. 18
What is a ‘good’ accuracy?
❑ Stock direction prediction (60% ?)
❑ Question answering (80% ?)
❑ Movie recommendation engine (90 % ?)
❑ Fraud detection (99.9% ?)
❑ Computer Vision Tasks in Driverless cars?
19. 19
Exploring the long tail
Ref: https://gradientdescent.co/t/tesla-autonomy-day-watch-the-full-event/216/18
21. 21
1 bn -> 1 mn (for labelling)
1. Get a good representation
2. Visualize your data
3. Cluster your data
4. Sample from each cluster!
Machine learning in Samping
22. 22
1 bn -> 1 mn (for labelling)
1) Get a good representation
Representation is key
23. 23
1 bn -> 1 mn (for labelling)
1) Get a good representation
Word embeddings!
24. 24
1 bn -> 1 mn (for labelling)
1) Get a good representation
Vector Arithmetic!
25. 25
Word embeddings - Popular
Approaches
❑ Word to Vec
❑ Fasttext
❑ Glove
❑ Bert/Elmo
30. 30
1 bn -> 1 mn (for labelling)
1. Get a good representation
2. Visualize your data
3. Cluster your data
4. Sample from each cluster! (Resample if necessary)
Machine learning in Samping
34. 34
Semi-supervised learning - The first
approach
Unsupervised word
Embeddings (1bn)
+
Deep Learning Models
(1mn)
train: Sachin Tendulkar played an amazing knock for india
test: Virender Sehwag played an amazing knock for india
35. 35
Semi-supervised learning - The first
approach
Unsupervised word
Embeddings (1bn)
+
Deep Learning Models
(1mn)
train test
42. 42
Avoiding Hidden Technical Debt
❑ Entanglement (CACE) - what if we change the distribution
of feature x1 in [x1, x2, x3] ?
❑ Correction cascades -> what is model M2 uses the output
of Model M1?
❑ Hidden/Unintended feedback loops
❑ Full List:
https://papers.nips.cc/paper/5656-hidden-technical-de
bt-in-machine-learning-systems
46. 46
Model Interpretation
Category: Sports
Alongside Tendulkar and Sehwag, several other retired international
cricketers will be seen in action
Alongside Tendulkar and Sehwag, several other retired international
cricketers will be seen in action
48. 48
Automatic Feedback Loops
(Ensemble of models)
❑ Monitor the changing distribution of data
❑ New words/sentences/classes not present in training
❑ Generated embedding is far from any cluster/Confidence
score is low
❑ Random Sampling