Rsqrd AI: ML Tooling at an AI-first Startup

Emad Elwany - CTO, Lexion
Evolution of ML Infrastructure at an AI-First Startup
Rsqrd AI Meetup - May 2020

Agenda
● Lexion Overview
● Document Understanding Pipeline
● Evolution of ML Infrastructure at Lexion
● Deep Dive - Model Versioning

Lexion: Applying NLP to legal agreements
Creating this simple report could take weeks without automation.

It’s a complex NLP problem
● Messy PDFs make OCR non-trivial
● Long, multi-agreement documents
● Domain specific language
● Complex schemas/ontologies
● Mix of non/semi/fully structured data

Sample: Identify Contract Term
Contract term is AUTO RENEW if, e.g.:
“will automatically renew for three year terms”
“shall continue on a month to month basis until terminated”
Contract term is FIXED if, e.g.:
“terminate effective April 1, 2007.”
“will continue until the 1 year anniversary”

Document Understanding Pipeline
Input
OCR
Output
BL
.
.
.
Entities
Classes
Relations
Text
Layout
Structured
Data
.
.
.
Many many models!
Key Takeaway: Every node in this graph is a “model” (of hundreds), and the remainder of this talk applies to
each and every one of them.

Initial Goals (Pre-MVP)
● Evaluate technical feasibility: Can we build it?
● Evaluate business viability: Will they find it useful?
● Move very quickly: Can we ship it before we run out of money?
Use tools that are easy to
● Understand
● Setup
● Deploy

Steady state Goals (Post-MVP)
● Scale model development
● Scale model deployment
● Keep users happy at all times
Use tools that are easy to
● Integrate
● Configure
● Scale

Typical model lifecycle
Experience with ML in
research, applications,
and platforms:

Data
EARLY
● Finding the data
Scrapers/FOIA
● Cleaning the data
Scripting + Rules
● Annotating the data
Simple annotation tools
LATER
● Managing the data
Data Stores and Caches
● Protecting the data
Encryption and Access control
● Scaling annotation
Weakly/Unsupervised

Training
EARLY
Optimize for Speed of Results
Jupyter, Scripts
Goal: does it work?
LATER
Optimize for speed of Experimentation
Frameworks and metrics
Goal: make it the best!

Packaging
EARLY
Optimize for shipping the models
REST endpoint (online)
Batch script (offline)
LATER
Optimize for operationalizing the
model
Versioning of artefacts
Dependency management
Cost management
More on this a bit later...

Validate Model
EARLY
● Does it work well enough?
Simple high level metrics (F1, P, R etc.)
LATER
● Is it better?
● Why is it better?
● How is it better?
Much more rigor:
● Validation sets
● E2E tests
● More detailed metrics

Deployment
EARLY
Optimize for Speed of deployment
LATER
Optimize for Scale of deployment
● Inference time
● Priority vs. starvation
● Rapid update deployment

Monitor
EARLY
Bare minimum to ensure things are
working:
● High level E2E alert
LATER
Invest in monitoring all aspects of the
models:
● Detailed KPIs
● Model Drift
● User DSAT
Logging, Dashboards, Alerts

Real life problems
● “We used to predict the right X on this document - when/why did it break?”
○ Usually accompanied by an alert or even worse: a user complaint.
● “The model we trained 2 months ago was so much better at Y - we can’t seem
to get the same performance. How do we roll back?”
○ Usually accompanied by a frustrated product manager / quality engineering.
● “I swear I got better results over the weekend for the same experiment, I don’t
know what changed!”
○ Usually accompanied by a confused data scientist.
But first: can you reproduce your model results to the 10th decimal place? If not, STOP!

Wait… didn’t we solve this problem a long time ago?
Source control has been used for decades. How is this different?
Versioning ML models shares a lot with code versioning, for e.g.:
But it also includes a lot more:
Code (*) Config
Library dependencies Topology
Training Data Training Parameters
Model State (weights, hyperparameters) Hardware
(*) Code is a lot of things in the context of ML models, it’s data prep, libraries, models, featurizers etc.

What exactly is Versioning for ML models?
L1: Production/Staging slots.
Allows very short-term rollback/rollforward.
L2: Reproducing Inference.
Once you have a trained model, this kind of versioning allows you to deterministically
reconstruct a model for inference. Allows pinning models for a long time as well as long-term
rollback/rollforward.
L3: Reproducing Training.
You can at any point in time, re-train a model that yields the exact same model you had
previously trained. This is a much stronger kind of versioninging, it enables reproducibility as
well as dealing with issues as training data corruption.

Artefacts that need to be versioned
Simple examples Inference Training
Model Hyper Parameters Size of Layer N
Featurizer Code Input feature vector size
Featurizer Data Vocab
Model Code NN Architecture
Model Config Remove Stop Words?
Model State Model Weights
Library Dependencies PyTorch Version
Hardware V100
Training Config Early Stopping Criteria
Training Data Data + Labels

Remember this pipeline?
Input
OCR
Output
BL
.
.
.
Entities
Classes
Relations
Text
Layout
Structured
Data
.
.
.
Many many models!
You need to version the aforementioned artefacts for every single node in this graph. That’s a lot of things to
version!

Some solutions (that don’t work)
● Let’s snapshot everything in a Docker image and store it forever
> How do you hotfix the model?
● Let’s mark a “stable” production model and not deploy any future “staging”
versions till they have been tested enough.
> How do you make “breaking” changes to the code?
● Let’s always support only “latest” version and never commit a new version
until we’re sure it’s good.
> How do you iterate quickly?

We evaluated some existing solutions
It’s always better to not reinvent the wheel

It’s a lot of work to move infrastructure
The question is when not if. Early stage startups need to ship and sell their
product, hard to justify infrastructure plumbing till the flywheel turns.
Instead of a full solution, these investments have paid off:
1. Versioning all model state during packaging
2. Versioning all data artefacts in our our data store and making them immutable
3. Versioning all code explicitly by keeping stable interfaces and supporting
minor/major version upgrades to model/featurizer code.
4. Pinning major versions of stable dependencies
Remember: we are building a whole user facing application on top of this,
prioritizing when to invest here is critical.

BTW, all this ML is in addition to…
● Permissions
● Email alerts
● SSO
● End-user annotations
● Custom reporting
● Full text search
● Task management
● Custom fields
● Doc schemas
● APIs
● Integrations
● Bulk export
● Integrations
● Dashboards
● Pretty charts
● Bulk ingestion
● Security
● Audit trail
… building a complete user facing application!

A note on ML technical debt
● Identify when cost debt > cost addressing debt
● Incorporate cost of ML infrastructure in your business model
● Pick the right kind of technical debt, with a plan to get out
● Model versioning is one of the areas you might want to invest in early
● Getting a great model is just the first step of a long journey. You have to build
a product customers love!

Questions?
Learn more at https://lexion.ai (we’re hiring!)

Rsqrd AI: ML Tooling at an AI-first Startup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rsqrd AI: ML Tooling at an AI-first Startup

Similar to Rsqrd AI: ML Tooling at an AI-first Startup (20)

More from Sanjana Chowdhury

More from Sanjana Chowdhury (12)

Recently uploaded

Recently uploaded (20)

Rsqrd AI: ML Tooling at an AI-first Startup