2. Life-Threatening
Spacesuit Failure
On July 16, 2013, water filled the helmet of Italian astronaut
Luca Parmitano, which forced NASA to abort his spacewalk for
safety reasons
3. 100,000
pages/month
• About 100,000 page scans/month in reports and docs
• This volume bogs down EVA Data Integration process
when submitted to existing OCR for text extraction
• Only first page could be scanned to avoid clogging the
sequential processing pipeline :-(
4. Approach: Parallelism, Low Cost
• Exploit parallelism
• Split PDF scan docs into pages
• Use AWS Lambda to autoscale and OCR pages in parallel
• Leverage AWS events and triggers to “wire” processing pipeline
• Connect services securely using AWS Roles, Policies
• Avoid paying for compute when not processing
• Avoid using database to minimize cost
5. Why use AWS Lambda?
• Drastically reduced cost, pay only for what you use
– App handling 1000 2-second requests/day would cost:
• Server: $16.84/month (t2.small, 24x7)
• Lambda: $1.50/month
• Event-driven: S3, DynamoDB, etc -- invoke functions
• Instant automatic scaling with no effort
– Conventional EC2 auto-scaling takes ~ 5 minutes
• Security: nothing to attack
– no logins, no open ports, no patching
• Encourages Micro-Service architecture patterns
6. Architecture and EDI Integration
Text
Documents
Text
Pages
PDF
Pages
PDF
Docs
EDI
App
EDI
Search
API
Split
OCR
OCR
OCR
Combine
Combine
Combine
Output
PDF
doc
PDF
pagePDF
pagePDF
page
TXT
doc
TXT
pageTXT
pageTXT
page
PDF
doc
AWS Autoscaling LambdasAWS S3 Storage
EDI OCREVA
8. Handle the S3 Upload Event (remember it’s processed)
def s3created(event, context):
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'])
etag = record['s3']['object']['eTag']
donekey = done_key(key, etag)
dt = done_check(bucket_name, donekey)
if dt:
log.warning('key={} already done at ‘
‘datetime={}'.format(key, dt))
return
_process_the_file_for_this_event_type(
bucket_name, key, context)
done_mark(bucket_name, donekey)
9. “Will it Lambda?”
● Requires/allows new architectures
● Python2, 3 are supported on Lambda
● Lambda auto-scaling is fast and low-stress
● Serverless Framework makes it easy-ish
○ define AWS resources: S3, Lambdas, ...
○ define events: ObjectCreated
○ map events to invoke lambdas