Big data at AWS Chicago User Group
Most of the slides from the Sept 23rd 2014 AWS User Group in Chicago.
Talks:
"AWS Storage Options" Ben Blair, CTO at MarkITx @stochastic_code
"APIs and Big Data in AWS" - Kin Lane, API Evangelist @kinlane
[coming soon] "Democratizing Data Analysis with Amazon Redshift" - Bill Wanjohi @billwanjohi and Michelangelo D'Agostino @MichelangeloDA, Civis Analytics
Sponsored by Cohesive and CivisAnalytics.
2. Have an idea for a meetup? Talk
to me:
!
Margaret Walker
CohesiveFT
!
!
Tweet: @MargieWalker
#AWSChicago
Sponsors & Hosts
#AWSChicago
3. 6:00 pm Introductions
6:05 pm Short Talks
!
"AWS Storage Options" Ben Blair, CTO at MarkITx
@stochastic_code
!
"APIs and Big Data in AWS" - Kin Lane,API Evangelist
@kinlane
!
"Democratizing Data Analysis with Amazon Redshift" - Bill
Wanjohi @billwanjohi and Michelangelo D'Agostino
@MichelangeloDA, Civis Analytics
!
6:45 pm Q & A
7:00 pm Networking, drinks and pizza
Agenda
#AWSChicago
Sponsors & Hosts
7. TL;DW
• Use IAM roles for access control
• Use DynamoDB for online storage &
transactions
• Use Redshift for offline storage & analysis
• Use S3 to keep *everything*
8. It’s hard to keep a
secret
Use AIM EC2 roles instead
28. ● collaborated on monolithic Vertica analytics
database
● dozens of TB of data
● scaled from 4-20 server blades
● dozens of concurrent users across
departments (hundreds total)
● arbitrary SQL allowed/encouraged
Life before Redshift
29. Our early requirements
● SQL language
● low starting cost
● easy to integrate with OSS, other DBs
● performant on large data sets
● minimal database administration
30. Choosing Redshift
● timing: first full release in Feb 2013
● drastically cheaper to start than other
commercial offerings
● very similar to our previous choice, HP
Vertica
● many fewer administration tasks
31. Basics
● RDBMS
● MPP/Columnar
Supports window functions
Few enforceable constraints
No concept of an index
● Redshift <= ParAccel <= PostgreSQL 8
Postgres drivers work
ORM requires mocking
● Most data I/O via S3 service
32. Things analytics DBs are good at
● Big aggregates
● Parallel I/O
● Merge joins between tables
33. Things they’re not good at
● Updates
● Retrieval of individual records
● Enforcing data quality
34. How’s it worked out?
Pretty good!
● adequate performance
○ big step up from traditional RDBMS
○ comparable to other analytics DBs
● easy to stand up new clusters
● cheaper clusters now available
● most workflows can live entirely in-database
● s3 is a good broker for what can’t
35. Data Science Workflow
Our custom plumbing syncs tables from dozens
of source databases into Redshift at varying
refresh frequencies.
36. We’ve found that SQL just invites so many
more people to the analytics game.
Analysts and data scientists run exploratory
SQL and build up complex tables for statistical
modeling一utilizing crazy joins, aggregates and
rollup features.
Redshift supports powerful window functions
Data Science Workflow
38. Predictive Modeling
For simple linear models, scoring is done
directly in redshift via SQL.
For more complicated models, data is pulled
from redshift to s3 with a COPY SQL
command, processed in EMR, and loaded back
into redshift with another COPY command.
39. Hurdles we’ve faced along
the way
● inconsistent runtimes
● catalog contention
● bugs (databases are hard)
● resizing
● too easy to end up with uncompressed data
● “missing” PostgreSQL functionality
● complex workload management
40. Setup Recommendations
● at least two nodes
● send 35-day snapshots to other regions
● at-rest encryption
● enforce SSL
● provision with boto or AWS CLI
● cluster isolation to hide objects
● buy 3-year reservations
41. We’re Hiring!
Through research, experimentation, and iteration, we’re
transforming how organizations do analytics. Our clients
range in scale and focus from local to international, all
empowered by our individual-level, data-driven approach.
civisanalytics.com/apply