O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Data Science
in the Cloud
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
November 2016
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
Who are Data Scientists?
Means: skills vary wildly
But they’re in
demand and expensive
“The Sexiest Job
of the 21st Century”
- HBR
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
How many
Data Scientists do you have?
At Stitch Fix we have ~80
~85% have not done formal CS
But what do they do?
What is Stitch Fix?
Two Data Scientist facts:
1. Has AWS console access*.
2. End to end,
they’re responsible.
How do we enable this without
?
Make doing the right
thing the easy thing.
Fellow Collaborators
Horizontal team focused on Data Scientist Enablement
1. Eng. Skills
2. Important
3. What they work on
Let’s Start
Will Only Cover
1. Source of truth: S3 & Hive Metastore
2. Docker Enabled DS @ Stitch Fix
3. Scaling DS doing ML in the Cl...
Source of truth:
S3 & Hive Metastore
Want Everyone to Have Same View
A
B
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
A
B
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
● But it’s easy to outgrow the...
● Amazon’s Simple Storage Service
● Infinite* storage
● Can write, read, delete, BUT NOT append.
● Looks like a file syste...
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:...
Hive Metastore
● Replacing data in a partition
But if we’re not careful
● Replacing data in a partition
But if we’re not careful
But if we’re not careful
B
A
But if we’re not careful
● S3 is eventually
consistent
● These bugs are hard
to track down
A
B
● Use Hive Metastore to control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each ...
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold...
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/so...
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch I...
Enforce via API
Enforce via API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘...
● Full partition history
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easil...
Docker Enabled
DS @ Stitch Fix
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: In the Beginning...
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution I
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution II
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
Low High
Ad hoc Infra: Evolution III
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on...
Flotilla UI
● Has:
○ Our internal API libraries
○ Jupyter Notebook:
■ Pyspark
■ IPython
○ Python libs:
■ scikit, numpy, scipy, pandas,...
Docker Deployment
Docker Deployment
Docker Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel...
Scaling
DS doing ML
in the Cloud
1. Data Latency
2. To Batch
or Not To Batch
3. What’s in a Model?
Data Latency
How much time do you spend waiting for data?
*This could be a laptop, a shared system, a batch process, etc.
Use Compression
*This could be a laptop, a shared system, a batch process, etc.
Use Compression - The Components
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0
0 1 0 1 ...
… 1 0 1 1 ]
[ 1 0 0 1 0 0 ...
Use Compression - Python Comparison
Pickle: 60MB
Zlib+Pickle: 129KB
JSON: 15MB
Zlib+JSON: 55KB
Pickle: 3.1KB
Zlib+Pickle: ...
● Naïve scheme of JSON + Zlib works well:
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(j...
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
Observations...
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
● For more i...
To Batch or Not To Batch:
When is batch inefficient?
● Online:
○ Computation occurs synchronously when needed.
● Streamed:
○ Computation is triggered by an event(s).
Online & ...
Online & Streamed Computation
Very likely
you start with
a batch system
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
Ver...
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● A...
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● A...
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● A...
Streamed Example
Streamed Example
Streamed Example
Streamed Example
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scie...
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scie...
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scie...
What’s in a Model?
Scaling model knowledge
Ever:
● Had someone leave and then nobody understands how they trained their
models?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● ...
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● ...
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● ...
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● ...
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Mo...
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Mo...
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Mo...
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Mo...
● Analogous to software libraries
● Packaging:
○ Zip/Jar file
Produce Model Artifacts
But all the above
seems complex?
We’re building APIs.
Fin; Questions?
@stefkrawczyk
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
Próximos SlideShares
Carregando em…5
×

Data Science in the Cloud @StitchFix

626 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.

Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..

Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Data Science in the Cloud @StitchFix

  1. 1. Data Science in the Cloud Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk November 2016
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ stitchfix-cloud
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Who are Data Scientists?
  5. 5. Means: skills vary wildly
  6. 6. But they’re in demand and expensive
  7. 7. “The Sexiest Job of the 21st Century” - HBR https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  8. 8. How many Data Scientists do you have?
  9. 9. At Stitch Fix we have ~80
  10. 10. ~85% have not done formal CS
  11. 11. But what do they do?
  12. 12. What is Stitch Fix?
  13. 13. Two Data Scientist facts: 1. Has AWS console access*. 2. End to end, they’re responsible.
  14. 14. How do we enable this without ?
  15. 15. Make doing the right thing the easy thing.
  16. 16. Fellow Collaborators Horizontal team focused on Data Scientist Enablement
  17. 17. 1. Eng. Skills 2. Important 3. What they work on
  18. 18. Let’s Start
  19. 19. Will Only Cover 1. Source of truth: S3 & Hive Metastore 2. Docker Enabled DS @ Stitch Fix 3. Scaling DS doing ML in the Cloud
  20. 20. Source of truth: S3 & Hive Metastore
  21. 21. Want Everyone to Have Same View A B
  22. 22. This is Usually Nothing to Worry About ● OS handles correct access ● DB has ACID properties A B
  23. 23. This is Usually Nothing to Worry About ● OS handles correct access ● DB has ACID properties ● But it’s easy to outgrow these options with a big data/team. A B
  24. 24. ● Amazon’s Simple Storage Service ● Infinite* storage ● Can write, read, delete, BUT NOT append. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Scales well S3 * For all intents and purposes
  25. 25. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore
  26. 26. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: Hive Metastore Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  27. 27. Hive Metastore
  28. 28. ● Replacing data in a partition But if we’re not careful
  29. 29. ● Replacing data in a partition But if we’re not careful
  30. 30. But if we’re not careful B A
  31. 31. But if we’re not careful ● S3 is eventually consistent ● These bugs are hard to track down A B
  32. 32. ● Use Hive Metastore to control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● Stitch Fix solution: ○ Use an inner directory → called Batch ID Hive Metastore to the Rescue
  33. 33. Batch ID Pattern
  34. 34. Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  35. 35. ● Overwriting a partition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252 sold_items
  36. 36. ● Overwriting a partition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252 sold_items
  37. 37. Enforce via API
  38. 38. Enforce via API
  39. 39. Python: store_dataframe(df, dest_db, dest_table, partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) API for Data Scientists
  40. 40. ● Full partition history ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  41. 41. Docker Enabled DS @ Stitch Fix
  42. 42. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: In the Beginning...
  43. 43. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: Evolution I
  44. 44. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: Evolution II
  45. 45. Workstation Env. Mgmt. Scalability Low Low Medium Medium Low High Ad hoc Infra: Evolution III
  46. 46. ● Control of environment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Does Docker Lower Overhead?
  47. 47. Flotilla UI
  48. 48. ● Has: ○ Our internal API libraries ○ Jupyter Notebook: ■ Pyspark ■ IPython ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Our Docker Image
  49. 49. Docker Deployment
  50. 50. Docker Deployment
  51. 51. Docker Deployment
  52. 52. ● Docker tightly integrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Switched to artifactory Our Docker Problems So Far
  53. 53. Scaling DS doing ML in the Cloud
  54. 54. 1. Data Latency 2. To Batch or Not To Batch 3. What’s in a Model?
  55. 55. Data Latency How much time do you spend waiting for data?
  56. 56. *This could be a laptop, a shared system, a batch process, etc.
  57. 57. Use Compression *This could be a laptop, a shared system, a batch process, etc.
  58. 58. Use Compression - The Components [ 1.3234543 0.23443434 … ] [ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ] [ 1 0 0 1 0 0 … 0 1 0 0 ] [ 1.3234543 0.23443434 … ] { 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
  59. 59. Use Compression - Python Comparison Pickle: 60MB Zlib+Pickle: 129KB JSON: 15MB Zlib+JSON: 55KB Pickle: 3.1KB Zlib+Pickle: 921B JSON: 2.8KB Zlib+JSON: 681B Pickle: 2.6MB Zlib+Pickle: 600KB JSON: 769KB Zlib+JSON: 139KB [ 1.3234543 0.23443434 … ] [ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ] [ 1 0 0 1 0 0 … 0 1 0 0 ] [ 1.3234543 0.23443434 … ] { 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
  60. 60. ● Naïve scheme of JSON + Zlib works well: Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  61. 61. ● Naïve scheme of JSON + Zlib works well: ● Double vs Float: do you really need to store that much precision? Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  62. 62. ● Naïve scheme of JSON + Zlib works well: ● Double vs Float: do you really need to store that much precision? ● For more inspiration look to columnar DBs and how they compress columns Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  63. 63. To Batch or Not To Batch: When is batch inefficient?
  64. 64. ● Online: ○ Computation occurs synchronously when needed. ● Streamed: ○ Computation is triggered by an event(s). Online & Streamed Computation
  65. 65. Online & Streamed Computation Very likely you start with a batch system
  66. 66. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? Very likely you start with a batch system
  67. 67. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? Very likely you start with a batch system
  68. 68. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? ● Online vs Streamed depends on in house factors: ○ Number of models ○ How often they change ○ Cadence of output required ○ In house eng. expertise ○ etc. Very likely you start with a batch system
  69. 69. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? ● Online vs Streamed depends on in house factors: ○ Number of models ○ How often they change ○ Cadence of output required ○ In house eng. expertise ○ etc. Very likely you start with a batch system We use online system for recommendations
  70. 70. Streamed Example
  71. 71. Streamed Example
  72. 72. Streamed Example
  73. 73. Streamed Example
  74. 74. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists Online/Streaming Thoughts
  75. 75. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists ● Requires better software engineering practices ○ Code portability/reuse ○ Designing APIs/Tools Data Scientists will use Online/Streaming Thoughts
  76. 76. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists ● Requires better software engineering practices ○ Code portability/reuse ○ Designing APIs/Tools Data Scientists will use ● Prototyping on AWS Lambda & Kinesis was surprisingly quick ○ Need to compile C libs on an amazon linux instance Online/Streaming Thoughts
  77. 77. What’s in a Model? Scaling model knowledge
  78. 78. Ever: ● Had someone leave and then nobody understands how they trained their models?
  79. 79. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself?
  80. 80. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why?
  81. 81. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments?
  82. 82. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments? ● Wanted to compare model performance over time?
  83. 83. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments? ● Wanted to compare model performance over time? ● Wanted to train a model in R/Python/Spark and then deploy it a webserver?
  84. 84. Produce Model Artifacts
  85. 85. ● Isn’t that just saving the coefficients/model values? Produce Model Artifacts
  86. 86. ● Isn’t that just saving the coefficients/model values? ○ NO! Produce Model Artifacts
  87. 87. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? Produce Model Artifacts
  88. 88. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? Produce Model Artifacts
  89. 89. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts
  90. 90. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Makes it easy to keep an archive and track changes over time
  91. 91. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Helps a lot with model debugging & diagnosis! Makes it easy to keep an archive and track changes over time
  92. 92. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Helps a lot with model debugging & diagnosis! Makes it easy to keep an archive and track changes over time Can more easily use in downstream processes
  93. 93. ● Analogous to software libraries ● Packaging: ○ Zip/Jar file Produce Model Artifacts
  94. 94. But all the above seems complex?
  95. 95. We’re building APIs.
  96. 96. Fin; Questions? @stefkrawczyk
  97. 97. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ stitchfix-cloud

×