Enabling Your Data Science Team with Modern Data Engineering

Enabling Your Data Science
Team with Modern Data
Engineering
James Densmore
Data Liftoff
dataliftoff.com
@jamesdensmore

About Me
Founder & Consultant at Data Liftoff
Experience leading Data Science and Data Engineering Teams
Technical Background (Software Engineering and Data Engineering)
@jamesdensmore
www.dataliftoff.com

What is “Modern” Data Engineering?
● Thanks to highly scalable, columnar databases (usually cloud based),
we’re now able to store, structure and query, extremely high volumes
of data at a low cost. Really!
● A mix of data lakes and data warehouses
● ELT instead of ETL
● Closer to software engineering than in the past
● No longer a “back ofﬁce” function. Often aligned with product
development. Sometimes a stand-alone Tech team

Diﬀerence Between Data Science and Data
Engineering - Oversimpliﬁed!
Data Engineers build and maintain data infrastructure, including data
warehouses.
Data Scientists use data to make predictions, run analysis and build models
to power products.

Common Data Engineering Tools and Platforms

Common Data Science Tools and Platforms

Don’t Assume The Two Teams Understand Each
Other

What Data Scientists Should Know about Data
Engineers
● They’re software engineers at heart
● They don’t always know how data is generated. Some questions are
better left to the production engineers
● They’re interested in your model, but probably not the math 😆
● They’re thinking about scale and efﬁciency - sometimes too much so
● You are one of many customers to them

What Data Engineers Should Know about Data
Scientists
● They write code, but they’re usually not software engineers
● They will look into data in more detail than anyone else, including you
● Their work is difﬁcult to put into tickets and sprints
● Scale and performance is not their top priority
● They understand the “why” of what they’re building - just ask

What Data Science Needs from a Data
Infrastructure
● Access to both transformed and unprocessed data
● Deﬁnitions of columns/attributes and how data is generated
● A safe space to experiment and tune models
○ Plenty of storage
○ No impact on production or other users
○ Read permissions on existing datasets, write/create space for
themselves
● A path to production

How This Diﬀers From Other Consumers of Data
● Data warehouses traditionally serve fully transformed and aggregated
data to BI tools, dashboards and data analysts. Data Scientists need
raw data - a lot of it
● The data warehouse was once the “end of the road” for data. Data
Scientists need it in other forms and locations.
● Data products built by the data science team may end up in production.
What’s the path to get there?

Asking More from Data Engineering
● New pipelines to support data science
● Documenting more detail of the raw data and ﬁelding highly speciﬁc
questions about it
● Strain on databases from ad hoc queries
● Managing data security and privacy outside of the warehouse
● Model deployment to production

Infrastructure Considerations
Image Credit: Amazon Web Services
● Data Lakes + Databases
● Secure storage for flat files
● VMs for building and testing models in
development
○ Discourage local development
with sensitive data
● Share best practices for accessing data
from scripts - credential management
● Data governance now extends to
development machines, VMs, and flat
file storage

An Example - Building a Recommender System
● Data to build the model
○ Previous recommendations and clicks, search logs, content metadata, user proﬁles,
user activity history
○ What they want might not exist!
● Infrastructure to build the model
○ Storage for exports of data
○ VMs to build and run models - needs to securely access input data, and output
results for analysis
● Moving model to production
○ Data engineering + application engineers
● Instrumenting further tracking and data collection in production
○ Build new pipelines and select storage
● Deploy, analyze, iterate and deploy again!

Partners, Not Siloed Services
● The closer together, the better!
● Over-communicate
○ Overlapping Slack channels
○ Sit in on planning meetings
● Share knowledge
○ Monthly demos or lunch-and-learns
○ Share detailed release notes
● Recognize differences in sizing, planning and
executing projects
Image Credit: Vector Open Stock - http://www.vectoropenstock.com/

Overcome Org Structure
● A single leader overseeing both teams, even if not directly, is ideal
○ Not always possible! Team up leaders and keep them close
● Align around projects, not org charts
● Find team members most curious about the “other side” and give them
opportunities to dip their toes in
● Share, and speak to, successes as a uniﬁed team. Perception is reality

Other Common Pitfalls
● Hiring data scientists without having data engineers
● Assuming because you collect “data”, data scientists have what they
need
● Structuring data science work like you do software and data
engineering
● Underestimating the failure rate of data science projects in comparison
to data engineering

Final Tips & Ideas
● New tools won’t save you, but don’t ignore them
● Be ﬂexible in your hiring. Generalists bridge gaps
● Invest in light-weight documentation, and commit to keeping it current
○ Accurate over Glossy
● Cross team interviewing and onboarding
● Question your team structure often
● When in doubt, talk!

Thank You!
DataLiftoﬀ.com
@jamesdensmore

Enabling Your Data Science Team with Modern Data Engineering

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Enabling Your Data Science Team with Modern Data Engineering

Semelhante a Enabling Your Data Science Team with Modern Data Engineering (20)

Último

Último (20)

Enabling Your Data Science Team with Modern Data Engineering