Top 10 Most Downloaded Games on Play Store in 2024
Enabling Your Data Science Team with Modern Data Engineering
1. Enabling Your Data Science
Team with Modern Data
Engineering
James Densmore
Data Liftoff
dataliftoff.com
@jamesdensmore
2. About Me
Founder & Consultant at Data Liftoff
Experience leading Data Science and Data Engineering Teams
Technical Background (Software Engineering and Data Engineering)
@jamesdensmore
www.dataliftoff.com
3. What is “Modern” Data Engineering?
● Thanks to highly scalable, columnar databases (usually cloud based),
we’re now able to store, structure and query, extremely high volumes
of data at a low cost. Really!
● A mix of data lakes and data warehouses
● ELT instead of ETL
● Closer to software engineering than in the past
● No longer a “back office” function. Often aligned with product
development. Sometimes a stand-alone Tech team
4. Difference Between Data Science and Data
Engineering - Oversimplified!
Data Engineers build and maintain data infrastructure, including data
warehouses.
Data Scientists use data to make predictions, run analysis and build models
to power products.
8. What Data Scientists Should Know about Data
Engineers
● They’re software engineers at heart
● They don’t always know how data is generated. Some questions are
better left to the production engineers
● They’re interested in your model, but probably not the math 😆
● They’re thinking about scale and efficiency - sometimes too much so
● You are one of many customers to them
9. What Data Engineers Should Know about Data
Scientists
● They write code, but they’re usually not software engineers
● They will look into data in more detail than anyone else, including you
● Their work is difficult to put into tickets and sprints
● Scale and performance is not their top priority
● They understand the “why” of what they’re building - just ask
10. What Data Science Needs from a Data
Infrastructure
● Access to both transformed and unprocessed data
● Definitions of columns/attributes and how data is generated
● A safe space to experiment and tune models
○ Plenty of storage
○ No impact on production or other users
○ Read permissions on existing datasets, write/create space for
themselves
● A path to production
11. How This Differs From Other Consumers of Data
● Data warehouses traditionally serve fully transformed and aggregated
data to BI tools, dashboards and data analysts. Data Scientists need
raw data - a lot of it
● The data warehouse was once the “end of the road” for data. Data
Scientists need it in other forms and locations.
● Data products built by the data science team may end up in production.
What’s the path to get there?
12. Asking More from Data Engineering
● New pipelines to support data science
● Documenting more detail of the raw data and fielding highly specific
questions about it
● Strain on databases from ad hoc queries
● Managing data security and privacy outside of the warehouse
● Model deployment to production
13. Infrastructure Considerations
Image Credit: Amazon Web Services
● Data Lakes + Databases
● Secure storage for flat files
● VMs for building and testing models in
development
○ Discourage local development
with sensitive data
● Share best practices for accessing data
from scripts - credential management
● Data governance now extends to
development machines, VMs, and flat
file storage
14. An Example - Building a Recommender System
● Data to build the model
○ Previous recommendations and clicks, search logs, content metadata, user profiles,
user activity history
○ What they want might not exist!
● Infrastructure to build the model
○ Storage for exports of data
○ VMs to build and run models - needs to securely access input data, and output
results for analysis
● Moving model to production
○ Data engineering + application engineers
● Instrumenting further tracking and data collection in production
○ Build new pipelines and select storage
● Deploy, analyze, iterate and deploy again!
15. Partners, Not Siloed Services
● The closer together, the better!
● Over-communicate
○ Overlapping Slack channels
○ Sit in on planning meetings
● Share knowledge
○ Monthly demos or lunch-and-learns
○ Share detailed release notes
● Recognize differences in sizing, planning and
executing projects
Image Credit: Vector Open Stock - http://www.vectoropenstock.com/
16. Overcome Org Structure
● A single leader overseeing both teams, even if not directly, is ideal
○ Not always possible! Team up leaders and keep them close
● Align around projects, not org charts
● Find team members most curious about the “other side” and give them
opportunities to dip their toes in
● Share, and speak to, successes as a unified team. Perception is reality
17. Other Common Pitfalls
● Hiring data scientists without having data engineers
● Assuming because you collect “data”, data scientists have what they
need
● Structuring data science work like you do software and data
engineering
● Underestimating the failure rate of data science projects in comparison
to data engineering
18. Final Tips & Ideas
● New tools won’t save you, but don’t ignore them
● Be flexible in your hiring. Generalists bridge gaps
● Invest in light-weight documentation, and commit to keeping it current
○ Accurate over Glossy
● Cross team interviewing and onboarding
● Question your team structure often
● When in doubt, talk!