SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Enabling Your Data Science
Team with Modern Data
Engineering
James Densmore
Data Liftoff
dataliftoff.com
@jamesdensmore
About Me
Founder & Consultant at Data Liftoff
Experience leading Data Science and Data Engineering Teams
Technical Background (Software Engineering and Data Engineering)
@jamesdensmore
www.dataliftoff.com
What is “Modern” Data Engineering?
● Thanks to highly scalable, columnar databases (usually cloud based),
we’re now able to store, structure and query, extremely high volumes
of data at a low cost. Really!
● A mix of data lakes and data warehouses
● ELT instead of ETL
● Closer to software engineering than in the past
● No longer a “back office” function. Often aligned with product
development. Sometimes a stand-alone Tech team
Difference Between Data Science and Data
Engineering - Oversimplified!
Data Engineers build and maintain data infrastructure, including data
warehouses.
Data Scientists use data to make predictions, run analysis and build models
to power products.
Common Data Engineering Tools and Platforms
Common Data Science Tools and Platforms
Don’t Assume The Two Teams Understand Each
Other
What Data Scientists Should Know about Data
Engineers
● They’re software engineers at heart
● They don’t always know how data is generated. Some questions are
better left to the production engineers
● They’re interested in your model, but probably not the math 😆
● They’re thinking about scale and efficiency - sometimes too much so
● You are one of many customers to them
What Data Engineers Should Know about Data
Scientists
● They write code, but they’re usually not software engineers
● They will look into data in more detail than anyone else, including you
● Their work is difficult to put into tickets and sprints
● Scale and performance is not their top priority
● They understand the “why” of what they’re building - just ask
What Data Science Needs from a Data
Infrastructure
● Access to both transformed and unprocessed data
● Definitions of columns/attributes and how data is generated
● A safe space to experiment and tune models
○ Plenty of storage
○ No impact on production or other users
○ Read permissions on existing datasets, write/create space for
themselves
● A path to production
How This Differs From Other Consumers of Data
● Data warehouses traditionally serve fully transformed and aggregated
data to BI tools, dashboards and data analysts. Data Scientists need
raw data - a lot of it
● The data warehouse was once the “end of the road” for data. Data
Scientists need it in other forms and locations.
● Data products built by the data science team may end up in production.
What’s the path to get there?
Asking More from Data Engineering
● New pipelines to support data science
● Documenting more detail of the raw data and fielding highly specific
questions about it
● Strain on databases from ad hoc queries
● Managing data security and privacy outside of the warehouse
● Model deployment to production
Infrastructure Considerations
Image Credit: Amazon Web Services
● Data Lakes + Databases
● Secure storage for flat files
● VMs for building and testing models in
development
○ Discourage local development
with sensitive data
● Share best practices for accessing data
from scripts - credential management
● Data governance now extends to
development machines, VMs, and flat
file storage
An Example - Building a Recommender System
● Data to build the model
○ Previous recommendations and clicks, search logs, content metadata, user profiles,
user activity history
○ What they want might not exist!
● Infrastructure to build the model
○ Storage for exports of data
○ VMs to build and run models - needs to securely access input data, and output
results for analysis
● Moving model to production
○ Data engineering + application engineers
● Instrumenting further tracking and data collection in production
○ Build new pipelines and select storage
● Deploy, analyze, iterate and deploy again!
Partners, Not Siloed Services
● The closer together, the better!
● Over-communicate
○ Overlapping Slack channels
○ Sit in on planning meetings
● Share knowledge
○ Monthly demos or lunch-and-learns
○ Share detailed release notes
● Recognize differences in sizing, planning and
executing projects
Image Credit: Vector Open Stock - http://www.vectoropenstock.com/
Overcome Org Structure
● A single leader overseeing both teams, even if not directly, is ideal
○ Not always possible! Team up leaders and keep them close
● Align around projects, not org charts
● Find team members most curious about the “other side” and give them
opportunities to dip their toes in
● Share, and speak to, successes as a unified team. Perception is reality
Other Common Pitfalls
● Hiring data scientists without having data engineers
● Assuming because you collect “data”, data scientists have what they
need
● Structuring data science work like you do software and data
engineering
● Underestimating the failure rate of data science projects in comparison
to data engineering
Final Tips & Ideas
● New tools won’t save you, but don’t ignore them
● Be flexible in your hiring. Generalists bridge gaps
● Invest in light-weight documentation, and commit to keeping it current
○ Accurate over Glossy
● Cross team interviewing and onboarding
● Question your team structure often
● When in doubt, talk!
Thank You!
DataLiftoff.com
@jamesdensmore

Mais conteúdo relacionado

Mais procurados

Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhereDmitry Tolpeko
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.Łukasz Grala
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Valdas Maksimavičius
 
What is the data analytics stack?
What is the data analytics stack? What is the data analytics stack?
What is the data analytics stack? George Mount
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014carrjc2
 
Demand For Data Scientist
Demand For Data ScientistDemand For Data Scientist
Demand For Data ScientistZaranTech LLC
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
What is bi analytics and big data
What is bi analytics and big dataWhat is bi analytics and big data
What is bi analytics and big datagaliasisense
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesJennifer Muilenburg
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Data vault
Data vaultData vault
Data vaultJisc
 

Mais procurados (20)

Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
 
What is the data analytics stack?
What is the data analytics stack? What is the data analytics stack?
What is the data analytics stack?
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
"Selling" Open Source 101
"Selling" Open Source 101"Selling" Open Source 101
"Selling" Open Source 101
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
 
Demand For Data Scientist
Demand For Data ScientistDemand For Data Scientist
Demand For Data Scientist
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
What is bi analytics and big data
What is bi analytics and big dataWhat is bi analytics and big data
What is bi analytics and big data
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Data vault
Data vaultData vault
Data vault
 

Semelhante a Enabling Your Data Science Team with Modern Data Engineering

Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Denodo
 
Data Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIData Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIDenodo
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceMark West
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Mark Tabladillo
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...MongoDB
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Build The Data Driven Organization With The Help Of Data Engineering.pptx
Build The Data Driven Organization With The Help Of Data Engineering.pptxBuild The Data Driven Organization With The Help Of Data Engineering.pptx
Build The Data Driven Organization With The Help Of Data Engineering.pptxWillHunting8
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberDataMites
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data EngineeringFibonalabs
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Collecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpCollecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpHarlan Harris
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyoneKaren Hsieh
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Productioniguazio
 

Semelhante a Enabling Your Data Science Team with Modern Data Engineering (20)

Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
Data Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIData Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AI
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...
MongoDB World 2019: Enabling Global Tire Design Leveraging MongoDB's Document...
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Build The Data Driven Organization With The Help Of Data Engineering.pptx
Build The Data Driven Organization With The Help Of Data Engineering.pptxBuild The Data Driven Organization With The Help Of Data Engineering.pptx
Build The Data Driven Organization With The Help Of Data Engineering.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-October
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Collecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUpCollecting and Making Sense of Diverse Data at WayUp
Collecting and Making Sense of Diverse Data at WayUp
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Enabling Your Data Science Team with Modern Data Engineering

  • 1. Enabling Your Data Science Team with Modern Data Engineering James Densmore Data Liftoff dataliftoff.com @jamesdensmore
  • 2. About Me Founder & Consultant at Data Liftoff Experience leading Data Science and Data Engineering Teams Technical Background (Software Engineering and Data Engineering) @jamesdensmore www.dataliftoff.com
  • 3. What is “Modern” Data Engineering? ● Thanks to highly scalable, columnar databases (usually cloud based), we’re now able to store, structure and query, extremely high volumes of data at a low cost. Really! ● A mix of data lakes and data warehouses ● ELT instead of ETL ● Closer to software engineering than in the past ● No longer a “back office” function. Often aligned with product development. Sometimes a stand-alone Tech team
  • 4. Difference Between Data Science and Data Engineering - Oversimplified! Data Engineers build and maintain data infrastructure, including data warehouses. Data Scientists use data to make predictions, run analysis and build models to power products.
  • 5. Common Data Engineering Tools and Platforms
  • 6. Common Data Science Tools and Platforms
  • 7. Don’t Assume The Two Teams Understand Each Other
  • 8. What Data Scientists Should Know about Data Engineers ● They’re software engineers at heart ● They don’t always know how data is generated. Some questions are better left to the production engineers ● They’re interested in your model, but probably not the math 😆 ● They’re thinking about scale and efficiency - sometimes too much so ● You are one of many customers to them
  • 9. What Data Engineers Should Know about Data Scientists ● They write code, but they’re usually not software engineers ● They will look into data in more detail than anyone else, including you ● Their work is difficult to put into tickets and sprints ● Scale and performance is not their top priority ● They understand the “why” of what they’re building - just ask
  • 10. What Data Science Needs from a Data Infrastructure ● Access to both transformed and unprocessed data ● Definitions of columns/attributes and how data is generated ● A safe space to experiment and tune models ○ Plenty of storage ○ No impact on production or other users ○ Read permissions on existing datasets, write/create space for themselves ● A path to production
  • 11. How This Differs From Other Consumers of Data ● Data warehouses traditionally serve fully transformed and aggregated data to BI tools, dashboards and data analysts. Data Scientists need raw data - a lot of it ● The data warehouse was once the “end of the road” for data. Data Scientists need it in other forms and locations. ● Data products built by the data science team may end up in production. What’s the path to get there?
  • 12. Asking More from Data Engineering ● New pipelines to support data science ● Documenting more detail of the raw data and fielding highly specific questions about it ● Strain on databases from ad hoc queries ● Managing data security and privacy outside of the warehouse ● Model deployment to production
  • 13. Infrastructure Considerations Image Credit: Amazon Web Services ● Data Lakes + Databases ● Secure storage for flat files ● VMs for building and testing models in development ○ Discourage local development with sensitive data ● Share best practices for accessing data from scripts - credential management ● Data governance now extends to development machines, VMs, and flat file storage
  • 14. An Example - Building a Recommender System ● Data to build the model ○ Previous recommendations and clicks, search logs, content metadata, user profiles, user activity history ○ What they want might not exist! ● Infrastructure to build the model ○ Storage for exports of data ○ VMs to build and run models - needs to securely access input data, and output results for analysis ● Moving model to production ○ Data engineering + application engineers ● Instrumenting further tracking and data collection in production ○ Build new pipelines and select storage ● Deploy, analyze, iterate and deploy again!
  • 15. Partners, Not Siloed Services ● The closer together, the better! ● Over-communicate ○ Overlapping Slack channels ○ Sit in on planning meetings ● Share knowledge ○ Monthly demos or lunch-and-learns ○ Share detailed release notes ● Recognize differences in sizing, planning and executing projects Image Credit: Vector Open Stock - http://www.vectoropenstock.com/
  • 16. Overcome Org Structure ● A single leader overseeing both teams, even if not directly, is ideal ○ Not always possible! Team up leaders and keep them close ● Align around projects, not org charts ● Find team members most curious about the “other side” and give them opportunities to dip their toes in ● Share, and speak to, successes as a unified team. Perception is reality
  • 17. Other Common Pitfalls ● Hiring data scientists without having data engineers ● Assuming because you collect “data”, data scientists have what they need ● Structuring data science work like you do software and data engineering ● Underestimating the failure rate of data science projects in comparison to data engineering
  • 18. Final Tips & Ideas ● New tools won’t save you, but don’t ignore them ● Be flexible in your hiring. Generalists bridge gaps ● Invest in light-weight documentation, and commit to keeping it current ○ Accurate over Glossy ● Cross team interviewing and onboarding ● Question your team structure often ● When in doubt, talk!