1. The Evolving Landscape of
Data Engineering
Bucharest Big Data Meetup @ TechHub
Andrei Savu / @andreisavu
2. Andrei Savu
Currently Staff Engineer @ Twitter:
* Twitter Ad Exchange Data Team
* Focus on Mobile Monetization
Co-organizer of the Data Engineering
Club in San Francisco.
Previously Tech Lead at Cloudera via
the Axemblr.com acquisition. Started
the Cloud engineering team.
One of the early founders of the
Bucharest Java User Group.
3. What is data engineering?
The Past / Drivers of innovation:
● OSS communities
● AWS history
● Google Cloud history
The Present: Common Patterns
The Future: Wish List
Where do I start?
Topics
4. What is data engineering? (vs. data science, vs. ML)
“Unlike data scientists — and inspired by
our more mature parent, software
engineering — data engineers build tools,
infrastructure, frameworks, and services. In fact,
it’s arguable that data engineering is much
closer to software engineering than it is to a data
science.”
Maxime Beauchemin
The Rise of the Data Engineer
5. Weeks of Provisioning
Static Infrastructure
Commodity Hardware
Commodity Networking
Data Locality Important
Running in the Public
Cloud was unusual
CAPEX
The Past - OSS
7. Visionary Products
Fast iterations
Machine Learning as a key
use case
State of the Art data
platform
Last 3 years on fast
forward
Intelligent Billing
OPEX & Elastic
The Past - Google Cloud
8. The Present: Patterns
Weeks to Minutes to Seconds
Hadoop/Spark ecosystem is mature and
continues to innovate.
We have a broad set of options.
Big Data is much bigger (e.g. x1e.32xlarge:
3TB mem, 128 vCPUs, 14Gbps network)
Scale continues to be hard.
Cloud economics can be very disruptive
(especially for data workloads)
High-performance networks are common.
Storage can be decoupled from compute.
Zone/DC locality is important (laws of physics)
Service Endpoints (not clusters, aka serverless,
aka managed etc.).
Sophisticated Auto-scaling (batch & streaming,
spot vs. on-demand, multi-az).
Multi-DC and Multi-Region from Day 1.
9. The Future: Wish List
A Data Catalog product as the center of the
universe.
Data Monitoring Systems:
* statistical properties, anomaly detection,
schema changes, consumption patterns etc.
More intelligence at the data infrastructure level:
* data format migrations, intelligent caching
based on access patterns.
Declarative data transformation vs. explicit ETL.
Intelligent data sampling products. Cost will
continue to be a concerns even when scale is
not.
10. Where do I start?
Technologies:
● SQL + Python
● Pandas + Numpy
● Jupyter or Zeppelin
● Spark
Google Cloud:
● https://www.coursera.org/specializations/g
cp-data-machine-learning ($300 credit)
Domain Knowledge:
● Critical business questions
● The data needed to answer them
● Understand access patterns