There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
5. Would you be
confident in a
self-driving car ...
… knowing that
there is your
software running
it?
6. Standardize and increase the descriptive power
of engineering processes
by applying patterns
Or in other words
stand on the shoulders of giants
and stop reinventing the wheel
7. Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222
● Left side of your brain is responsible for
analytical thinking, science, math, etc.
● It uses known building blocks to model the
surrounding world
● If you like table representation of data, you
will try to model everything as a table
● As an engineer, expand your tool belt by
learning new patterns and new building
blocks to solve business problems better.
Why does my brain need patterns?
8. About me
● IT Architect at Cognizant
● Data Engineering, Data Science,
Cloud Computing, Agile teams
● Financial, Manufacturing,
Logistics, Retail industries
● Organizer of Vilnius Microsoft Data
Platform Meetup & Hack4Vilnius Hackathon
● Blogging on www.valdas.blog
9. Biological and Physiological needs
Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc.
Safety needs
security, employment, protection against hunger and violence
Love and belonging needs
Receive and give love, appreciation, friendship
Esteem need
Unique individual, self-respect, etc.
Experience purpose and meaning
Realising all inner potentials
Self-actualization
Personal growth and fulfillment
Maslow’s hierarchy of needs
11. Culture
Core values, way of working
Enterprise architecture
Buy vs build, cloud readiness
Data strategy & architecture
Defensive vs offensive strategy, use cases
Existing team skillset
Databases, programming, etc
Design patterns, tools &
principles
Business drivers
Business goals and objectives
Maslow’s hierarchy of needs for data projects
12. Culture
Core values, way of working
Data architecture
Ingestion, storage consumption, how data is collected,
stored, transformed, distributed, and consumed
Tools & principles
Best practices, naming, patterns
Maslow’s hierarchy of needs for data projects -
simplified view for today’s presentation
14. DevOps culture
1. Foster a Collaborative Environment
2. Impose End-to-End Responsibility - you build it you ship it
3. Encourage Continuous Improvement
4. Automate (Almost) Everything
5. Focus on the Customer’s Needs
6. Embrace Failure, and Learn From it
7. Unite Teams — and Expertise
Source: https://www.cmswire.com/information-management/7-key-principles-for-a-successful-devops-culture/
17. If you are building a data platform in the
cloud, remember that ...
low barrier-to-entry overshadows
complexity
18. Big Data cloud architecture references
Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-warehouse/
19. CRM
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
External systems
Digital portals
Architecture example
Reporting
Core systems
20. Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data ingestion
CRM
External systems
Digital portals
Reporting
Core systems
21. Application integration approaches
File Transfer
Have each application produce files of shared data for others to consume, and consume files that others have produced.
Shared Database
Have the applications store the data they wish to share in a common database.
Remote Procedure Invocation
Have each application expose some of its procedures so that they can be invoked remotely, and have applications invoke
those to run behavior and exchange data.
Messaging
Have each application connect to a common messaging system, and exchange data and invoke behavior using messages.
22. Ingestion challenges
● Multiple data source load and prioritization -> push vs pull strategy
● Ingested data indexing and tagging -> metadata collection is mandatory
● Data validation and cleansing -> separate business from processing logic
● Data transformation and compression -> different compression and file types
23. Choose privacy protection patterns
Privacy protection at the ingress
Source: https://www.valdas.blog/2019/08/06/privacy-gdpr-implementation-in-azure/
Privacy protection at the
egress
24. Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data storage
CRM
External systems
Digital portals
Reporting
Core systems
26. Data Warehouse vs Data Lake
Data Warehouse Data Lake
Requirements Relational requirements Diverse data, scalability, low cost
Data Value Data of recognised high value Candidate data of potential value
Data Processing Mostly refined calculated data Mostly detailed source data
Business Entities Known entities, tracked over time Raw material for discovering entities and facts
Data Standards Data conforms to enterprise
standards
Fidelity to original format and condition
Data Integration Data integration upfront Data prep on demand
Transformation Data transformed, in principle Data repurposed later, as needs arise
Schema Definition Schema-on-write Schema-on-read
Metadata Management Metadata improvement Metadata developed on read
30. Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Data preparation & training
CRM
External systems
Digital portals
Reporting
Core systems
31. Offer self-service tools
Self service exploration
Automated pipeline
Collect raw
data
Curate data
Train &
Score
Take Insights
Into Actions
Make
hypothesis
Identify
variables
Split
data
Build
model
Validate
model
SQL
33. Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data store Transform,
Clean & Train
Results
Serve results to end consumers
CRM
External systems
Digital portals
Reporting
Core systems
34. Apply domain and product thinking
● Model to describe a domain
● Unified language
● Raw or transformed datasets
● Domain team is responsible for its lifecycle, SLA
● Discoverable, addressable, trustworthy,
self-describing, interoperable, secure
● Each producer is responsible of sharing data
products to organization
50. Delay commitments and keep important
decisions open
● The principle of Last Responsible
Moment originates from Lean
Software Development
● It emphasises holding on taking
important actions and crucial
decisions for as long as possible.
51. Why Last Responsible
Moment is important in
cloud analytics?
Expect new improvements and
upgrades all the time