The quality of business decisions, machine learning insights, and executive reports depend on the quality and integrity of the underlying data. There are many ways that data can get corrupted in an analytical data platform from de-synchronization with the system-of-record to defects in data pipelines. We will show how to detect and prevent data corruption with automation, open-source tools, and machine learning.
2. Introducing Grid Dynamics technology services
Digital transformation Big data, real time analytics, ML & AI
Microservices replatforming DevOps & cloud enablement
Open Source Cloud-ready Scalable Automated
10. 1 0 1 1
0 1 1 0
1 0 1 0
1 0 0 1
1. Trust is hard to build and easy to lose.
2. Distrust in data slows down decisions.
3. Slow decisions prevent agility.
17. Data Lake
Data
source
data
1. Compare with
SoR
2. Validate
business rules
3. Data profiling and
anomaly detection
Main data processing pipeline
confidence
data
confidence
18. 1. Control divergence from SoR
Data Lake
Data
source
Imported
dataset
Compare data in
SoR and data lake
1. Validate correctness of
import.
2. Prevent stale data.
3. Prevent corruption
accumulation in stream
processing use cases.
4. Check data before it gets in
the lake.
19. 2. Validate business rules
Data Lake
Dataset
Check for nulls and
data ranges
1. Enforce schema.
2. Check for nulls.
3. Validate data ranges.
4. Specify and enforce
data invariants.
20. 3. Anomaly detection
Data Lake
Dataset
1. Fully automatic data
quality enforcement.
2. Collect data profile,
metrics and statistics.
3. Train ML models.
4. Find anomalies in data.
Data profiling and
anomaly detection
25. Capabilities for enterprise data quality and governance
Enables widespread adoption.
Enforces enterprise-level controls
and data usage policies.
Increases consistency and
confidence in decision making.
Decreases the risk of regulatory
fines.
Improves data security.
Facilitates accountability for
information quality.
Minimizes or eliminates efforts
duplication.
Data Governance Platform
Metadata
Management
Full-text
Search
Data Quality
Status
Schema /
Summary
Data
Profiling
Mapping
to
Glossary
Change
Log
Dependency
Detection
Consumers Flow Visualization
Glossary Portal
Knowledge
Base
Fields
Fingerprinting
Data Catalog
Dataset Profile
Lineage Dashboard
Data
Glossary
Data Quality
Access and
Security
Business
Rules
Anomaly
Detection
Alerting
Access Rules
Compliance
Policies
Policy Engine