O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
2
Data Quality
Automatic enforcement in real-time with machine learning.
Max Martynov, CTO
Introducing Grid Dynamics technology services
Digital transformation Big data, real time analytics, ML & AI
Microservices ...
12 years of
experience in digital
transformation.
9.8
9.3 9.6 9.4
10.1
17.5
16.9
8.9 9.6 9.2
10.3
9.8
5.1
4.1 3.9 4 3.9
4.5
7.5 7.1
3.8 4.1 4 4.5
4.2
2.3
0
5
10
15
20
7/1/1...
DBDBDB
EDW
Data Lake
EDW
DBDBDB FileFileFile
Cloud
Data Lake
EDW
DBDBDB FileFileFile MQ CloudCloud
AppAppAPI
Cloud
EDW
DBDBDB FileFileFile MQ CloudCloud
AppAppAPI
AppAppApp
Data Lake
1 0 1 1
0 1 1 0
1 0 1 0
1 0 0 1
1. Trust is hard to build and easy to lose.
2. Distrust in data slows down decisions.
3. S...
Data corruption reasons
1. Code 2. Data Sources 3. Infrastructure
Test environment
Input
Actual
Expected
ETL
code
compareTest suite
run test
Traditional approach to testing
Production data quality goals
Detect
data corruption
Prevent
it from spreading
Alert
support team
Production data lake
Data quality enforcement in production
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job
Production data lake
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job data quality job
Data quality enforcement in pro...
Production data lake
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job data quality job
alert &
stop pipeline
alert &
c...
Data Lake
Data
source
data
1. Compare with
SoR
2. Validate
business rules
3. Data profiling and
anomaly detection
Main dat...
1. Control divergence from SoR
Data Lake
Data
source
Imported
dataset
Compare data in
SoR and data lake
1. Validate correc...
2. Validate business rules
Data Lake
Dataset
Check for nulls and
data ranges
1. Enforce schema.
2. Check for nulls.
3. Val...
3. Anomaly detection
Data Lake
Dataset
1. Fully automatic data
quality enforcement.
2. Collect data profile,
metrics and s...
Catalog
Inventory
Orders
Data Lake Data Quality
Reporting &
Alerting
Data Profile
Demo setup
23
Live demo
Anomaly detection example
Anomaly detection example: zooming in
Capabilities for enterprise data quality and governance
 Enables widespread adoption.
 Enforces enterprise-level control...
www.griddynamics.com
Thank you!
28
Demo screenshots
Dataproc cluster
Aifrlow pipeline
Griffin measures
Anomaly detection: normal data
Anomaly detection: anomaly
Anomaly detection: return to normal
Anomaly detection: historical view
Anomaly detection (counts): anomaly & return
Uniqueness: normal data
Uniqueness: anomaly
Uniqueness: return to normal
Uniqueness: historical view
Nulls: normal data
Nulls: anomaly
Nulls: return to normal
Nulls: historical view
Ranges: historical view
Completeness: anomalies
"Implementing data quality automation with open source stack" - Max Martynov, CTO of Grid Dynamics
Terminou este documento.
Transfira e leia offline.
Próximos SlideShares
What to Upload to SlideShare
Avançar
Próximos SlideShares
What to Upload to SlideShare
Avançar
Transfira para ler offline e ver em ecrã inteiro.

Compartilhar

"Implementing data quality automation with open source stack" - Max Martynov, CTO of Grid Dynamics

Baixar para ler offline

The quality of business decisions, machine learning insights, and executive reports depend on the quality and integrity of the underlying data. There are many ways that data can get corrupted in an analytical data platform from de-synchronization with the system-of-record to defects in data pipelines. We will show how to detect and prevent data corruption with automation, open-source tools, and machine learning.

  • Seja a primeira pessoa a gostar disto

"Implementing data quality automation with open source stack" - Max Martynov, CTO of Grid Dynamics

  1. 1. 2 Data Quality Automatic enforcement in real-time with machine learning. Max Martynov, CTO
  2. 2. Introducing Grid Dynamics technology services Digital transformation Big data, real time analytics, ML & AI Microservices replatforming DevOps & cloud enablement Open Source Cloud-ready Scalable Automated
  3. 3. 12 years of experience in digital transformation.
  4. 4. 9.8 9.3 9.6 9.4 10.1 17.5 16.9 8.9 9.6 9.2 10.3 9.8 5.1 4.1 3.9 4 3.9 4.5 7.5 7.1 3.8 4.1 4 4.5 4.2 2.3 0 5 10 15 20 7/1/19 7/2/19 7/3/19 7/4/19 7/5/19 7/6/19 7/7/19 7/8/19 7/9/19 7/10/19 7/11/19 7/12/19 7/13/19 7/14/19 Retailer X, Daily Sales – Executive Summary Revenue, $M Gross Profit, $M weekend weekend
  5. 5. DBDBDB EDW
  6. 6. Data Lake EDW DBDBDB FileFileFile
  7. 7. Cloud Data Lake EDW DBDBDB FileFileFile MQ CloudCloud AppAppAPI
  8. 8. Cloud EDW DBDBDB FileFileFile MQ CloudCloud AppAppAPI AppAppApp Data Lake
  9. 9. 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1. Trust is hard to build and easy to lose. 2. Distrust in data slows down decisions. 3. Slow decisions prevent agility.
  10. 10. Data corruption reasons 1. Code 2. Data Sources 3. Infrastructure
  11. 11. Test environment Input Actual Expected ETL code compareTest suite run test Traditional approach to testing
  12. 12. Production data quality goals Detect data corruption Prevent it from spreading Alert support team
  13. 13. Production data lake Data quality enforcement in production DBDBDB FileFileFile MQ AppAppAPI data processing job
  14. 14. Production data lake DBDBDB FileFileFile MQ AppAppAPI data processing job data quality job Data quality enforcement in production
  15. 15. Production data lake DBDBDB FileFileFile MQ AppAppAPI data processing job data quality job alert & stop pipeline alert & continue x Data quality enforcement in production
  16. 16. Data Lake Data source data 1. Compare with SoR 2. Validate business rules 3. Data profiling and anomaly detection Main data processing pipeline confidence data confidence
  17. 17. 1. Control divergence from SoR Data Lake Data source Imported dataset Compare data in SoR and data lake 1. Validate correctness of import. 2. Prevent stale data. 3. Prevent corruption accumulation in stream processing use cases. 4. Check data before it gets in the lake.
  18. 18. 2. Validate business rules Data Lake Dataset Check for nulls and data ranges 1. Enforce schema. 2. Check for nulls. 3. Validate data ranges. 4. Specify and enforce data invariants.
  19. 19. 3. Anomaly detection Data Lake Dataset 1. Fully automatic data quality enforcement. 2. Collect data profile, metrics and statistics. 3. Train ML models. 4. Find anomalies in data. Data profiling and anomaly detection
  20. 20. Catalog Inventory Orders Data Lake Data Quality Reporting & Alerting Data Profile Demo setup
  21. 21. 23 Live demo
  22. 22. Anomaly detection example
  23. 23. Anomaly detection example: zooming in
  24. 24. Capabilities for enterprise data quality and governance  Enables widespread adoption.  Enforces enterprise-level controls and data usage policies.  Increases consistency and confidence in decision making.  Decreases the risk of regulatory fines.  Improves data security.  Facilitates accountability for information quality.  Minimizes or eliminates efforts duplication. Data Governance Platform Metadata Management Full-text Search Data Quality Status Schema / Summary Data Profiling Mapping to Glossary Change Log Dependency Detection Consumers Flow Visualization Glossary Portal Knowledge Base Fields Fingerprinting Data Catalog Dataset Profile Lineage Dashboard Data Glossary Data Quality Access and Security Business Rules Anomaly Detection Alerting Access Rules Compliance Policies Policy Engine
  25. 25. www.griddynamics.com Thank you!
  26. 26. 28 Demo screenshots
  27. 27. Dataproc cluster
  28. 28. Aifrlow pipeline
  29. 29. Griffin measures
  30. 30. Anomaly detection: normal data
  31. 31. Anomaly detection: anomaly
  32. 32. Anomaly detection: return to normal
  33. 33. Anomaly detection: historical view
  34. 34. Anomaly detection (counts): anomaly & return
  35. 35. Uniqueness: normal data
  36. 36. Uniqueness: anomaly
  37. 37. Uniqueness: return to normal
  38. 38. Uniqueness: historical view
  39. 39. Nulls: normal data
  40. 40. Nulls: anomaly
  41. 41. Nulls: return to normal
  42. 42. Nulls: historical view
  43. 43. Ranges: historical view
  44. 44. Completeness: anomalies

The quality of business decisions, machine learning insights, and executive reports depend on the quality and integrity of the underlying data. There are many ways that data can get corrupted in an analytical data platform from de-synchronization with the system-of-record to defects in data pipelines. We will show how to detect and prevent data corruption with automation, open-source tools, and machine learning.

Vistos

Vistos totais

390

No Slideshare

0

De incorporações

0

Número de incorporações

5

Ações

Baixados

5

Compartilhados

0

Comentários

0

Curtir

0

×