Mais conteúdo relacionado Semelhante a Data Observability.pptx (20) Data Observability.pptx2. ©2022 Intuit Inc. All rights reserved. 2
About Us
Intuit Data Ecosystem: Unique consumer and small business assets at scale
Data & Analytics at Intuit
Product Reports and Analytics Dashboards
Data Quality Challenges
Understanding data issues
Data Observability
Cure, Detect, Prevent, Eradicate : Data Observability Model in Intuit
Achieving Data Quality At Scale
Preventing Data Incidents Using DQ Checks, ADS & Infrastructure Monitoring
Agenda
3. ©2022 Intuit Inc. All rights reserved. 3
Intuit Data Ecosystem: Unique consumer and
small business assets at scale
5. ©2022 Intuit Inc. All rights reserved. 5
The monthly aggregated metrics
for subscribers is not matching
with the weekly data.
Are there any issues?
-Lisa (Frustrated Data Worker)
6. ©2022 Intuit Inc. All rights reserved. 6
Data Sources
Do not understand how
downstream uses the data
How the challenges originate?
Data Lake
Trouble identifying the ways the
data pipelines might break
Analytics
Hard to understand what’s wrong
in the data and how to get some
help
7. ©2022 Intuit Inc. All rights reserved. 7
What is not well ?
Incorrect Reporting
More Incidents
High MTTD/
MTTR
Missed SLA
Data
Reload
Parity
Issues
8. ©2022 Intuit Inc. All rights reserved. 8
Event
Parity
Issues
Source
Failures /
Delays
Data /
Processing
Defects
Prediction
Model
Defects
Incorrect
Reporting
SLA Misses
Data Analytics Platform
10. ©2022 Intuit Inc. All rights reserved. 10
Let’s Recall the Problem
Incorrect Reporting
More Incidents
High MTTD/
MTTR
Missed SLA
Data
Reload
Parity
Issues
11. ©2022 Intuit Inc. All rights reserved. 11
Cure Detect Prevent Eradicate
Monitoring Dashboard Alerting Protocol
12. ©2022 Intuit Inc. All rights reserved. 12
Preventing Data Incidents: Data Quality Checks
Circuit
Breaker
Checks
Low
Priority
Alerts
Data
Validation
Checks
Key Design Decisions
● Multiple Source Support
● Performance Consistency & Scale
● Config Driven Data Profile Rules
● Capability to add Business Rules
● Run as part of Data Processing Pipelines
● Data Discrepancy and Anomaly Detection
● Fail Fast with Circuit Breaker
● Multi-channel Alerts, Single Window Reporting
Data
Pipeline 1
High
Priority
Alerts
Data
Pipeline 2
Data Quality Checks
13. ©2022 Intuit Inc. All rights reserved. 13
Data
Quality
Library
Data
Source
Reports /
Alerts
Data
Quality
Rules
Data
Quality
Library
Data
Quality
Rules
Reports /
Alerts
14. ©2022 Intuit Inc. All rights reserved. 14
DB file
.parquet
.csv
.xml
.json
Input
Sources
Object Store
Spark
SQL
Spark
Config
Input
Configs
Alerts
Triage
Dashboard
DB
logs Output
Object Store
Spark Process
Dataframe 3
Dataframe
Comparator
Spark Loader
Class
Spark Loader
Class
Dataframe 1 Dataframe 2
Spark Step 1 Spark Step 2
Spark Loader
Class
Dataframe n
Dataframe
Validator
Spark Step n
15. ©2022 Intuit Inc. All rights reserved. 15
DB
logs Output
Object Store
DB file
.parquet
.csv
.xml
.json
Input
Sources
Object Store
Spark
SQL
Spark
Config
Input
Configs
Alerts
Triage
Dashboard
Spark Process
Dataframe 3
Dataframe
Comparator
Spark Loader
Class
Spark Loader
Class
Dataframe 1 Dataframe 2
Spark Step 1 Spark Step 2
Spark Loader
Class
Dataframe n
Dataframe
Validator
Spark Step n
Spark Loader
Class
Load
Transform
Save
dataframe
dataframe
PreStep = {
class-name = com.intuit.MySparkLoadClass,
inputs = {
my-company-gns-df = {
order = 2, sql = {
sql-type = local,
out-path = "s3://temp/path", table = companies_filtered,
sql = """select a.* from demo.company_info a
join demo.company_status b on a.c_id = b.c_id
where instr(a.company_name, 'delete') = 0"""
}, metadata = {is-input = true, is-save = true}
}
}
Custom Scala Class
Custom Spark SQL
16. ©2022 Intuit Inc. All rights reserved. 16
DB file
.parquet
.csv
.xml
.json
Input
Sources
Object Store
Spark
SQL
Spark
Config
Input
Configs
Alerts
Triage
Dashboard
Spark Process
Spark Loader
Class
Spark Loader
Class
Spark Step 1 Spark Step 2
Spark Loader
Class
Dataframe n
Dataframe
Validator
Spark Step n
Spark Loader
Class
compare_1 = {
class-name = com.intuit.Dataframe
Comparator
,
properties = {
"comparator-config" = """{
"comparisonSets" :
[{"Product":"df_dataset_1"},
{"Billing
System":"df_dataset_2"}],
"validationName": "Test QBO
signup count by product ",
"comparisonType":
"percent_variance",
"threshold":"1.00",
"comp_out_df":
"df_gns_by_product"
}"""
}
Custom Scala Class
Comparison Sets
Validation Name
Threshold value
Output Dataframe
Comparison Type
config input
comp_out_df
Dataframe
Comparator
df_dataset_1 df_dataset_2
17. ©2022 Intuit Inc. All rights reserved. 17
DB file
.parquet
.csv
.xml
.json
Input
Sources
Object Store
Spark
SQL
Spark
Config
Input
Configs
Alerts
Triage
Dashboard
Spark Process
Spark Loader
Class
Spark Loader
Class
Spark Step 1 Spark Step 2
Spark Loader
Class
Dataframe n
Dataframe
Validator
Spark Step n
Spark Loader
Class
comp_out_df
Dataframe
Comparator
df_dataset_1 df_dataset_2
DB
logs Output
Object Store
DatamartValidation = {
class-name = com.intuit.DataframeValidator,
properties = {
"validation-config" = """{
"pipelineName":"Test",
"validationResultSets" : ["my-company-gns-df",
"df_gns_by_product",..],
"resultGenericColumns":["validation_name:string",
"dimension_1:string","dimension_2:string",
"dataset_1:string","metric_1_value:decimal(20,4)",
"dataset_2:string","metric_2_value:decimal(20,4)",
"validation_result:string",
"is_valid:boolean"],
"outputDirectory":"s3://temp/outpath",
"emailId":"alertsmail@intuit.com",
"validatorOptions":{"failJob":"1", "emailSubject":"Test
Triangulation","forwardToSplunk":"1"}
}"""
}
}
Validation
Result Sets
Custom Scala Class
Result Set
Columns
Alert Email
18. ©2022 Intuit Inc. All rights reserved. 18
Preventing Data Incidents: Anomaly Detection Checks
Features
● Ensemble of machine learning
algorithms
● Training done from historic
patterns
● Supports time series data
● Scheduled and API based
triggers for training and
inference
● Post inference anomalies
published for consumption
Time Series
Dataset
Majority Voting
Anomaly /
Not Anomaly
20. ©2022 Intuit Inc. All rights reserved. 20
Data Triage - Validating Data at Each Layer
21. ©2022 Intuit Inc. All rights reserved. 21
Spark - Performance Optimization
Spark Lens
Ganglia Chart
Spark History Server
23. ©2022 Intuit Inc. All rights reserved. 23
Parity checks
to identify
event loss
Source Job
Failure / Delay
Alerts
Resource
Health
Monitoring Anomaly
Detection for
Data Outliers
Data Analytics Platform
Improved Forecast
Models
Data Quality
Checks &
Circuit
Breakers