3. Agenda
❖ A brief introduction to Qubole
❖ Apache Airflow
❖ Operational Challenges in managing an ETL
❖ Alerts and Monitoring
❖ Quality Assurance in ETL’s
3
4. About Qubole Data service
❖ A self-service platform for big data analytics.
❖ Delivers best-in-class Apache tools such as Hadoop, Hive, Spark,
etc. integrated into an enterprise-feature rich platform optimized
to run in the cloud.
❖ Enables users to focus on their data rather than the platform.
4
5. Data Team @ Qubole
❖ Data Warehouse for Qubole
❖ Provides Insights and Recommendations to users
❖ Just Another Qubole Account
❖ Enabling data driven features within QDS
5
6. Multi Tenant Nature Of Data
Team
6
Qubole
Distribution 2
(azure.qubole.com)
Distribution 1
(api.qubole.com)
Data Warehouse
Data Warehouse
7. Apache Airflow For ETL
❖ Developer Friendly
❖ A rich collection of Operators, CLI utilities and UI to author and manage your
Data Pipeline.
❖ Horizontally Scalable.
❖ Tight Integration With Qubole
7
9. Operational Challenges In ETL World.
9
How to achieve
continuous
integration and
deployment for
ETL’s
?
How to effectively
manage
configuration for
ETL’s in a multi
tenant environment
?
How we do we
make ETL’s aware of
the Data Warehouse
migrations
?
12. Airflow Variables for ETL Configuration
❖ Stores the information as a key value pair in airflow.
❖ Extensive support like CLI, UI and API to manage the variables
❖ Can be used from within the airflow script as
variable.get(“variable_name”)
12
13. Warehouse Management.
❖ A leaf out of Ruby on Rails: Active Record
Migrations.
❖ Each migration is tagged and committed as a
single commit to version control along with ETL
changes.
13
14. The PROCESS IS EASY
14
Checkout
from version
control the
target tag.
Update the
migration
number
Run any new
relevant
migrations
Fetch Current
Migration
Number from
Airflow
Variables.
15. ❖ Traditional deployment too messy when multiple users are handling airflow.
❖ Data Apps for ETL deployment.
❖ Provides cli option like <ETL_NAME> deploy -r <version_tag> -d <start_date>
Deployment
Checkout the
airflow
template file
from version
control.
Copy the final
script file to
airflow
directory.
Read Config
Values from
Airflow and
translate the
config values
19. IMPORTANCE OF DATA
VALIDATION
❖ Application’s correctness depends on correctness of data.
❖ Increase confidence on data by quantifying data quality.
❖ Correcting existing data can be expensive - prevention better than cure!
❖ Stopping critical downstream tasks if the data is invalid.
19
20. TREND MONITORING
❖ Monitor dips, peaks, anomalies.
❖ Hard problem!
❖ Not real time.
❖ One size doesn’t fit all - Different ETLs manipulate data in different ways.
❖ Difficult to maintain.
20
22. Using Apache Airflow Check operators:
Approach:
Extend open
source airflow
check operator for
queries running on
Qubole platform
Run data
validation queries
Fail the operator if
the validation fails
22
25. Problem: Airflow Check operators required pass_value to be defined
before
the ETL starts.
Use case: Validating data import logic
Solution: Make pass_value an Airflow template field
This way it can be configured at run-time. The pass value can be injected
through multiple mechanisms once it’s an airflow template field.
1. Compare Data across engines
25
27. Problem: Currently, Apache airflow check operators consider single row for
comparison.
Use case: Run group queries, compare each of the values against the pass_value.
Solution: Qubole_check_operator adds `results_parser_callable` parameter
The function pointed to by `results_parser_callable` holds the logic to return a list
of records on which the checks would be performed.
2. Validate multiline results
27
30. ETL # 1: Data Ingestion Imports data from RDS tables into Data Warehouse for analysis
purposes.
Historical Issues:
Mismatch with source data
1. Data duplication
2. Data missing for certain duration
Checks employed:
- Count comparison across the two data stores - source and
destination.
How checks have helped us:
- Verify and rectify upsert logic (which is not plain copy of
RDS)
PS: Runtime fetching of expected values!
30
31. ETL # 2: Data
Transformation
Repartitions a day’s worth of data into hourly partitions.
Historical Issues:
1. Data ending up in single partition field (Default hive
partition).
2. Wrong ordering of values in fields.
Checks employed:
1. Number of partitions getting created are 24 (one for every
hour).
2. Check the value of critical field, “source” .
How checks have helped us: Verify and rectify repartitioning
logic.
31
32. ETL # 3: Cost Computation
Computes Qubole Compute Unit
Hour (QCUH)
Situation: We are narrowing down on the
granularity of cost computation from daily to hourly.
How Checks have helped?
To monitor new data and alarm in case of
mismatch in trends of old and new data.
32
33. ETL # 4: Data
Transformation
Parses customer queries and outputs table usage information.
Historical Issues:
1. Data missing for a customer account.
2. Data loss due to different syntaxes across engines.
3. Data loss due to query syntax changes across different versions of data-
engines.
Checks employed:
1. Group by account ids, if any of them is 0, raise an alert.
2. Group by on engine type, account ids. If high error %, raise an alert.
How checks have helped us:
- Insights into amount of data loss.
- Provides feedback, helped us make syntax checking more robust.
33
34. FEATURES
❖ Ability to plug-in different alerting mechanisms.
❖ Dependency management and Failure handling.
❖ Ability to parse the output of assert query in a user defined manner.
❖ Run time fetching of the pass_value against which the comparison is made.
❖ Ability to generate failure/success report.
34
35. LESSONS LEARNT
One size doesn’t fit
all- Estimation of data
trends is a difficult
problem
Delegate the
validation task to the
ETL itself
35
36. Source code has been
contributed to Apache Airflow
AIRFLOW-2228: Enhancements in Check operator
AIRFLOW-2213: Adding Qubole Check Operator
36
37. In data we trust!
THANKS!
Any questions?
You can find us at:
sakshib@qubole.com
sreenathk@qubole.com