Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.
2. Agenda
§ Who are Reckitt?
§ Project background
§ Project architecture
§ Reducing complexity with a
metadata driven ETL framework
§ Turning a data set into a data
product
§ Demo Data Quality Processes
3. Who are Reckitt?
§ FMCG company with global presence
§ 43000+ employees
§ Wide data landscape
▪ Various systems in every region
▪ 50+ Sales CRMs worldwide
▪ 1000s of sales representatives
▪ Many global & local reporting platforms
▪ 100s of data lakes
Our Brands
Reckitt in numbers
4. Karol Sawicz
§ IT Business Analyst at RB
§ Build end to end reporting solutions
§ Manage rollout of global reporting
platforms
§ B.Eng in Computer Science from the
Polish-Japanese Academy of
Information Technology
§ Builds electric bikes in spare time
• Add head shot
karol.sawicz@rb.com
5. Project background – Sales Execution Reporting
▪ Automate data
ingestion
▪ Make data mapping
& cleansing easy
▪ Sustainably check
data quality
▪ No need to rebuild
for every new CRM
• Harmonization goals
• Challenges
▪ Standardized data
from many systems
▪ Clean data for
analysis
▪ Tools to maintain
data mapping and
quality checking
• Project deliverables
▪ Enabling future
data science
projects by having
reliable datasets
▪ Encouragement of
ad-hoc analysis
▪ Cross dataset
analysis
• Next level analytics
▪ To enable a solid reporting base and analytics capability for Pharmacy & Medical data globally
▪ Data siloed locally
▪ Analysts work on
basics and don’t
reuse across region
▪ New sales CRMs
render existing
reports obsolete
6. Project architecture
Build on azure platform and databricks
Bronze
• Separate common archive enviroment
• Source to all DEV/QA/PROD environments
• Configuration driven ingestion from salesforce
7. Project architecture
Build on azure platform and databricks
Silver
• Metadata driven ETL process
• Data harmonized into a single data model
• Local to global value mapping done during the pipeline
• Data quality checks performed using rule-based data quality framework
8. Project architecture
Build on azure platform and databricks
Gold
• Materialized data with no downtime
• Bad quality data is filtered out through views on top of silver data
• Reporting views are defined on this layer
9. Richard Chadwick
§ Data Engineering consultant at
Cervello, a Kearney Company.
§ Service end to end data journey for
Sales Execution: archiving, ETL,
validation and deployment.
§ BSc in Mathematics, The University of
Edinburgh.
§ Previously worked as professional
poker player.
rchadwick@mycervello.com
10. Metadata driven ETL framework
▪ Metadata configuration
table:
▪ File path
▪ File type
▪ Partition structure
▪ Schema
▪ Spark options
▪ Land all data as temporary
views
• SQL scripts sourced from
repository
• Parametrized SQL scripts
• SQL process
configuration table
• All transformations
applied to temporary
views
SQL Process
Land Data
• Mix and match any number
of land and SQL processes
to create an executable end
to end ETL plan
• Configure branch, partition
and SQL dictionary
arguments for entire ETL
plan
• Single Databricks notebook
supports executing any ETL
process
ETL
land = LandData(config_table)
land.land_data(land_key,
p_dict)
sql_p = SQLProcess(config_table)
sql_p.run_sql(sql_key,
sql_dict,
branch)
etl = ETL(config_table)
etl.run_etl(etl_key,
p_dict,
sql_dict,
branch)
11. Benefits of a metadata driven ETL framework
• Configurations for ETL processes double up as documentation.
• Low code execution enables wide range of stakeholders to contribute to ETL
processes.
• Reduces the complexity of developing and executing lots of similar transformation
processes.
• Reduces the complexity of orchestration pipelines (Azure Data Factory).
12. Turning a data set into a data product
• Accurate data dictionary.
• Data objects have correct naming conventions, data types and a sensible ordering to
columns.
• Local market values conformed to global ones e.g. translation or category mapping.
• Data quality tested against expectations.
• Strategy for data that fails data quality tests.
• Adhere to any service-level agreements.