Automating Data Quality Processes at Reckitt

Automating Data
Quality Processes at
Reckitt
Karol Sawicz
IT Business Analyst
Richard Chadwick
Data Engineer

Agenda
§ Who are Reckitt?
§ Project background
§ Project architecture
§ Reducing complexity with a
metadata driven ETL framework
§ Turning a data set into a data
product
§ Demo Data Quality Processes

Who are Reckitt?
§ FMCG company with global presence
§ 43000+ employees
§ Wide data landscape
▪ Various systems in every region
▪ 50+ Sales CRMs worldwide
▪ 1000s of sales representatives
▪ Many global & local reporting platforms
▪ 100s of data lakes
Our Brands
Reckitt in numbers

Karol Sawicz
§ IT Business Analyst at RB
§ Build end to end reporting solutions
§ Manage rollout of global reporting
platforms
§ B.Eng in Computer Science from the
Polish-Japanese Academy of
Information Technology
§ Builds electric bikes in spare time
• Add head shot
karol.sawicz@rb.com

Project background – Sales Execution Reporting
▪ Automate data
ingestion
▪ Make data mapping
& cleansing easy
▪ Sustainably check
data quality
▪ No need to rebuild
for every new CRM
• Harmonization goals
• Challenges
▪ Standardized data
from many systems
▪ Clean data for
analysis
▪ Tools to maintain
data mapping and
quality checking
• Project deliverables
▪ Enabling future
data science
projects by having
reliable datasets
▪ Encouragement of
ad-hoc analysis
▪ Cross dataset
analysis
• Next level analytics
▪ To enable a solid reporting base and analytics capability for Pharmacy & Medical data globally
▪ Data siloed locally
▪ Analysts work on
basics and don’t
reuse across region
▪ New sales CRMs
render existing
reports obsolete

Project architecture
Build on azure platform and databricks
Bronze
• Separate common archive enviroment
• Source to all DEV/QA/PROD environments
• Configuration driven ingestion from salesforce

Silver
• Metadata driven ETL process
• Data harmonized into a single data model
• Local to global value mapping done during the pipeline
• Data quality checks performed using rule-based data quality framework

Gold
• Materialized data with no downtime
• Bad quality data is filtered out through views on top of silver data
• Reporting views are defined on this layer

Richard Chadwick
§ Data Engineering consultant at
Cervello, a Kearney Company.
§ Service end to end data journey for
Sales Execution: archiving, ETL,
validation and deployment.
§ BSc in Mathematics, The University of
Edinburgh.
§ Previously worked as professional
poker player.
rchadwick@mycervello.com

Metadata driven ETL framework
▪ Metadata configuration
table:
▪ File path
▪ File type
▪ Partition structure
▪ Schema
▪ Spark options
▪ Land all data as temporary
views
• SQL scripts sourced from
repository
• Parametrized SQL scripts
• SQL process
configuration table
• All transformations
applied to temporary
views
SQL Process
Land Data
• Mix and match any number
of land and SQL processes
to create an executable end
to end ETL plan
• Configure branch, partition
and SQL dictionary
arguments for entire ETL
plan
• Single Databricks notebook
supports executing any ETL
process
ETL
land = LandData(config_table)
land.land_data(land_key,
p_dict)
sql_p = SQLProcess(config_table)
sql_p.run_sql(sql_key,
sql_dict,
branch)
etl = ETL(config_table)
etl.run_etl(etl_key,
p_dict,
sql_dict,
branch)

Benefits of a metadata driven ETL framework
• Configurations for ETL processes double up as documentation.
• Low code execution enables wide range of stakeholders to contribute to ETL
processes.
• Reduces the complexity of developing and executing lots of similar transformation
processes.
• Reduces the complexity of orchestration pipelines (Azure Data Factory).

Turning a data set into a data product
• Accurate data dictionary.
• Data objects have correct naming conventions, data types and a sensible ordering to
columns.
• Local market values conformed to global ones e.g. translation or category mapping.
• Data quality tested against expectations.
• Strategy for data that fails data quality tests.
• Adhere to any service-level agreements.

Demo
github.com/richchad/data_quality_databricks

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Automating Data Quality Processes at Reckitt

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Automating Data Quality Processes at Reckitt

Semelhante a Automating Data Quality Processes at Reckitt (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Automating Data Quality Processes at Reckitt