SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Automating Data
Quality Processes at
Reckitt
Karol Sawicz
IT Business Analyst
Richard Chadwick
Data Engineer
Agenda
§ Who are Reckitt?
§ Project background
§ Project architecture
§ Reducing complexity with a
metadata driven ETL framework
§ Turning a data set into a data
product
§ Demo Data Quality Processes
Who are Reckitt?
§ FMCG company with global presence
§ 43000+ employees
§ Wide data landscape
▪ Various systems in every region
▪ 50+ Sales CRMs worldwide
▪ 1000s of sales representatives
▪ Many global & local reporting platforms
▪ 100s of data lakes
Our Brands
Reckitt in numbers
Karol Sawicz
§ IT Business Analyst at RB
§ Build end to end reporting solutions
§ Manage rollout of global reporting
platforms
§ B.Eng in Computer Science from the
Polish-Japanese Academy of
Information Technology
§ Builds electric bikes in spare time
• Add head shot
karol.sawicz@rb.com
Project background – Sales Execution Reporting
▪ Automate data
ingestion
▪ Make data mapping
& cleansing easy
▪ Sustainably check
data quality
▪ No need to rebuild
for every new CRM
• Harmonization goals
• Challenges
▪ Standardized data
from many systems
▪ Clean data for
analysis
▪ Tools to maintain
data mapping and
quality checking
• Project deliverables
▪ Enabling future
data science
projects by having
reliable datasets
▪ Encouragement of
ad-hoc analysis
▪ Cross dataset
analysis
• Next level analytics
▪ To enable a solid reporting base and analytics capability for Pharmacy & Medical data globally
▪ Data siloed locally
▪ Analysts work on
basics and don’t
reuse across region
▪ New sales CRMs
render existing
reports obsolete
Project architecture
Build on azure platform and databricks
Bronze
• Separate common archive enviroment
• Source to all DEV/QA/PROD environments
• Configuration driven ingestion from salesforce
Project architecture
Build on azure platform and databricks
Silver
• Metadata driven ETL process
• Data harmonized into a single data model
• Local to global value mapping done during the pipeline
• Data quality checks performed using rule-based data quality framework
Project architecture
Build on azure platform and databricks
Gold
• Materialized data with no downtime
• Bad quality data is filtered out through views on top of silver data
• Reporting views are defined on this layer
Richard Chadwick
§ Data Engineering consultant at
Cervello, a Kearney Company.
§ Service end to end data journey for
Sales Execution: archiving, ETL,
validation and deployment.
§ BSc in Mathematics, The University of
Edinburgh.
§ Previously worked as professional
poker player.
rchadwick@mycervello.com
Metadata driven ETL framework
▪ Metadata configuration
table:
▪ File path
▪ File type
▪ Partition structure
▪ Schema
▪ Spark options
▪ Land all data as temporary
views
• SQL scripts sourced from
repository
• Parametrized SQL scripts
• SQL process
configuration table
• All transformations
applied to temporary
views
SQL Process
Land Data
• Mix and match any number
of land and SQL processes
to create an executable end
to end ETL plan
• Configure branch, partition
and SQL dictionary
arguments for entire ETL
plan
• Single Databricks notebook
supports executing any ETL
process
ETL
land = LandData(config_table)
land.land_data(land_key,
p_dict)
sql_p = SQLProcess(config_table)
sql_p.run_sql(sql_key,
sql_dict,
branch)
etl = ETL(config_table)
etl.run_etl(etl_key,
p_dict,
sql_dict,
branch)
Benefits of a metadata driven ETL framework
• Configurations for ETL processes double up as documentation.
• Low code execution enables wide range of stakeholders to contribute to ETL
processes.
• Reduces the complexity of developing and executing lots of similar transformation
processes.
• Reduces the complexity of orchestration pipelines (Azure Data Factory).
Turning a data set into a data product
• Accurate data dictionary.
• Data objects have correct naming conventions, data types and a sensible ordering to
columns.
• Local market values conformed to global ones e.g. translation or category mapping.
• Data quality tested against expectations.
• Strategy for data that fails data quality tests.
• Adhere to any service-level agreements.
Demo
github.com/richchad/data_quality_databricks
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Mais conteúdo relacionado

Mais procurados

Where to Begin? Application Portfolio Migration
Where to Begin? Application Portfolio MigrationWhere to Begin? Application Portfolio Migration
Where to Begin? Application Portfolio MigrationAmazon Web Services
 
Data Migration Steps PowerPoint Presentation Slides
Data Migration Steps PowerPoint Presentation Slides Data Migration Steps PowerPoint Presentation Slides
Data Migration Steps PowerPoint Presentation Slides SlideTeam
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
AWS Cloud Center Excellence Quick Start Prescriptive Guidance
AWS Cloud Center Excellence Quick Start Prescriptive GuidanceAWS Cloud Center Excellence Quick Start Prescriptive Guidance
AWS Cloud Center Excellence Quick Start Prescriptive GuidanceTom Laszewski
 
Cloud Center of Excellence - Datasheet
Cloud Center of Excellence - DatasheetCloud Center of Excellence - Datasheet
Cloud Center of Excellence - DatasheetTodd Erskine
 
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...Amazon Web Services
 
AWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and WorkshopsAWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and WorkshopsTom Laszewski
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfChris Bingham
 
Microsoft Azure Cost Optimization and improve efficiency
Microsoft Azure Cost Optimization and improve efficiencyMicrosoft Azure Cost Optimization and improve efficiency
Microsoft Azure Cost Optimization and improve efficiencyKushan Lahiru Perera
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
A Practical Guide to Cloud Migration
A Practical Guide to Cloud MigrationA Practical Guide to Cloud Migration
A Practical Guide to Cloud MigrationAlaina Carter
 
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Timothy McAliley
 

Mais procurados (20)

Where to Begin? Application Portfolio Migration
Where to Begin? Application Portfolio MigrationWhere to Begin? Application Portfolio Migration
Where to Begin? Application Portfolio Migration
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
Data Migration Steps PowerPoint Presentation Slides
Data Migration Steps PowerPoint Presentation Slides Data Migration Steps PowerPoint Presentation Slides
Data Migration Steps PowerPoint Presentation Slides
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
AWS Cloud Center Excellence Quick Start Prescriptive Guidance
AWS Cloud Center Excellence Quick Start Prescriptive GuidanceAWS Cloud Center Excellence Quick Start Prescriptive Guidance
AWS Cloud Center Excellence Quick Start Prescriptive Guidance
 
Cloud Center of Excellence - Datasheet
Cloud Center of Excellence - DatasheetCloud Center of Excellence - Datasheet
Cloud Center of Excellence - Datasheet
 
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
 
AWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and WorkshopsAWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and Workshops
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
 
Microsoft Azure Cost Optimization and improve efficiency
Microsoft Azure Cost Optimization and improve efficiencyMicrosoft Azure Cost Optimization and improve efficiency
Microsoft Azure Cost Optimization and improve efficiency
 
How to Streamline DataOps on AWS
How to Streamline DataOps on AWSHow to Streamline DataOps on AWS
How to Streamline DataOps on AWS
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Application Portfolio Migration
Application Portfolio MigrationApplication Portfolio Migration
Application Portfolio Migration
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
A Practical Guide to Cloud Migration
A Practical Guide to Cloud MigrationA Practical Guide to Cloud Migration
A Practical Guide to Cloud Migration
 
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
 

Semelhante a Automating Data Quality Processes at Reckitt

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updatedSukanta Saha
 
Resume april updated
Resume april updatedResume april updated
Resume april updatedSukanta Saha
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updatedSukanta Saha
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cEdelweiss Kammermann
 
Preparing Your Legacy Data for Automation in S1000D
Preparing Your Legacy Data for Automation in S1000DPreparing Your Legacy Data for Automation in S1000D
Preparing Your Legacy Data for Automation in S1000Ddclsocialmedia
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax
 
SQL Server 2008 Migration
SQL Server 2008 MigrationSQL Server 2008 Migration
SQL Server 2008 MigrationMark Ginnebaugh
 
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...Marcus Vinicius Miguel Pedro
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 

Semelhante a Automating Data Quality Processes at Reckitt (20)

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Resume 11 2015
Resume 11 2015Resume 11 2015
Resume 11 2015
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updated
 
Resume april updated
Resume april updatedResume april updated
Resume april updated
 
Resume
ResumeResume
Resume
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updated
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
 
Preparing Your Legacy Data for Automation in S1000D
Preparing Your Legacy Data for Automation in S1000DPreparing Your Legacy Data for Automation in S1000D
Preparing Your Legacy Data for Automation in S1000D
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
rakesh_resume
rakesh_resumerakesh_resume
rakesh_resume
 
Hegazy_Mon_EG_v2
Hegazy_Mon_EG_v2 Hegazy_Mon_EG_v2
Hegazy_Mon_EG_v2
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
SQL Server 2008 Migration
SQL Server 2008 MigrationSQL Server 2008 Migration
SQL Server 2008 Migration
 
Sandeep Grandhi (1)
Sandeep Grandhi (1)Sandeep Grandhi (1)
Sandeep Grandhi (1)
 
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...
2019 - GUOB Tech Day / Groundbreakers LAD Tour - Database Migration Methods t...
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Último (20)

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

Automating Data Quality Processes at Reckitt

  • 1. Automating Data Quality Processes at Reckitt Karol Sawicz IT Business Analyst Richard Chadwick Data Engineer
  • 2. Agenda § Who are Reckitt? § Project background § Project architecture § Reducing complexity with a metadata driven ETL framework § Turning a data set into a data product § Demo Data Quality Processes
  • 3. Who are Reckitt? § FMCG company with global presence § 43000+ employees § Wide data landscape ▪ Various systems in every region ▪ 50+ Sales CRMs worldwide ▪ 1000s of sales representatives ▪ Many global & local reporting platforms ▪ 100s of data lakes Our Brands Reckitt in numbers
  • 4. Karol Sawicz § IT Business Analyst at RB § Build end to end reporting solutions § Manage rollout of global reporting platforms § B.Eng in Computer Science from the Polish-Japanese Academy of Information Technology § Builds electric bikes in spare time • Add head shot karol.sawicz@rb.com
  • 5. Project background – Sales Execution Reporting ▪ Automate data ingestion ▪ Make data mapping & cleansing easy ▪ Sustainably check data quality ▪ No need to rebuild for every new CRM • Harmonization goals • Challenges ▪ Standardized data from many systems ▪ Clean data for analysis ▪ Tools to maintain data mapping and quality checking • Project deliverables ▪ Enabling future data science projects by having reliable datasets ▪ Encouragement of ad-hoc analysis ▪ Cross dataset analysis • Next level analytics ▪ To enable a solid reporting base and analytics capability for Pharmacy & Medical data globally ▪ Data siloed locally ▪ Analysts work on basics and don’t reuse across region ▪ New sales CRMs render existing reports obsolete
  • 6. Project architecture Build on azure platform and databricks Bronze • Separate common archive enviroment • Source to all DEV/QA/PROD environments • Configuration driven ingestion from salesforce
  • 7. Project architecture Build on azure platform and databricks Silver • Metadata driven ETL process • Data harmonized into a single data model • Local to global value mapping done during the pipeline • Data quality checks performed using rule-based data quality framework
  • 8. Project architecture Build on azure platform and databricks Gold • Materialized data with no downtime • Bad quality data is filtered out through views on top of silver data • Reporting views are defined on this layer
  • 9. Richard Chadwick § Data Engineering consultant at Cervello, a Kearney Company. § Service end to end data journey for Sales Execution: archiving, ETL, validation and deployment. § BSc in Mathematics, The University of Edinburgh. § Previously worked as professional poker player. rchadwick@mycervello.com
  • 10. Metadata driven ETL framework ▪ Metadata configuration table: ▪ File path ▪ File type ▪ Partition structure ▪ Schema ▪ Spark options ▪ Land all data as temporary views • SQL scripts sourced from repository • Parametrized SQL scripts • SQL process configuration table • All transformations applied to temporary views SQL Process Land Data • Mix and match any number of land and SQL processes to create an executable end to end ETL plan • Configure branch, partition and SQL dictionary arguments for entire ETL plan • Single Databricks notebook supports executing any ETL process ETL land = LandData(config_table) land.land_data(land_key, p_dict) sql_p = SQLProcess(config_table) sql_p.run_sql(sql_key, sql_dict, branch) etl = ETL(config_table) etl.run_etl(etl_key, p_dict, sql_dict, branch)
  • 11. Benefits of a metadata driven ETL framework • Configurations for ETL processes double up as documentation. • Low code execution enables wide range of stakeholders to contribute to ETL processes. • Reduces the complexity of developing and executing lots of similar transformation processes. • Reduces the complexity of orchestration pipelines (Azure Data Factory).
  • 12. Turning a data set into a data product • Accurate data dictionary. • Data objects have correct naming conventions, data types and a sensible ordering to columns. • Local market values conformed to global ones e.g. translation or category mapping. • Data quality tested against expectations. • Strategy for data that fails data quality tests. • Adhere to any service-level agreements.
  • 14. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.