SlideShare uma empresa Scribd logo
1 de 64
Data Cleansing
                    What about quality?




Stefan Urbanek
stefan.urbanek@gmail.com
@Stiivi                                   March 2011
Content

■   Introduction
■   What is data quality?
■   E and T from ETL
■   Summary
http://vestnik.transparency.sk
Brewery
  analytical data streams

        &
      Cubes
online analytical processing




  github/bitbucket: Stiivi
Quality
What is data quality


        ?
Dimensions
■   completeness – data provided
■   accuracy – reflecting real world
■   credibility – regarded as true
■   timeliness – up-to-date
■   consistency – matching facts across datasets
■   integrity – valid references between datasets
completeness
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1  1
                                      20
                                           06
                                              -1
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1  1
                                      20
                                           08
                                              -1
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
how many % of the field is filled and


                                        09
                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 55%
type 1       type 2


         +
all




                                              none
                                                                             better




                                                   0%
                                                                     25%
                                                                            50%
                                                                                      75%
                                                                                            100%




                                      20
                                           05
                                              -3
                                      20
                                           05
                                              -5
                                      20
                                           05
                                              -7
                                      20
                                           05
                                              -9
                                      20
                                        05
                                           -1 0
                                      20
                                        05
                                           -1  2
                                      20
                                           06
                                              -3
                                      20
                                           06
                                              -5
                                      20
                                           06
                                              -7
                                      20
                                           06
                                              -9
                                      20
                                        06
                                           -1  1
                                      20
                                           07
                                              -1
                                      20
                                           07
                                              -3
                                      20
                                           07
                                              -5
                                      20
                                           07
                                              -7
                                      20
                                           07
                                              -9
                                      20
                                        07
                                           -1 0
                                      20
                                        07
                                           -1  2
                                      20
                                           08
                                              -3
                                      20
                                           08
                                              -5
                                      20
                                           08
                                              -7
                                      20
                                           08
                                              -9
                                      20
                                        08
                                           -1  1
                                      20
                                           09
                                              -1

     successfully processed?          20
                                      20
                                           09
                                           09
                                              -3
                                              -5
                                      20
                                           09
                                              -7
                                      20
                                           09
                                              -9
                                      20
                                        09
how many % of the field is filled and


                                           -1  1
                                      20
                                           10
                                              -1
                                      20
                                           10
                                              -3
                                      20
                                           10
                                              -5
                                      20
                                           10
                                              -7
                                      20
                                           10
                                              -9
                                                         Quality measure
                                                        completeness: 88%
reconstruction: 5€

                     temperature: 32˚C

             accuracy
timeliness
Auto-measurable
■   completeness – easily
■   accuracy – somehow
■   credibility – not-so
■   timeliness – easily
■   consistency – yes
■   integrity – yes
What does that mean:
“high quality data?”


          ?
85%
appropriate for given
     purpose
attach quality report
Quality Measurement
   for accuracy and transparency
■ why to measure?
■ when to measure?
■ where to measure?
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




        keep intermediate results for auditability
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index




                   insert probes at appropriate places
like unit testing:

1. write probes
2. set data quality indicators
3. pass data through
SQL


                            PostgreSQL
   yml                       database
                               table

YAML directory   coalesce
                  values
                                                     {x:.2%}
                                         +           15.00%

                            data audit   threshold   formatted
                                                       printer
field                            nulls     status   distinct
------------------------------------------------------------
file                             0.00%         ok        100
source_code                      0.00%         ok          6
year                             0.00%         ok          6
donor_code                       0.00%         ok          2
receiver_name                    1.25%       fail      10363
receiver_address                13.29%       fail       9979
receiver_ico                    13.53%       fail       5813
project                          0.01%         ok      28370
program                          0.00%         ok         29
subprogram                      11.60%       fail        177
project_budget                  14.48%       fail       9487
requested_amount                88.73%       fail       1356
received_amount                  9.32%       fail       2179
contract_number                 13.29%       fail      28627
contract_date                   57.88%       fail       1425
source_comment                  99.93%       fail          9
source_id                       89.52%       fail        814
E and T from ETL
     E as Extraction
HTML Documents
Ceci ne sont pas des données
html
 body
 div id=#page
  div id=#page
   div id=#container
        div id=#main
         div id=#innerMain
          div (anonymous)
           div (anonymous)
              table tbody
                             tr   td
                                       tabletbody
                                                tr td
                                                        table trtd
                                                           tbody
                                                                  tabletd value
                                                                    √tr
Now: you parse!
       3 seconds




   *non-technical explanation follows
<SPAN class=podnazov>More information
</SPAN>
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
?
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...
<SPAN class=podnazov
  style="TEXT-TRANSFORM: uppercase">o
</SPAN>
<SPAN class=podnazov>dkaz na&nbsp;projekt
...

here is a subtitle
and it should be in upper-case:
o
And here is another subtitle:
dkaz na (non-breaking space) projekt

                                              much better

               here is a label: Odkaz na projekt
“Structured”
spreadsheets


          error prone
          more work needed
✓ structured file format
1

                                             2




             4



3
                                      5




    (1) image & title
    (2) repeating groups of columns
    (3) padding rows/columns
    (4) removed redundancy for readability
    (5) colored cells
1




                    2




                        3




(1) header with row padding
(2) multi-row logical cell
(3) broken pattern
1
                           2




(1) multi-row cell
(2) more values in a row
why?


source                                               id
                                                   itemid
         file format parser   data extraction
                                                   class id
                                                       item
                                                  amount
                                                       class
                                                          item
                                                      amount
                                                         class
                             why not?                   amount

  “structured”
       file
                                               raw data
E and T from ETL
    T as Transformation
Basic pattern
 slightly more technical
source   lists and maps




  ?

         +
                          target



                            ?


                   diff



                    ?




                 target
SELECT ...
EXCEPT
SELECT ...

      *in PostgreSQL, not in MySQL
sta_vvo_vysledky
sta_regis




                                             -                                              -


                                                              map_suppliers
                                      1
                                          unknown suppliers




                                             ?    Slovensko

                          +

                      2


     +

              tmp_coalesced_suppliers_sk


     -
                          sta_suppliers



     +
                  3

         new suppliers
Script or manual?

       script
Script or manual?
script




■ recurrent processing (weekly, monthly,...)
■ huge amount of data


■ one-time processing
■ small amount of data
appropriate tool
 for given task
balance
from staging to analytical data




from source to staging data                                                                                                                      analytical model
since 2009                                                                                                                                            description




                    Download                       Parse                    Load source                              Cleanse                                      Create cube


                                                                                                                                      staging clean data
  raw sources                      HTML files                    YAML files                 contracts table
                                                                                             (staging)




from source to staging data
2005-2008

                                                                                                     REGIS (SK                   "unknown"                 fact table
                                                                                                    organisations)             suppliers map                            dimension tables
                                                                            Load source
                    Download                      Parse 08
                                                                                08

                                                                YAML files
  raw sources
   2005-2008
                                                                                           search index

                                   Pre-process
                                                                                                                          Create
                                                                                                                       search index
                                                 One HTML per
                 Large HTML files
                                                  Procurement                                 dimension tables                              search index
                  (one per year)
                                                   Document



                                                                                                                                          dimension index
Brewery
 data streams
Data Sources                                      Data Targets



                        CSV file

                                                                 relational database
                                    data stream
                                    processing
      Google Spreadsheet

                                                                       report

               X
  remote Excel Spreadsheet    URL




                   processing streams
data row         data row                   data row
data source                                                                     data target




                                            value       value        value   value




                 id             id                       id
               item           item                     item
               class          class                    class
              amount         amount                   amount
data source                                                                     data target
               data record    data record              data record




                                               id          value

                                              item         value

                                              class     value

                                            amount         value
Sources

                       X
                                      SQL

     CSV file         XLS file        SQL query   mongo DB



                      yml



Google spreadsheet YAML directory    row list   record list
Targets

                                                yml
                  SQL

  CSV file       SQL table        mongo DB    YAML directory



                {x:.2%}
 <html>         15.00%

HTML table   formatted printer    row list     record list
Record Operations

+
                       !

append   distinct   aggregate    merge (join)



                                                           !x
          ?                 ?                              n

sample   select     set select   data audit     numerical statistics*
Field Operations
                     A→B
                       re             +                +
field map          text substitute   value threshold*     derive*




   abc
                                       +
string strip   consolidate value    histogram/bin*     set to flag*
                   to type
+
      SQL




            ?   <html>




SQL
yml             nodes = {
                    "source": CSVSourceNode(...),
                    "clean": CoalesceValueToTypeNode(),
                    "output": DatabaseTableTargetNode(...),
                    "audit": AuditNode(...),
                    "threshold": ValueThresholdNode(),
                    "print": FormattedPrinterNode()
                }

                connections = [
                                  ("source", "clean"),
                                  ("clean", "output"),
SQL
                                  ("clean", "audit"),
                                  ("audit", "threshold"),
                                  ("threshold", "print")
                                  ]

      +         ... # configure nodes here

                stream = Stream(nodes, connections)
                stream.initialize()
      {x:.2%}   stream.run()
      15.00%

Mais conteúdo relacionado

Destaque

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
Rami Mansour
 

Destaque (6)

Data Quality Best Practices Nbk Auto May 06 2010
Data Quality Best Practices  Nbk Auto May 06 2010Data Quality Best Practices  Nbk Auto May 06 2010
Data Quality Best Practices Nbk Auto May 06 2010
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 

Semelhante a Data Cleansing introduction (for BigClean Prague 2011)

Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Business Intelligence Research
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the Consequences
Mercatus Center
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Burton Lee
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and results
George Kershoff
 

Semelhante a Data Cleansing introduction (for BigClean Prague 2011) (11)

Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
Nationwide Insurance - Building an Effective Finance Control and Fast Book Cl...
 
Regulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the ConsequencesRegulations As a "Panacea": Exploring the Consequences
Regulations As a "Panacea": Exploring the Consequences
 
Wellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health CareWellness &amp; Consumer Driven Health Care
Wellness &amp; Consumer Driven Health Care
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
Wellness & Consumer Driven Health Care
Wellness & Consumer Driven Health CareWellness & Consumer Driven Health Care
Wellness & Consumer Driven Health Care
 
AEFI Dhamija
AEFI DhamijaAEFI Dhamija
AEFI Dhamija
 
Dr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather OutlookDr. Elwynn Taylor - Weather Outlook
Dr. Elwynn Taylor - Weather Outlook
 
21C Venture Capital
21C Venture Capital21C Venture Capital
21C Venture Capital
 
22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment22.02, Group 5 — Concept of sustainable development in built environment
22.02, Group 5 — Concept of sustainable development in built environment
 
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
Estonia Overview - Andrus Viirg - Stanford - Jan 25 2010
 
The BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and resultsThe BER's business tendency surveys in South Africa: method and results
The BER's business tendency surveys in South Africa: method and results
 

Mais de Stefan Urbanek

Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
Stefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
Stefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
Stefan Urbanek
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
Stefan Urbanek
 

Mais de Stefan Urbanek (20)

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Data Cleansing introduction (for BigClean Prague 2011)

  • 1. Data Cleansing What about quality? Stefan Urbanek stefan.urbanek@gmail.com @Stiivi March 2011
  • 2. Content ■ Introduction ■ What is data quality? ■ E and T from ETL ■ Summary
  • 4. Brewery analytical data streams & Cubes online analytical processing github/bitbucket: Stiivi
  • 6. What is data quality ?
  • 7. Dimensions ■ completeness – data provided ■ accuracy – reflecting real world ■ credibility – regarded as true ■ timeliness – up-to-date ■ consistency – matching facts across datasets ■ integrity – valid references between datasets
  • 9. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 1 20 06 -1 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 1 20 08 -1 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 how many % of the field is filled and 09 -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 55%
  • 10. type 1 type 2 +
  • 11. all none better 0% 25% 50% 75% 100% 20 05 -3 20 05 -5 20 05 -7 20 05 -9 20 05 -1 0 20 05 -1 2 20 06 -3 20 06 -5 20 06 -7 20 06 -9 20 06 -1 1 20 07 -1 20 07 -3 20 07 -5 20 07 -7 20 07 -9 20 07 -1 0 20 07 -1 2 20 08 -3 20 08 -5 20 08 -7 20 08 -9 20 08 -1 1 20 09 -1 successfully processed? 20 20 09 09 -3 -5 20 09 -7 20 09 -9 20 09 how many % of the field is filled and -1 1 20 10 -1 20 10 -3 20 10 -5 20 10 -7 20 10 -9 Quality measure completeness: 88%
  • 12. reconstruction: 5€ temperature: 32˚C accuracy
  • 14. Auto-measurable ■ completeness – easily ■ accuracy – somehow ■ credibility – not-so ■ timeliness – easily ■ consistency – yes ■ integrity – yes
  • 15. What does that mean: “high quality data?” ?
  • 16. 85%
  • 19. Quality Measurement for accuracy and transparency
  • 20. ■ why to measure? ■ when to measure? ■ where to measure?
  • 21. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index keep intermediate results for auditability
  • 22. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index insert probes at appropriate places
  • 23. like unit testing: 1. write probes 2. set data quality indicators 3. pass data through
  • 24. SQL PostgreSQL yml database table YAML directory coalesce values {x:.2%} + 15.00% data audit threshold formatted printer
  • 25. field nulls status distinct ------------------------------------------------------------ file 0.00% ok 100 source_code 0.00% ok 6 year 0.00% ok 6 donor_code 0.00% ok 2 receiver_name 1.25% fail 10363 receiver_address 13.29% fail 9979 receiver_ico 13.53% fail 5813 project 0.01% ok 28370 program 0.00% ok 29 subprogram 11.60% fail 177 project_budget 14.48% fail 9487 requested_amount 88.73% fail 1356 received_amount 9.32% fail 2179 contract_number 13.29% fail 28627 contract_date 57.88% fail 1425 source_comment 99.93% fail 9 source_id 89.52% fail 814
  • 26. E and T from ETL E as Extraction
  • 28. Ceci ne sont pas des données
  • 29.
  • 30.
  • 31. html body div id=#page div id=#page div id=#container div id=#main div id=#innerMain div (anonymous) div (anonymous) table tbody tr td tabletbody tr td table trtd tbody tabletd value √tr
  • 32.
  • 33. Now: you parse! 3 seconds *non-technical explanation follows
  • 35. ?
  • 36. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 37. ?
  • 38. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ...
  • 39. <SPAN class=podnazov style="TEXT-TRANSFORM: uppercase">o </SPAN> <SPAN class=podnazov>dkaz na&nbsp;projekt ... here is a subtitle and it should be in upper-case: o And here is another subtitle: dkaz na (non-breaking space) projekt much better here is a label: Odkaz na projekt
  • 40. “Structured” spreadsheets error prone more work needed
  • 42. 1 2 4 3 5 (1) image & title (2) repeating groups of columns (3) padding rows/columns (4) removed redundancy for readability (5) colored cells
  • 43. 1 2 3 (1) header with row padding (2) multi-row logical cell (3) broken pattern
  • 44. 1 2 (1) multi-row cell (2) more values in a row
  • 45. why? source id itemid file format parser data extraction class id item amount class item amount class why not? amount “structured” file raw data
  • 46. E and T from ETL T as Transformation
  • 47. Basic pattern slightly more technical
  • 48. source lists and maps ? + target ? diff ? target
  • 49. SELECT ... EXCEPT SELECT ... *in PostgreSQL, not in MySQL
  • 50. sta_vvo_vysledky sta_regis - - map_suppliers 1 unknown suppliers ? Slovensko + 2 + tmp_coalesced_suppliers_sk - sta_suppliers + 3 new suppliers
  • 52. Script or manual? script ■ recurrent processing (weekly, monthly,...) ■ huge amount of data ■ one-time processing ■ small amount of data
  • 53. appropriate tool for given task
  • 55. from staging to analytical data from source to staging data analytical model since 2009 description Download Parse Load source Cleanse Create cube staging clean data raw sources HTML files YAML files contracts table (staging) from source to staging data 2005-2008 REGIS (SK "unknown" fact table organisations) suppliers map dimension tables Load source Download Parse 08 08 YAML files raw sources 2005-2008 search index Pre-process Create search index One HTML per Large HTML files Procurement dimension tables search index (one per year) Document dimension index
  • 57. Data Sources Data Targets CSV file relational database data stream processing Google Spreadsheet report X remote Excel Spreadsheet URL processing streams
  • 58. data row data row data row data source data target value value value value id id id item item item class class class amount amount amount data source data target data record data record data record id value item value class value amount value
  • 59. Sources X SQL CSV file XLS file SQL query mongo DB yml Google spreadsheet YAML directory row list record list
  • 60. Targets yml SQL CSV file SQL table mongo DB YAML directory {x:.2%} <html> 15.00% HTML table formatted printer row list record list
  • 61. Record Operations + ! append distinct aggregate merge (join) !x ? ? n sample select set select data audit numerical statistics*
  • 62. Field Operations A→B re + + field map text substitute value threshold* derive* abc + string strip consolidate value histogram/bin* set to flag* to type
  • 63. + SQL ? <html> SQL
  • 64. yml nodes = { "source": CSVSourceNode(...), "clean": CoalesceValueToTypeNode(), "output": DatabaseTableTargetNode(...), "audit": AuditNode(...), "threshold": ValueThresholdNode(), "print": FormattedPrinterNode() } connections = [ ("source", "clean"), ("clean", "output"), SQL ("clean", "audit"), ("audit", "threshold"), ("threshold", "print") ] + ... # configure nodes here stream = Stream(nodes, connections) stream.initialize() {x:.2%} stream.run() 15.00%

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n