O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Data Quality

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 35 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Data Quality (20)

Anúncio

Mais recentes (20)

Data Quality

  1. 1. Data Quality - Testing Vijaya Kokkili Director of Quality CommerceHub
  2. 2. MEEEE………  Overcome the fear!!! Gardening Adrenaline Junkie
  3. 3. MEEEE………  Still to come…….
  4. 4. Agenda: Data Quality Data Quality Testing World trending towards….. How to test data quality Facts about data quality Data quality test management Most common business problems Data quality testing challenges Business benefits Data quality testing best practices What is data quality? Dimensions of data quality Definitions of dimensions Real time situation Measuring data quality Data profiling analysis When and how to conduct data profiling
  5. 5. • Operating systems • Mobile platforms • Software frameworks • Hardware • Software
  6. 6. Few facts about data quality: ● Cost of poor data quality in US - $600 Billion ● Poor Data/Lack of visibility cited as #1 reason for project cost overruns ● Poor data quality costs the US Economy $3.1 Trillion a year ● Implementing data quality best practices boosts revenue by 66% ● Median Fortune 1000 company could increase revenue by $2.01 Billion if they improved usability of data by 10%
  7. 7. Most common business problems • Billing and payment errors causing negative customer perceptions • Operating expenses are inflated • Regulatory fines are levied due to inaccurate reporting of data to government entities • Customers and revenue are lost due to an inability to track customer interactions or to recognize high-value customers • Disruption of service • Flawed analytics lead to poor tactical and strategic directions • Extra time on IT projects to reconcile data • Delays in deploying new systems
  8. 8. Business benefits: • Customer satisfaction • Strengthens trust and collaboration between trading partners • Increases supply chain efficiencies and cuts costs by reducing errors • Cuts delays at point-of-sale as a result of reduced measurement errors • Increases reliability and efficiency • Ensures better compliance
  9. 9. What is data quality? Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context
  10. 10. Dimensions of data quality: ● Consistency ● Accuracy ● Correctness ● Objectivity ● Timeliness ● Conciseness ● Precision ● Usefulness ● Unamiguous ● Usability ● Completeness ● Relevance ● Reliability ● Amount of data
  11. 11. Definitions of data quality dimensions: •Correctness / Accuracy: Accuracy of data is the degree to which the captured data correctly describes the real world entity. •Consistency: This is about the single version of truth. Consistency means data throughout the enterprise should be sync with each other. •Completeness: It is the extent to which the expected attributes of data are provided. •Timeliness: Right data to the right person at the right time is important for business.
  12. 12. Definitions of data quality dimensions: •Correctness / Accuracy: Accuracy of data is the degree to which the captured data correctly describes the real world entity. Ability to draw correct conclusions from data Business process that match reality Eg of data accuracy issues: • An incident reported with $23M when the loss was $12k • The amount invoiced does not represent the customer’s usage
  13. 13. Definitions of data quality dimensions: •Consistency: This is about the single version of truth. Consistency means throughout the enterprise should be sync with each other. Ability to trust data regardless of source Identical information available to all processes and units Eg of data consistency issues: • Mr.A defines “reprocessing” as cancel/total and Mr. B as Cancel/new.
  14. 14. Definitions of data quality dimensions: •Completeness: It is the extent to which the expected attributes of data are provided. Data that does not leave any open questions Ability to make a good decision based on available data Closeness between “need to know” and what data tells you Eg of data completeness issues: • We cannot tell how many cell phone contracts Mr. X has • A summary report includes projects that did not report status!
  15. 15. Definitions of data quality dimensions: •Timeliness: Right data to the right person at the right time is important for business. Data that is available without delay Ability to know what you need, when you need Smooth information flow: “Data delayed is Data denied!” Eg of data timeliness issues: • Receiving a “budget exceeded” SMS after you went over the limit!
  16. 16. Real time situation Many database professionals face situations like: 1. Several data inconsistencies in source, like missing records or NULL values. 2. column they chose to be the primary key column is not unique throughout the table. 3. Schema design is not coherent to the end user requirement. 4. Any other concern with the data, that must have been fixed right at the beginning
  17. 17. What does it mean to fix data quality issues? Make changes in ETL data flow packages, cleaning identified inconsistencies etc.. Lot of re-work to be done Added costs in terms of time and effort So….. What is the solution???
  18. 18. Solution “PREVENTION IS BETTER THAN CURE” Hence data profiling comes to the rescue
  19. 19. Measuring Data Quality Profiling – Understand metadata • Point of time shows what data looks like now • Automating shows trends o Alert to new/potential issues as they happen o Potentially fix issues in near real time
  20. 20. Statistical process control Automated inspection Visibility shows process deviation
  21. 21. Data profiling analysis Duplication Pattern matching Day of week Character set Reference data matching Inter-data set comparisons
  22. 22. Master data management Create a standard for data Distribute data so that all sources are uniform • Names • Addresses • Phone numbers • Products Can hook into 3rd party sources
  23. 23. Data Governance Central authority for data quality control Applies information collected from data profiling uniformly across the business Communication channels between business and IT groups
  24. 24. Maintenance of data quality Data quality results from the process of going through the data and scrubbing it, standardizing it, and removing duplicate records, as well as doing some of the data enrichment. 1. Maintain complete data 2. Clean up data by standardizing using rules 3. Using algorithms to detect duplicates 4. Avoid entry of duplicate leads and contacts 5. Merge existing duplicate records 6. Use roles for security
  25. 25. Inconsistent data before cleaning up Bill no CustomerName SSN 101 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 204 Ms V Kokkili SSN100123 Bill no CustomerName SSN 354 Ms Kokkili Vijaya SSN100123 Bill no CustomerName SSN 467 Ms Vijaya K SSN100123 Invoice 1 Invoice 2 Invoice 3 Invoice 4
  26. 26. Consistent data after cleaning up Bill no CustomerName SSN 101 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 204 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 354 Ms Vijaya Kokkili SSN100123 Bill no CustomerName SSN 467 Ms Vijaya Kokkili SSN100123 Invoice 1 Invoice 2 Invoice 3 Invoice 4
  27. 27. When and how to conduct data profiling? Generally, data profiling is conducted in two ways: 1.Writing SQL queries on sample data extracts put into a database. 2.Using data profiling tools When to conduct Data profiling? At the discovery/requirements gathering phase
  28. 28. How to conduct data profiling? Data profiling involves statistical analysis of the data at source and the data being loaded, as well as analysis of metadata. These statistics may be used for various analysis purposes. Common examples of analyses to be done are: Data quality: Analyze the quality of data at the data source. NULL values: Look out for the number of NULL values in an attribute Candidate keys: Analysis of the extent to which certain columns are distinct will give developer useful information w. r. t. selection of candidate keys. Primary key selection: To check whether the candidate key column does not violate the basic requirements of not having NULL values or duplicate values. Empty string values: A string column may contain NULL or even empty sting values that may create problems later. String length: An analysis of largest and shortest possible length as well as the average string length of a sting-type column can help us decide what data type would be most suitable for the said column
  29. 29. How to test for Data quality? Discrepancy in records count at Source & target When all data is at source is present at target Ensure that source & target don’t contain conflicting facts Degree of conformance of data to its domain and business values Physical and logical duplicates Orphan records in targets when no corresponding parent records List of valid/invalid values that are allowed along with ranges, look up etc Degree to which data reflects the real world objects Describes the relevance & meaning of data Describes availability of data as per SLA Row Count Completeness Consistency Validity Redundancy Referential Integrity Domain Integrity Accuracy Usability Timeliness
  30. 30. Data quality test management Test planning Test design Test Execution Test monitoring Requirements: • BRD • FSD • Test Plan Requirements: • Test scenarios • Test cases • Automated Requirements: • Executed in test cycles • Test results/bugs are shared with business • Prioritize Requirements: • Collect metrics • Observe trend
  31. 31. Data quality testing challenges • Lack of tools • Lack of domain knowledge • Changing requirements • Poor planning for data quality in initial phase of the application
  32. 32. Data quality testing best practices • Understand user business • Plan early in Design and testing phase • Be proactive when it comes to data growth/trending • Don’t assume! Understand data!
  33. 33. Q & A @vkokkili vkokkili@gmail.com

Notas do Editor

  • Today is world of heterogeneity. We have different technologies. We operate on different platforms. We have large amount of data being generated everyday in all sorts of organizations and Enterprises.
  • Fitbit

    Medical

    Life everyday routine
  • Facts of Data quality:
    ● Cost of poor data quality in US - $600 Billion● Poor Data/Lack of visibility cited as #1 reason for project cost overruns● Poor data quality costs the US Economy $3.1 Trillion a year● Implementing data quality best practices boosts revenue by 66%● Median Fortune 1000 company could increase revenue by $2.01 Billion if they improved usability of data by 10%

    And we do have problems with data. Problems like: Duplicated , inconsistent , ambiguous, incomplete.

    So there is a need to collect data in one place and clean up the data
  • Businesses are increasingly only as good as their data. High quality data is essential for capturing the interest of consumers and driving online sales.
  • Increases customer satisfaction by ensuring the accuracy of product information – ingredients, prices, nutritional information
    Strengthens trust and collaboration between trading partners
    Increases supply chain efficiencies and cuts costs by reducing errors
    Cuts delays at point-of-sale as a result of reduced measurement errors
    Increases the reliability and efficiency of product transportation and delivery to stores and warehouses
    Ensures better compliance with industry standards and regulations
  • Why data quality matters?

    Good data is your most valuable asset, and bad data can seriously harm your business and credibility… 1.What have you missed? 2.When things go wrong. 3.Making confident decisions

    Is the data trustworthy and credible information.
  • Accuracy: What does accuracy stand for? Good fit between data and reality………Ability to draw correct conclusions from data……………….Business process that match reality
    Eg: of data acc;uracy issues: An incident reported with $23M when the loss was $12k………………….The amount invoiced does not represent the customer’s usage

    Consistency stands for: Data in harmony across the company…………..ability to trust data regardless of source………………….Identical information available to all processes and units
    Eg: Mr.A defines “reprocessing” as cancel/total and Mr. B as Cancel/new.

    Completeness stands for: Data that does not leave any open questions…………………..Ability to make a good decision based on available data……………….Closeness between “need to know” and what data tells you
    Eg: we cannot tell how many cell phone contracts Mr. X has………………A summary report includes projects that did not report status!

    Timeliness stands for: Data that is available without delay…………………………Ability to know what you need, when you need………………..smoothe information flow: data delayed is data denied!
  • Accuracy: What does accuracy stand for? Good fit between data and reality………Ability to draw correct conclusions from data……………….Business process that match reality
    Eg: of data acc;uracy issues: An incident reported with $23M when the loss was $12k………………….The amount invoiced does not represent the customer’s usage
  • Consistency stands for: Data in harmony across the company…………..ability to trust data regardless of source………………….Identical information available to all processes and units
    Eg: Mr.A defines “reprocessing” as cancel/total and Mr. B as Cancel/new.
  • Completeness stands for: Data that does not leave any open questions…………………..Ability to make a good decision based on available data……………….Closeness between “need to know” and what data tells you
    Eg: we cannot tell how many cell phone contracts Mr. X has………………A summary report includes projects that did not report status!
  • Timeliness stands for: Data that is available without delay…………………………Ability to know what you need, when you need………………..smoothe information flow: data delayed is data denied!
  • What is data profiling ? It is the process of statistically examining and analyzing the content in a data source, and hence collecting information about the data. It consists of techniques used to analyze the data we have for accuracy and completeness. 1. Data profiling helps us make a thorough assessment of data quality. 2. It assists the discovery of anomalies in data. 3. It helps us understand content, structure, relationships, etc. about the data in the data source we are analyzing.

    4. It helps us know whether the existing data can be applied to other areas or purposes. 5. It helps us understand the various issues/challenges we may face in a database project much before the actual work begins. This enables us to make early decisions and act accordingly. 6. It is also used to assess and validate metadata
  • It is important for QA to make sure these requirements are provided upfront.

×