O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Data Quality Analytics: Understanding what is in your data, before using it

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 14 Anúncio

Data Quality Analytics: Understanding what is in your data, before using it

Baixar para ler offline

Analytics and data science are ever growing fields, as business decision makers continue to use data to drive decisions. The pinnacle of these fields are the models and their accuracy/fit,; what about the data? Is your data clean, and how do you know that? Our discussion will focus on best practices for data preprocessing for analytic uses. Beginning with essential distributional checks of a dataset to a propose method for automated data validation process during ETL for transactional data.

Analytics and data science are ever growing fields, as business decision makers continue to use data to drive decisions. The pinnacle of these fields are the models and their accuracy/fit,; what about the data? Is your data clean, and how do you know that? Our discussion will focus on best practices for data preprocessing for analytic uses. Beginning with essential distributional checks of a dataset to a propose method for automated data validation process during ETL for transactional data.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Data Quality Analytics: Understanding what is in your data, before using it (20)

Anúncio

Mais de Domino Data Lab (20)

Mais recentes (20)

Anúncio

Data Quality Analytics: Understanding what is in your data, before using it

  1. 1. Data Quality Analytics: Scott Murdoch, PhD Understanding what is in your data, before using it.
  2. 2. What is Data Quality Analytics? Why is it needed – Cost of Dirty Data Dirty data can cost any company productivity, brand perception, and most importantly revenue AGENDA
  3. 3. Understanding your data Implementing Data Quality Analytics ‘Spot checking’ data is not longer effective. Integration with IT
  4. 4. is the use of distributions and modeling techniques to understand the pitfalls within data Data Quality Analytics Opportunity cost of time Cost [None; {% of Revenue, Reputation, Embarrassment}] Savings The steps to follow are for preprocessing not post-validation Important ! 1 2 3 What is Data Quality Analytics?
  5. 5. Healthcare industry has estimated cost of $314 billion alone for ‘dirty’ data.1 1http://www.hoovers.com/lc/sales-marketing-education/cost-of-dirty-data.html Making decisions off ‘dirty’ data brings a estimated cost of $3 trillion per year for the US.1 Cost of Dirty Data
  6. 6. What is the problem you are trying to solve.1 Understanding your Data So you are ready? So you think… What type of data do you have?2 What do you really know about the data?3 This is not as easy or straightforward as it seems. CAUTION In your last data project, what predispositions did you have about the data? Were you right? PAUSE
  7. 7. Identify Key Fields within your Data Unique Key of The Dataset Member, Date of Service, Claim Number, Claim Line Crucial Fields Needed For Analysis Your dependent variable, and theoretical top independents • Allow Payment, Provider NPI, Covered Amount, CPT Code, Provider Specialty, Member Zip code Other fields Medicare ID, Provider last name, Provider first name
  8. 8. • % missing • % Zero • Top 20 most frequent values • Create histogram • Minimum & Maximum Compute the following metrics for EACH crucial field. Start with simple metrics for benchmarking
  9. 9. REGRESSION More setup, dependent variable needed, easier to explain 01 02 03 Advance Methods for Tracking Quality Modeling Techniques: Regressions, Cluster, or Neural Networks, Etc. CLUSTERING No dependent variable needed, harder to explain NEURAL NETWORKS Dependent variable needed, less setup, harder to explain
  10. 10. Build The Best Model Based on your choice of goodness of fit statistics Using crucial fields in a OLS regression Setting up Advanced Data Quality Methods Fields Allow Payment, Provider NPI, Covered Amount, CPT Code, Provider Specialty, Member Zip code Dependent Variable Allow Payment $ IMPORTANT: KEEP the coefficients for the future, this is the most important part!
  11. 11. K-means Clustering Setting up Advanced Data Quality Methods Fields Allow Payment, Provider NPI, Covered Amount, CPT Code, Provider Specialty IMPORTANT: KEEP the seeds for the future, this is the most important part! Try building a 3-dimensional cluster use these fields How is the fit? Do the groups make sense?
  12. 12. Integration with IT So you have checked your data; now what? Create marginal error range for benchmark metrics. Examples: • Metric: % missing • Run a random sampling without replacement using 60% of your sample, 1000 times. • Results from samples can will serve as acceptable range. As new data comes in, calculate these metrics, comparing them to the acceptable range Requires partnership with Information Technology
  13. 13. Integration with IT Model results is the second, more advance, stage of integration. Run the regression or neural network model using the coefficients of previous data, and compare predicted fit. Use your models as a method of validation in two ways. 1 Run new model using the same variables, and calculate the change in coefficients. 2
  14. 14. Data QualityAnalytics: Scott Murdoch, PhD Understanding what is in your data, before using it. Questions

×