You've received your data, found important insights for your business and are ready to present the information to your leadership. Could you be making a costly mistake?
In this talk, we discuss the importance of exploratory data analysis, what questions to ask from your data to make sure that the results are accurate, and the importance of applying business domain knowledge to your findings so you don't identify insights that could be both incorrect and very costly.
Takeaways:
What are the right questions to ask?
Common mistakes that lead to inaccurate results.
What to watch out for (correlation vs. causality, spurious results, and more)
Your Instructor: Jeanette Shutay is Senior Director of Advanced Analytics where she leads the Center of Excellence at HAVI, which is a leading organization whose services provide insights and solutions for the world’s largest foodservice brands.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Common Data Driven Mistakes with HAVI's Sr. Director of Advanced Analytics
1. Common Data Driven Mistakes
Promotable Presentation
Jeanette Shutay, PhD
Senior Director, Advanced Analytics
February 5, 2020
2. HAVI | Confidential & Proprietary | 2/13/2020 | 2
Current Professional Contributions
• Advanced Analytics Center of Excellence Lead at HAVI
• Adjunct Professor for NCU School of Technology
Academic Preparation
• BA in Psychology
• MA in Developmental Psychology
• PhD in Research Methodology
• Student in AIP program
• Student to start in GIS program
Personal Interests
• Family & pets
• Volunteering
• I love animals!!
• Jogging & yoga
• Sports
My son Brett (16)
My son Brendon (12)
3. HAVI | Confidential & Proprietary | 2/13/2020 | 3
Data Solutions Lifecycle
Key Concepts
• Stakeholders
• Value proposition
• Data quality & characteristics
• Interaction effects / complex
relationships
• Getting to causality
• Constraints
• Scaling
4. HAVI | Confidential & Proprietary | 2/13/2020 | 4
Defining the problem
- Have you correctly and thoroughly defined the problem?
• Engage domain experts early and maintain continual engagement
- Estimate and consider the value proposition associated with solving the problem
- Identify the key performance indicators and any drivers of interest
- Operationally define all variables and indicators to be measured or observed
• Start with a priori hypotheses based on subject matter expertise and
industry/academic literature
- Brainstorm with key stakeholders & involve people with diverse backgrounds & views
- Generate hypotheses to test prior to specifying data requirements
- Review and align on all assumptions
• Document business requirements
5. HAVI | Confidential & Proprietary | 2/13/2020 | 5
Specifying the data requirements
- Do your data meet all requirements?
• Granularity & Cadence
- Do you need daily level data, location level data, etc.?
- When are decisions made? Every day, every week?
• Representativeness & Fidelity
- How generalizable are the cases you are studying to the problem as a whole?
• If doing a POC or small pilot, do you have a representative set of cases?
• Are you working with cases that have a high probability of treatment fidelity?
- Example: If testing a new customer experience program, are the stores that you are using
as part of the POC going to implement the program as intended? A low fidelity situation
can do more harm than not testing at all.
6. HAVI | Confidential & Proprietary | 2/13/2020 | 6
Data preparation
- Normalizing Variables
• In many cases, you will need to standardize your variables before analysis
- Using z scores are a good way to avoid data mistakes in modeling
- Identifying and Managing Anomalies or Outliers
• Sometimes anomalies are what you are interested in
• When anomalies or outliers are problematic, consider dropping those cases
or imputing, but watch out for errors in this approach
- Example: Some values may appear as outliers in time series data with high seasonality
- Model assumptions
• Ensure that the characteristics of your data, and the problem you are trying
to solve, align with the model you are implementing
7. HAVI | Confidential & Proprietary | 2/13/2020 | 7
Data exploration
- Go beyond univariate exploratory data analysis (EDA)
• Explore interaction effects
Conclusion: There is no difference
between green and yellow feeders.
Conclusion: There is a difference
between green and yellow feeders.
8. HAVI | Confidential & Proprietary | 2/13/2020 | 8
Causality & spurious relationships
- Causality - three conditions must exist:
• X and Y must be correlated
• X must proceed Y in time
• All other rival causes must be ruled out (e.g., internal validity)
- Beware of Spurious Relationships & Rival Causes
• Example 1: You launch a promotion in March. You believe the success of your promotion
(increased sales) is due to your marketing campaign, but it is a result of a third-party cause
(increased consumer buying power due to tax refunds)
• Example 2: You launch a new crime watch campaign that launches in December and you
see a significant decrease in crime month-over-month. The true cause is seasonality.
• Solution: Design your campaign to minimize potential rival causes. This is where including
the SME is critical.
9. HAVI | Confidential & Proprietary | 2/13/2020 | 9
- Look for Suppressor Effects
• Example: You launch an employee training
program. You compare their performance at
the end of the program to the general
employee population. You find that those in
the training program had lower performance
ratings than the general population.
• Problem: You didn’t consider pre-existing
differences. You find out that those who were
selected for the program where low
performers.
• Solution: Use deltas (change from baseline to
post) and/or include control variables in your
model (prior performance, demographics, etc.).
1.7
3.2
3.5
3.6
Baseline performance Final performance
Employee Performance Rating 5-Point Scale
Training participants General population
Suppressor Effects
10. HAVI | Confidential & Proprietary | 2/13/2020 | 10
Time to value & diminishing returns
- Progress vs. Perfection
• Time to value is an important factor to consider. It is better to provide
something for the business to work with and continually improve than to wait
until you reach perfection before sharing with the business
- Data & Analytics ROI
• Know when improving the model and/or adding more external data no longer
yields the return on investment
- Cost-to-benefit analysis
- Assess forecastability
11. HAVI | Confidential & Proprietary | 2/13/2020 | 11
- Are there specific constraints that might impact your approach?
• Example 1: Can’t recommend an alcoholic beverage, even if customer is likely to buy
• Example 2: Must use interpretable models
• Example 3: Must include non-significant promotions for simulation purposes
- Do you need to scale your solution?
• If you need to scale your solution, try to prototype within the same ecosystem (e.g., Azure,
Python, Spark, etc.) in which you plan to scale.
- Many times results do not replicate when using different software or platforms
- Avoid using data for modeling that is not available at decision time
• Weather data or other data that you have historical information for, but no future data
- Avoid data leakage when building models
• Don’t commingle modeling training data with model validation data
Other important considerations