1. On InfoQ and PSE:
A brief introduction
Ron S. Kenett
KPA Ltd., Raanana, Israel and University of Torino, Torino, Italy
ron@kpa.co.il
1
2. Introduction
This presentation is about doing the right research with
statistical methods, the right way - we call it Quality
Research. Research is a critical activity leading to
knowledge acquisition and formulation of policies and
management decisions.
By effective research we mean research that produces an
impact, as intended by decision makers. One measure of
effective research is Information Quality (InfoQ), an
approach developed by Kenett and Shmueli (2009) to
assess Information Quality. Practical Statistical Efficiency
(PSE) is assessing the level of implementation of the
research recommendations (Kenett, Coleman and
Stewardson, 2003).
2
4. Information Quality (InfoQ)
Knowledge
Goals
Information
Quality
Data Analysis
Quality Quality
Primary Data Secondary Data Kenett, R. abd Shmueli, G., “On Information Quality”,
- Experimental - Experimental
- Observational - Observational http://ssrn.com/abstract=1464444, 2009. 4
6. Practical Statistical Efficiency (PSE)
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
• V{D} = value of the data actually collected
• V{M} = value of the statistical method employed
• V{P} = value of the problem to be solved
• P{S} = probability that the problem actually gets solved
• V{PS} = value of the problem being solved
• P{I} = probability the solution is actually implemented
• T{I} = time the solution stays implemented
• E{R} = expected number of replications
Kenett, R.S., Coleman, S.Y. and Stewardson, D. (2003), “Statistical Efficiency: The
Practical Perspective”, Quality and Reliability Engineering International, 19: 265-272. 6
8. Information Quality (InfoQ)
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Sampling bias
6. Chronology of data and goal
7. Concept operationalization
8. Communication and data visualization
8
9. The InfoQ Suisse Cheese Model
Sampling Concept
bias operationalization
Communication and
data visualization
Chronology of
data and goal
Data
resolution
Data structure
Data Temporal
integration relevance 9
10. InfoQ1: Data Resolution
• Two aspects of data resolution are measurement
scale and data aggregation.
• The measurement scale of the data must be
adequate for the purpose of the study.
• The level of aggregation of the data relative to the
task at hand. For example, consider data on daily
purchases of over-the-counter medications at a large
pharmacy. If the goal of the analysis is to forecast
future inventory levels of different medications, when
re-stocking is done on a weekly basis, then we would
prefer weekly aggregate data to daily aggregate
data.
10
11. InfoQ2: Data Structure
• The data can combine structured quantitative
data with unstructured, semantic based data.
• For example, in assessing the reputation of an
organization one might combine data derived
from balance sheets with data mined from text
such as newspaper archives or press reports.
11
12. InfoQ3: Data Integration
• Knowledge is often spread out across multiple
data sources.
• Hence, identifying the different relevant
sources, collecting the relevant data, and
integrating the data, directly affect information
quality.
12
13. InfoQ4: Temporal Relevance
• A data set contains information collected during a
certain period of time. The degree of relevance of the
data to the current goal at hand must be assessed.
• For instance, in order to learn about current online
shopping behaviors, a dataset that records online
purchase behavior (such as Comscore data
(www.comscore.com)) can be irrelevant if it is even
several years old, because of the fast changing
online shopping environment.
13
14. InfoQ5: Chronology of Data and Goal
• A data set contains daily weather information for a particular
city for a certain period as well as information on the Air
Quality Index (AQI) on those days.
• For the United States such data are publicly available from
the National Oceanic and Atmospheric Administration website
(http://www.noaa.gov). To assess the quality of the
information contained in this data set, we must consider the
purpose of the analysis.
• Although AQI is widely used (for instance, for issuing a “code
red” day), how it is computed is not easy to figure out. One
analysis goal might therefore be to find out how AQI is
computed from weather data (by reverse-engineering). For
such a purpose, this data is likely to contain high quality
information. In contrast, if the goal is to predict future AQI
levels, then the data on past temperatures contains low-
quality information.
14
15. InfoQ6: Sampling Bias
• A clear definition of the population of interest and how the
sample relates to that population is necessary in both primary
and secondary analyses.
• Dealing with sampling bias can be proactive or retroactive. In
studies where there is control over the design (e.g., surveys),
sampling schemes are selected to reduce bias. Such
methods do not apply to retrospective studies. However,
retroactive measures such as post-stratification weighting,
which are often used in survey analysis, can be useful in
secondary studies as well.
15
16. InfoQ7: Concept Operationalization
• Observable data are an operationalization of
underlying concepts. “Anger” can be measured via a
questionnaire or by measuring blood pressure;
“economic prosperity” can be measured via income
or by unemployment rate; and “length” can be
measured in centimeters or in inches.
• The role of concept operationalization is different for
explanatory, predictive, and descriptive goals,.
16
17. InfoQ8: Communication and Data
Visualization
• If crucial information does not reach the right
person at the right time, then the quality of
information becomes poor.
• Data visualization is also directly related to the
quality of information. Poor visualization can
lead to degradation of the information
contained in the data.
17
18. The InfoQ Score
For each measure, Yi(x) is defined as a univariate desirability function di(Yi)
which assigns numbers between 0 and 1 to the possible values of Yi, with
di(Yi)=0 representing a completely undesirable value of Yi and di(Yi)=1
representing a completely desirable or ideal response value. The individual
desirabilities are then combined to an overall desirability index using the
geometric mean of the individual desirabilities:
Desirability Function = [(d1(Y1) x d2(Y2))x … dk(Yk))]1/k
with k denoting the number of measures. Notice that if any response Yi is
completely undesirable (di(Yi) = 0), then the overall desirability is zero.
We use the Desirability Function to compute an InfoQ Score based on an
assessment of indicators reflecting the 8 InfoQ dimensions.
Derringer, G., and Suich, R., (1980), "Simultaneous Optimization of Several Response
Variables," Journal of Quality Technology, 12, 4, 214-219.
Harrington, E. C. (1965). The desirability function. Industrial Quality Control, 21, 494-498 18
19. The InfoQ Score
InfoQ Score = [(d1(Y1) x d2(Y2))x …d8(Y8))]1/8
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Sampling bias
6. Chronology of data and goal
7. Concept operationalization
5
1 2 4 6 3 7 8 8. Communication and data visualization
The lower The higher
On target
the better the better
19
20. Practical Statistical Efficiency (PSE)
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
• V{D} = value of the data actually collected
• V{M} = value of the statistical method employed
• V{P} = value of the problem to be solved
• P{S} = probability that the problem actually gets solved
• V{PS} = value of the problem being solved
• P{I} = probability the solution is actually implemented
• T{I} = time the solution stays implemented
• E{R} = expected number of replications
20
21. V{D} = value of the data actually collected
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
Readily accessible data, is like
observations below the lamppost
where there is light -
not necessarily where you lost your
key or where the answer to your
problem lies
21
22. V{M} = value of the statistical method employed
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
A mathematical definition of statistical
efficiency is given by:
Relative Efficiency of Test A versus Test B =
Ratio of sample size for test
A to sample size for test B, where sample
sizes are determined so that both
tests reach a certain power against the same
alternative. 22
23. V{P} = value of the problem to be solved
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
Statisticians too often forget this
part of the equation. We frequently
choose problems to be solved on
the basis of their statistical interest
rather than the value of solving
them.
23
24. P{S} = probability that the problem actually gets solved
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
Usually no one method or attempt
actually solves the entire problem,
only part of it. So this part of the
equation could be expressed as a
fraction
24
25. V{PS} = value of the problem being solved
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
This is both a statistical question and
a management question. Did the
method work and lead to a solution
that worked and were the data,
information and resources available
to solve the problem?
25
26. P{I} = probability the solution is actually implemented
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
Here is the non-statistical part of the
equation that is often the most
difficult to evaluate. Implementing the
solution may be far harder than just
coming up with the solution.
26
27. T{I} = time the solution stays implemented
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
Problems have the tendency not to
stay solved. This is why we need to
put much emphasis on holding the
gains in any process improvement.
27
28. E{R} = expected number of replications
PSE = E{R} x T {I} x P {I} x V {PS} x P {S} x V {P} x V {M} x V {D}
This is the part most often missed in
companies. If the basic idea of the
solution could be replicated in other
areas of the company, the savings
could be enormous.
28
29. The Quality Ladder: Matching Management
Approach with Statistical Methods
Quality by Design Design of Experiments
Process Improvement Statistical Process Control
Inspection Sampling
Fire Fighting Data Accumulation
Kenett, R. and Zacks S., Modern Industrial Statistics: Design and Control of Quality
and Reliability (with S. Zacks), Duxbury Press, San Francisco, 1998, Spanish edition
2002, 2nd paperback edition 2002, Chinese edition 2004. 29
30. The Statistical Efficiency Conjecture
Let PSE = PSE of a specific project and L= the maturity level of an
organization on the Quality Ladder (L=1,…4).
PSE is a random variable with specific realisations for individual projects.
E{ PSE } = The expected value of PSE in a given organisation over all
projects.
The Statistical Efficiency Conjecture is linking Expected Practical Statistical
Efficiency with the maturity of an organisation on the Quality Ladder.
In more formal terms it is stated as:
Conditioned on the right variable,
E{ PSE } is an increasing function of L
We partially demonstrated this with 21 case studies
Kenett, R., De Frenne, A., Tort-Martorell, X and McCollin, C., The Statistical Efficiency
Conjecture, Chapter 4 in Applying Statistical Methods in Business and Industry –
the state of the art , Coleman S., Greenfield, T. and Montgomery, D. (editors), John 30
Wiley and Sons, 2008.