General Approach: Two actual runs from TREC 3, 5-8 were used The MAP of each runs is as showed in the excel table (accounting every topic) 5 significance test were use to measure if the difference in MAP between System A and System B was statistically significant, which means.. If System A is in fact better that System B. For every significance test the p-value was calculated according to the test statistic. Then that value is confronted with the significance level, that states the maximum value that a p-value can have to reject the null hypothesis. finally the null hypothesis is accept or rejected. Significance Testing 1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric. 2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems 3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.
Null hypothesis = System A and System B have the same distribution. Statistic Test = Mean Average Precision (MAP) P-Value = number of times the difference between MPA(A ) - MPA(B) <= -0.052 + number of times the difference between MPA(A ) - MPA(B) >= 0.052 / total number of permutations (100,000). Characteristics : Distribution-free and doesn’t assumes random sampling.
It can be used as an alternative to the paired Student's t-test  when the population cannot be assumed to be normally distributed . But when N (the number of samples) is bigger than 25 the distribution of the wilcoxon text approximates to a normal distributions. Null hypothesis = System A and System B have the same distribution. Test statistic = is the sum of the ranks. p-value = is the minimum value of the test statistic.
Null hypothesis = the scores of System A and System B are random samples from the same distribution (diferent from randomization test, wilcoxon test and sign test). Statistic Test = Mean Average Precision (MAP) P-Value = fraction of samples in the shifted distribution that have an absolute value as large or larger that our experiment’s difference. Sampling with replacement - Sampling schemes may be without replacement ('WOR' - no element can be selected more than once in the same sample) or with replacement ('WR' - an element may appear multiple times in the one sample). Characteristics : Distribution-free and assumes random sampling.
Null Hipothesis = System A and System B are random samples from the normal distribution. Statistic Test = Mean Average Precision (MAP) P-Value = fraction of samples in the shifted distribution that have an absolute value as large or larger that our experiment’s difference. Characteristics : Normal Distribution and assumes random sampling. IMPORTANTE: só funciona com populações que sigam uma distribuição normal, portanto pode não ser adequado a todas as null hypothesis. Exemplo??
In this section we report the amount of agreement among p-values produced by the various significance tests. Table 1 shows the RMSE or each of the tests on a subset of the TREC run pairs. We formed this subset by removing all pairs for which all tests agreed on p-value. * If the tests agree with each other there is practical difference among tests. The randomization test, bootstrap test and t test largely agree with each other. The RMSE between these three tests is approximately 0,01 which is an error of 20% for a p-value of 0.05. The wilcoxon test and sign tests don’t agree with any of the other tests. Compared to the randomization test, and this to the t-test and bootstrap, the wilcoxon and sig tests will result in failure to detect significance and false detection of significance. Root Mean Square Error (RMSE)  of an estimator  is one of many ways to quantify the difference between an estimator  and the true value of the quantity being estimated.Â
Wilcoxon and sign tests : were apropriated before affordable computation existed, but are inappropriate today. Random sampling versus not random sampling: An IR researcher may argue that the assumption of random samples from a population is required to draw an inference from the experiment to the larget world. This cannot be the case. IR researchers have for long understood that inferences from their experiments must be carefuly drawn given the construction of the test setup. Using significance test based on the assumption of random sampling is not warranted for most IR research.
A researcher using the wilcoxon test and sign test is likely spend a lot longer searching for methods that improve retrieval performance compared to a researcher using the randomization, bootstrap or t test.