Dr Simone Stumpf from City University
Quantitative analysis and summative statistics can be powerful tools in UX but their use needs to be carefully considered. Quantitative results are not context-free – a number may be the answer to the wrong question. Much more important than understanding the answer is understanding the question in order to choose the right method to capture and analyse quantitative data.
3. My background.
Academia
University College London
– BSc Computer Science
with Cognitive Science
– PhD Computer Science
– Research Fellow
Oregon State University
– Research Manager
City University London
– Senior Lecturer
Industry
BT
–
–
–
–
Fraud detection
Product management
Marketing
Project management
White Horse
– UX Architect
5. How old was Methuselah when he
died?
670
969
1254
2756
6. Cognitive bias and heuristics.
Anchoring – any number has a priming
effect on number estimates.
People, even researchers, are bad at
probability, predictions and statistics.
[Daniel Kahneman – Thinking, Fast and Slow]
7. Quantitative approaches in UX.
Quantitative data – numbers.
Quantitative analysis – statistics.
For statistical tests you have a hypothesis.
8. Quantitative data and/or analysis?
How many problems does a user have using
my snazzy new design?
What kind of problems does a user have using
my snazzy new design?
Do you like my snazzy new design?
Is this snazzy new design better than the old
boring design?
11. Let’s ask the user.
How much do you like the design on a scale
of 1 to 5 (where 5 is best).
Average of ratings across all users.
Then, er, do some stats?
12. Way around?
NASA Task Load Index (TLX)
to assess user’s perceptions
of
–
–
–
–
–
–
Mental Demand
Physical Demand
Temporal Demand
Performance
Effort
Frustration
13. Mea culpa!
“Responses to TLX questions (Mental
Demand, Temporal Demand, Success of
Performance, Effort, Frustration) were all
around the mid-point of the scale.”
On an interface which was truly hateful!
“However, the [Condition 1]
participants showed no significant
difference to [Condition 2]
participants’ TLX scores.” – 62
participants
At least our sample size wasn’t shabby
and we did some stats.
14. What kinds of problems does a
user have with my snazzy new
design?
15. Hold on – is that a trick question?
Surely, that’s qualitative analysis!
Yes, but no.
It starts out that way but then I expect
frequencies to back this up. No stats
though, thanks.
“4 out of 5 users could not find the Purchase
button.”
18. Easy-peasy.
I’ll use a between-subject design using
objective measures.
Like…eye tracking! What could be more
objective than where people look.
19. Lots of numbers – First Fixation Duration, Fixation
Duration, Time to First Fixation, …
[http://uxmag.com/articles/eye-tracking-the-best-way-to-test-rich-app-usability]
20. Well, that was fun.
There was a highly significant difference in the number of fixations
between versions (Χ2(2,N=4257)=22.25, p<0.001). Each participant on
average fixated 240.83 times in version 1, 259.83 times in version 2 yet
only 209.33 times in version 3. The average fixation duration between
versions was also different (ANOVA, F(2,4257)=13.30, p=<0.001), with
participants in version 1 spending on average 0.57 seconds per
fixation, 0.56 seconds in version 2 but 0.69 seconds in version 3.
Yay – we did stats! There were results!
The total fixation duration is the sum of all individual fixation’s
durations. There was no statistical significance between participants’
length of total fixations (Kruskal-Wallis, H(2,N=18), p=0.236).
Oh bum…
21. To summarise.
Try and quantify as much as possible but be clear
about limitations of what you can measure.
Descriptive statistics are good but are relatively
meaningless without context.
If you must use statistical tests, please make sure
they are appropriate.
22. Be clear about your Numbers are
questions and the best way
awesome.
to answer them.
@DrSimoneStumpf
Simone.Stumpf.1@city.ac.uk
Notas do Editor
Also known as…
Also run MSc course in Human-Centred Systems at City University London.-> Lots of experience with numbers and quantitative data, both in industry and in academia.
Hands up!
Combined this has wide reaching implications in UX. Firstly when you ask users about estimating something (e.g. user experiences) which involves frequency, numeric ratings or anything that looks like rational decision-making you’ll run into trouble.Secondly, researchers and UX professionals are equally likely to fall into the trap of cognitive bias and unsound methods. For example, the decision of how many users to recruit to a study quite often depends on practicalities than sampling of the population and statistical power. Usually, the answer may be “well, let’s get more than 8 and we’ll see if we get a good result”
So let’s talk about quanttative approaches in UX. There is quantitative data which is anything you can count or turn into frequencies or measure as a number. There is also quantitative analysis which usually means some kind of statistics. For example, a t-test, or a Chi-squared or a Fisher’s exact or a Kruskall-Wallis. Now, I’m just showing off…What’s important for statistics is that you have to have a hypothesis, usually a Null hypothesis you reject and an H1 hypothesis which is what you want to accept because you’ve shown a statistically significant difference. (I won’t go into the details of what statistical difference is but suffice to say you are looking for a pretty drastic effect). A hypothesis is usually a question with a yes or no answer.
Let’s look at some questions that are very commonly asked in UX… let’s go through each of these and see if they use Quantitative data or quantitative analysis or both.
Next I will go through 3 examples of quantitative approaches I’ve encountered to answer questions in UX. I’ll discuss some limitations of what I’ve seen and also offer some possible solutions.
Firstly, as we have seen before using a number can be suspect. It could be prone to anchoring.From experience, if this is asked after user test, the average will be around 3.5. Unless they really really loathe it or really really love it, it’ll always be around 3.5. Why 3.5 and not 3? 0.5 is for liking the facilitator and your incentives.Even if the numbers were at all meaningful, there are various downfalls – low sample size, data that is not normally distributed. Usually, researchers ignore these things for convenience.
In research, the NASA-TLX is a very popular instrument to measure users’ satisfaction with an interface, to carry out a task. It has 6 dimensions and these are measured on a 20-point scale, without using numbers. It’s very easy to apply and because so many other people have used it, it’s become pretty standard.
However, even this is prone to problems. Without a comparison to a different design, the number again becomes fairly meaningless. Even on a user interface that I consider really bad, the feedback was still all in the middle.The problem is also that it is really hard to get a good enough effect, even with a decent enough sample size. In this study, even though all the other softer responses were telling us With-scaffolding design was better, the TLX still come out not statistically significant.So, what’s the solution here? One perspective would suggest that we shouldn’t measure subjective satisfaction using a quantitative approach at all. However, I still hold it can be valuable to quantify the user experience but you will need to be careful to limit your expectations of what you can get as an answer to your question.
Whenever I read “some” or “many” in a student report my heart sinks. Give me some indication of whether this is prevalent or not, and also what your base sample is so I can interpret it properly.
Here is an example of analysing the user experience of two different user interface environments – one mainly textual, the other mainly visual, in a quantitative form. We used an adapted form of the Microsoft desirability toolkit – the Procduct Reaction cards and we were able to show the frequency of how often certain words occurred, in this case in a visual form through word clouds where the size of the words corresponds to its relative frequency. And then another trick, we counted how many words selected were positive and how many negative. This gives us an insight on what is going on, even without resorting to stats.
Finally, this is a common question, and one which is relatively tractable if you have very few changes.
Often touted as the answer and the best way to test the user experience. Partly, it seductive because you have these pretty pictures to look at and include in a client report. However, be clear that these pictures can at best give you an intuitive answer. Once you delve down into the numbers themselves you again need to be pretty careful. On their own, an average fixation duration of 0.54 seconds is without context. Is this good or bad? Who knows?! These studies are complex to set up – you have to have an “area of interest” and any questions you have need to be related back to these.
In a recent eyetracking study we carried out, luckily we knew our areas of interest right from the start and we were able to compare three versions. But the data analysis still turned out to be more complex than first anticipated – we had a low sample size, to get some of the data to try to do some stats we actually had to get the raw eye movement data and calculate from there, and also we were dealing with dynamic content not just static images.