Our study found that applying a performance-based “expert” weighted method to the crowd improves the crowds’ wisdom, measured by crowd accuracy. This finding contrasts previous research that was not able to find a significant improvement in accuracy by applying weighted methods. This research indicates that in order to optimise crowdsourcing, experts within the crowd should be given higher weighting compared to non-experts.
Quantifying Artificial Intelligence and What Comes Next!
Wisdom of the Crowd: Wiser Crowds Through Weighted Methods
1. UNIVERSITY COLLEGE LONDON, DIVISION OF PSYCHOLOGY AND LANGUAGE
SCIENCE
MSc Cognitive and Decision Sciences 2018-19
Seeking wisdom: A wiser crowd judgements through weighted methods
Ethics approval code: CPB.2013/015
Date of submission: 10th August 2019
2. Abstract 2
Introduction 3
Method 13
Design 13
Participants 15
Procedure 16
Results 22
Hypotheses and predicted results 22
Discussion 32
Review of hypotheses 32
Possible explanations for the current findings 33
Conclusion 35
References 37
Appendix 43
1
3. Abstract
Background: How can we make the wisdom of the crowd wiser? This question comes at a
time where society is faced with increasing amounts of misinformation dispersed at scale and
with speed through the internet. With the spread of “fake news” the facts no longer seem like
facts until verified by an unbiased entity. To battle this increasing amount of information,
society is seemingly turning to platforms where ‘the crowd’ or communities that can help
filter some of this misinformation through social proof techniques. However, there are many
weaknesses which can be found in these communities such as low-level contributors or
spammers having the ability to skew the crowd's judgement. In this study, we aimed to
examine whether weighted methods can be applied to the crowd to increase the crowds'
accuracy on judging truthfulness, making them arguably “wiser”.
Method: Through a popular online crowdsourcing platform (Amazon MTurk), participants
completed a questionnaire of 120 questions related to current affairs and world news. A
repeated measures design was used to detect any difference between the weighted methods.
Results: We found that by applying performance-based weighted methods the crowds'
accuracy improved significantly compared to the purely unweighted “democratic” method.
Interestingly, we found a significant decrease in accuracy when aggregating confidence.
Conclusion: These findings suggest that by weighting performance-based experts higher, the
crowds' accuracy significantly improves compared to an exclusively democratic
("unweighted”) aggregation method.
2
4. Introduction
Background
The internet has revolutionised the way human society interacts more than any other
invention since the printing press, which inaugurated the dissemination of printed information
to the masses in Europe between the years 1450 and 1500 (Dittmar, 2011). In this digital age,
the internet is now a stage for the “vox populi” to be heard. Digital platforms that allow
communities to be built and individual contributions to be made are making a large impact on
society. Conversely, the ease of access the internet has also allowed low-quality contributors
and spammers to more easily participate in crowdsourcing events. Howe (2006) originally
coined the term “crowdsourcing” in 2006, defined as outsourcing a task and function that
would traditionally have been performed by a single agent to an undefined network of
labourers. Crowdsourcing is now a popular way to distribute tasks in exchange for a
monetary reward or recognition.
The aim of this study is to determine if the veracity of judgements at the crowd level,
on a set of claims found in the news and general knowledge factual statements, can be
improved upon by applying various weighted methods to responses within the crowd.
Wisdom of the crowd
The commonly known “wisdom of the crowd” (WoC) effect is predicated on the
belief that a diverse group of independent people can achieve a better result, measured by
accuracy in this study, than any of the individuals in the group. James Surowiecki (2005)
outlined the WoC phenomenon, describing the remarkability of collective intelligence in the
right circumstances can be smarter than the smartest person in the group. Giving our inability
3
5. to recall at will information in our brain and humans’ bounded rationality highlighted by
Simon (1955), the advantages of collective decision making appears to remain broadly
supported (Bonabeau, 2009).
Francis Galton (1907), an English Statistician, is known to have discovered the power
of crowd wisdom from his analysis of a weight-judging competition, which was held at the
annual show of the ‘West of England Fat Stock and Poultry Exhibition’. This competition
allowed attendees, mostly consisting of experienced butchers and farmers as well as people
with no specific expertise in cattle raising and the like, to guess the weight of an Ox, once it
had been slaughtered and dressed, for a small entry fee of 6d (sixpence, equivalent to £1.96 in
today’s money). The participants were incentivised with prizes for those who submitted the
most accurate estimate. Galton studied the 787 estimates submitted by the crowd and
discovered that the median weight of the crowd (1,197 lb) was 1 lb away from the actual
weight (1198 lb). A recent re-examination of Galton’s findings demonstrated that Galton’s
data indicated some errors in the original calculation and when corrected, taking the mean as
opposed to the median, the crowd produced a perfect estimate (Wallis, 2014).
The aforementioned conclusion from Galton supported the notion that there is a
higher probability in achieving the correct judgment or decision through a democratic
mechanism. This has led to further studies that endeavour to seek ways in which the crowd
could be enriched by identifying experts and eliminating the contribution of poor performers
(Budescu, Chen, 2014; Drew, 2018; Zhao & Zhu, 2014).
The search for an expert
Experts are typically perceived as someone who demonstrates good judgment and
high predictive accuracy. According to Merriam Webster, an expert is “one with the special
4
6. skill or knowledge representing mastery of a particular subject” (Expert, n.d.). In some
cases, experts are hard to come by, subjective, highly dependant on the domain and also
relative to the group with which they are compared. Nevertheless, many people believe that
experts are above average at making good judgements and decisions compared to another
person or group and turn to them for signals in times of uncertainty. For example, a professor
in neurology could be deemed an “Expert” in neuroscience knowledge due to their depth of
knowledge built over years of experience, but a novice in sports knowledge if that is not a
domain they allocate time to or have an interest in. The difference in performance between
experts, therefore, can vary significantly depending on the task. Previous research had argued
experts can be knowledgeable but bad at predicting outcomes (Camerar & Johnson, 1997)
and that the expectation of experts are unproven, causing the relationship between expertise
and accuracy to be unpredictable (Hinds, 1999; Norman et al., 1989). However, contrasting
research has demonstrated good reliable experts do exist and can perform well, for example, a
group of National Weather Service (NWS) forecasters provided reliable probability
predictions of precipitation and temperature (Murphy & Winkler, 1977).
Crowd diversity
Group diversity is an important attribute to consider in the context of wisdom of the
crowd which to counter-argues the expert judgment theory. Research has shown crowd
diversity to have a positive effect performance (Hong & Page, 2004). The study showed that
a team of randomly selected intelligent agents outperforms a team comprising of the
best-performing agent at problem-solving tasks. The paper focuses exclusively on functional
diversity, which is constructed on the agents perspective and heuristic, inspired by research
conducted in the organisational behaviour (Kephart, Hogg & Huberman, 1990; Miller, Burke
5
7. &Glick, 1998) and psychology literature (Polzer, Milton & Swarm, 2002; Nisbett & Ross,
1980). Cognitive diversity in the crowd is valuable even in the context of seeking experts. An
expert is commonly domain-specific and relative to the size of the crowd. The smaller the
crowd, the more significant the role of an expert. Having a diversified crowd fills
knowledge-gaps that a group of experts may form.
In terms of the application of group diversity, recent research has also studied the
effects on businesses when there is diverse leadership (Noland, Moran & Kotschwar, 2016).
In this study, they found there to be a positive relationship between the proportion of female
leaders and net revenue.
Group diversity and independence can also negate potential “group-think” effects. A
high-profile case study of the adverse effects of group-think phenomenon can be illustrated in
the Challenger disaster. The Challenger was an American space shuttle that exploded 73
seconds after launch. An investigation into the cause claimed the failure of the launch to be
due to group-thinking. Some key traits of group-thinking are having a concurrence-seeking
tendency and having a homogeneous group with similar ideology, social background, etc. this
can lead to symptoms such as over-estimation of the group, close-mindedness and pressures
toward uniformity which consequently can lead to defective decision-making (Janis, 2008).
This highlights the value of crowd diversity. Through crowdsourcing, group-think effects can
be circumvented as the crowd should perform tasks independently.
Crowd motivation
When assessing crowdsourcing it is important to consider the motivation of crowd
members. Understanding crowd motivation design and incentive structuring are integral to
mitigate biases, attract a high level of quality and participation. The motivation for this type
6
8. of participation sits well within the ‘belonging’ and ‘self-esteem’ categories of Maslow's
hierarchy of needs model (Maslow, 1943, 1954). There are a few ways in which a crowd
member could be motivated to participate and contribute to a platform, for example, an
immediate payoff as a payment (Lakhani & Wolf, 2005), delayed payoffs in the form of
signalling and stakeholder feedback (Hackman & Oldham, 1980). Further studies have
analysed different motivation design approaches in the context of “ubiquitous
crowdsourcing,” crowdsourcing on the go through mobile phones, to assess the various
effects of motivational design on crowd participation and contribution quality (Goncalves,
Hosio, Rogstadius, Karapanos & Kostakos, 2015). In this study, they found a positive effect
on participation rates from using various motivation techniques such as psychological
empowerment, self-efficacy and causal importance. It may be stated that increased incentives,
in particular, extrinsic incentives, can cause an adverse effect on quality as discussed in
previous papers (Kittur et. al, 2013).
Sourcing the crowd
Howe (2006) was the first to coin the phrase “crowdsourcing”, which describes a way
in which micro problem-solving tasks can be completed by a distributed network of many
agents. Brahbam (2013) has also described crowdsourcing as requiring the following
ingredients:
1) An organisation that has a task that it needs performed,
2) A community (crowd) that is willing to perform the task voluntarily,
3) An online environment that allows the work to take place and the community
to interact with the organization, and
4) Mutual benefit for the organization and the community
7
9. By harnessing collective-intelligence, growing companies and industries have been
able to leverage the wisdom of the crowd effects through the internet. A well-known
beneficiary of crowdsourcing is Wikipedia. Wikipedia, the highly successful free online
encyclopedia has effectively leveraged crowd coordination to curate content at a scale and
quality which would be hard to replicate by another company. Wikipedia has succeeded by
having a large number of agents that contribute to the platform improving the accuracy and
completeness whilst reducing the bias of the online encyclopedia. Wikipedia has over 2.5m
pages of information that has had over six million contributors (Kittur & Kraut, 2008),
requiring a high amount of coordination in order to harness the crowd’s wisdom. Although
this is a high-touch approach, Kittur and Kraut found in this case organised crowdsourcing
was highly effective.
Approaches to judgement aggregation
In a recent study, Budescu and Chen (2014) explored aggregation weighted methods
which could improve upon the crowd's overall forecast accuracy. The study collected
responses on 104 “events” with 1,233 participants. Only 420 participants were used in the
data analysis as participants who responded to less than 10 events could unfoundedly skew
the crowd's effectiveness, reducing the efficacy of the study. The approach adopted was to 1)
identify experts, and 2) disregard non-expert contributions from the overall forecast. The
study observed a 39% improvement by applying and re-computing “expert weights”
periodically throughout the survey. The application of this approach, removing non-experts,
could be highly controversial in a democratic society raising the risk of displacing
non-experts which could reduce the level of participation driving by the disengagement of
so-called “non-experts.”
8
10. Fact-checking domain
There is a long history in fact-checking. Notably, the role of an entity or person
carrying out independent claims and facts validation appears dates back to 1913 when Ralph
Pulitzer and Isaac White of TIME magazine established the ‘Bureau of Accuracy and Fair
Play.’ The primary goal of this particular bureau was to track repeated offences of
misinformation and disinformation, seeking reprimand or public apologies from the accused
(Machor, 2008). In recent times, a significant amount of disinformation and ‘fake news’ have
been prevalent in the media, causing a perceivably major influence on various democratic
processes such as the 2016 US election and UK Brexit referendum. Fake news can be defined
as news articles or claims that are intentionally and verifiably fabricated, likely misleading to
recipients (Allcott, & Gentzkow, 2017). The proliferation of misinformation and
disinformation on social media platforms have launched government-led inquiries concerning
the role in which social media platforms play in combating ‘fake news’.
In a recent report published by Full Fact , they discussed a fact-checking initiative1
(‘Third Party Fact Checking programme’) they lead with Facebook to fact-check posts which
had been flagged by Facebook as possibly false. These posts were then added to a queue for
fact-checking. Once the content was checked for the misinformation they published the
fact-check outcome and was able to attach the fact-check article to the Facebook post
together with a rating: False, Mixture, False Headline, True, Not eligible, Satire, Opinion,
Prank generator, and Not rated. Of the 96 facts, checks that were published as part of the
Third Party Fact Checking programme, 61.4% of the claims were rated as ‘false’..
Research has highlighted some of the effects of fake news; those who are exposed to
fake news are likely to believe them (Silverman & Singer-Vine, 2016; Pennycook & Rand,
1
Report on the Facebook Third Party Fact Checking programme (https://fullfact.org/media/uploads/tpfc-q1q2-2019.pdf).
9
11. 2018). With technology today, the barriers to entry are low in disseminating information to a
large number of people, true or false. This low barrier allows for a higher frequency of fake
news exposure to individuals, to which psychological experimental studies have shown that
trust increases as familiarity increase through cognitive fluency (Begg, Anas & Farinacci,
1992; Alter & Oppenheimer, 2009), introducing the chance of familiarity bias. Another
example of this fluency effect can be found in a recent study by Pennycook et al. (2018).
Pennycook results suggested that social media platforms help to incubate belief of “fake
news” and prior exposure to misinformation can create an illusory truth effect, however,
within a plausible boundary. The high use of fake news in 2016 has also led to a decline in
trust of mass media amongst American voters, particularly, Republicans (Swift, 2016).
What might be an effective solution to tackle the rise of fake news? There are three
possible solutions envisaged: (1) reduce the number of fake news stimuli the masses are
exposed to through structural changes on social media and other news platforms; (2)
empower the crowd to effectively evaluate news and claims, demoting those which are
deemed false by the crowd; and (3) use machine learning and AI techniques to parse through
world news and flag misinformation. For the purpose of this paper, I will briefly expand on
points 2 and 3.
Empowering the crowd. Independent and non-partisan fact-checking entities (e.g.
FullFact, Politifact, Snopes and FactCheck) have been playing an increasing importance in
most recent times, however, independence alone is not sufficient, the source of funds should
be neutral in order to maintain strong efficacy with no outside influence political or
otherwise. There remains an imbalance between the velocity of news being “digitally printed”
and human fact-checkers ability to perform an in-depth investigation. Through online
crowdsourcing methods, distributed fact-checking could be a worthwhile solution. Attempts
10
12. have been made in this domain, in particular, a blockchain project called Avow (Shamlo, &
Alavi, 2018). The Avow project aims to counter disinformation and “fake news” by creating
a system which aggregates the crowd's anonymised opinion on various claims and news
items, rewarding the contributors with cryptocurrency ERC-20 Ethereum tokens. This23
approach uses a novel technology with an incentive system in place to encourage
participation for greater social impact. Although the recent cryptocurrency price volatility
may require further thought into the challenge that many blockchain projects encounter.
Wisdom of the machines. The relentless rate at which news is published
compounded by the fragmented distribution channels through the internet makes
fact-checking at scale a practically impossible human task. A recent MIT study (Baly et al.,
2018) explores tackling the problem of exponential “fake news” by using a machine learning
system to assess the source is accurate or politically bias for “fake news” detection inspired
by previous research (Horne et al., 2018a, 2018b). By collecting information from multiple
online sources such as Wikipedia, Twitter, the article itself, and Alexa Rank , the system is4
trained on a rich set of features on 1,066 news sources. The research found that news sources
which have a Wikipedia page can help predict factuality but not political bias, it also found
that analysing the Twitter account (not tweets) does not provide any significant indication of
factuality or bias. The best-performing feature was the articles from the source website which
highlights the importance of analysing the content of the news source, analysing article titles
alone was not strong enough. There is clearly a case and necessity for leveraging AI
techniques to tackle the scale of “fake news” and from this study, the advantage of assessing
2
ERC-20 is a technical standard for smart contracts used on the Ethereum blockchain
(https://en.wikipedia.org/wiki/ERC-20).
3
Ethereum is an open-source, public, blockchain-based distributed computing platform
(https://en.wikipedia.org/wiki/Ethereum).
4
Alex is a website traffic data provider that ranks websites (https://www.alexa.com/siteinfo).
11
13. the new source could reduce the need to assess claim-by-claim if there could be a bias and
factuality score easily visible on all publishing platforms.
Confident crowds
Can a confident crowd be trusted? A recent study carried out by Aydin and colleagues
(2014) found that by considering participant confidence can significantly improve accuracy,
particularly when combining performance with confidence. Previous studies have also shown
that expert confidence should be taken with care as Experts have been found to overestimate
their capabilities, for example, Glenberg & Epstein (1987) found that psychics and music
experts over-exaggerated their ability to understand text associated with physics or music,
respectively, compared to novices.
Overview of the current experiment
A related previous study (Drew, 2018) was unable to find any significant difference in
crowd accuracy from applying performance-based weighted methods to the crowd. The
inability to find an effect appeared to be primarily due to a small non-parametric sample (n =
36). Drew also provided a 5-point Likert scale of truthfulness which then resulted in a
proportion of the sample to be removed as it was believed participants responding with
“neutral truthfulness” response did not provide a strong enough signal. For this reason, the
current study will aim to address these limitations by increasing the sample size (n = 113),
increasing the question set (n = 120) and replacing the 5-point Likert scale option with a
binary ‘mostly true’ or ‘mostly false’. The effects of this are unknown, however, the literature
reviewed above provides good reason to hypothesise the following:
(i) The crowd’s accuracy improves when removing low-performers.
12
14. (ii) The crowd’s accuracy improves when the expert judgements are mainly
considered while designedly maintaining “crowd diversity”.
(iii) The crowd’s accuracy improves when only confident answers are considered.
The current experiment aims to test these hypotheses using questionnaire accuracy as the
target variable. Given the limitations of the previous related study appeared to be mainly
caused by the sample size reducing the achieved power and causing a non-parametric data
(Drew, 2018), we aim to overcome these limitations by increasing the sample size to 113
(previously 37) and question set to 120 (previously 90).
Method
Design
For the purpose of this study, a repeated measures design was applied. Participants’
assessment of truthfulness and respective confidence responses to a group of 120
grounded-truth claims will be used to examine whether weighted methods can improve the
accuracy of the groups’ overall performance. The questions were presented as a survey to the
participants which consisted of 120 claims related to general knowledge (e.g. society and
culture) and current affairs. The claims used in the study covered a broad range of news
categories with the top 5 being economics (16%), politics (12.5%), science (11.6%), health
(10.83%) and immigration (6.67%). All the questions were sourced from highly reputable
sources with a strong skew towards independent fact-checking organisations. Each claim was
carefully qualitatively assessed to check for any potential biases. 61% of claims were related
to UK and USA news. The survey was slightly imbalance in favour of more truthful
13
15. statements (52.5%) than false statements (47.5%). Each page presented a short single claim,
that would be typically seen as a headline or tweet, to which the participant was required to
respond with whether they thought the claim was ‘mostly true’ or ‘mostly false’ together with
their level of confidence ranging from ‘unconfident’ to ‘very confident’ (see figure 1).
Figure 1. An example of the format of the question page for
each claim.
A large majority of the claims were chosen from unbiased and rigorous fact-checking
experts. Broader news sources were used, in avoidance of creating a very niche knowledge
quiz, using only fact-checking which is predominantly skewed towards political
fact-checking to base the survey on which would consequently create a very narrow and
nuanced survey for participants. Fact-checking organisations included ‘BBC Reality Check’,
‘Snopes’, ‘FullFact’, ‘PolitiFact’, ‘Fact Check’, ‘Channel 4’, ‘Washington Post’, ‘Africa
Check’ and ‘World Bank’.
It was decided against the idea of displaying participants with their result relative of
other participants or show their performance mid-way through the questionnaire as we were
not interested to see the effects of this type of stimuli on the participants' performance.
14
16. Self-reported questionnaire
All participants were required to self-report on a number of questions before the
questionnaire. These questions covered demographic information (age, nationality, gender,
religiosity, education-level and occupation). Participants were also asked ‘how happy do you
feel today?’ inspired by research which showed a positive correlation between emotional
intelligence and performance (Schutte, Schuettpelz & Malouff, 2001). Participants were also
required to indicate on a five-point Likert scale their attitude on free markets, Brexit vote,
frequency of news consumption and medium of news consumption. Another set of
self-reporting questions were focused on the participants level of expertise in news domains:
general news, economics and politics, science & health, pop culture, international affairs,
crime and art.
Participants
113 UK-based participants (62 males, 51 females) were recruited using a popular
crowdsourcing platform Amazon Mechanical Turk (MTurk). The crowd was incentivised
with immediate extrinsic motivation in the form of a $5.00 monetary reward upon the
completion of the survey, with an additional $20 Amazon voucher offered to the top three
performers, which took approximately 21 minutes to complete on average (equivalent of
£14.28/h). This is notably higher than the median wage level ranging between $1.38-2.30/h
previous research have shown Mturk crowd workers to earn (Horton and Chilton, 2010; Hara,
2018). The average age of the participants was 32.80 years (SD = 8.64, range = 18 - 67yrs).
Participants were only eligible to participate if they were over the age of 18 and spoke
English as their first language or had an equal level of fluency. Participants were given a brief
to read that outlined the general objective of the survey, this being that we were assessing
15
17. their ability to assess the truthfulness of information relating to the news and current affairs to
apply methods for validating information more efficiently and accurately. All subjects were
remunerated $5.00 for their participation. To incentivise high participant performance, a
bonus reward was of $20.00 Amazon vouchers were also offered to the top three
highest-scoring participants. To prevent bots from participating in the survey, participants
were required to pass a CAPTCHA task for the survey to commence. Participants were also
required to have 100% of their previous “Human Intelligence Tasks” (“HITs”) approved,
these were recommended criteria to apply from online forums to deter uncommitted
participants. The study has been approved by the UCL Ethics Board Committee (ethics
approval code: CPB/2013/015).
Procedure
Participants were recruited on the widely used crowdsourcing “knowledge-worker”
platform, Amazon Mechanical Turk (www.mturk.com) and if they met they were eligible to
participate (UK-based and a 98% approval rate) they were sent to the survey brief and
questionnaire, which was hosted on the Gorilla platform. All participants were presented with
news headlines, quotes and factual statements, and a binary choice corresponding to whether
the statement was ‘mostly true’ or ‘mostly false’ along with their level of confidence in their
response. Each claim was presented on a new page. No ‘back’ option was given to the
participants to revise their response. From the learnings of the previous study (Drew, 2018), it
was necessary to limit the agreeableness to two binary options as opposed to a 5-point Likert
scale as applied in the previous research which resulted in a loss of sample data due to
insufficient direction on neutral responses. The confidence was set upon a 5-point Likert
16
18. scale, ranging from ‘highly confidence’ to ‘unconfident’ in 5 increments. The two data points
per claim provided the ability to assess another dimension to the Wisdom Of The Crowd
effects. Participants indicated their choice by clicking on it.
Splitting the question set
In order to establish the most efficient amount of questions to use for the two sets, 120
simulations were performed on the complete set of questions. The objective of running this
simulation was to find the range at which an increase in the number of questions added a
diminishing level of improvement in the crowd's accuracy. A random sample of questions
was picked each simulation run, which assessed the crowd accuracy. An additional question
was added each simulation and the crowd accuracy measured per simulation. A stabilisation
of crowd accuracy can be visually observed in figure 2 between 40 to 60 questions, the
average accuracy within this range was 0.609. As efficiency in expert classification is a key
consideration for the purpose of this study i.e., the ability to identify experts within the with
the least amount of questions seemed highly beneficial to the application of the wisdom of
the crowd effect, 40/120 questions was chosen as the number of questions to split the two sets
by. This was also supported by a second test which took the mean difference in accuracy
between Set A (x) and set B (y). Three selected splits of questions were used to test the mean
difference between the two sets of questions (20/100, 40/80, 60/60). Results showed there
was a significant difference between question splits F(2,336) = 9.83, p < .001; the Standard
Error was lowest for the 40/80 questions split (SE = 0.0532) which also supported the visual
graph (see Table 1 and Figure 3).
17
19. Figure 2. Line graph displaying the level of accuracy stabilising as
more questions added to the simulation group.
18
20. Figure 3. Box plot showing the range of mean difference of each
question split.
Comparison of weighted methods effects
For the purpose of this study, five different weighted aggregation methods were
selected to explore the crowd’s wisdom in an attempt to improve upon the mean participant
accuracy of 61% (refer to table 1). In order to rank the participant's, the performance was
based on a random set of 40 questions (‘Set A’), which was used as the “rank set.” Participant
weights were calculated based on their individual performance on Set A. These weights were
then used in the aggregation of participants responses on the second set of 80 questions (“Set
B”). Set A and Set B questions were kept mutually exclusive. As a result of the aggregation,
the crowd’s response was binarised as mostly true or mostly false for each question. The
crowd’s responses were then compared to the expected answers, grounded by fact-checking
sources and implausible facts, on which an overall accuracy score was calculated on the
complete set of 80 questions. Given the large number of possible sample combinations that
could be produced with the question dataset , this(n, ) C(120, 0) 1.145568482e32C r = 4 =
19
21. process was simulated one thousand times to reduce the likelihood of high variance and
increase the reproducibility. Please refer to table 1 for a summary of the mean accuracy of
each weighted method which was then compared for statistical analysis.
For the purpose of data analysis, participants individual responses were enumeratedxp
q
to 1 (‘mostly true’) and -1 (‘mostly false’). Depending on the weighted method and the
participant's weight based on their relative performance on Set A , a weight would then bewp
a
applied to the participant's response and the aggregate response of the participants.wxp
q
p
a.
would indicate the crowds' answer.
20
22. Unweighted method (UW)
For this method, all participants have equal weight, the mean response of all participants for
each question was taken as the crowd's response. For example, if 75/113 participants voted
for the claim to be mostly true, the result would be 0.32 (75-38)/113. A positive mean
indicates the crowd believes the question to be mostly true and a negative mean indicates the
crowd believe the question to be mostly false. The same aggregation approach applies to all
tested weighted methods.
Performance-weighted method (PW)
Participants who were wrong more than they were correct scoring .50 accuracy in set A were
given a zero weight = 0 when aggregating the crowds' response on Set B.wp
a
Expert weighted method (ExW)
Participants who were identified to be a “top performer,” defined by being in the top 25
percentile in Set A, were applied a performance-based weight that was used in aggregating
the crowds' responses in set B.
Confidence weighted method (CW)
Participants confidence was normalised by applying a z-score across Set B responses. The
confidence is then multiplied by the participant's response for that particular question which
was -1 for ‘mostly false’ and +1 for ‘mostly true’. A high negative z-score would indicate the
participant is relatively confident that the claim is ‘mostly false’. The aggregation of the
participant's confidence is then taken as mostly true or false depending on the sign of the
aggregate number.
21
23. Z-score weighted methods (W)
A z-score was calculated on the participants' accuracy performance on Set A , which waswp
a
then used as an exponential weight on set B By adjusting the base number on a scale.xza x
(1, 4, 5, 6, 7, 1000), we were able to explore the optimal relative weight for each participant.
For example, W1 results in and W5 results in1xza
.5xza
Results
Hypotheses and predicted results
Hypothesis one (H1) predicts that the crowds' accuracy improves when the influence
of low-performers are decreased. Hypothesis two (H2) predicts the crowds’ accuracy
improves when the expert judgements are heavily weighted. Hypothesis three (H3) predicts
there is an optimal intermediate weighting method between expert weighting and
crowdsourcing (“unweighted”). Hypothesis four (H4) predicts the crowd’s accuracy improves
when taking the participants’ confidence into consideration. In summary:
(H1) The crowd’s accuracy improves when decreasing the influence of
low-performers.
(H2) The crowd’s accuracy improves when the expert judgements are heavily
weighted
(H3) There is an optimal intermediate weighting method between expert weighting
and crowdsourcing “unweighted” method
(H4) the crowd’s accuracy improves when taking the participants’ confidence into
consideration.
22
24. Normality analysis
Various tests were conducted to assess whether the participant's scores were gaussian
and therefore parametric. The main concern was the parametricity of the crowds’ accuracy.
Figure 4 represents the crowd accuracy which visually appears to be gaussian. To further
show the data sample is parametric, figure 5 shows the data is Gaussian. As previous research
has shown the Shapiro-Wilk test to be a reliable test of normality (Mendes & Pala, 2003), to
confirm the visual plots we conducted a Shapiro-Wilk test (W = 0.987, p = 0.367), this
validated that the dataset was suitable for testing using parametric statistical methods.
Figure 4. Distribution of crowd accuracy.
This histogram shape illustrates the visual
appearance of a Gaussian distribution.
Figure 5. Q-Q Plot of crowd accuracy. This
figure illustrates there is no skew in the
accuracy.
Participant characteristics
The mean performance of was 0.60 (min = 0.44; max = 0.77, SD = 0.07). Given the
questions were predominantly related to British and American news it appeared necessary to
understand whether participants from other nationalities were not disadvantaged. The data
suggests there were no significant difference between British participants (M = 0.6, SD =
23
25. 0;07) and non-British participants (M = 06; SD = 0.07), a two-sample t-test displayed no clear
evidence that this difference was meaningful, t(19) = -0.04, p = .48, d = .01.
A majority of participants had a high level of education (88% possessed A- Level or
equivalent or above). 5.3% of participants were students and 81% were employed (including
full-time, part-time and self-employed). The largest percentage of participants had British
nationality (86%), but the overall sample was diverse with a total of 14 different nationalities
represented.
Defining an expert
After carrying out a pairwise t-test on different percentiles we found there was not a
significant difference between defining experts as those who scored in the top 15th and 25th
percentile, F(1,999) = 3.25, p = .0.071, = 0.03. The greatest mean difference wasρη 2
observed between unweighted and experts being defined at the 25th percentile threshold.
Repeated-measures ANOVA analysis
To confirm if there was a statistically significant effect of applying weighted methods
on accuracy a repeated-measures ANOVA was used to compare the mean difference. The
repeated-measures was carried out within two distinct groups: Group 1 (UW, PW, ExW) and
group 2 (W1, W4, W5, W6, W7, W1000). Refer to table 2 and figure 6.
24
26. Figure 6. Graph illustrating the mean accuracy (of 1000 simulations) each weighted method
with respective error bars.
Group 1 analysis
The repeated-measures ANOVA showed there was a significant main effect of
weighted methods on crowd accuracy, F(3, 2997) = 700.21, p < .001, = 0.524. To furtherρη 2
explore what specifically is significant within the test post-hoc analyses were performed
25
27. using Tukey’s HSD indicated that the expert method (ExW) had the most significant positive
effect on accuracy compared to the unweighted (UW) method (p = <.001). The results also
showed that by purely weighing on the top-performer decreased the overall accuracy
compared to the unweighted method. Examining the results of group 1 results, allows us to
observe the hypothesis that there is value in applying weighted methods to the crowd for
better outcomes.
Figure 7. Density plot of accuracy for each weighted method (excluding z-score
weighted method - see figure 8).
Group 2 analysis
The repeated-measures ANOVA showed there was a significant main effect of
weighted methods on crowd accuracy, F(5, 4995) = 895.82, p < .001, = 0.431. To furtherρη 2
explore what specifically is significant within the test a post-hoc analyses were performed
using Tukey’s HSD indicated that applying a z-score weight of 4,5 or 6 significantly
26
28. improves accuracy compared to an unweighted method (W1) (p = <.001). The results also
showed there was no significant difference between W4, W5, W6 and W7, and accuracy
significantly decreases when experts are disproportionately weighted (w1000). This result
supports the prediction that accuracy improves when decreasing the weight of
low-performers (H1), while also providing an insight on (H2) showing that there is an
optimal limit for how much experts should be weighted and heavily weighting on experts has
an adverse effect on the crowd’s wisdom.
Figure 8. Density plot of accuracy for z-score each weighted methods
Expert analysis
A further examination was carried out on the weighted methods which appeared
improved crowd accuracy significantly compared to the unweighted approach in both groups
(ExW, W4, W5, W6, W7). A repeated-measures ANOVA showed there was a significant
main effect within of expert-weighted methods on crowd accuracy F(4, 3996) = 15.16, p <
27
29. .006, = 0.15. Pos-thoc analyses using Tukey’s HSD indicated there was only a significantρη 2
difference between the W4 and W7 weighted methods, as well as between W4 and W7.
Confidence analysis
An additional test was performed to check if crowd confidence (CW) affects crowd
accuracy. The participants were required to rate their confidence between 1 to 5 along with
their answer (mostly true or mostly false). To address confidence biases, we applied a z-score
normalisation to the confidence of each participant per question in Set B, then used the
z-score confidence value as the exponent of a base number 2. This created a one-sided
confidence distribution which was then multiplied by the participant's response for that
particular question which was -1 for ‘mostly false’ and +1 for ‘mostly true’. This meant, for
example, a high negative z-score would indicate the participant is relatively confident that the
claim is ‘mostly false’. The aggregation of the participant's confidence was then taken as
mostly true or false depending on the sign of the aggregate number. The results showed there
was a significant decrease in crowd accuracy when comparing between unweighted (M=.71,
SD=.029) and Confidence weighted (M=.52, SD =.032), F(1, 999) = 17228, p <.001.
Power analysis
By conducting a simulation, 1000 results were created to compare the effects of
weighted methods (N = 1000). This played a key part in detecting significance with the
repeated measures approach. The post-hoc power analysis revealed that on the basis of the
mean, the achieved power on group 1 and 2 (d = 1.0) is considered large and above the
recommended .80 level (Cohen, 1988).
28
30. Self-reported experts
Interestingly, in this study, we examined the self-reported expertise by looking at the
ranking that each participant rated themselves on a 5-point Likert scale for each of the seven
domains (general news, economics and politics, science and health, pop culture, international
affairs, crime and art). Participants ranked themselves on a 5-point Likert scale ranging from
‘very poor’ to ‘excellent’. Self-reported experts were defined as those participants who
reported to have a knowledge-score above the mean (M = 23.93, max = 35, SD = 4.36). The
results showed there was a significant difference in the accuracy between self-report experts
(M = .585, SD = .074) and self-reported non-experts (M = .625, SD = .057); these results
appear to suggest that a conservative crowd is likely to perform better than a confident crowd,
t(111) = -3.23, p = <.002).
Relationship between performance and level of education
Participants were asked “What is the highest degree or level of school you have
completed? If currently enrolled, the highest degree received.” to which they were provided 6
options ranging from ‘GCSE (or equivalent)’ to ‘Doctorate’. A Welch’s t-test showed there
was a significant difference between performance between less formally educated
participants defined as those with or studying GCSEs (or equivalent) or A-levels (or
equivalent) (M = 0.62, SD = 0.044, n = 29) compared to participants with Bachelor degrees or
higher (M = 0.56, SD = 0.076, n = 84); t(111) = -2,18, p = 0.0313). These results suggest that
higher education has a negative effect on news judgement. This appears slightly
counterintuitive in particular because the “highly educated” group had a larger crowd. That
said, there were many low-performing participants in the “highly educated” group.
29
31. Relationship between performance and happiness
Previous studies had shown there to be a positive correlation between happiness and
performance (Schutte, Schuettpelz & Malouff, 2001). Participants were asked their level of
happiness before the questionnaire commenced. The Welch’s t-test showed there was a
significant difference between participants who responded ‘OK’, ‘slightly unhappy’ or ‘very
unhappy’ (M = 0.62, SD = 0.062) compared to those who reported a slightly happier
emotional state (M = 0.59, SD = 0.072); t(111) = -2.90, p = 0.004. These results suggest that
people who are less emotional or pessimistic perform significantly better on general
knowledge tests than those which skew towards optimistic. A Welch’s t-test was required
because the Levene f-test showed there was a significant difference between variances
F(1,92) = 14.11, p = <.001.
Relationship between performance and age
We were interested to see if there would be a significant difference in performance
between age groups. The group was split into two by average group age of 32 years. The
independent-samples t-test showed there was no significant difference between under 32’s (M
= 0.61, SD = 0.063) and over 32’s (M = 0.59, SD = 0.079); t(90) = 1.24, p = 0.21.
Relationship between performance and Brexit vote
The results from the Welch’s t-test displayed no significant difference in performance
between the ‘leavers (M = 0.59, SD = 0.055) and ‘remainers’ (M = 0.6, SD = 0.073). A
Welch’s t-test was required because the Levene f-test showed there was a significant
difference between variances F(1,92) = 4.91, p = 0.02.
30
32. Relationship between performance and gender
An independent-samples t-test was conducted to show there was no relationship
between performance and gender. There was not a significant difference in the performance
of males (M = 0.60, SD = 0.064) and females (M = 0.61, SD = 0.08); t(111) = 0.68, p = 0.49.
Relationship between performance and religiosity
Participants were asked “how religious are you?” and provided a 5-point Likert scale
from ‘Not religious at all’ to ‘Very religious’. The independent-samples t-test showed there
was no relationship between performance and religious participants (M = 0.57, SD = 0.06)
and non-religious participants (M = 0.64, SD = 0.05); t(94) = 6.46, p = 4.57. Although the t
value was larger than the t statistic (1.98) the large p-value shows that the result is not
significant. Participants who stated ‘neutral’ were excluded from both groups.
Performance between requent news readers vs. infrequent news readers
Participants were asked “on average, how often do you read/watch the news?” and
provided 5 choices from ‘Rarely or never’ to ‘More than once a day’. An
independent-samples t-test showed there was no significant difference in performance
between news frequentists (M = 0.6, SD = 0.07, n = 108) and non-frequentists (M = 0.58, SD
= 0.047, n = 5); t(111) = 0.60, p = 0.54.
31
33. Figure 9. Box plots of group comparison. Y-axis represents accuracy, X-axis
represents groups.
Discussion
Review of hypotheses
Our study found a significant improvement in accuracy when applying weighted
methods to crowd aggregation, in particular when applying higher weights to those who are
32
34. deemed “experts” within the crowd (H1 & H2). We also found there to be a similar positive
effect to crowd accuracy when either low-performing “non-Experts” are unweighted (ExW)
and unweighted on an exponential scale relative to high-performers (W5, W6, W7).
Interestingly, we also found there is an optimal intermediate weighting method between
expert and unweighted (H3), however, we found there to be an opposite effect to accuracy
when aggregating the participant-normalised confidence (H4) in comparison to the
unweighted method.
Possible explanations for the current findings
This study yielded support for the hypothesis that weighted methods can make the
crowd wiser. Measuring “crowd wisdom” as overall questionnaire accuracy, we explored
various aggregation methods to define an expert and treat them relative to non-experts. Group
2 analysis proved to be fruitful. By applying a plausibly objective method as opposed to the
subjective methods in group 1 we were able to explore an optimal weight that could
significantly improve the crowds' accuracy, particularly compared to the unweighted method.
When we explored the “weight space” for the expert we observed a loss in accuracy with a
base number greater than 5 and a significant drop in accuracy with an extreme case of
This method is highly sensitive given its exponentially. Due to time constraints,1000.χ =
we were unable to find the exact optimal base number but the results indicated it is located
around the 4 value (W4). These results also showed that over-dependance on experts can
have a negative effect on crowd accuracy (for example, W1000). An explanation of this could
be that although experts may perform better overall, the gaps in knowledge could be
domain-specific where the crowd could compensate. For example, an academic may have
limited or very little knowledge of sports and business whereas a “generalist” in the crowd
33
35. may possess such knowledge in those areas. This can be particularly true in the case of this
fact-checking survey as many questions were spread across multiple news categories, making
consistency a challenge for domain experts.
The crowd confidence weighted method (CW) did not prove to increase the accuracy
of the crowd and was, in fact, the weakest performing weighted method. A normalisation was
required to remove participants confidence bias as much as possible. The decrease in crowd
accuracy could be due to low-performing participants being over-confident on defective
decisions and experts being conservative, a phenomenon also known as the Dunning-Kruger
effect (Kruger & Dunning, 1999; Hodges, Regehr & Martin, 2001). This effect was also
observed as we saw self-reported experts were less accurate than non-experts. The confidence
results were contradictory to the conclusions of Ayin et al. (2014). They found by
aggregating only the ‘certain’ opinions and weighting by the degrees of confidence proved to
be an effective method that resulted in higher accuracy. Interestingly, we observed higher
accuracy in participants who were in a negative mood than those who were in a slightly
positive mood, defined by their self-reported level of happiness. This result contrasts previous
studies which found that positive mood promotes higher performance (Kavanagh, 1987).
Areas for future research
The conclusions above could not go beyond informed speculation without further
experimental work to test them. Therefore, for future research in this domain, I proposed the
following areas of interest.
(i) Firstly, further testing should be carried out to reproduce these findings. Although
we found there to be a significant effect in weighting experts in the crowd through
34
36. repeated-measures testing, the design of this study was discretely focused on current affairs
and news.
(ii) Secondly, it would be worthwhile to explore the expert space to better understand
the possible and best ways in which an Expert can be defined and identified. This study did
not focus on exhaustively explore the possibilities of defining an Expert, for example,
defining Experts from using the meta-data (age, nationality, news frequency).
(iii) Thirdly, if such a mechanism was to be implemented and made available to the
public, further investigation should be carried out on creating a dynamic weighted method
which takes into consideration the most recent performance compared to historical
performance on an individual level and group level.
(iv) In this study, participants were anonymous and had no reputation-risk of
performing badly. As crowdsourcing systems are decentralised by design the effect of
anonymity and reputational risk in a crowdsourcing system should be further tested. This
could be acceptable and have no impact on discrete tasks, however, incorporating a type of
reputation-risk could play an important part in designing higher-performing crowdsourcing.
(v) Lastly, creating a hybrid (‘Augmented Intelligence’) system between Artificial
Intelligence and traditional human judgement to help detect misinformation on the internet
could assist improving accuracy and tackling the scale at which “fake news” is being
published.
Conclusion
Our study found that applying a performance-based “expert” weighted method to the
crowd improves the crowds' wisdom, measured by crowd accuracy. This finding contrasts
35
37. previous research that was not able to find a significant improvement in accuracy by applying
weighted methods. This research indicates that in order to optimise crowdsourcing, experts
within the crowd should be given higher weighting compared to non-experts. Future research
should investigate ways in which experts can be assessed and weighted dynamically.
36
38. References
Allcott, H., & Gentzkow, M. (2017). Social media and fake news in the 2016 election.
Journal of economic perspectives, 31(2), 211-36.
Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a
metacognitive nation. Personality and social psychology review, 13(3), 219-235.
Aydin, B. I., Yilmaz, Y. S., Li, Y., Li, Q., Gao, J., & Demirbas, M. (2014, June).
Crowdsourcing for multiple-choice question answering. In Twenty-Sixth IAAI
Conference.
Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., & Nakov, P. (2018). Predicting factuality
of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765.
Begg, I. M., Anas, A., & Farinacci, S. (1992). Dissociation of processes in belief: Source
recollection, statement familiarity, and the illusion of truth. Journal of Experimental
Psychology: General, 121(4), 446.
Bonabeau, E. (2009). Decisions 2.0: The power of collective intelligence. MIT Sloan
management review, 50(2), 45.#
Brabham, D. C. (2013). Crowdsourcing. Mit Press. (pp. 3)
Budescu, D. V., & Chen, E. (2014). Identifying expertise to extract the wisdom of crowds.
Management Science, 61(2), 267-280.
Camerer, C. F., & Johnson, E. J. (1997). The process-performance paradox in expert
judgment: How can experts know so much and predict so badly. Research on judgment
and decision making: Currents, connections, and controversies, 342.
Dittmar, J. E. (2011). Information technology and economic change: the impact of the
printing press. The Quarterly Journal of Economics, 126(3), 1133-1172.
37
39. Drew, I. (2018). Applying weighting methods to crowd truth judgements: Can the wisdom of
the crowd be made wiser.
Expert. (n.d.). Retrieved July 12, 2019, from
https://www.merriam-webster.com/dictionary/expert
Galton, F. (1907). Vox populi (the wisdom of crowds). Nature, 75(7), 450-451.
Glenberg, A. M., & Epstein, W. (1987). Inexpert calibration of comprehension. Memory &
Cognition, 15(1), 84-93.
Goncalves, J., Hosio, S., Rogstadius, J., Karapanos, E., & Kostakos, V. (2015). Motivating
participation and improving quality of contribution in ubiquitous crowdsourcing.
Computer Networks, 90, 34-48.
Hackman, J. R., & Oldham, G. R. (1980). Work redesign.
Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct comparison of the
efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on
systems, man, and cybernetics, 17(5), 753-770.
Hara, K., Adams, A., Milland, K., Savage, S., Callison-Burch, C., & Bigham, J. P. (2018,
April). A data-driven analysis of workers' earnings on amazon mechanical turk. In
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (p.
449). ACM.
Hillsdale, NJ: Lawrence Earlbaum Associates Journal of Information and Technology, 2(2),
135-139.
Hinds, P. J. (1999). The curse of expertise: The effects of expertise and debiasing methods on
prediction of novice performance. Journal of Experimental Psychology: Applied, 5(2),
205.
38
40. Hodges, B., Regehr, G., & Martin, D. (2001). Difficulties in recognizing one's own
incompetence: novice physicians who are unskilled and unaware of it. Academic
Medicine, 76(10), S87-S89.
Hong, L., & Page, S. E. (2004). Groups of diverse problem solvers can outperform groups of
high-ability problem solvers. Proceedings of the National Academy of Sciences,
101(46), 16385-16389.
Horne, B. D., Dron, W., Khedr, S., & Adali, S. (2018b, April). Assessing the news landscape:
A multi-module toolkit for evaluating the credibility of news. In Companion
Proceedings of The Web Conference 2018 (pp. 235-238). International World Wide
Web Conferences Steering Committee.
Horne, B. D., Khedr, S., & Adali, S. (2018a, June). Sampling the news producers: A large
news and feature data set for the study of the complex media landscape. In Twelfth
International AAAI Conference on Web and Social Media.
Horton, J. J., & Chilton, L. B. (2010, June). The labor economics of paid crowdsourcing. In
Proceedings of the 11th ACM conference on Electronic commerce (pp. 209-218).
ACM.
Howe, J. (2006). The rise of crowdsourcing. Wired magazine, 14(6), 1-4.
Janis, I. L. (2008). Groupthink. IEEE Engineering Management Review, 36(1), 36.
Kavanagh, D. J. (1987). Mood, persistence, and success. Australian Journal of Psychology,
39(3), 307-318.
Kephart, J. O., Hogg, T., & Huberman, B. A. (1990). Collective behavior of predictive
agents. Physica D: Nonlinear Phenomena, 42(1-3), 48-65.
39
41. Kittur, A., & Kraut, R. E. (2008, November). Harnessing the wisdom of crowds in wikipedia:
quality through coordination. In Proceedings of the 2008 ACM conference on
Computer supported cooperative work (pp. 37-46). ACM.
Kittur, A., Nickerson, J. V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., ... &
Horton, J. (2013, February). The future of crowd work. In Proceedings of the 2013
conference on Computer supported cooperative work (pp. 1301-1318). ACM.
Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: how difficulties in
recognizing one's own incompetence lead to inflated self-assessments. Journal of
personality and social psychology, 77(6), 1121.
Lakhani, K. R., Wolf, R. G., Feller, J., & Fitzgerald, B. (2005). Perspectives on free and open
source software. Perspectives on Free and Open Source Software, 1-22.
Machor, P. G. J. L. (2008). New directions in American reception study. Oxford University
Press on Demand.
Maslow, A. H. (1943). A Theory of Human Motivation. Psychological Review, 50(4), 370-96
Maslow, A. H. (1954). Motivation and Personality. New York: Harper and Row.
Mendes, M., & Pala, A. (2003). Type I error rate and power of three normality tests.
PakistaCohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd
ed.).
Miller, C. C., Burke, L. M., & Glick, W. H. (1998). Cognitive diversity among upper-echelon
executives: implications for strategic decision processes. Strategic management journal,
19(1), 39-58.
Murphy, A. H., & Winkler, R. L. (1977). Can weather forecasters formulate reliable
probability forecasts of precipitation and temperature. National weather digest, 2(2),
2-9.
40
42. Nisbett, R. E., & Ross, L. (1980). Human inference: Strategies and shortcomings of social
judgment.
Noland, M., Moran, T., & Kotschwar, B. R. (2016). Is gender diversity profitable? Evidence
from a global survey. Peterson Institute for International Economics Working Paper,
(16-3).
Norman, G. R., Rosenthal, D., Brooks, L. R., Allen, S. W., & Muzzin, L. J. (1989). The
development of expertise in dermatology. Archives of Dermatology, 125(8),
1063-1068.
Pennycook, G., & Rand, D. G. (2018). Who falls for fake news? The roles of bullshit
receptivity, overclaiming, familiarity, and analytic thinking. Journal of personality.
Pennycook, G., Cannon, T. D., & Rand, D. G. (2018). Prior exposure increases perceived
accuracy of fake news. Journal of experimental psychology: general.
Polzer, J. T., Milton, L. P., & Swarm Jr, W. B. (2002). Capitalizing on diversity:
Interpersonal congruence in small work groups. Administrative Science Quarterly,
47(2), 296-324.
Schutte, N. S., Schuettpelz, E., & Malouff, J. M. (2001). Emotional intelligence and task
performance. Imagination, Cognition and Personality, 20(4), 347-354.
Shamlo, N. B., & Alavi, S. (2018, August). Fact Checker: AVOW. Retrieved July 14, 2019,
from https://www.avow.ai/
Silverman, C., & Singer-Vine, J. (2016). Most Americans who see fake news believe it, new
survey says, BuzzFeed.
Simon, H. A. (1955). A behavioral model of rational choice. The quarterly journal of
economics, 69(1), 99-118.
Surowiecki, J. (2005). The wisdom of crowds. Anchor.
41
43. Swift, A. (2016). Americans’ trust in mass media sinks to new low. Gallup News, 14.
Wallis, K. F. (2014). Revisiting Francis Galton's forecasting competition. Statistical Science,
420-424.
Zhao, Y., & Zhu, Q. (2014). Evaluation on crowdsourcing research: Current status and future
direction. Inf Syst Front, 16, 417-434.
42
44. Appendix
Appendix A – The list of the 120 claims participants were required to respond to during the
quiz.
Claim
In the UK, youth unemployment is down 44% since 2010.
Ghanaians are allowed to divorce only if they attend court dressed the same clothing they
wore when they got married.
All cameras on M1 and M25 go live at midnight tonight, set at 72mph. Auto ticket
generating system with 6 point penalty. Watch your speed and tell everyone else tonight,
any speed over 90mph is instant ban & possible court & custodial sentence order!! Drive
safely.”
Lawmakers in California have proposed a new law called the "Check Your Oxygen
Privilege Act."
MDMA shown to increase empathy over other substances
Sniffing rosemary increases human memory by up to 75 percent.
UK won more gold medals in Rio Olympics 2016 than China
UK taxpayers are paying less income tax than 2010.
There are 4000 abortions a week in Britain.
NASA rejected Hillary Clinton's childhood dream of becoming an astronaut.
Over the past year the number of illegal immigrants crossing the Mexico-US has
significantly decreased.
‘A record number of people kill themselves in prisons in England and Wales in 2016,
figures show.’
The UK will be paying the Brexit divorce bill until 2064.
The number of poor [working class] students dropping out of university at the highest
level in five years.
The western cape has the lowest unemployment rate of all provinces.
Trump used crib notes during listening session with parkland survivors
Police officers have been cut by 21,000 since 2010
More than 80% of student graduates won't repay their loan in full.
China has a Panda shaped solar farm
The US is the largest donor of humanitarian aid in Syria
Some antidepressants are more effective than others
Wealthy professionals are most likely to drink regularly.
BBC Newsnight edited photos of Jeremy Corbyn to make him look close to Russia.
In a 2014 survey 24% of people thought that the USA was the country that posed the
43
45. greatest danger to world peace.
A woman who entered an Uber in Tampa, Florida, on 18 February 2019 was the victim of
an attempted kidnapping by a "sex traffic worker."
The Institute for Public Policy Research found household bills will rise by between £245
and £1,961 a year after Brexit.
Luxembourg is the capital of Luxembourg
The Walton family makes more money in one minute than Walmart workers do in an
entire year. This is what we mean when we talk about a rigged economy.
In 2018, Apple was the largest publicly traded company in the world
The single market is dependent on membership of the EU. What we’ve said all along is
that we want a tariff free trade access to the European market and a partnership with
Europe in the future.
"we just had 2 years (2016-2018) of record-breaking Global Cooling"
"Trump's action could push the Earth over the brink, to become like Venus, with a
temperature of two hundred and fifty degrees, and raining sulphuric acid."
Drug kingpin Joaquín "El Chapo" Guzmán testified that he gave millions of dollars to
Nancy Pelosi, Adam Schiff, and Hillary Clinton.
Musician Jay-Z said that "satan is our true lord" and that "only idiots believe in Jesus"
during a backstage tirade in November 2017.
President Trump's oft-repeated slogan "America First" was also a credo of the white
supremacist Ku Klux Klan organization.
The new technology, developed by private company ASI Data science, can detect Daesh
propaganda “with 99.99%% accuracy”.
The size of the world's ice caps (type of glacier) are at record high levels.
After leaving the EU, the UK will take back control of roughly £350 million per week.
Facebook shut down an AI experiment after chatbots developed their own language.
Illegal crossings at the US-Mexico border have reduced by 40%.
An individual's psychological attributes can be determined by observing and feeling the
skull.
Staying in the single market & customs union would not cover services.
The position and relative movement of continents is at least partially due to the volume of
Earth increasing.
98% of US mass shootings occur in gun-free zones
Former Federal Bureau of Investigation director Robert Mueller’s indictments (formal
accusation that a person has committed a crime) prove that there was no collusion
between Trump campaign and Russia.
Vaccines may cause autism.
Eating bacon is better for you than tilapia (common name for nearly a hundred species of
cichlid fish).
44
46. Claim: If you live in an area where the council is run by the Labour party, you pay £100
more than under the Conservatives.
A vintage Heineken advertisement showed a toddler drinking a beer and boasted about
having the youngest customers in the business.
New study shows that Marijuana leads to a 'complete remission of Crohn's Disease.
The McDonald's fast food chain announced they will be phasing out the Big Mac by July
1st.
Coffee causes cancer
There are 480,000 young people who are hidden from the unemployment figures.
Snapchat CEO has said that the app is for rich people and so did not want to expand
Parents should ask a baby's permission before changing their nappy/diaper.
Medical marijuana has no health risks says WHO
Trump doubled his African-american poll numbers (from 11% to 22%) in a week.
It would take $135 billion to eradicate global poverty.
More than 700 attacks have been launched from the Afrin area under PYD/YPG
Google search spike suggests many people don't know why they voted for Brexit
David Davis [Secretary of State for Exiting the European Union] has never said the
government had impact assessments of the effect Brexit on different parts of the
economy.
“There is more money going into our schools in this country than ever before. We know
that real-terms funding per pupil is increasing across the system, and with the national
funding formula, each school will see at least a small cash increase.”
“It is an absolute scandal that the Conservatives are pressing ahead with a plan that could
leave over a million children without a hot meal in schools.”
Japan’s prime minister, Shinzo Abe, championing of women’s advancement is a factor in
“the beginning of a new era in female success”.
Donald Trump has been much tougher on Russia than Barack Obama.
“The top 1% of earners in this country are paying 28% of the tax burden. That is the
highest percentage ever, under any Government.”
Last year, we increased the number of tourists [in South Africa] by 12.8%”
The National Institute of Health (NIH) have plans for lifting ban on human-animal
chimeras.
There are only 18 minutes of total action in the average baseball game.
Diesel cars are more polluting than petrol cars
Nigeria contributes 23% of the global malaria cases
47% of the population don’t earn enough money to bring in a wife or husband from
outside the EU.
Fewer than half Britons think Princess Diana's death was accidental.
Trump signed a bill blocking Obama-era background checks on guns for people with
mental illness
45
47. Spending on mental health went up by £575 million last year
60% of UK trade is through EU trade agreements.
700,000 public workers use up half of kenya's taxes
The type of cladding used on Grenfell Tower is banned in Britain
The total number of london murders, even excluding victims of terrorism, has risen by
38% sine 2014.
In May 2018, president Donald Trump established a 'religious office' to give religious
groups a 'voice in government'.
The CIA paid two psychologists $81 million "to develop and run their torture program."
3.7 million people living in the UK are citizens of another EU country. That’s about 6%
of the UK population, according to the latest figures covering the year to June 2018.
12.7% of NHS staff say that their nationality is not British
Indians are the second most common nationalities of NHS staff
76% of British people support Shamima Begum being stripped of her citizenship
Romanians are the most common EU national to live in the UK
The number of EU nurses coming to the UK has fallen by 90% since the Brexit vote.
In the Brexit campaign, parties on both sides of the EU referendum made false claims.
50% of Irish exports go to Northern Ireland.
Only 5% of Northern Ireland’s GDP goes to Ireland.
Every five minutes, 70 children will be born in the UK, 20 to mothers not born here.
The number of EEA nurses and midwives who joined the NHS for the first time fell by
91% from 2015/16 to 2017/18.
China is Britain's top trading partner
Neil Armstrong was the first man on the moon
Consumption of sugar causes type 2 diabetes
Approximately 1 in 4 people in the UK will experience a mental health problem each
year
Eating organic food doesn't come with any nutritional benefits over non-organic food
Refugees or illegal immigrants living in Britain get a total yearly benefit of £29,900.
The plane in the Malaysian Flight MH370 was hidden away and reintroduced as Flight
MH17 later the same year in order to be shot down over Ukraine for political purposes
We only used approx. 10% of our brain
Cold weather causes colds
The UK population is approximately 66 million in 2017
In January 2019, Christiano Ronaldo had the highest amount of Instagram followers in
the world
The most expensive Big Mac can be found in Switzerland at 6.62 USD
UK drink approximately 95 million cups of coffee per day
In 2018, Amazon was the largest publicly traded company in the world
Singapore has the highest average IQ level in the world
46
48. 2 billion tons of waste was dumped in 2016
American Bison is the heaviest land mammal
The EU currently costs the UK over £350 million each week - nearly £20 billion a year
“Labour reveals over 200,000 nurses have quit the NHS since 2010.”
The Ethiopian calendar is 7.5 years behind the Gregorian calendar due to the fact that it
has 13 months.
Drinking lemon mixed with hot water for one to three months will cause cancer to
disappear.
75% of the world’s diet is produced from just 12 plant and five different animal species.
If you call 999 in an emergency but can’t speak, press 55 and they can track where you
are calling from using new technology.
There are 118 elements in the periodic table
Staying in the single market & customs union would not cover services.
The population of the UK in 2017 was 66 million
World War II ended in 1945
The Beatles are the best-selling artists of all time
47