Modeling Social Data, Lecture 2: Introduction to Counting
Complete Study
1. American Voter Study - by Oojwal Manglik
15/04/2015
Introduction:
The Presidential Elections in the United States are interesting not just for Americans but
also for the rest of the world, owing to the status of USA as a major player in global events.
Since these are characterized by high marketing spends by candidates and scrutiny by the
general public, it is interesting to observe how the American voters perception of the
presidential candidate changes during a President's term in office.
For this study I would like to analyse the proportion of an American voter who has a
similar voting pattern between two consecutive US presidential elections
Data:
About the data set
I have taken the American National Electoral Survey data taken in 2012. The extract used of
the American National Elections Study (ANES) provides a sample of selected indicators in
the 2012. Complete citation of the data used is available in the citation section.
Data collection methodology
Data for the study was collected over a 5 month period (September 2012-January 2013)
through face to face and internet based interviews. For the pre-election data, interviews
were conducted from 2 months prior to the election day and for post election data,
interviews were conducted for another 2 months post election result being declared.
Cases
Each case in the study represents a survey respondent who has reported his voting pattern
in the 2008 and 2012 elections.
Nature of Study
During the collection of the data set, surveyors were only collecting data based on
observations made by them and took no measures were taken by the surveyor to introduce
any bias or influence that would alter the response of the voter. Also data was collected
multiple times during the study. Hence, the proposed project will hence be a prospective
observational study.
Scope of inference - Generalizability
This study takes data only for 2008 and 2012 elections and cannot be generalized for all US
Presidential elections. A similar analysis conducted for multiple pairs of elections can give
greater insight on how the proportion changes between different pairs of elections.
2. Scope of inference - Bias
The current study may not be generalizable to the complete population of the United States
of America mainly because the survey has a majority of African American and Hispanic
respondents, a demographic mix which is not representative of the complete population.
Scope of inference - Causality
For the current study, there may be some confounding variables which can make it look as
if there is causal relationship between voting patterns of a respondent during multiple
elections. For ex. a voter may simply vote for Barack Obama because the voter is a lifelong
democrat. Such confounding variables may be difficult to exhaustively identify. As a result
causality cannot be established with absolute certainty.
Variables
The data fields used for this study are as follows:
1. interest_voted2008: Did R vote for President in 2008 The first variable used in the
study is the vote cast by respondent in 2008 election. This is a non ordinal categorical
variable with 3 levels - "Barack Obama", "John McCain" and Others. Additionally 1394
values are recorded as NA (23.5% of all values) for various reasons.
#first few records of whovote2008
head(anes$interest_whovote2008)
## [1] Barack Obama Barack Obama Barack Obama <NA> Barack Obama
## [6] <NA>
## Levels: Barack Obama John Mccain Other {Specify}
#Summary of interest_whovote2008 field
VotePat2008<-anes$interest_whovote2008
VotePat2008Sumry<-summary(VotePat2008)
VotePat2008Sumry
## Barack Obama John Mccain Other {Specify} NA's
## 2704 1702 114 1394
2. presvote2012_x: SUMMARY: For whom did R vote for President in 2012 The second
variable used in this study is the vote cast by the President in 2012.This is a non
ordinal categorical variable with 3 levels - "Barack Obama", "Mitt Romney" and Others.
Additionally 1394 values are recorded as NA (27.1% of all values) for various reasons.
#first few records of presvote2012_x field
head(anes$presvote2012_x)
## [1] <NA> Barack Obama Barack Obama Barack Obama Barack Obama
## [6] <NA>
## Levels: Barack Obama Mitt Romney Other
#Summary of interest_presvote2012_x field
VotePat2012<-anes$presvote2012_x
3. VotePat2012Sumry<-summary(VotePat2012)
VotePat2012Sumry
## Barack Obama Mitt Romney Other NA's
## 2496 1692 118 1608
Note: The Republican candidates for 2008 and 2012 were not the same (John McCain in
2008 and Mitt Romney in 2012). For this study, it is assumed the two values to be same.
Such a response represents a class of voters who do not change their voting preference (to
become pro-president) between elections as a result of the President's work during his
term.
3. sample_state: SAMPLE- State of Respondent address (used for exploratory analysis)
The third variable used in this study is the state from which the respondent comes.This is a
non ordinal categorical variable with 51 levels, each level being a state from the USA.
#first few records of sample_state field
head(anes$sample_state)
## [1] AL AL AL AL AL AL
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
The total data set has responses from 5914 American voters. Responses which have value
of NA recorded in any of the fields of interest have been ignored for the exploratory and
statistical analysis in this study.
Exploratory data analysis:
The following bar plot depicts the number of respondents in the study who voted for the
differnt Presidential Candidates in 2008. Visually it depicts that most respondents (2704 to
be precise) voted for Barack Obama. In terms of percentage, 59.82% of valid responses
were for Barack Obama, 37.65% were for John McCain and 2.52% were for Others. Here
valid responses are those responses which have not been categorized as NA for this
variable.
This response pattern is inline with the outcome of the 2008 Presidential elections in
which Barack Obama had emerged as the winner. Of the 61.6% of eligible Americans that
had cast their vote, Barack Obama had secured 52.9% votes, John McCain had secured
45.7% votes and others had secured 1.4% votes (source of data is Wikipedia page -
http://en.wikipedia.org/wiki/United_States_presidential_election,_2008).
There is a significant observed variation between the observed sample proportions and the
actual reported proportions.
Note : All "NA" values have been omitted for each variable under consideration for the
purpose of this exploratory analysis.
#Bar plot distribution of interest_whovote2008 field
barplot(VotePat2008Sumry[1:3],main = "Voting Pattern of Respondents in
2008",xlab="Name of Candidates",ylab="Number of Respondents")
4. The following bar plot depicts the number of respondents in the study who voted for the
differnt Presidential Candidates in 2012. Visually it depicts that most respondents (2496 to
be precise) voted for Barack Obama. In terms of percentage, 57.97% of valid responses
were for Barack Obama, 39.29% were for Mitt Romney and 2.74% were for Others. Here
valid responses are those responses which have not been categorized as NA for this
variable.
This response pattern is inline with the outcome of the 2008 Presidential elections in
which Barack Obama had emerged as the winner. Of the 58.2% of eligible Americans that
had cast their vote, Barack Obama had secured 51.1% votes, Mitt Romney had secured
47.2% votes and others had secured 1.7% votes (source of data is Wikipedia page -
http://en.wikipedia.org/wiki/United_States_presidential_election,_2012).
It is also interesting to observe here that the difference between the voting patterns in
2008 and 2012 of respondents is +/- 2% which is significant but small.
There is a significant observed variation between the observed sample proportions and the
actual reported proportions.
#Bar plot distribution of presvote2012_x field
barplot(VotePat2012Sumry[1:3],main = "Voting Pattern of Respondents in
2012",xlab="Name of Candidates",ylab="Number of Respondents")
5. Now I move on to the comparison of the sample voting patterns of 2008 and 2012 by
looking at the below contingency table. The Y-Axis here depicts the sample voting patterns
for 2008 elections and sample voting patterns for 2012 elections. In this table, the diagnal
values represents respondents whose voting behaviour did not change in the 2008 and
2012 elections. Just by looking at the data I can see that a very high proportion of
respondents had similar voting preferences both in 2008 and 2012. This is indicative of the
fact that voting preferences do not change significantly amongst voters between elections.
#Contingency table to compare voting patterns between 2008 and 2012
ContingencyTabVote<-table(VotePat2008,VotePat2012)
ContingencyTabVote
## VotePat2012
## VotePat2008 Barack Obama Mitt Romney Other
## Barack Obama 2077 184 34
## John Mccain 95 1325 35
## Other {Specify} 22 32 37
The following bar plot depicts the state wise voting pattern of the respondents for 2008
elections. Visually it is evident that voting patterns vary from state to state. One limitation
to this inference is that the number of respondents available for each state is not the same
and for certain states the number of respondents is very less. However this inference is
inline with the traditional view about certain states in the US being affiliated to certain
political parties (democrat/republican).
#Compare statewise voting patterns in 2008
ContingencyTabPat2008<-table(VotePat2008,RespState)
6. barplot(ContingencyTabPat2008,legend=rownames(ContingencyTabPat2008),main =
"State Wise Voting Pattern of Respondents in 2008",xlab="States",ylab="Number
of Respondents")
The following bar plot depicts the state wise voting pattern of the respondents for 2012
elections. Visually this plot also depicts that voting patterns vary from state to state. There
is also a certain consistency in the voting patterns between this plot and the previous plot.
This could possibly be attributed to the fact that it is the same respondent who has been
sampled for getting responses of 2008 and 2012. However this inference is inline with the
traditional view about certain states in the US being affiliated to certain political parties
(democrat/republican).
#Compare statewise voting patterns in 2012
ContingencyTabPat2012<-table(VotePat2012,RespState)
barplot(ContingencyTabPat2012,legend=rownames(ContingencyTabPat2012),main =
"State Wise Voting Pattern of Respondents in 2012",xlab="States",ylab="Number
of Respondents")
7. Inference:
For the statistical analysis, my objective is to compare 2 paired categorical variables which
depict the voting behaviour of respondents in the 2008 and 2012 elections. Let us first
proceed with the 95% confidence interval analysis.
Confidence Interval
The statistical parameter of interest chosen for this purpose is the proportion. The
objective of this analysis is to find the 95% confidence interval for the proportion of voters
who have voted for either the president (voted Barack Obama in both elections) or not the
president (voted John McCain - 2008 and voted Mitt Romney - 2012, voted other in both
elections) in the two elections. In other word this is the proportion of voters whose voting
pattern has remained consistent in the two elections.
I begin this analysis by converting the available sample data from 2008 and 2012 to the
same levels for comparison. The levels chosen for our analyisis are President, Not President
and Other. Even though the category other also constitutes a vote not for Barack Obama, for
now I have classified them separately.
#Converting voter data into the same levels for 2008 and 2012
Vote2008<-revalue(VotePat2008, c("Barack Obama"="President","John
Mccain"="Not President","Other {Specify}"="Other"))
Vote2012<-revalue(VotePat2012, c("Barack Obama"="President","Mitt
Romney"="Not President"))
8. Now I consolidate the two vectors into a single vector that records the comparitive voting
pattern in 2008 and 2012. Here if the voting pattern in the two years is the same, a value of
1 is recorded. If the voting pattern in the two years is not the same, a value of 0 is recorded.
All responses with NA value are omitted from the final vector. Distribution of the final
comparison is depicted in the table output for VoteSamp below. This vector is now a binary
variable with output only as success/failure or 1/0, 1 meaning that the respondent voted
similarly in 2008 and 2012 and 0 meaning that the respondent did not vote similarly in
2008 and 2012.
#Initializing the vector to compare voting between 2008 and 2012
VoteSamp<-rep(NA,length(Vote2008))
#Populating the comparison vector VoteSamp
#If value from voter sample in 2008 or 2012 is NA then assign NA. I would
later remove these records
#If value from voter sample in 2008 or 2012 is equal (ex. Voter cast his vote
for Barack Obama in both elections) then assign 1
#If value from voter sample in 2008 or 2012 is not equal (ex. Voter cast his
vote for different candidates in two elections) then assign 0
for (i in 1:length(Vote2008)) {
if (is.na(Vote2008[i])|is.na(Vote2012[i])) {
VoteSamp[i]<-NA
} else if (Vote2008[i]!=Vote2012[i]) {
VoteSamp[i]<- 0
} else {
VoteSamp[i]<- 1
}
}
#Removing all samples for which response NA has been recorded
VoteSamp<-na.omit(VoteSamp)
#VoteSamp has now been converted into a binary distribution
table(VoteSamp)
## VoteSamp
## 0 1
## 402 3439
This table shows that 89.53% of the candidates voted similarly between 2008 and 2012.
This is the sample proportion (p-hat) of our study.
#Calculating sample proportions
SampProportionSame = table(VoteSamp)[2]/length(VoteSamp)
SampProportionNotSame = table(VoteSamp)[1]/length(VoteSamp)
SampProportionSame
## 1
## 0.8953398
9. Next I construct the sampling distribution of sample proportions based on the available
VoteSamp vector. But before I do that, I calculate the number of samples needed for 1%
margin of error. For calculating the margin of error, I have assumed worst case scenario for
the proportion of success and failure as 50% each. This is mainly because no reference
proportions from any reliable past study is available currently. Based on this, the number
of samples requied is 9604
#Calculating the number of sample required for 1% margin of error for a 95%
confidence interval assuming equal probability of success & failure
zvalue<-qnorm(0.975)
n=(zvalue^2)*(0.5)*(0.5)/(0.01^2)
n
## [1] 9603.647
Now I create the sampling distribution. Number of samples in the sampling distribution is
taken as 500 with each sample consisting of 9604 samples. Sampling is done with
replacement to ensure independence of each sample. Summary and histogram of the
sampling distribution constructed is given below.
#Creating the sampling distribution
SamplingDistribution_Proportion<-rep(NA,500)
for (i in 1:length(SamplingDistribution_Proportion)) {
Samp<-sample(VoteSamp,n,replace=TRUE)
SamplingDistribution_Proportion[i] = table(Samp)[2]/length(Samp)
}
#Summary of sampling distribution
summary(SamplingDistribution_Proportion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8864 0.8931 0.8953 0.8952 0.8971 0.9042
#Histogram of sampling distribution
hist(SamplingDistribution_Proportion)
10. From the histogram of the sampling distribution, I can visually see that it is nearly normal
and centred around the sample proportion that I had calculated earlier. There is also very
less skew on the left or right. But before I apply the central limit theorem, let us evaluate
each required condition:
1. Independence - The available sample vector VoteSamp consists of 3841 respondents.
This is <10% of the American voter population. During constructing the sampling
distribution of proportion, I have sampled with replacement. This has ensured the
condition of independence is met for our sampling distribution.
2. Skewness - I have 3439 successes and 402 failures in our sample. That means I have at
least 10 successes and 10 failures in our sample and this satisfies the success-failure
requirement. With this I can conclude that the sampling distribution for proportion is
not skewed and is approximately normal as required by the central limit theorem. By
looking at the histogram visually also I can conclude that this condition is met.
It is evident that we have adequate number of samples available for our analysis.
Consequently we go ahead and use the pvalue method for our analysis (no need for t-
distribution here). Now that the conditions have been met, I calculate the mean of the
sampling distribution that has been created.
#Mean of sampling distribution
SamplingDistribution_Proportion_Mean<-mean(SamplingDistribution_Proportion)
I also calculate the standard error of the sampling distribution for a 95% confidence
interval. Please note I have taken the sample size 'n' such that this standard error is 1%.
11. #Standard error for 95% confidence interval using P Distribution
SE<-(zvalue)*(SamplingDistribution_Proportion_Mean^0.5)*((1-
SamplingDistribution_Proportion_Mean)^0.5)/(n^0.5)
SE
## [1] 0.006125668
Confidence interval is calculated as sampling distribution mean +/- standard error of the
sampling distribution
LowerConfidenceLimit <- SamplingDistribution_Proportion_Mean - SE
UpperConfidenceLimit <- SamplingDistribution_Proportion_Mean + SE
ConfidenceIntervalP <- c(LowerConfidenceLimit,UpperConfidenceLimit)
ConfidenceIntervalP
## [1] 0.8890837 0.9013351
Hypothesis Testing
Now I move on to the hypothesis testing. I want to check the possibility that population
proportion for the number of people who voted similarly in the 2008 and 2012 elections is
91%. For this I construct the following hypothesis:
Ho = PopulationProportion = 91%
Ha = PopulationProportion != 91%
For this hypothesis, we will be performing a two side test for the normal distribution. We
will reuse the sampling distribution already created earlier for proportion and it has
already been established that this distribution is nearly normal. First we calculate the z
value for the null value of 91%:
NullVal<-0.91
zvalue<-(SamplingDistribution_Proportion_Mean-NullVal)/SE
zvalue
## [1] -2.414526
Next I calculate the associated p value for this z score. Note that since this is a two tailed
test. Area of interest in this case is the area under the curve for which
zscore > abs(zvalue)
Below I calculate the pvalue for this area and I would be multiplying the obtained p value
by 2 for the two tailed test.
pval<-2*pnorm(zvalue)
pval
## [1] 0.01575569
12. Since the obtained p value is very small (1.5%), we can reject the null hypothesis that the
population proportion of Americans who voted similarly in the 2008 and 2012 US
Presidential elections is 91% or more.
The result of the hypothesis test is inline with the 95% confidence interval we have earlier
identified. Hence we can say that the two findings are consistent.
Conclusion:
Based on this analysis, I can conclude with 95% confident that 88.9% to 90.1% of all
Americans vote consistently between the first and second term of a presidential election
with 1% margin of error.
For the hypothesis testing, we can reject the null hypothesis that the population proportion
of Americans who voted similarly in the 2008 and 2012 US Presidential elections is 91%.
In the future, this methodology can be repeated for multiple pairs of US Presidential
elections to see if there is any statistical consistency in the findings over the years.
The main learning out of this excercise has been a practical insight into how statistical
techniques can be used to strengthen our ability to draw conclusions and inferences.
Citation:
Full details of this data set is available in the following links:
Information on the study http://www.electionstudies.org/
Study Codebook
https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fanes1.html
Data Set Used http://bit.ly/dasi_anes_data
Additionally following wikipedia links have also been referenced for checking the actual
result of the presidential election in 2008 and 2012:
http://en.wikipedia.org/wiki/United_States_presidential_election,_2008
http://en.wikipedia.org/wiki/United_States_presidential_election,_2012