SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
 
 
 
 
 
 
 
 
 
Making Databases Great Again 
By Alifya Ali, Wade Cope, and Garner Vincent 
 
 
 
 
 
 
 
 
 
 
 
Motivation for Project 
We chose to collect data from polling results of the presidential primaries because if there 
were ever a time in modern history to pay attention to politics, it's the 2016 election cycle. With 
multiple polls released each day for the last several months, we knew there would not be a 
shortage of available data. Another reason we chose this primary season is that the Republican 
side was stacked with a record 17 candidates at the start, and several of them rose to the front of 
polls before crashing and burning. It was this potential to analyze dynamic trends in relevant, 
real­world data over time that first drew us to political polling as an option, and inspired us to 
make databases great again. 
Web Scraping 
 
Figure a. Real Clear Politics Website  
The selection process for the website was very simple. Political polls are a dime a dozen, 
however we had to find one that would work nicely for the purposes of scraping data. Real Clear 
Politics is a poll aggregate website and we thought for the purposes of our project an aggregate 
was the best way to go. Real clear politics splits up data very nicely into a readable table 
separated by date and then further by the name of the poll, the polling agency, percentages won 
by candidate, and how many percentage points a candidate won by in that poll (excluding ties). 
The website does this process for every poll conducted for about a month and a half before the 
current date. This gave us more than enough data to store into our database keeping it substantial 
enough to justify making the tables in the structure chosen yet fast enough that the scope was 
manageable and so that the program runs very fast. 
 
Figure b. Real Clear Politics Source Code 
The source code of the page itself it was also pretty straightforward as well. The code 
divided up all the information described earlier into separate table tags that had their class listed 
as empty or ‘alt’ and alternated between those two class distinctions. After the information about 
the date which is extracted then formatted (day of the week, month, day to year, month, day) the 
polls are broken up by the name of the race, the polling agency, the results of the poll, and the 
spread (which gives the winner and the margin they won by). They do this for every single day 
for 3 pages’ worth of data (which approximates to about 225 polls). 
 
 
Figure c. Web Scraping Function  
 
Because of the way the source code is broken up, scraping the data was actually very 
simple. First we made a Beautiful Soup object of the data scraped and made a list with strings of 
the days of the week. We then used a for loop to traverse through all the blocks of code (called 
siblings in the tree structure) that fell into the classes “date”, “alt” or an empty string. We did this 
by using the beautifulsoup function findAll(), which neatly pulled out all of the data from the 
page that was actually relevant to our database using the parameters listed. In each iteration of 
our loop, we split the sibling and then tested the length.We only worked with the lists that had a 
length greater than 0 because the findAll() method returns empty lists occasionally. After 
checking the length, we checked if the first element of each split sibling contained a weekday. If 
the element contained a weekday, it is transformed from text format into mySQL date format 
(01/01/2012) and then stored in a local variable to be used in the loading of all polls that 
occurred on that date. 
If the element was not a date, we split the sibling further (using the comma as a delimiter) 
and then deleted any empty elements at the beginning and end of the list. Next, we executed 
statements according to the location of the element and then formatted them as needed to input 
into our database. Candidate and result points had to be transformed using the space character as 
a delimiter, and the plus symbol had to be removed from each spread. If the element was neither 
a date nor a candidate (another fun quirk of beautiful soup) then we continued the loop to move 
onto the next sibling because it was not necessary for our database. 
 
The Database 
 
 
 
 
 
Figure d. Entity Relationship Diagram 
Our database structure consists of a total of five tables. The ‘Polling Agency’ table 
houses the name of each unique polling agency and its auto­incremented primary key. An 
‘agency’ is the website or entity which originally created the poll and recorded its responses. 
Each polling agency has one or more poll records related to it by the agency’s primary key, each 
housed in the ‘Poll’ table. A ‘Poll’ record consists of the name of poll and the date it was 
conducted.  Each ‘Poll’ record has one or more related results, each housed in the ‘Result’ table. 
A ‘Result’ record consists of the name of the politician the result is recorded for, and the number 
of percentage points that politician received in the poll. The ‘Result’ table serves as a junction 
table between the ‘Poll’ table and the ‘Politician’ table, because each politician is in at least one 
poll, and most of the time many polls. Each record in the ‘Result’ table stores a politician’s last 
name as the foreign key to its ‘Politician’ record. Each ‘Politician’ record also houses the 
politician’s party affiliation. Finally, each ‘Poll’ record may or may not be associated with a 
‘Spread’ record. The ‘Poll’ record will only be associated with a ‘Spread’ record if there was not 
a tie in the poll. 
Data Dictionary: 
1. Polling Agency  
a. Entity Description:  
i. Our website collects polling information for many different polling  
agencies across the internet and other media. An example is “FOX” or  
“MSNBC”.  
b. Attributes:  
i. pk_Id = Auto incremented primary key for each polling agency object  
ii. Name = Varchar name of the polling agency ex “MSNBC”.  
2. Poll  
a. Entity Description:  
i. Each polling agency conducts multiple polls throughout the course of  
weeks/months. Each record in the table will be a poll on a particular day  
(ex Republican Poll March 11th)  
b. Attributes  
i. Pk: INT autoincremented  
primary key  
ii. fk_pa: INT foreign key to polling agency  
iii. Date: DATE Date of the poll  
iv. Name: Varchar name of the poll itself (republican vs democrat,  
republicans only, democrats only)  
3. Result  
a. Entity Description:  
i. Each poll has a variable number of results. A result entity is a politician  
and the percentage points they received in the poll.  
b. Attributes:  
i. Fk_poll = foreign key to the poll primary key (int)  
ii. Fk_politician = foreign key to the politician primary key (int)  
iii. Points = number of points the particular candidate won in the poll (int)  
4. Spread  
a. Entity Description:  
i. Each poll has one spread, the number of points the winning politician won  
in the respective poll.  
b. Attributes:  
i. Fk_poll = foreign key to the poll (int)  
ii. Fk_politican = foreign key to the politician primary key (int)  
iii. Points = number of points the particular candidate won by (spread) in the  
poll (int)  
5. Politician  
a. Entity Description  
i. Names of all the presidential candidates represented from the polling data  
collected.  
b. Attributes  
i. Pk: Varchar, last name of the candidate  
ii. Party: Varchar, Republican or Democrat  
 
Overview of the Process of Creating the Database 
Our application begins by opening a connection to a mySQL server. A connection object 
is created at authentication via the PyMySQL library. A cursor object which serves as a pointer 
for the actions of PyMySQL commands is created from the connection object. Anyone can run 
this application by changing the connection information in line 143 to the location of their 
mySQL server and login credentials to their own login credentials on their server. 
Next, our application creates a database in the connected server called ‘polling’. Note that 
this will fail if a database named ‘polling’ already exists in the server. After creating the 
database, the tables ‘agency’, ‘poll’, ‘politician’, ‘result’, and ‘spread’ are created, representing 
the entities described in the relationship diagram ‘Polling Agency’, ‘Poll’, ‘Politician’, ‘Result’, 
and ‘Spread’, respectively.  After creation of the database and tables, the web scraping part of the 
application begins. During scraping, there a set of functions with names containing the word 
‘store’ which are executed to insert records. 
Each ‘store’ function takes relevant information from the scraper and inserts the relevant 
record into the table in the second part of the name of the function. For example, the function 
‘storeAgency’ takes the title of the polling agency as an argument to insert an agency record into 
the agency table. With each store function call, a new record insertion is attempted. If the insert 
fails, usually due to the limitation of inserting duplicate agency titles in the case of the 
storeAgency function ­ which is the intended function, then nothing is inserted and the data is 
committed. This process is similar, although not identical across the five ‘store’ functions. Next, 
the ‘store’ function selects the primary key from the table in question using one of the arguments 
it took in as an identifier. The primary key for the record in question is returned to the main 
scraping code. This allows easy relation of ‘Poll’ table records in the case of the storeAgency 
function. For example, the returned primary key is later sent as an argument into the storePoll 
function where it is inserted into the database as the foreign key in the record. 
After inserting all found records, the application drops the database created at 
initialization of the application, and closes both the cursor and connection objects (lines 
174­178). These lines are part of a ‘finally’ statement, meaning that if there are errors that occur 
during runtime, the database will always be dropped, and the connection and cursor objects will 
always be dropped before throwing errors. 
 
Queries 
When it came time to design the queries for our interface, we wanted to give the user 
multiple options for analyzing the data. We provided some commands that give general 
information like which candidates were running, which agencies are conducting the polls, and a 
list of all of the results from over the past few months.  
 
Figure e. The Queries 
We also provided queries that show things like changes in a candidate’s poll standings 
over time. This was accomplished through a query that gives a candidate's 'fluctuation', a figure 
calculated by subtracting their lowest poll result from their highest finish. This number tells you 
about how a candidate has fared over time. If their fluctuation number is low, it suggests that the 
candidate has placed very consistently in polls. This means that the candidate either always did 
well, as Clinton has done from the start, or it means that they simply weren't able to surge out of 
last place as they had planned to (Ohio Gov. John Kasich). A high fluctuation number means that 
in recent months, the candidate has either risen or fallen by a large amount. It's entirely possible 
that they did both, as we saw from candidates like Ben Carson and Jeb Bush.  
Another important query we provided was the option to see an average percentage result 
for each candidate over a period of time. Individual poll results will vary from agency to agency, 
and even polls conducted in the same state at the same time can give wildly conflicting results. 
Taking the average of multiple polls gives a more accurate picture of the state of the race with 
less of a margin of error. 
Challenges and Future Improvements 
Throughout the design and construction of our program, the majority of obstacles we 
faced were in the form of initial program set up, syntax errors, learning how to use PymySQL 
and Beautifulsoup, and GitHub. Once past initial setup and navigation of foreign territory, 
writing the python and SQL necessary for the program was not difficult (the first half of the 
semester helped with his). 
Writing correct syntax over multiple programming languages can get confusing what 
with the subtle nuances of each. We were able to pinpoint the offending syntax quickly through 
error messages on the command line. Once we realized what aspect of our syntax was incorrect, 
we found consulting our textbooks to be invaluable in fixing these mistakes. 
Helpful future improvements to our application would include nicer output formatting, an 
expansion of our project to include more than three pages of data from the website, 
realclearpolitics, and performance improvements to our database queries. 
Potential improvements to the database queries in our application exist within the store 
functions. The way the store functions are written, there is a select statement executed every time 
a store function is called, even though it may not be necessary. For example, if a certain polling 
agency has already been inserted into the ‘agency’ table, that means that we could have already 
captured its primary key at its initial record creation. The key could potentially be stored in a 
python dictionary, which could be accessed using the polling agency’s title, and would avoid 
calling a select statement to the database to obtain the key. While this sort of change may seem 
trivial and not save any execution time at our application’s current functioning level, a much 
larger version of our application could potentially see performance gains from this type of 
change. 

Mais conteúdo relacionado

Semelhante a Making Databases Great Again with Political Polling Data

WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfWiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfHilary Parker
 
How to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero APIHow to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero APIAzavea
 
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...chrisbrock54
 
8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile PollNew Latino Voice
 
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)evanescentprotest
 
Trumped Up Digital Marketing
Trumped Up Digital MarketingTrumped Up Digital Marketing
Trumped Up Digital MarketingIncubeta NMPi
 
OurSociety 2018 Annual Report
OurSociety 2018 Annual ReportOurSociety 2018 Annual Report
OurSociety 2018 Annual ReportRon Rivers
 
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...EPIPNational
 
Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?Epolitics.com
 
Political Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaignPolitical Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaignNathan Watson
 
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...Nancy Kaplan-Biegel
 
Mark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web PortfolioMark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web Portfoliomarkoslideshare79
 
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the ConnectionMoney & Politics: Illuminating the Connection
Money & Politics: Illuminating the ConnectionSteve Toub
 
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on FacebookIncreasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on FacebookMIT GOV/LAB
 
Public Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa CaucusesPublic Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa CaucusesGloverParkGroup
 
SBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deckSBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deckRichard Colwell
 

Semelhante a Making Databases Great Again with Political Polling Data (20)

Data in Politics
Data in PoliticsData in Politics
Data in Politics
 
WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfWiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
 
How to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero APIHow to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero API
 
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
 
8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll
 
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
 
Trumped Up Digital Marketing
Trumped Up Digital MarketingTrumped Up Digital Marketing
Trumped Up Digital Marketing
 
OurSociety 2018 Annual Report
OurSociety 2018 Annual ReportOurSociety 2018 Annual Report
OurSociety 2018 Annual Report
 
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
 
RRP Polling in 2020
RRP Polling in 2020RRP Polling in 2020
RRP Polling in 2020
 
Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?
 
Political Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaignPolitical Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaign
 
Intervention Findings
Intervention FindingsIntervention Findings
Intervention Findings
 
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
 
Mark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web PortfolioMark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web Portfolio
 
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the ConnectionMoney & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
 
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on FacebookIncreasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
 
Wiedman_Op-Ed
Wiedman_Op-EdWiedman_Op-Ed
Wiedman_Op-Ed
 
Public Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa CaucusesPublic Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa Caucuses
 
SBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deckSBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deck
 

Making Databases Great Again with Political Polling Data