Making Databases Great Again with Political Polling Data

Making Databases Great Again
By Alifya Ali, Wade Cope, and Garner Vincent

Motivation for Project
We chose to collect data from polling results of the presidential primaries because if there
were ever a time in modern history to pay attention to politics, it's the 2016 election cycle. With
multiple polls released each day for the last several months, we knew there would not be a
shortage of available data. Another reason we chose this primary season is that the Republican
side was stacked with a record 17 candidates at the start, and several of them rose to the front of
polls before crashing and burning. It was this potential to analyze dynamic trends in relevant,
realworld data over time that first drew us to political polling as an option, and inspired us to
make databases great again.
Web Scraping

Figure a. Real Clear Politics Website

The selection process for the website was very simple. Political polls are a dime a dozen,
however we had to find one that would work nicely for the purposes of scraping data. Real Clear
Politics is a poll aggregate website and we thought for the purposes of our project an aggregate
was the best way to go. Real clear politics splits up data very nicely into a readable table
separated by date and then further by the name of the poll, the polling agency, percentages won
by candidate, and how many percentage points a candidate won by in that poll (excluding ties).
The website does this process for every poll conducted for about a month and a half before the
current date. This gave us more than enough data to store into our database keeping it substantial
enough to justify making the tables in the structure chosen yet fast enough that the scope was
manageable and so that the program runs very fast.

Figure b. Real Clear Politics Source Code
The source code of the page itself it was also pretty straightforward as well. The code
divided up all the information described earlier into separate table tags that had their class listed

as empty or ‘alt’ and alternated between those two class distinctions. After the information about
the date which is extracted then formatted (day of the week, month, day to year, month, day) the
polls are broken up by the name of the race, the polling agency, the results of the poll, and the
spread (which gives the winner and the margin they won by). They do this for every single day
for 3 pages’ worth of data (which approximates to about 225 polls).

Figure c. Web Scraping Function

Because of the way the source code is broken up, scraping the data was actually very
simple. First we made a Beautiful Soup object of the data scraped and made a list with strings of
the days of the week. We then used a for loop to traverse through all the blocks of code (called
siblings in the tree structure) that fell into the classes “date”, “alt” or an empty string. We did this
by using the beautifulsoup function findAll(), which neatly pulled out all of the data from the
page that was actually relevant to our database using the parameters listed. In each iteration of
our loop, we split the sibling and then tested the length.We only worked with the lists that had a
length greater than 0 because the findAll() method returns empty lists occasionally. After
checking the length, we checked if the first element of each split sibling contained a weekday. If
the element contained a weekday, it is transformed from text format into mySQL date format
(01/01/2012) and then stored in a local variable to be used in the loading of all polls that
occurred on that date.
If the element was not a date, we split the sibling further (using the comma as a delimiter)
and then deleted any empty elements at the beginning and end of the list. Next, we executed
statements according to the location of the element and then formatted them as needed to input
into our database. Candidate and result points had to be transformed using the space character as
a delimiter, and the plus symbol had to be removed from each spread. If the element was neither
a date nor a candidate (another fun quirk of beautiful soup) then we continued the loop to move
onto the next sibling because it was not necessary for our database.

The Database

Figure d. Entity Relationship Diagram
Our database structure consists of a total of five tables. The ‘Polling Agency’ table
houses the name of each unique polling agency and its autoincremented primary key. An
‘agency’ is the website or entity which originally created the poll and recorded its responses.
Each polling agency has one or more poll records related to it by the agency’s primary key, each
housed in the ‘Poll’ table. A ‘Poll’ record consists of the name of poll and the date it was
conducted. Each ‘Poll’ record has one or more related results, each housed in the ‘Result’ table.
A ‘Result’ record consists of the name of the politician the result is recorded for, and the number
of percentage points that politician received in the poll. The ‘Result’ table serves as a junction
table between the ‘Poll’ table and the ‘Politician’ table, because each politician is in at least one

poll, and most of the time many polls. Each record in the ‘Result’ table stores a politician’s last
name as the foreign key to its ‘Politician’ record. Each ‘Politician’ record also houses the
politician’s party affiliation. Finally, each ‘Poll’ record may or may not be associated with a
‘Spread’ record. The ‘Poll’ record will only be associated with a ‘Spread’ record if there was not
a tie in the poll.
Data Dictionary:
1. Polling Agency
a. Entity Description:
i. Our website collects polling information for many different polling
agencies across the internet and other media. An example is “FOX” or
“MSNBC”.
b. Attributes:
i. pk_Id = Auto incremented primary key for each polling agency object
ii. Name = Varchar name of the polling agency ex “MSNBC”.
2. Poll
i. Each polling agency conducts multiple polls throughout the course of
weeks/months. Each record in the table will be a poll on a particular day
(ex Republican Poll March 11th)
b. Attributes
i. Pk: INT autoincremented
primary key

ii. fk_pa: INT foreign key to polling agency
iii. Date: DATE Date of the poll
iv. Name: Varchar name of the poll itself (republican vs democrat,
republicans only, democrats only)
3. Result
i. Each poll has a variable number of results. A result entity is a politician
and the percentage points they received in the poll.
b. Attributes:
i. Fk_poll = foreign key to the poll primary key (int)
ii. Fk_politician = foreign key to the politician primary key (int)
iii. Points = number of points the particular candidate won in the poll (int)
4. Spread
i. Each poll has one spread, the number of points the winning politician won
in the respective poll.
b. Attributes:
i. Fk_poll = foreign key to the poll (int)
ii. Fk_politican = foreign key to the politician primary key (int)
iii. Points = number of points the particular candidate won by (spread) in the
poll (int)
5. Politician

a. Entity Description
i. Names of all the presidential candidates represented from the polling data
collected.
b. Attributes
i. Pk: Varchar, last name of the candidate
ii. Party: Varchar, Republican or Democrat

Overview of the Process of Creating the Database
Our application begins by opening a connection to a mySQL server. A connection object
is created at authentication via the PyMySQL library. A cursor object which serves as a pointer
for the actions of PyMySQL commands is created from the connection object. Anyone can run
this application by changing the connection information in line 143 to the location of their
mySQL server and login credentials to their own login credentials on their server.
Next, our application creates a database in the connected server called ‘polling’. Note that
this will fail if a database named ‘polling’ already exists in the server. After creating the
database, the tables ‘agency’, ‘poll’, ‘politician’, ‘result’, and ‘spread’ are created, representing
the entities described in the relationship diagram ‘Polling Agency’, ‘Poll’, ‘Politician’, ‘Result’,
and ‘Spread’, respectively.  After creation of the database and tables, the web scraping part of the
application begins. During scraping, there a set of functions with names containing the word
‘store’ which are executed to insert records.
Each ‘store’ function takes relevant information from the scraper and inserts the relevant
record into the table in the second part of the name of the function. For example, the function

‘storeAgency’ takes the title of the polling agency as an argument to insert an agency record into
the agency table. With each store function call, a new record insertion is attempted. If the insert
fails, usually due to the limitation of inserting duplicate agency titles in the case of the
storeAgency function which is the intended function, then nothing is inserted and the data is
committed. This process is similar, although not identical across the five ‘store’ functions. Next,
the ‘store’ function selects the primary key from the table in question using one of the arguments
it took in as an identifier. The primary key for the record in question is returned to the main
scraping code. This allows easy relation of ‘Poll’ table records in the case of the storeAgency
function. For example, the returned primary key is later sent as an argument into the storePoll
function where it is inserted into the database as the foreign key in the record.
After inserting all found records, the application drops the database created at
initialization of the application, and closes both the cursor and connection objects (lines
174178). These lines are part of a ‘finally’ statement, meaning that if there are errors that occur
during runtime, the database will always be dropped, and the connection and cursor objects will
always be dropped before throwing errors.

Queries
When it came time to design the queries for our interface, we wanted to give the user
multiple options for analyzing the data. We provided some commands that give general
information like which candidates were running, which agencies are conducting the polls, and a
list of all of the results from over the past few months.

Figure e. The Queries
We also provided queries that show things like changes in a candidate’s poll standings
over time. This was accomplished through a query that gives a candidate's 'fluctuation', a figure
calculated by subtracting their lowest poll result from their highest finish. This number tells you
about how a candidate has fared over time. If their fluctuation number is low, it suggests that the
candidate has placed very consistently in polls. This means that the candidate either always did
well, as Clinton has done from the start, or it means that they simply weren't able to surge out of
last place as they had planned to (Ohio Gov. John Kasich). A high fluctuation number means that
in recent months, the candidate has either risen or fallen by a large amount. It's entirely possible
that they did both, as we saw from candidates like Ben Carson and Jeb Bush.
Another important query we provided was the option to see an average percentage result
for each candidate over a period of time. Individual poll results will vary from agency to agency,
and even polls conducted in the same state at the same time can give wildly conflicting results.
Taking the average of multiple polls gives a more accurate picture of the state of the race with
less of a margin of error.
Challenges and Future Improvements

Throughout the design and construction of our program, the majority of obstacles we
faced were in the form of initial program set up, syntax errors, learning how to use PymySQL
and Beautifulsoup, and GitHub. Once past initial setup and navigation of foreign territory,
writing the python and SQL necessary for the program was not difficult (the first half of the
semester helped with his).
Writing correct syntax over multiple programming languages can get confusing what
with the subtle nuances of each. We were able to pinpoint the offending syntax quickly through
error messages on the command line. Once we realized what aspect of our syntax was incorrect,
we found consulting our textbooks to be invaluable in fixing these mistakes.
Helpful future improvements to our application would include nicer output formatting, an
expansion of our project to include more than three pages of data from the website,
realclearpolitics, and performance improvements to our database queries.
Potential improvements to the database queries in our application exist within the store
functions. The way the store functions are written, there is a select statement executed every time
a store function is called, even though it may not be necessary. For example, if a certain polling
agency has already been inserted into the ‘agency’ table, that means that we could have already
captured its primary key at its initial record creation. The key could potentially be stored in a
python dictionary, which could be accessed using the polling agency’s title, and would avoid
calling a select statement to the database to obtain the key. While this sort of change may seem
trivial and not save any execution time at our application’s current functioning level, a much
larger version of our application could potentially see performance gains from this type of
change.

Making Databases Great Again with Political Polling Data

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Making Databases Great Again with Political Polling Data

Semelhante a Making Databases Great Again with Political Polling Data (20)

Making Databases Great Again with Political Polling Data