Enviar pesquisa
Carregar
Making Databases Great Again with Political Polling Data
•
0 gostou
•
69 visualizações
Título melhorado com IA
W
Wade Cope
Seguir
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 12
Baixar agora
Baixar para ler offline
Recomendados
Bán vé máy bay vietnam airlines tân sân nhất đi busan giá rẻ
Bán vé máy bay vietnam airlines tân sân nhất đi busan giá rẻ
thuy06baydep
Why Welcomm Presentation April 2016
Why Welcomm Presentation April 2016
Louis Cox
Ireo Gurgaon Hills Luxury Apartments Gurgaon
Ireo Gurgaon Hills Luxury Apartments Gurgaon
Propzilla Infratech
Inside the-cave
Inside the-cave
conkor
MAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 Elections
MAYDAY.US
Vote.org 2016 Impact Report
Vote.org 2016 Impact Report
Vote.org
Inside the cave
Inside the cave
Jennifer Raiffie
Scott Swafford: Reader-focused election coverage
Scott Swafford: Reader-focused election coverage
Reynolds Journalism Institute (RJI)
Recomendados
Bán vé máy bay vietnam airlines tân sân nhất đi busan giá rẻ
Bán vé máy bay vietnam airlines tân sân nhất đi busan giá rẻ
thuy06baydep
Why Welcomm Presentation April 2016
Why Welcomm Presentation April 2016
Louis Cox
Ireo Gurgaon Hills Luxury Apartments Gurgaon
Ireo Gurgaon Hills Luxury Apartments Gurgaon
Propzilla Infratech
Inside the-cave
Inside the-cave
conkor
MAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 Elections
MAYDAY.US
Vote.org 2016 Impact Report
Vote.org 2016 Impact Report
Vote.org
Inside the cave
Inside the cave
Jennifer Raiffie
Scott Swafford: Reader-focused election coverage
Scott Swafford: Reader-focused election coverage
Reynolds Journalism Institute (RJI)
Data in Politics
Data in Politics
Kristen Yates
WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
Hilary Parker
How to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero API
Azavea
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
chrisbrock54
8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll
New Latino Voice
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
evanescentprotest
Trumped Up Digital Marketing
Trumped Up Digital Marketing
Incubeta NMPi
OurSociety 2018 Annual Report
OurSociety 2018 Annual Report
Ron Rivers
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIPNational
RRP Polling in 2020
RRP Polling in 2020
ragnarresearchpartners
Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?
Epolitics.com
Political Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaign
Nathan Watson
Intervention Findings
Intervention Findings
Jessica Morrison
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Nancy Kaplan-Biegel
Mark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web Portfolio
markoslideshare79
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
Steve Toub
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
MIT GOV/LAB
Wiedman_Op-Ed
Wiedman_Op-Ed
Catherine Wiedman
Public Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa Caucuses
GloverParkGroup
SBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deck
Richard Colwell
Mais conteúdo relacionado
Semelhante a Making Databases Great Again with Political Polling Data
Data in Politics
Data in Politics
Kristen Yates
WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
Hilary Parker
How to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero API
Azavea
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
chrisbrock54
8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll
New Latino Voice
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
evanescentprotest
Trumped Up Digital Marketing
Trumped Up Digital Marketing
Incubeta NMPi
OurSociety 2018 Annual Report
OurSociety 2018 Annual Report
Ron Rivers
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIPNational
RRP Polling in 2020
RRP Polling in 2020
ragnarresearchpartners
Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?
Epolitics.com
Political Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaign
Nathan Watson
Intervention Findings
Intervention Findings
Jessica Morrison
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Nancy Kaplan-Biegel
Mark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web Portfolio
markoslideshare79
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
Steve Toub
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
MIT GOV/LAB
Wiedman_Op-Ed
Wiedman_Op-Ed
Catherine Wiedman
Public Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa Caucuses
GloverParkGroup
SBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deck
Richard Colwell
Semelhante a Making Databases Great Again with Political Polling Data
(20)
Data in Politics
Data in Politics
WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
How to Conquer your Post-Election Data Chaos with the Cicero API
How to Conquer your Post-Election Data Chaos with the Cicero API
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
8/7 National & Florida Hispanic Mobile Poll
8/7 National & Florida Hispanic Mobile Poll
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Tearful Vladimir Putin Wins Russia's Presidential Election 2012 (Pictures)
Trumped Up Digital Marketing
Trumped Up Digital Marketing
OurSociety 2018 Annual Report
OurSociety 2018 Annual Report
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
EPIP/NCRP Webinar | Supersized Imbalance: Post-2014 Election, What Foundation...
RRP Polling in 2020
RRP Polling in 2020
Big Data in Politics: How Did We Get Here & Where Are We Going?
Big Data in Politics: How Did We Get Here & Where Are We Going?
Political Campaigns & Predictive Analytics- Changing how to campaign
Political Campaigns & Predictive Analytics- Changing how to campaign
Intervention Findings
Intervention Findings
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Data Journalism for the Rest of Us: A Beginner's Guide--JACC State Convention...
Mark O’Keefe Pew Forum Web Portfolio
Mark O’Keefe Pew Forum Web Portfolio
Money & Politics: Illuminating the Connection
Money & Politics: Illuminating the Connection
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Increasing Voter Knowledge with Pre-Election Interventions on Facebook
Wiedman_Op-Ed
Wiedman_Op-Ed
Public Opinion Landscape: Election 2016 - Iowa Caucuses
Public Opinion Landscape: Election 2016 - Iowa Caucuses
SBP Jan poll 2011 chart deck
SBP Jan poll 2011 chart deck
Making Databases Great Again with Political Polling Data
1.
Making Databases Great Again By Alifya Ali, Wade Cope, and Garner Vincent
2.
Motivation for Project We chose to collect data from polling results of the presidential primaries because if there were ever a time in modern history to pay attention to politics, it's the 2016 election cycle. With multiple polls released each day for the last several months, we knew there would not be a shortage of available data. Another reason we chose this primary season is that the Republican side was stacked with a record 17 candidates at the start, and several of them rose to the front of polls before crashing and burning. It was this potential to analyze dynamic trends in relevant, realworld data over time that first drew us to political polling as an option, and inspired us to make databases great again. Web Scraping Figure a. Real Clear Politics Website
3.
The selection process for the website was very simple. Political polls are a dime a dozen, however we had to find one that would work nicely for the purposes of scraping data. Real Clear Politics is a poll aggregate website and we thought for the purposes of our project an aggregate was the best way to go. Real clear politics splits up data very nicely into a readable table separated by date and then further by the name of the poll, the polling agency, percentages won by candidate, and how many percentage points a candidate won by in that poll (excluding ties). The website does this process for every poll conducted for about a month and a half before the current date. This gave us more than enough data to store into our database keeping it substantial enough to justify making the tables in the structure chosen yet fast enough that the scope was manageable and so that the program runs very fast. Figure b. Real Clear Politics Source Code The source code of the page itself it was also pretty straightforward as well. The code divided up all the information described earlier into separate table tags that had their class listed
4.
as empty or ‘alt’ and alternated between those two class distinctions. After the information about the date which is extracted then formatted (day of the week, month, day to year, month, day) the polls are broken up by the name of the race, the polling agency, the results of the poll, and the spread (which gives the winner and the margin they won by). They do this for every single day for 3 pages’ worth of data (which approximates to about 225 polls). Figure c. Web Scraping Function
5.
Because of the way the source code is broken up, scraping the data was actually very simple. First we made a Beautiful Soup object of the data scraped and made a list with strings of the days of the week. We then used a for loop to traverse through all the blocks of code (called siblings in the tree structure) that fell into the classes “date”, “alt” or an empty string. We did this by using the beautifulsoup function findAll(), which neatly pulled out all of the data from the page that was actually relevant to our database using the parameters listed. In each iteration of our loop, we split the sibling and then tested the length.We only worked with the lists that had a length greater than 0 because the findAll() method returns empty lists occasionally. After checking the length, we checked if the first element of each split sibling contained a weekday. If the element contained a weekday, it is transformed from text format into mySQL date format (01/01/2012) and then stored in a local variable to be used in the loading of all polls that occurred on that date. If the element was not a date, we split the sibling further (using the comma as a delimiter) and then deleted any empty elements at the beginning and end of the list. Next, we executed statements according to the location of the element and then formatted them as needed to input into our database. Candidate and result points had to be transformed using the space character as a delimiter, and the plus symbol had to be removed from each spread. If the element was neither a date nor a candidate (another fun quirk of beautiful soup) then we continued the loop to move onto the next sibling because it was not necessary for our database. The Database
6.
Figure d. Entity Relationship Diagram Our database structure consists of a total of five tables. The ‘Polling Agency’ table houses the name of each unique polling agency and its autoincremented primary key. An ‘agency’ is the website or entity which originally created the poll and recorded its responses. Each polling agency has one or more poll records related to it by the agency’s primary key, each housed in the ‘Poll’ table. A ‘Poll’ record consists of the name of poll and the date it was conducted. Each ‘Poll’ record has one or more related results, each housed in the ‘Result’ table. A ‘Result’ record consists of the name of the politician the result is recorded for, and the number of percentage points that politician received in the poll. The ‘Result’ table serves as a junction table between the ‘Poll’ table and the ‘Politician’ table, because each politician is in at least one
7.
poll, and most of the time many polls. Each record in the ‘Result’ table stores a politician’s last name as the foreign key to its ‘Politician’ record. Each ‘Politician’ record also houses the politician’s party affiliation. Finally, each ‘Poll’ record may or may not be associated with a ‘Spread’ record. The ‘Poll’ record will only be associated with a ‘Spread’ record if there was not a tie in the poll. Data Dictionary: 1. Polling Agency a. Entity Description: i. Our website collects polling information for many different polling agencies across the internet and other media. An example is “FOX” or “MSNBC”. b. Attributes: i. pk_Id = Auto incremented primary key for each polling agency object ii. Name = Varchar name of the polling agency ex “MSNBC”. 2. Poll a. Entity Description: i. Each polling agency conducts multiple polls throughout the course of weeks/months. Each record in the table will be a poll on a particular day (ex Republican Poll March 11th) b. Attributes i. Pk: INT autoincremented primary key
8.
ii. fk_pa: INT foreign key to polling agency iii. Date: DATE Date of the poll iv. Name: Varchar name of the poll itself (republican vs democrat, republicans only, democrats only) 3. Result a. Entity Description: i. Each poll has a variable number of results. A result entity is a politician and the percentage points they received in the poll. b. Attributes: i. Fk_poll = foreign key to the poll primary key (int) ii. Fk_politician = foreign key to the politician primary key (int) iii. Points = number of points the particular candidate won in the poll (int) 4. Spread a. Entity Description: i. Each poll has one spread, the number of points the winning politician won in the respective poll. b. Attributes: i. Fk_poll = foreign key to the poll (int) ii. Fk_politican = foreign key to the politician primary key (int) iii. Points = number of points the particular candidate won by (spread) in the poll (int) 5. Politician
9.
a. Entity Description i. Names of all the presidential candidates represented from the polling data collected. b. Attributes i. Pk: Varchar, last name of the candidate ii. Party: Varchar, Republican or Democrat Overview of the Process of Creating the Database Our application begins by opening a connection to a mySQL server. A connection object is created at authentication via the PyMySQL library. A cursor object which serves as a pointer for the actions of PyMySQL commands is created from the connection object. Anyone can run this application by changing the connection information in line 143 to the location of their mySQL server and login credentials to their own login credentials on their server. Next, our application creates a database in the connected server called ‘polling’. Note that this will fail if a database named ‘polling’ already exists in the server. After creating the database, the tables ‘agency’, ‘poll’, ‘politician’, ‘result’, and ‘spread’ are created, representing the entities described in the relationship diagram ‘Polling Agency’, ‘Poll’, ‘Politician’, ‘Result’, and ‘Spread’, respectively. After creation of the database and tables, the web scraping part of the application begins. During scraping, there a set of functions with names containing the word ‘store’ which are executed to insert records. Each ‘store’ function takes relevant information from the scraper and inserts the relevant record into the table in the second part of the name of the function. For example, the function
10.
‘storeAgency’ takes the title of the polling agency as an argument to insert an agency record into the agency table. With each store function call, a new record insertion is attempted. If the insert fails, usually due to the limitation of inserting duplicate agency titles in the case of the storeAgency function which is the intended function, then nothing is inserted and the data is committed. This process is similar, although not identical across the five ‘store’ functions. Next, the ‘store’ function selects the primary key from the table in question using one of the arguments it took in as an identifier. The primary key for the record in question is returned to the main scraping code. This allows easy relation of ‘Poll’ table records in the case of the storeAgency function. For example, the returned primary key is later sent as an argument into the storePoll function where it is inserted into the database as the foreign key in the record. After inserting all found records, the application drops the database created at initialization of the application, and closes both the cursor and connection objects (lines 174178). These lines are part of a ‘finally’ statement, meaning that if there are errors that occur during runtime, the database will always be dropped, and the connection and cursor objects will always be dropped before throwing errors. Queries When it came time to design the queries for our interface, we wanted to give the user multiple options for analyzing the data. We provided some commands that give general information like which candidates were running, which agencies are conducting the polls, and a list of all of the results from over the past few months.
11.
Figure e. The Queries We also provided queries that show things like changes in a candidate’s poll standings over time. This was accomplished through a query that gives a candidate's 'fluctuation', a figure calculated by subtracting their lowest poll result from their highest finish. This number tells you about how a candidate has fared over time. If their fluctuation number is low, it suggests that the candidate has placed very consistently in polls. This means that the candidate either always did well, as Clinton has done from the start, or it means that they simply weren't able to surge out of last place as they had planned to (Ohio Gov. John Kasich). A high fluctuation number means that in recent months, the candidate has either risen or fallen by a large amount. It's entirely possible that they did both, as we saw from candidates like Ben Carson and Jeb Bush. Another important query we provided was the option to see an average percentage result for each candidate over a period of time. Individual poll results will vary from agency to agency, and even polls conducted in the same state at the same time can give wildly conflicting results. Taking the average of multiple polls gives a more accurate picture of the state of the race with less of a margin of error. Challenges and Future Improvements
12.
Throughout the design and construction of our program, the majority of obstacles we faced were in the form of initial program set up, syntax errors, learning how to use PymySQL and Beautifulsoup, and GitHub. Once past initial setup and navigation of foreign territory, writing the python and SQL necessary for the program was not difficult (the first half of the semester helped with his). Writing correct syntax over multiple programming languages can get confusing what with the subtle nuances of each. We were able to pinpoint the offending syntax quickly through error messages on the command line. Once we realized what aspect of our syntax was incorrect, we found consulting our textbooks to be invaluable in fixing these mistakes. Helpful future improvements to our application would include nicer output formatting, an expansion of our project to include more than three pages of data from the website, realclearpolitics, and performance improvements to our database queries. Potential improvements to the database queries in our application exist within the store functions. The way the store functions are written, there is a select statement executed every time a store function is called, even though it may not be necessary. For example, if a certain polling agency has already been inserted into the ‘agency’ table, that means that we could have already captured its primary key at its initial record creation. The key could potentially be stored in a python dictionary, which could be accessed using the polling agency’s title, and would avoid calling a select statement to the database to obtain the key. While this sort of change may seem trivial and not save any execution time at our application’s current functioning level, a much larger version of our application could potentially see performance gains from this type of change.
Baixar agora