This presentation was made by Lucky Adike, Marty McEnroe, and Dann Ormond for the CS410 class at University of Illinois taught by Prof. ChengXiang Zhai for the Spring 2013 semester
1. CS410 Course Project Presentation
Petition Predictor
CS410 Spring 2013
Lucky Adike
Martin McEnroe
Dann Ormond
1
2. CS410 Course Project Presentation
Problem Statement
Congress shall make no law respecting an establishment of religion, or prohibiting the free
exercise thereof; or abridging the freedom of speech, or of the press; or the right of the
people peaceably to assemble, and to petition the Government for a redress of grievances.
- The First Amendment of the United States Constitution
âą January 2012: Congress proposed legislation on behalf of content distributors
âą The internet community grew increasingly alarmed about the change and side effects
âą Several well publicized events took place on January 18, 2012 as part of the SOPA
blackout day: Google, Reddit, Wired, Wikipedia and 115,000 other websites modified
their web presence to protest the pending legislation.
âą January 20th the legislation was shelved indefinitely
What useful information retrieval tool could be built?
âą Could this citizenry-government action have been anticipated and predicted?
âą Could information retrieval and analysis of the online conversation anticipate and
predict the end result?
2
3. CS410 Course Project Presentation 3
100,000
signatures
in 30 days Which new
petitions will
hit threshold?
Reach
threshold and
Whitehouse
responds
Must
register with
email and
zip code
Related work: On 2/21/13 Whitehouse hosts
hackathon and releases project results on 5/1.
Pulse predicts when the threshold will pass 100k:
http://youtu.be/5-2P4GFZf8Y
https://github.com/DruRly/pulse
4. CS410 Course Project Presentation
Solution Approach
âą 1st Idea: Classify the petition:
â â1â : Petition will receive 100,000 discrete, validated signatures within 30 days
â â0â : Petition will not pass 1000,000 threshold in time
âą How to make a classification decision?
1. statistical analysis of past performance.
âą Wrote a Python program to scrape the whitehouse website every 8 hours.
Stored in a JSON object for use in subsequent analysis and retrieval
4
Text of petitionSignature count every 8
hours starting 4/28
Petition create date (but only
viewable on website after 150
signatures
a unique identifier, also
useful as a search term
Title of petition
During course of
project we changed to
ranking petitions
5. CS410 Course Project Presentation
Logarithmic Curve Fit of 10 Most Likely Petitions
5
150
1500
15000
150000
0 10 20 30 40 50 60 70 80 90 100
NumberofSignatures(Logscale)
Time in 8 hour increments (petitions time shifted to common origin = creation date)
archbishops
marijuana3
airgun
postal
Malaysian
assault
habeas
aggag
thallium
transnational
Log. (archbishops)
Log. (marijuana3)
Log. (airgun)
Log. (postal)
Log. (Malaysian)
Log. (assault)
Log. (habeas)
Log. (aggag)
Log. (thallium)
Log. (transnational)
Threshold @ 100,000 signatures
Curve fit then predict the 30th day value
(x = 90 since we sample every 8 hours)
Petition âfatigueâ suggests logarithmic
model is better predictor
- Ln used (base w1 = e) can be tuned
6. CS410 Course Project Presentation
Twitter: Tweets and Followers
After signing, wh.gov site encourages you to promote the petition
âą Used public Twitter REST API
âą Search on the petition title
âą Tweet Rate = count / # of days (twitter limits age of tweets in API)
âą Use transformation of rate to reward place in rank, not absolute value
difference
â sublinear
â linear
â exponential
âą Guess: linear
6
Tweet Weight Adjusted for âÆ(followers)
âą Are some tweeters more important than others?
âą Can we develop something like authorities/hubs?
âą Weighted Rate incorporates number of followers to
increase/decrease score of each tweet
Adj. Score = â log5(followers) /
days of tweets
Base 5 -> Pivot point is w2 = 5 followers â can be tuned
rank
1.05
0.95
7. CS410 Course Project Presentation
Transforming Rank to Boost Factors
7
âą Petition rank is mapped via a linear function â function type can be tuned
âą Tuning scaling parameter applied based on judgment of importance of each IR category
â tweet rate: w3= .02 1st -> 1.10; 10th -> .90
â follower adjusted tweet rate: w4 = .04 1st -> 1.20; 10th -> .80
Petition ID
Ln Curve Fit
w1= e
Tweets
Tweet Rate
per Day
Rank
Boost
Factor
Boost
Follower
Weighted
Rate
Weighted/Ra
te Ratio
Rank
Boost
Factor
Boost
xNskxL1q 16,545 94 11.8 10 0.90 -1,655 39.7 3.379 7 0.92 -1,324
xqNMVRB4 9,115 97 12.1 9 0.92 -729 44.1 3.636 1 1.20 1,823
khpw6LCt 50,898 1022 127.8 2 1.08 4,072 459.8 3.600 3 1.12 6,108
drCmyCHZ 21,280 231 28.9 5 1.02 426 103.1 3.570 4 1.08 1,702
nBqKR7bm 446,841 1676 838.0 1 1.10 44,684 2675.7 3.193 9 0.84 -71,494
kVhNfHQ1 14,720 168 21.0 6 0.98 -294 71.7 3.412 6 0.96 -589
bMJpDrNq 6,769 114 14.3 8 0.94 -406 49.6 3.479 5 1.04 271
KQWSvsKr 5,380 127 15.9 7 0.96 -215 57.7 3.635 2 1.16 861
Rd8C54p1 83,231 93 31.0 4 1.04 3,329 63.5 2.047 10 0.80 -16,646
V3hNt2fB 17,376 508 63.5 3 1.06 1,043 208.5 3.283 8 0.88 -2,085
8. CS410 Course Project Presentation
Can Google Trends help us?
8
Chunks,value (0 â 100)
Revoke US Visa,7
on,83
National Security Grounds,7
to,83
Venezuelan Government Officials,0
involved,65
in,93
Transnational Organized Crime,65
Converted the petition title into search
phrases using OpenNLP
âą sentence detector
âą tokenizer & POS tagger => Chunker
Some observations
âą Chunking produced common terms with high scores
âą Would be more useful to build a custom Query
background language model â need more data
âą Not clear how Google trends computes values from 0
to 100 â different petitions are not relative to each
other
âą Doesnât appear to be âbag of wordsâ model. What
about semantically equivalent terms? We were
hoping for a tf-idf weighting from the web
âą Is there another tool out there? Is there functions of
the API we didnât exploit? Will the API evolve?
âą most unreliable IR source therefore w5= .01
Results from web interface
Results from API interface
9. CS410 Course Project Presentation
Authority Sites via Bing API
âą Created list of 30 authoritative web sites (e.g., cnn.com). Each weighted equally.
âą Sent full title of petition as query to Bing API exactly as listed on wh.gov:
âInvest and deport Jasmine Sun who was the main suspect of a famous Thallium
poison murder case (victim:Zhu Lin) in Chinaâ
âą Measured number of responses in the top 50 results that came from an authoritative
domain - eliminated self-posting parts of domain: http://ireport.cnn.com/docs/DOC-965382
âą Observation: Most petitions do not receive mainstream attention
âą Second most reliable w6= .03
9
Petition ID keyword
Close
Date
Ln Curve
Fit
Authority
Sites
Rank
Boost
Factor
Boost
xNskxL1q archbishops 5/27 16,545 5 3 1.09 1489
xqNMVRB4 marijuana3 5/17 9,115 8 1 1.15 1367
khpw6LCt airgun 5/15 50,898 4 4 1.06 3054
drCmyCHZ postal 5/24 21,280 6 2 1.12 2554
nBqKR7bm Malaysian 6/4 446,841 2 6 0.97 -13405
kVhNfHQ1 assault 5/21 14,720 0 10 0.85 -2208
bMJpDrNq habeas 5/27 6,769 3 5 1.03 203
KQWSvsKr aggag 5/10 5,380 2 8 0.91 -484
Rd8C54p1 thallium 6/4 83,231 0 9 0.88 -9988
V3hNt2fB transnational 6/3 17,376 2 7 0.85 -2606
10. CS410 Course Project Presentation
Putting it together
âą Our focus was on acquiring data and constructing a model and automated where
necessary and using open tools, APIs, and information sources
âą Some work about transfer between modules and final ranking and computation needs
more automation if we are to run unattended
âą Much data analysis, both manual and automated to guess at important sources and
parameters. Many initial ideas didnât pan out:
â Sentiment analysis (no such thing as bad publicity)
â Google trends surprisingly useless â forced to do manual manipulation â very low
confidence in this as a prediction
â Facebook button on wh.gov but didnât appear to be used as much as twitter
â No training data to choose parameters. Choose simple âboostâ model to start and
used intuition from project to guess at relative size of boost from different sources.
10
Stop 85
seismic airgun testing 0
for 86
oil and gas 77
off 80
the U.S. East Coast . 0
11. CS410 Course Project Presentation
Putting Our Money Where Our Mouth IsâŠ
Ranked predictions of 10 most likely1 of the 842 petitions started between April 5 and May 4
and ranked predictions. How will we do?
11
1. Only petitions that have at least 150 signatures are visible to us
2. One petition ( 0MNp0Bys ) started on 4/15 and hit 100k before we started collecting statistics so we excluded this form our data set
Petition ID keyword
Close
Date
Linear Curve
Fit
NaĂŻve
Order
Ln Curve Fit
w1= e
Twitter
w3 = .02
Twitter+
w2 = 5
w4 = .04
Google
Trends
w5 = .01
Authority
Sites
w6 = .03
Combined
model
Predicted
Order
xNskxL1q archbishops 5/27 31,309 5 16,545 -1,655 -1,324 165 1,489 15,222 5
xqNMVRB4 marijuana3 5/17 10,048 9 9,115 -729 1,823 456 1,367 12,032 7
khpw6LCt airgun 5/15 57,185 4 50,898 4,072 6,108 -2,036 3,054 62,096 2
drCmyCHZ postal 5/24 27,929 6 21,280 426 1,702 -426 2,554 25,536 4
nBqKR7bm Malaysian 6/4 2,895,387 1 446,841 44,684 -71,494 17,874 -13,405 424,499 1
kVhNfHQ1 assault 5/21 16,568 7 14,720 -294 -589 -442 -2,208 11,187 8
bMJpDrNq habeas 5/27 12,036 8 6,769 -406 271 -68 203 6,769 9
KQWSvsKr aggag 5/10 5,448 10 5,380 -215 861 -269 -484 5,273 10
Rd8C54p1 thallium 6/4 734,304 2 83,231 3,329 -16,646 1,665 -9,988 61,591 3
V3hNt2fB transnational 6/3 88,236 3 17,376 1,043 -2,085 521 -2,606 14,248 6
Baseline IR Model Prediction
12. CS410 Course Project Presentation
Quo Vadis?
Do Research
âą Collect more data, train parameters, learn different ways to make predictions
âą Publish
âą Awesome? idea for a team competition homework 5 in a future class
Sharpen CS skills
âą Whitehouse.gov released API on 5/1 and a historical corpus on 5/2
âą Next Whitehouse hackathon on 6/1
Make money
âą Turn this into an actual app and host it on web site
â Business model: tweet dashboard link to anyone who tweets a petition, dashboard
site is advertising supported
âą Apply methods to other petition sites:
change.org, gopetition.com, ipetitions.com, signon.org, thepetitionsite.com, care2.com
(or get a job at one of these companies)
Give back
âą Fraudulent petition signature detection
âą Mine the web for new petition topics with high success potential
12
Notas do Editor
0 sec
25 Sec
20 Sec
15 SecWe noticed that of approximately 90 current predictions, three of them have to do with marijuana. This gave us an idea.
work to do: Marty â compute numbers, sort by start date