This presentation was made by Lucky Adike, Marty McEnroe, and Dann Ormond for the CS410 class at University of Illinois taught by Prof. ChengXiang Zhai for the Spring 2013 semester
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Petition predictor final
1. CS410 Course Project Presentation
Petition Predictor
CS410 Spring 2013
Lucky Adike
Martin McEnroe
Dann Ormond
1
2. CS410 Course Project Presentation
Problem Statement
Congress shall make no law respecting an establishment of religion, or prohibiting the free
exercise thereof; or abridging the freedom of speech, or of the press; or the right of the
people peaceably to assemble, and to petition the Government for a redress of grievances.
- The First Amendment of the United States Constitution
• January 2012: Congress proposed legislation on behalf of content distributors
• The internet community grew increasingly alarmed about the change and side effects
• Several well publicized events took place on January 18, 2012 as part of the SOPA
blackout day: Google, Reddit, Wired, Wikipedia and 115,000 other websites modified
their web presence to protest the pending legislation.
• January 20th the legislation was shelved indefinitely
What useful information retrieval tool could be built?
• Could this citizenry-government action have been anticipated and predicted?
• Could information retrieval and analysis of the online conversation anticipate and
predict the end result?
2
3. CS410 Course Project Presentation 3
100,000
signatures
in 30 days Which new
petitions will
hit threshold?
Reach
threshold and
Whitehouse
responds
Must
register with
email and
zip code
Related work: On 2/21/13 Whitehouse hosts
hackathon and releases project results on 5/1.
Pulse predicts when the threshold will pass 100k:
http://youtu.be/5-2P4GFZf8Y
https://github.com/DruRly/pulse
4. CS410 Course Project Presentation
Solution Approach
• 1st Idea: Classify the petition:
– “1” : Petition will receive 100,000 discrete, validated signatures within 30 days
– “0” : Petition will not pass 1000,000 threshold in time
• How to make a classification decision?
1. statistical analysis of past performance.
• Wrote a Python program to scrape the whitehouse website every 8 hours.
Stored in a JSON object for use in subsequent analysis and retrieval
4
Text of petitionSignature count every 8
hours starting 4/28
Petition create date (but only
viewable on website after 150
signatures
a unique identifier, also
useful as a search term
Title of petition
During course of
project we changed to
ranking petitions
5. CS410 Course Project Presentation
Logarithmic Curve Fit of 10 Most Likely Petitions
5
150
1500
15000
150000
0 10 20 30 40 50 60 70 80 90 100
NumberofSignatures(Logscale)
Time in 8 hour increments (petitions time shifted to common origin = creation date)
archbishops
marijuana3
airgun
postal
Malaysian
assault
habeas
aggag
thallium
transnational
Log. (archbishops)
Log. (marijuana3)
Log. (airgun)
Log. (postal)
Log. (Malaysian)
Log. (assault)
Log. (habeas)
Log. (aggag)
Log. (thallium)
Log. (transnational)
Threshold @ 100,000 signatures
Curve fit then predict the 30th day value
(x = 90 since we sample every 8 hours)
Petition ‘fatigue’ suggests logarithmic
model is better predictor
- Ln used (base w1 = e) can be tuned
6. CS410 Course Project Presentation
Twitter: Tweets and Followers
After signing, wh.gov site encourages you to promote the petition
• Used public Twitter REST API
• Search on the petition title
• Tweet Rate = count / # of days (twitter limits age of tweets in API)
• Use transformation of rate to reward place in rank, not absolute value
difference
– sublinear
– linear
– exponential
• Guess: linear
6
Tweet Weight Adjusted for ∑ƒ(followers)
• Are some tweeters more important than others?
• Can we develop something like authorities/hubs?
• Weighted Rate incorporates number of followers to
increase/decrease score of each tweet
Adj. Score = ∑ log5(followers) /
days of tweets
Base 5 -> Pivot point is w2 = 5 followers – can be tuned
rank
1.05
0.95
7. CS410 Course Project Presentation
Transforming Rank to Boost Factors
7
• Petition rank is mapped via a linear function – function type can be tuned
• Tuning scaling parameter applied based on judgment of importance of each IR category
– tweet rate: w3= .02 1st -> 1.10; 10th -> .90
– follower adjusted tweet rate: w4 = .04 1st -> 1.20; 10th -> .80
Petition ID
Ln Curve Fit
w1= e
Tweets
Tweet Rate
per Day
Rank
Boost
Factor
Boost
Follower
Weighted
Rate
Weighted/Ra
te Ratio
Rank
Boost
Factor
Boost
xNskxL1q 16,545 94 11.8 10 0.90 -1,655 39.7 3.379 7 0.92 -1,324
xqNMVRB4 9,115 97 12.1 9 0.92 -729 44.1 3.636 1 1.20 1,823
khpw6LCt 50,898 1022 127.8 2 1.08 4,072 459.8 3.600 3 1.12 6,108
drCmyCHZ 21,280 231 28.9 5 1.02 426 103.1 3.570 4 1.08 1,702
nBqKR7bm 446,841 1676 838.0 1 1.10 44,684 2675.7 3.193 9 0.84 -71,494
kVhNfHQ1 14,720 168 21.0 6 0.98 -294 71.7 3.412 6 0.96 -589
bMJpDrNq 6,769 114 14.3 8 0.94 -406 49.6 3.479 5 1.04 271
KQWSvsKr 5,380 127 15.9 7 0.96 -215 57.7 3.635 2 1.16 861
Rd8C54p1 83,231 93 31.0 4 1.04 3,329 63.5 2.047 10 0.80 -16,646
V3hNt2fB 17,376 508 63.5 3 1.06 1,043 208.5 3.283 8 0.88 -2,085
8. CS410 Course Project Presentation
Can Google Trends help us?
8
Chunks,value (0 – 100)
Revoke US Visa,7
on,83
National Security Grounds,7
to,83
Venezuelan Government Officials,0
involved,65
in,93
Transnational Organized Crime,65
Converted the petition title into search
phrases using OpenNLP
• sentence detector
• tokenizer & POS tagger => Chunker
Some observations
• Chunking produced common terms with high scores
• Would be more useful to build a custom Query
background language model – need more data
• Not clear how Google trends computes values from 0
to 100 – different petitions are not relative to each
other
• Doesn’t appear to be “bag of words” model. What
about semantically equivalent terms? We were
hoping for a tf-idf weighting from the web
• Is there another tool out there? Is there functions of
the API we didn’t exploit? Will the API evolve?
• most unreliable IR source therefore w5= .01
Results from web interface
Results from API interface
9. CS410 Course Project Presentation
Authority Sites via Bing API
• Created list of 30 authoritative web sites (e.g., cnn.com). Each weighted equally.
• Sent full title of petition as query to Bing API exactly as listed on wh.gov:
“Invest and deport Jasmine Sun who was the main suspect of a famous Thallium
poison murder case (victim:Zhu Lin) in China”
• Measured number of responses in the top 50 results that came from an authoritative
domain - eliminated self-posting parts of domain: http://ireport.cnn.com/docs/DOC-965382
• Observation: Most petitions do not receive mainstream attention
• Second most reliable w6= .03
9
Petition ID keyword
Close
Date
Ln Curve
Fit
Authority
Sites
Rank
Boost
Factor
Boost
xNskxL1q archbishops 5/27 16,545 5 3 1.09 1489
xqNMVRB4 marijuana3 5/17 9,115 8 1 1.15 1367
khpw6LCt airgun 5/15 50,898 4 4 1.06 3054
drCmyCHZ postal 5/24 21,280 6 2 1.12 2554
nBqKR7bm Malaysian 6/4 446,841 2 6 0.97 -13405
kVhNfHQ1 assault 5/21 14,720 0 10 0.85 -2208
bMJpDrNq habeas 5/27 6,769 3 5 1.03 203
KQWSvsKr aggag 5/10 5,380 2 8 0.91 -484
Rd8C54p1 thallium 6/4 83,231 0 9 0.88 -9988
V3hNt2fB transnational 6/3 17,376 2 7 0.85 -2606
10. CS410 Course Project Presentation
Putting it together
• Our focus was on acquiring data and constructing a model and automated where
necessary and using open tools, APIs, and information sources
• Some work about transfer between modules and final ranking and computation needs
more automation if we are to run unattended
• Much data analysis, both manual and automated to guess at important sources and
parameters. Many initial ideas didn’t pan out:
– Sentiment analysis (no such thing as bad publicity)
– Google trends surprisingly useless – forced to do manual manipulation – very low
confidence in this as a prediction
– Facebook button on wh.gov but didn’t appear to be used as much as twitter
– No training data to choose parameters. Choose simple “boost” model to start and
used intuition from project to guess at relative size of boost from different sources.
10
Stop 85
seismic airgun testing 0
for 86
oil and gas 77
off 80
the U.S. East Coast . 0
11. CS410 Course Project Presentation
Putting Our Money Where Our Mouth Is…
Ranked predictions of 10 most likely1 of the 842 petitions started between April 5 and May 4
and ranked predictions. How will we do?
11
1. Only petitions that have at least 150 signatures are visible to us
2. One petition ( 0MNp0Bys ) started on 4/15 and hit 100k before we started collecting statistics so we excluded this form our data set
Petition ID keyword
Close
Date
Linear Curve
Fit
Naïve
Order
Ln Curve Fit
w1= e
Twitter
w3 = .02
Twitter+
w2 = 5
w4 = .04
Google
Trends
w5 = .01
Authority
Sites
w6 = .03
Combined
model
Predicted
Order
xNskxL1q archbishops 5/27 31,309 5 16,545 -1,655 -1,324 165 1,489 15,222 5
xqNMVRB4 marijuana3 5/17 10,048 9 9,115 -729 1,823 456 1,367 12,032 7
khpw6LCt airgun 5/15 57,185 4 50,898 4,072 6,108 -2,036 3,054 62,096 2
drCmyCHZ postal 5/24 27,929 6 21,280 426 1,702 -426 2,554 25,536 4
nBqKR7bm Malaysian 6/4 2,895,387 1 446,841 44,684 -71,494 17,874 -13,405 424,499 1
kVhNfHQ1 assault 5/21 16,568 7 14,720 -294 -589 -442 -2,208 11,187 8
bMJpDrNq habeas 5/27 12,036 8 6,769 -406 271 -68 203 6,769 9
KQWSvsKr aggag 5/10 5,448 10 5,380 -215 861 -269 -484 5,273 10
Rd8C54p1 thallium 6/4 734,304 2 83,231 3,329 -16,646 1,665 -9,988 61,591 3
V3hNt2fB transnational 6/3 88,236 3 17,376 1,043 -2,085 521 -2,606 14,248 6
Baseline IR Model Prediction
12. CS410 Course Project Presentation
Quo Vadis?
Do Research
• Collect more data, train parameters, learn different ways to make predictions
• Publish
• Awesome? idea for a team competition homework 5 in a future class
Sharpen CS skills
• Whitehouse.gov released API on 5/1 and a historical corpus on 5/2
• Next Whitehouse hackathon on 6/1
Make money
• Turn this into an actual app and host it on web site
– Business model: tweet dashboard link to anyone who tweets a petition, dashboard
site is advertising supported
• Apply methods to other petition sites:
change.org, gopetition.com, ipetitions.com, signon.org, thepetitionsite.com, care2.com
(or get a job at one of these companies)
Give back
• Fraudulent petition signature detection
• Mine the web for new petition topics with high success potential
12
Notas do Editor
0 sec
25 Sec
20 Sec
15 SecWe noticed that of approximately 90 current predictions, three of them have to do with marijuana. This gave us an idea.
work to do: Marty – compute numbers, sort by start date