SlideShare a Scribd company logo
1 of 51
Download to read offline
CLASSIFYING
OMNIBUS BILLS
OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH,
2016
SAMUEL WITHERSPOON, MATHEW SONKE
DISCLAIMER
THIS IS OUR FIRST ITERATION AND IS A WORK IN
PROGRESS.
PURPOSE
WE WANT TO SHOW HOW WE MOVE FROM START TO
FIRST SET OF RESULTS IN AN ML PROBLEM
SUMMARY OF EFFORT
≈ 50 HOURS SPENT
≈ 120 BILLS MANUALLY CLASSIFIED
SOURCE CODE:

https://github.com/switherspoon/
MachineLearningMeetup
WHAT IS AN OMNIBUS
BILL?
TYPICALLY VERY LONG
TYPICALLY LOTS OF OTHER BILLS MODIFIED
For Example Bill C-51
A BILL THAT HAS A WIDE VARIETY OF TOPICS
THAT DEFINITION
INFORMED OUR FEATURES
LENGTH OF BILL
DIVERSITY OF TOPICS IN THE BILL
NUMBER OF OTHER BILLS MODIFIED/REFERENCED
FEATURES:
WHAT DOES AN
OMNIBUS LOOK
LIKE?
BILL C-51 - 41st
PARLIAMENT 2nd
SESSION
BILL C-54 - 41st
PARLIAMENT 2nd
SESSION
4/19 PAGES 1/1 PAGE
GETTING STARTED
WE USED PYTHON3 WITH:
1. NLTK (http://www.nltk.org/) - FOR NLP
2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER
3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL
4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT
ALL INSTALLED WITH PIP3
GETTING STARTED
(CONT…)
WE SOURCED OUR DATA FROM:
https://openparliament.ca/
http://parl.gc.ca
DATA ANALYSIS
MANUALLY SKIMMED AND EXTRACTED FEATURES FROM
≈120 BILLS AND BUILT A SPREADSHEET
link: https://docs.google.com/spreadsheets/d/
1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing
MODEL FEATURES
LENGTH OF BILL
NUMBER OF BILLS REFERENCED
AVERAGE SEMANTIC DISTANCE OF TOPICS IN
EACH BILL
THE MODEL
THE CLASSIFIER
NAIVE BAYES
EASY
FAST
UNDERSTANDABLE
WORKS WELL WITH SMALL TRAINING SET (MAYBE NOT THIS SMALL)
LENGTH OF BILL
LENGTH OF RAW STRING READ
IN FROM FILES
AS EASY AS:
len(raw)
NUMBER OF BILLS
REFERENCED
(1) DATA RETRIEVAL
(2) PREPROCESSING
(3) NAMED ENTITY RECOGNITION (NER)
(1) DATA RETRIEVAL
2 DATA SETS TO COLLECT
• CONSOLIDATED LIST OF ACTS
• FULL TEXT OF BILLS
DATA RETRIEVAL
CONT…
LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA
(http://laws-lois.justice.gc.ca/eng/acts/)
WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE
• SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT
• IMPORT.IO WORKED WELL FOR OUR NEEDS
DATA RETRIEVAL
CONT…
DATA RETRIEVAL
CONT…
TEXT OF BILLS RETRIEVED FROM OPENPARLIAMENT
DATABASE USING SQL
(2) PREPROCESSING
OPENPARLIAMENT DATABASE ISN’T PERFECT
• REMOVED DUPLICATES
• VERIFIED SESSION NUMBER WAS CORRECT
• CONVERTED EVERYTHING TO LOWERCASE
(3) NAMED ENTITY
RECOGNITION
MANY APPROACHES TO THIS
• HAND-CRAFTED GRAMMAR BASED
• STATISTICAL MODELS
• MATCHING AGAINST A LIBRARY
NAMED ENTITY
RECOGNITION CONT…
WE NOTICED COMMON PHRASES LIKE “AMENDS”,
“RELATED AMENDMENTS”, “REPLACED BY” WHEN
REFERENCING ACTS
ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY
• THIS GAVE US GOOD RESULTS WITH LITTLE CODE
• WON’T ALWAYS WORK
SEMANTIC DISTANCE
OF TOPICS
HYPOTHESIS:
SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS
THE AVERAGE DISTANCE BETWEEN TOPICS IN AN
OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS
BILL.
SEMANTIC DISTANCE
OF TOPICS PROCEDURE
(1) PREPROCESS A BILL
(2) LDA TOPIC MODELLING ON THE BILL
(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)
(4) AVERAGE TOPIC DISTANCE OF THE BILL
(1) PREPROCESSING
• READ IN FILES
•TOKENIZE WORDS
• REMOVE STOP WORDS
•IGNORE WORD ORDER (BAG OF WORDS)
(2) LDA TOPIC MODELING
•PROBABILISTIC TOPIC MODEL
•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION
•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A
HIDDEN STRUCTURE BUILT AROUND TOPICS
•IGNORES WORD ORDER
LDA CONT…
•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/
LDA
•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY
SCORE OF ‘1’
•THIS IS A REALLY BAD WORKAROUND
•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES
•THIS IS AN OPTIMIZATION PROBLEM
MORE READING:
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
(3) SEMANTIC SIMILARITY
LIN SIMILARITY
BUT WHAT DOES THIS MEAN???
WORDNET
A HIERARCHICAL TREE OF WORDS WITH MORE GENERAL
WORDS AT THE ROOT AND MORE SPECIFIC WORDS AT
THE LEAF
SIMILARITY CONT…
LIN SIMILARITY
*OVERSIMPLIFICATION*
THERE IS A GRAPH/NETWORK OF SYNONYMS - LIN
SIMILARITY IS THE SHORTEST DISTANCE TO THE FIRST
COMMON ANCESTOR (LOWEST COMMON ANCESTOR)
SIMILARITY CONT…
SCORES ARE BETWEEN 0 AND 1
>0.8 MEANS VERY SIMILAR
<0.2 MEANS NOT VERY SIMILAR
ie.
CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC)
HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC)
CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)
(4) AVG. TOPIC DISTANCE
IN A BILL
WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH
BILL:
SUM OF ALL COMPARED SCORES/TOTAL NUMBER
OF COMPARISONS
THERE ARE FLAWS IN THIS APPROACH
•NOUN ONLY
•NO WEIGHTING
CLASSIFICATION!
WE WERE RUNNING OUT OF TIME…..
WE WANTED TO COMPARE:
•NAIVE BAYES
•RANDOM FOREST DECISION TREE
•SVM
WE COMPARED:
•NAIVE BAYES!
CLASSIFIER
COMPARISON
:(
NAIVE BAYES
•GAUSSIAN
•MULTINOMIAL
MODEL EVALUATION
WE WONT SHOW YOU ACCURACY BECAUSE…
CLASS IMBALANCE!
•9 OMNIBUS BILLS IN 120 BILLS
•7.5% CHANCE A BILL IS AN OMNIBUS BILL
•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY
PICKING ‘NOT OMNIBUS’ EVERY TIME!
PRECISION
True Positives / (True Positives + False Positives)
RECALL
(True Positives / (True Positives + False Negatives))
BUT WE HAVE A CLASS
IMBALANCE PROBLEM
PRETENDING WE DON’T HAVE
A PROBLEM
CLASS IMBALANCE
SOLUTION
REMOVE THE IMBALANCE!!!!
WE WENT FROM 65 TRAINING EXAMPLES TO 25 TO 11
BY REMOVING NEGATIVE EXAMPLES
RESULTS
TRUE CLASS IMBALANCE 

(5:60)
NEW
(5:20)
RATIOS ARE

(#OMNIBUS:#NOTOMNIBUS)
REMOVING EVEN MORE
NEW
(5:20)
NEWEST
(5:6)
FINAL TRAINING SET
} }
CONCLUSIONS
EITHER NEED:
(1)SUBSTANTIALLY MORE DATA OR;
(2)BETTER ACCURACY ON TOPIC EXTRACTION AND
NAMED ENTITY RECOGNITION
LOTS OF ROOM FOR IMPROVEMENT
WE STILL THINK THREE FEATURES IS ENOUGH
NEED TO DO MORE WORK CLEANING/VALIDATING OUR
INPUT DATA
CONCLUSIONS CONT…
WE ARE PERFORMING BETTER THAN
RANDOM GUESSING!
WE WOULD LOVE HELP IMPROVING
OUR APPROACH
WAYS TO IMPROVE
USE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY
LINKED TOPIC MODELLING
IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS
EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS
USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)
TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A
DOCUMENT
EXPERIMENT WITH OTHER CLASSIFIERS
EXPERIMENT WITH MORE FEATURES
…
QUESTIONS?
Machine learning is no cakewalk.
Can we form a group to help Ottawa companies achieve
greater success with ML?
What would this group do?
Who would be in it?
How would it be funded?
Do we have the local talent?
What about protecting IP?
Who would make the decisions?
Why bother?
We want your feedback!
If you'd like to participate in ongoing discussions, please leave
us your contact info.
RELATIVE OPERATING
CHARACTERISTICS (ROC)
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
Random Guess
Gaussian
Multinomial
FALSE POSITIVE RATE
TRUEPOSITIVERATE

More Related Content

Viewers also liked

Viewers also liked (9)

The pain and gains running Docker in live @Pipedrive
The pain and gains running Docker in live @PipedriveThe pain and gains running Docker in live @Pipedrive
The pain and gains running Docker in live @Pipedrive
 
Diapresentatie materiële cultuur van de kelten
Diapresentatie materiële cultuur van de keltenDiapresentatie materiële cultuur van de kelten
Diapresentatie materiële cultuur van de kelten
 
Diapresentatie geschiedenis van de keltische wereld
Diapresentatie geschiedenis van de keltische wereldDiapresentatie geschiedenis van de keltische wereld
Diapresentatie geschiedenis van de keltische wereld
 
LIFI: Light Fidelity - A Survey
LIFI: Light Fidelity - A SurveyLIFI: Light Fidelity - A Survey
LIFI: Light Fidelity - A Survey
 
Presentación de los equipos en las sedes campoalto
Presentación de los equipos en las sedes campoaltoPresentación de los equipos en las sedes campoalto
Presentación de los equipos en las sedes campoalto
 
Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning ...
Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning ...Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning ...
Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning ...
 
RingCentral's Path to Customer Sucess
RingCentral's Path to Customer SucessRingCentral's Path to Customer Sucess
RingCentral's Path to Customer Sucess
 
Raising Attention - Catching the Customers Attention - Vortrag von Angelika S...
Raising Attention - Catching the Customers Attention - Vortrag von Angelika S...Raising Attention - Catching the Customers Attention - Vortrag von Angelika S...
Raising Attention - Catching the Customers Attention - Vortrag von Angelika S...
 
Präsentation von Stefan Schmidt-Grell (XING AG) auf der Zukunft Personal 2011
Präsentation von Stefan Schmidt-Grell (XING AG) auf der Zukunft Personal 2011Präsentation von Stefan Schmidt-Grell (XING AG) auf der Zukunft Personal 2011
Präsentation von Stefan Schmidt-Grell (XING AG) auf der Zukunft Personal 2011
 

Similar to Jan25 - Ottawa Machine Learning Meetup

WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docxWEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
cockekeshia
 

Similar to Jan25 - Ottawa Machine Learning Meetup (11)

probability.pptx
probability.pptxprobability.pptx
probability.pptx
 
WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docxWEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
WEEK 5 HOMEWORK 5THIS WEEK INVOLVES READING NEW TABLES, THE t-TA.docx
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Agile Analysis 101: Agile Stats v Command & Control Maths
Agile Analysis 101: Agile Stats v Command & Control MathsAgile Analysis 101: Agile Stats v Command & Control Maths
Agile Analysis 101: Agile Stats v Command & Control Maths
 
Develop winning federal_proposals
Develop winning federal_proposalsDevelop winning federal_proposals
Develop winning federal_proposals
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case07 Handling of Uncertainties in the Safety Case
07 Handling of Uncertainties in the Safety Case
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
powerpoint 1-19.pdf
powerpoint 1-19.pdfpowerpoint 1-19.pdf
powerpoint 1-19.pdf
 
The Ludic Fallacy Applied to Automated Planning
The Ludic Fallacy Applied to Automated PlanningThe Ludic Fallacy Applied to Automated Planning
The Ludic Fallacy Applied to Automated Planning
 
Failure Rate Prediction with Deep Learning
Failure Rate Prediction with Deep LearningFailure Rate Prediction with Deep Learning
Failure Rate Prediction with Deep Learning
 

Recently uploaded

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 

Recently uploaded (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 

Jan25 - Ottawa Machine Learning Meetup