8. BILL C-51 - 41st
PARLIAMENT 2nd
SESSION
BILL C-54 - 41st
PARLIAMENT 2nd
SESSION
4/19 PAGES 1/1 PAGE
9. GETTING STARTED
WE USED PYTHON3 WITH:
1. NLTK (http://www.nltk.org/) - FOR NLP
2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER
3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL
4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT
ALL INSTALLED WITH PIP3
11. DATA ANALYSIS
MANUALLY SKIMMED AND EXTRACTED FEATURES FROM
≈120 BILLS AND BUILT A SPREADSHEET
link: https://docs.google.com/spreadsheets/d/
1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing
12. MODEL FEATURES
LENGTH OF BILL
NUMBER OF BILLS REFERENCED
AVERAGE SEMANTIC DISTANCE OF TOPICS IN
EACH BILL
17. (1) DATA RETRIEVAL
2 DATA SETS TO COLLECT
• CONSOLIDATED LIST OF ACTS
• FULL TEXT OF BILLS
18. DATA RETRIEVAL
CONT…
LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA
(http://laws-lois.justice.gc.ca/eng/acts/)
WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE
• SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT
• IMPORT.IO WORKED WELL FOR OUR NEEDS
23. NAMED ENTITY
RECOGNITION CONT…
WE NOTICED COMMON PHRASES LIKE “AMENDS”,
“RELATED AMENDMENTS”, “REPLACED BY” WHEN
REFERENCING ACTS
ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY
• THIS GAVE US GOOD RESULTS WITH LITTLE CODE
• WON’T ALWAYS WORK
24. SEMANTIC DISTANCE
OF TOPICS
HYPOTHESIS:
SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS
THE AVERAGE DISTANCE BETWEEN TOPICS IN AN
OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS
BILL.
25. SEMANTIC DISTANCE
OF TOPICS PROCEDURE
(1) PREPROCESS A BILL
(2) LDA TOPIC MODELLING ON THE BILL
(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)
(4) AVERAGE TOPIC DISTANCE OF THE BILL
26. (1) PREPROCESSING
• READ IN FILES
•TOKENIZE WORDS
• REMOVE STOP WORDS
•IGNORE WORD ORDER (BAG OF WORDS)
27. (2) LDA TOPIC MODELING
•PROBABILISTIC TOPIC MODEL
•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION
•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A
HIDDEN STRUCTURE BUILT AROUND TOPICS
•IGNORES WORD ORDER
28. LDA CONT…
•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/
LDA
•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY
SCORE OF ‘1’
•THIS IS A REALLY BAD WORKAROUND
•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES
•THIS IS AN OPTIMIZATION PROBLEM
MORE READING:
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
32. SIMILARITY CONT…
SCORES ARE BETWEEN 0 AND 1
>0.8 MEANS VERY SIMILAR
<0.2 MEANS NOT VERY SIMILAR
ie.
CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC)
HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC)
CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)
33. (4) AVG. TOPIC DISTANCE
IN A BILL
WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH
BILL:
SUM OF ALL COMPARED SCORES/TOTAL NUMBER
OF COMPARISONS
THERE ARE FLAWS IN THIS APPROACH
•NOUN ONLY
•NO WEIGHTING
34. CLASSIFICATION!
WE WERE RUNNING OUT OF TIME…..
WE WANTED TO COMPARE:
•NAIVE BAYES
•RANDOM FOREST DECISION TREE
•SVM
WE COMPARED:
•NAIVE BAYES!
37. CLASS IMBALANCE!
•9 OMNIBUS BILLS IN 120 BILLS
•7.5% CHANCE A BILL IS AN OMNIBUS BILL
•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY
PICKING ‘NOT OMNIBUS’ EVERY TIME!
46. CONCLUSIONS
EITHER NEED:
(1)SUBSTANTIALLY MORE DATA OR;
(2)BETTER ACCURACY ON TOPIC EXTRACTION AND
NAMED ENTITY RECOGNITION
LOTS OF ROOM FOR IMPROVEMENT
WE STILL THINK THREE FEATURES IS ENOUGH
NEED TO DO MORE WORK CLEANING/VALIDATING OUR
INPUT DATA
47. CONCLUSIONS CONT…
WE ARE PERFORMING BETTER THAN
RANDOM GUESSING!
WE WOULD LOVE HELP IMPROVING
OUR APPROACH
48. WAYS TO IMPROVE
USE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY
LINKED TOPIC MODELLING
IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS
EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS
USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)
TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A
DOCUMENT
EXPERIMENT WITH OTHER CLASSIFIERS
EXPERIMENT WITH MORE FEATURES
…
50. Machine learning is no cakewalk.
Can we form a group to help Ottawa companies achieve
greater success with ML?
What would this group do?
Who would be in it?
How would it be funded?
Do we have the local talent?
What about protecting IP?
Who would make the decisions?
Why bother?
We want your feedback!
If you'd like to participate in ongoing discussions, please leave
us your contact info.