SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Language Sleuthing HOWTO
                 or
   Discovering Interesting Things
           with Python's
     Natural Language Tool Kit


                           Brianna Laugher
                                modernthings.org
                         brianna[@.]laugher.id.au
Corpus linguistics on web
          texts




          why?
Because the web is full of
       language data

 Because linguistic techniques
can reveal unexpected insights

Because I don't want to have to
       read everything
Like... mailing lists
luv-main as a corpus



√ Big collection of text
x Messy data
x Not annotated
what's interesting?

   conversations

      topics

 change over time

     (authors)
Step 1:




get the data
wget vs Python script


√ wget is purpose-built

√ convenient options like
   --convert-links
Meaningful URLs FTW


              Sympa/MhonArc:


lists.luv.asn.au/wws/arc/luv-main/
                                 2009-04/
                                         msg00057.html
Step 2:




clean the data
Cleaning for what?

Remove archive boilerplate

      Remove HTML

   Remove quoted text?

   Remove signatures?
J.W.
J.W.




       W.E.
Behind the scenes
        J.W.




 W.E.
what are we aiming for?




what do NLTK corpora look like?
Getting NLTK


sudo apt-get install python-nltk
         in Ubuntu 10.04
                 or
sudo apt-get install python-pip
         pip install nltk
                 or
  from source at nltk.org/download
Getting NLTK data...




    an “NLTKism”
NLTK corpora types
Brown corpus
A CategorizedTagged corpus:

   Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in
clearing/vbg up/in any/dti possible/jj
misconception/nn in/in your/pp$ minds/nns ,/,
wherever/wrb you/ppss are/ber ./.
The/at collective/nn by/in which/wdt I/ppss
address/vb you/ppo in/in the/at title/nn above/rb
is/bez neither/cc patronizing/vbg nor/cc jocose/jj
but/cc an/at exact/jj industrial/jj term/nn in/in
use/nn among/in professional/jj thieves/nns ./.
Inaugural corpus
A Plaintext corpus:

My fellow citizens:

I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful
of the sacrifices borne by our ancestors. I thank
President Bush for his service to our nation, as
well as the generosity and cooperation he has
shown throughout this transition.

Forty-four Americans have now taken the
presidential oath. ...............
But we still have lots of HTML...
BeautifulSoup to the rescue



>>>   from BeautifulSoup import BeautifulSoup as BS
>>>   data = open(filename,'r').read()
>>>   soup = BS(data)
>>>   print 'n'.join(soup.findAll(text=True))
notice the blockquote!
What about blockquotes?

>>> bqs = s.findAll('blockquote')
>>> [bq.extract() for bq in bqs]
>>> print 'n'.join(s.findAll(text=True))

On 05/08/2007, at 12:05 PM, [...] wrote:
If u want it USB bootable, just burn the DSL boot disk to CD and fire it
up.  Then from the desktop after boot, right click and create the
bootable USB key yourself.  I havent actually done this myself (only
seen the option from the menu), but I am assuming it will be a fairly painless
process if you are happy with the stock image.  Would be interested in
how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
Regards,
[...]
Step 3:




analyse the data
Getting it into NLTK



import nltk
path = 'path/to/files'
corpus = nltk.corpus.PlaintextCorpusReader(path,
                                     '.*.html')
What about our metadata?
Create a Python dictionary that maps filenames to
categories
e.g.
categories={}
categories['2008-12/msg00226.html'] =
                    ['year-2008',
                      'month-12',
                      'author-BM<bm@xxxxx>'
                    ]
....etc
then...
import nltk
path = 'path/to/files/'
corpus =
nltk.corpus.CategorizedPlaintextCorpusReader(path,
                    '.*.html', cat_map=categories)
Simple categories


cats = corpus.categories()
authorcats=[c for c in cats if c.startswith('author')]
#>>> len(authorcats)
#608
yearcats=[c for c in cats if c.startswith('year')]
monthcats=[c for c in cats if c.startswith('month')]
...who are the top posters?
posts = [(len(corpus.fileids(author)), author) for author in
authorcats]
posts.sort(reverse=True)

for count, author in posts[:10]:
   print "%5dt%s" % (count, author)

→

 1304    author-JW
 1294    author-RC
 1243    author-CS
 1030    author-JH
  868    author-DP
  752    author-TWB
  608    author-CS#2
  556    author-TL
  452    author-BM
  412    author-RM
(email   me if you're curious to know if you're on it...)
Frequency distributions
popular =['ubuntu','debian','fedora','arch']
niche = ['gentoo','suse','centos','redhat']

def getcfd(distros,limit):
  cfd = nltk.ConditionalFreqDist(
     (distro, fileid[:limit])
     for fileid in corpus.fileids()
     for w in corpus.words(fileid)
     for distro in distros
     if w.lower().startswith(distro))
  return cfd

popularcfd = getcfd(popular,4) # or 7 for months
popularcfd.plot()
nichecfd = getcfd(niche,4)
nichecfd.plot()
                       another “NLTKism”
'Popular' distros by month
'Popular' distros by year
'Niche' distros by year
Random text generation
import random
words = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_model(cfdist, word, num=15):
    for i in range(num):
       print word,
       words = list(cfdist[word])
       word = random.choice(words)

text = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'hi', num=20)
hi...
hi allan : ages since apparently yum erased . attempts
now venturing into config run ip 10 431 ms 57

hi serg it illegal address entries must *, t close relative info
many families continue fi into modem and reinstalled

hi wen and amended :) imageshack does for grade service
please blame . warning issued an overall environment
consists in

hi folks i accidentally due cause excitingly stupid idiots ,
deletion flag on adding option ? branded ) mounting them

hi guys do composite required </ emulator in for
unattended has info to catalyse a dbus will see atz init3
hi from Peter...
text = [w.lower() for w in corpus.words(categories=
          [c for c in authorcats if 'PeterL' in c])]


hi everyone , hence the database schema and that run on memberdb on mail
store is 12 . yep ,

hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
of failure .

hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
g4 ibook here

hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
host basis

hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
! now ). txt

hi cameron , attribution for 30 seconds , and runs out on linux to on www .
luv , these
interesting collocations
                              ...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)

finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
→
bufnewfile bufread
busmaster speccycle
cellx celly
cheswick bellovin
cread clocal
curtail atl
dmcrs rscem
dmmrbc dmost
dmost dmcrs
...
oblig tag cloud


stopwords =
nltk.corpus.stopwords.words('english')
words = [w.lower() for w in corpus.words()
                                if w.isalpha()]
words = [w for w in words if w not in stopwords]
word_fd = nltk.FreqDist(words)
wordmax = word_fd[word_fd.max()]
wordmin = 1000 #YMMV
taglist = word_fd.items()
ranges = getRanges(wordmin, wordmax)
writeCloud(taglist, ranges, 'tags.html')
another one for Peter :)
cats =  [c for c in corpus.categories()
               if 'PeterL' in c]
words=[w.lower() for w in corpus.words(categories=cats)
                         if w.isalpha()]
wordmin = 10
  →
thanks!
for more corpus fun:
http://www.nltk.org/
                             The Book:
       'Natural Language Processing
                         with Python',
                 2nd ed. pub. Jan 2010



      These slides are © Brianna Laugher and are released under
           the Creative Commons Attribution ShareAlike license,
                    v3.0 unported. The data set is not free, sadly...

Mais conteúdo relacionado

Mais procurados

Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Python for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationPython for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationVictor Marcelino
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013Puppet
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekingeProf. Wim Van Criekinge
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programmingMarc Gouw
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Cosimo Streppone
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutAudrey Roy
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - RoutersLogicaltrust pl
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015Brian Coffey
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습Mario Cho
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekingeProf. Wim Van Criekinge
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0David Beazley (Dabeaz LLC)
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남Eunjeong (Lucy) Park
 

Mais procurados (20)

Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Python for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administrationPython for-unix-and-linux-system-administration
Python for-unix-and-linux-system-administration
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
 
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 
Python Tricks That You Can't Live Without
Python Tricks That You Can't Live WithoutPython Tricks That You Can't Live Without
Python Tricks That You Can't Live Without
 
Simplest-Ownage-Human-Observed… - Routers
 Simplest-Ownage-Human-Observed… - Routers Simplest-Ownage-Human-Observed… - Routers
Simplest-Ownage-Human-Observed… - Routers
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습오픈소스로 시작하는 인공지능 실습
오픈소스로 시작하는 인공지능 실습
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 

Destaque

Beyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBeyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBrianna Laugher
 
Free and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerFree and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerBrianna Laugher
 
Wikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitWikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitBrianna Laugher
 
Future directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesFuture directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesBrianna Laugher
 
Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Brianna Laugher
 
CFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareCFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareBrianna Laugher
 
Special:Contributions/newbies
Special:Contributions/newbiesSpecial:Contributions/newbies
Special:Contributions/newbiesBrianna Laugher
 

Destaque (7)

Beyond Open Source - Arthur Sale
Beyond Open Source - Arthur SaleBeyond Open Source - Arthur Sale
Beyond Open Source - Arthur Sale
 
Free and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon GreenerFree and open geodata: From shadows to reality - Simon Greener
Free and open geodata: From shadows to reality - Simon Greener
 
Wikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & ProfitWikimedia Commons for Fun & Profit
Wikimedia Commons for Fun & Profit
 
Future directions for copyright law - Laura Simes
Future directions for copyright law - Laura SimesFuture directions for copyright law - Laura Simes
Future directions for copyright law - Laura Simes
 
Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)Hacking MediaWiki (For Users)
Hacking MediaWiki (For Users)
 
CFFSW - Crowdfunded free software
CFFSW - Crowdfunded free softwareCFFSW - Crowdfunded free software
CFFSW - Crowdfunded free software
 
Special:Contributions/newbies
Special:Contributions/newbiesSpecial:Contributions/newbies
Special:Contributions/newbies
 

Semelhante a Language Sleuthing HOWTO with NLTK

Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersYury Chemerkin
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonTariq Rashid
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the futureJeff Miccolis
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Peter Higgins
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talkdotCloud
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009Dr Nic Williams
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossumoscon2007
 
Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Logicaltrust pl
 
Rust: Reach Further
Rust: Reach FurtherRust: Reach Further
Rust: Reach Furthernikomatsakis
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java ProgrammingKaty Allen
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in RAndrew Lowe
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Vincent Batts
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...amit kuraria
 

Semelhante a Language Sleuthing HOWTO with NLTK (20)

Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
What is Python?
What is Python?What is Python?
What is Python?
 
Filip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routersFilip palian mateuszkocielski. simplest ownage human observed… routers
Filip palian mateuszkocielski. simplest ownage human observed… routers
 
Intro
IntroIntro
Intro
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
Integration Testing With Cucumber How To Test Anything J A O O 2009
Integration Testing With  Cucumber    How To Test Anything    J A O O 2009Integration Testing With  Cucumber    How To Test Anything    J A O O 2009
Integration Testing With Cucumber How To Test Anything J A O O 2009
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015Trust boundaries - Confidence 2015
Trust boundaries - Confidence 2015
 
Rust: Reach Further
Rust: Reach FurtherRust: Reach Further
Rust: Reach Further
 
Designing A Project Using Java Programming
Designing A Project Using Java ProgrammingDesigning A Project Using Java Programming
Designing A Project Using Java Programming
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 

Mais de Brianna Laugher

So You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthSo You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthBrianna Laugher
 
Dynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookDynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookBrianna Laugher
 
Funcargs & other fun with pytest
Funcargs & other fun with pytestFuncargs & other fun with pytest
Funcargs & other fun with pytestBrianna Laugher
 
Zookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareZookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareBrianna Laugher
 
BarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBrianna Laugher
 
Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Brianna Laugher
 
Wiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIWiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIBrianna Laugher
 
GLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureGLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureBrianna Laugher
 
The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)Brianna Laugher
 
Free as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellFree as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellBrianna Laugher
 
Public history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhPublic history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhBrianna Laugher
 
It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...Brianna Laugher
 
Gratis & libre - Liam Wyatt
Gratis & libre - Liam WyattGratis & libre - Liam Wyatt
Gratis & libre - Liam WyattBrianna Laugher
 
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerOpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerBrianna Laugher
 
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Brianna Laugher
 
How Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleHow Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleBrianna Laugher
 

Mais de Brianna Laugher (20)

So You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career GrowthSo You're A Software Developer, Now What? Exploring Career Growth
So You're A Software Developer, Now What? Exploring Career Growth
 
Dynamic viz in the IPython Notebook
Dynamic viz in the IPython NotebookDynamic viz in the IPython Notebook
Dynamic viz in the IPython Notebook
 
Funcargs & other fun with pytest
Funcargs & other fun with pytestFuncargs & other fun with pytest
Funcargs & other fun with pytest
 
Zookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management softwareZookeepr: Home-grown conference management software
Zookeepr: Home-grown conference management software
 
BarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text fieldBarCamp Geelong - Why gender should be a text field
BarCamp Geelong - Why gender should be a text field
 
Distributed wikis
Distributed wikisDistributed wikis
Distributed wikis
 
Neurosexism
NeurosexismNeurosexism
Neurosexism
 
Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?Clash of the encyclopedias - is competition good for sharing?
Clash of the encyclopedias - is competition good for sharing?
 
Visualising geo-data
Visualising geo-dataVisualising geo-data
Visualising geo-data
 
Wiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki APIWiki[mp]edia data sources & the MediaWiki API
Wiki[mp]edia data sources & the MediaWiki API
 
GLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructureGLAM-WIKI - Wikimedia tech infrastructure
GLAM-WIKI - Wikimedia tech infrastructure
 
The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)The right level of detail (MediaWiki, APIs)
The right level of detail (MediaWiki, APIs)
 
Free as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty RussellFree as in Market: Liberty and Property - Rusty Russell
Free as in Market: Liberty and Property - Rusty Russell
 
Public history in the digital age - Claudine Chionh
Public history in the digital age - Claudine ChionhPublic history in the digital age - Claudine Chionh
Public history in the digital age - Claudine Chionh
 
It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...It's all fun and games until someone wants to sue you: Reporting in the age o...
It's all fun and games until someone wants to sue you: Reporting in the age o...
 
Gratis & libre - Liam Wyatt
Gratis & libre - Liam WyattGratis & libre - Liam Wyatt
Gratis & libre - Liam Wyatt
 
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerOpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
OpenAustralia - Everyday democracy for everybody in Australia - Matthew Landauer
 
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Freedom Fighting: How do we convince the powers that be to relax their grip? ...
Freedom Fighting: How do we convince the powers that be to relax their grip? ...
 
Who's behind Wikipedia?
Who's behind Wikipedia?Who's behind Wikipedia?
Who's behind Wikipedia?
 
How Free Software makes Wikipedia possible
How Free Software makes Wikipedia possibleHow Free Software makes Wikipedia possible
How Free Software makes Wikipedia possible
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Language Sleuthing HOWTO with NLTK

  • 1. Language Sleuthing HOWTO or Discovering Interesting Things with Python's Natural Language Tool Kit Brianna Laugher modernthings.org brianna[@.]laugher.id.au
  • 2. Corpus linguistics on web texts why?
  • 3. Because the web is full of language data Because linguistic techniques can reveal unexpected insights Because I don't want to have to read everything
  • 5. luv-main as a corpus √ Big collection of text x Messy data x Not annotated
  • 6. what's interesting? conversations topics change over time (authors)
  • 8. wget vs Python script √ wget is purpose-built √ convenient options like --convert-links
  • 9. Meaningful URLs FTW Sympa/MhonArc: lists.luv.asn.au/wws/arc/luv-main/ 2009-04/ msg00057.html
  • 10.
  • 12. Cleaning for what? Remove archive boilerplate Remove HTML Remove quoted text? Remove signatures?
  • 13. J.W. J.W. W.E.
  • 14. Behind the scenes J.W. W.E.
  • 15. what are we aiming for? what do NLTK corpora look like?
  • 16. Getting NLTK sudo apt-get install python-nltk in Ubuntu 10.04 or sudo apt-get install python-pip pip install nltk or from source at nltk.org/download
  • 17. Getting NLTK data... an “NLTKism”
  • 18.
  • 20. Brown corpus A CategorizedTagged corpus: Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./. The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
  • 21. Inaugural corpus A Plaintext corpus: My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the presidential oath. ...............
  • 22. But we still have lots of HTML...
  • 23.
  • 24. BeautifulSoup to the rescue >>> from BeautifulSoup import BeautifulSoup as BS >>> data = open(filename,'r').read() >>> soup = BS(data) >>> print 'n'.join(soup.findAll(text=True))
  • 25.
  • 27. What about blockquotes? >>> bqs = s.findAll('blockquote') >>> [bq.extract() for bq in bqs] >>> print 'n'.join(s.findAll(text=True)) On 05/08/2007, at 12:05 PM, [...] wrote: If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.&#xA0; Then from the desktop after boot, right click and create the bootable USB key yourself.&#xA0; I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.&#xA0; Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks. Regards, [...]
  • 29. Getting it into NLTK import nltk path = 'path/to/files' corpus = nltk.corpus.PlaintextCorpusReader(path, '.*.html')
  • 30. What about our metadata? Create a Python dictionary that maps filenames to categories e.g. categories={} categories['2008-12/msg00226.html'] = ['year-2008', 'month-12', 'author-BM<bm@xxxxx>' ] ....etc then... import nltk path = 'path/to/files/' corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*.html', cat_map=categories)
  • 31. Simple categories cats = corpus.categories() authorcats=[c for c in cats if c.startswith('author')] #>>> len(authorcats) #608 yearcats=[c for c in cats if c.startswith('year')] monthcats=[c for c in cats if c.startswith('month')]
  • 32. ...who are the top posters? posts = [(len(corpus.fileids(author)), author) for author in authorcats] posts.sort(reverse=True) for count, author in posts[:10]: print "%5dt%s" % (count, author) → 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM (email me if you're curious to know if you're on it...)
  • 33. Frequency distributions popular =['ubuntu','debian','fedora','arch'] niche = ['gentoo','suse','centos','redhat'] def getcfd(distros,limit): cfd = nltk.ConditionalFreqDist( (distro, fileid[:limit]) for fileid in corpus.fileids() for w in corpus.words(fileid) for distro in distros if w.lower().startswith(distro)) return cfd popularcfd = getcfd(popular,4) # or 7 for months popularcfd.plot() nichecfd = getcfd(niche,4) nichecfd.plot() another “NLTKism”
  • 37. Random text generation import random words = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) def generate_model(cfdist, word, num=15): for i in range(num): print word, words = list(cfdist[word]) word = random.choice(words) text = [w.lower() for w in corpus.words()] bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) generate_model(cfd, 'hi', num=20)
  • 38. hi... hi allan : ages since apparently yum erased . attempts now venturing into config run ip 10 431 ms 57 hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
  • 39. hi from Peter... text = [w.lower() for w in corpus.words(categories= [c for c in authorcats if 'PeterL' in c])] hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep , hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure . hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
  • 40. interesting collocations ...or not ext = [w.lower() for w in corpus.words() if w.isalpha()] from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(text) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufread busmaster speccycle cellx celly cheswick bellovin cread clocal curtail atl dmcrs rscem dmmrbc dmost dmost dmcrs ...
  • 41. oblig tag cloud stopwords = nltk.corpus.stopwords.words('english') words = [w.lower() for w in corpus.words() if w.isalpha()] words = [w for w in words if w not in stopwords] word_fd = nltk.FreqDist(words) wordmax = word_fd[word_fd.max()] wordmin = 1000 #YMMV taglist = word_fd.items() ranges = getRanges(wordmin, wordmax) writeCloud(taglist, ranges, 'tags.html')
  • 42.
  • 43. another one for Peter :) cats = [c for c in corpus.categories() if 'PeterL' in c] words=[w.lower() for w in corpus.words(categories=cats) if w.isalpha()] wordmin = 10 →
  • 44. thanks! for more corpus fun: http://www.nltk.org/ The Book: 'Natural Language Processing with Python', 2nd ed. pub. Jan 2010 These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license, v3.0 unported. The data set is not free, sadly...