Franta Polach - Exploring Patent Data with Python

Exploring patent space
with python
Franta Polach
@FrantaPolach
IPberry.com
PyData 2014

@FrantaPolach 6
Outline
● Why patents
● Data kung fu
● Topic modelling
● Future

@FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data structured, mostly consistent
● A chance to learn

@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or
practice that requires patience, energy, and time to
complete

@FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
● patent grants
● patent applications
● How to parse?
● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute
– http://funginstitute.berkeley.edu/tools-and-data

@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.

@FrantaPolach 12
Fung Institute SQL database schema

@FrantaPolach 13
Entity-relationship diagram
Patents with citations, claims, applications and classes

@FrantaPolach 14
Descriptive statistics

@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● Tools: nltk, gensim
● Data: patent abstracts, claims, descriptions
● Usage: have invention description, find
semantically similar patents

@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
● Do:
– lemmatization, stemming
– collocations, Named Entity Recognition

@FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and
analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS

@FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for line in open(self.fname):
patent=line.lower().split('t')
tokens = gensim.utils.tokenize(patent[5], lower=True)
title = patent[6]
yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')
print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a',
u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile',
u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items',
u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle',
u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general',
u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the',
u'brackets', u'the', u'brackets', u'are', u'flat', …

@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a
dictionary
text = "A community for developers and users of Python
data tools."
bow = id2word.doc2bow(tokenize(text))
print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text):
return [t for t in simple_preprocess(text) if t not in
STOPWORDS]

@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random
Projections (RP), etc.
model = gensim.models.LdaModel(corpus, num_topics=100,
id2word=id2word, passes = 4, alpha=None)
_ = model.print_topics(-1)
INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell +
0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address +
0.022*logic + 0.017*row
INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines +
0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics +
0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss
INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing +
0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each +
0.016*formed + 0.016*arm
INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input
+ 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit +
0.016*amplifier + 0.014*reference

@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))
print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1),
(u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system +
0.008*internet + ...

@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a
set of similar documents and compare with the results of given
semantic model
● "word intrusion" method: for each trained topic, take its first ten words,
substitute one of them with a randomly chosen word (intruder!) and let
a human detect the intruder
● Method without human intervention: split each document into two parts,
and check that topics of the first half are similar to topics of the second;
halves of different documents are dissimilar

@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dirichlet Allocation is to
statistically model documents as containing
multiple hidden semantic topics

@FrantaPolach 24
memory: 188
cell: 146
plurality: 102
array: 86
bit: 71
address: 51
Exploring topic space
speed: 178
line: 163
performance: 107
characteristic: 79
skin: 63
suspension: 45
signal: 324
output: 142
input: 108
frequency: 62
phase: 49
clock: 35
portion: 310
housing: 109
end: 62
edge: 53
mounting: 43
form: 35

@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them
->> sparse model

@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexing
from gensim.similarities import Similarity
index = Similarity('tmp/index', corpus,
num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes
->> scales well

@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525,
0.80638835174553156)]

@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for applications)
● Drawings analysis
● Blockchain based smart contracts
● Artificial patent lawyer

Franta Polach - Exploring Patent Data with Python

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Franta Polach - Exploring Patent Data with Python

Semelhante a Franta Polach - Exploring Patent Data with Python (20)

Mais de PyData

Mais de PyData (20)

Último

Último (20)

Franta Polach - Exploring Patent Data with Python