SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Exploring patent space
with python
Franta Polach
@FrantaPolach
IPberry.com
PyData 2014
@FrantaPolach 2
@FrantaPolach 3
@FrantaPolach 4
@FrantaPolach 5
@FrantaPolach 6
Outline
● Why patents
● Data kung fu
● Topic modelling
● Future
@FrantaPolach 7
Why patents
@FrantaPolach 8
Why patents
● The system is broken
● Messy, slow & costly process
● USPTO data freely available
● Data structured, mostly consistent
● A chance to learn
@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 ,
Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or
practice that requires patience, energy, and time to
complete
@FrantaPolach 10
USPTO Data
● xml, SGML key-value store
● 1975 – present
● eight different formats
● > 70GB (compressed)
● patent grants
● patent applications
● How to parse?
● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute
– http://funginstitute.berkeley.edu/tools-and-data
@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.
@FrantaPolach 12
Fung Institute SQL database schema
@FrantaPolach 13
Entity-relationship diagram
Patents with citations, claims, applications and classes
@FrantaPolach 14
Descriptive statistics
@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent
documents
● i.e. compute semantic similarity
● Tools: nltk, gensim
● Data: patent abstracts, claims, descriptions
● Usage: have invention description, find
semantically similar patents
@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database
● Want: data ready for semantic analysis
● Do:
– lemmatization, stemming
– collocations, Named Entity Recognition
@FrantaPolach 17
Text preprocessing
Lemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and
analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS
@FrantaPolach 18
Data streaming
Why? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for line in open(self.fname):
patent=line.lower().split('t')
tokens = gensim.utils.tokenize(patent[5], lower=True)
title = patent[6]
yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')
print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a',
u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile',
u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items',
u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle',
u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general',
u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the',
u'brackets', u'the', u'brackets', u'are', u'flat', …
@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a
dictionary
text = "A community for developers and users of Python
data tools."
bow = id2word.doc2bow(tokenize(text))
print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text):
return [t for t in simple_preprocess(text) if t not in
STOPWORDS]
@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random
Projections (RP), etc.
model = gensim.models.LdaModel(corpus, num_topics=100,
id2word=id2word, passes = 4, alpha=None)
_ = model.print_topics(-1)
INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell +
0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address +
0.022*logic + 0.017*row
INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines +
0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics +
0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss
INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing +
0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each +
0.016*formed + 0.016*arm
INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input
+ 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit +
0.016*amplifier + 0.014*reference
@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))
print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1),
(u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system +
0.008*internet + ...
@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a
set of similar documents and compare with the results of given
semantic model
● "word intrusion" method: for each trained topic, take its first ten words,
substitute one of them with a randomly chosen word (intruder!) and let
a human detect the intruder
● Method without human intervention: split each document into two parts,
and check that topics of the first half are similar to topics of the second;
halves of different documents are dissimilar
@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary
of terms
● the idea behind Latent Dirichlet Allocation is to
statistically model documents as containing
multiple hidden semantic topics
@FrantaPolach 24
memory: 188
cell: 146
plurality: 102
array: 86
bit: 71
address: 51
Exploring topic space
speed: 178
line: 163
performance: 107
characteristic: 79
skin: 63
suspension: 45
signal: 324
output: 142
input: 108
frequency: 62
phase: 49
clock: 35
portion: 310
housing: 109
end: 62
edge: 53
mounting: 43
form: 35
@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them
->> sparse model
@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queries
from scipy.spatial import distance
pairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexing
from gensim.similarities import Similarity
index = Similarity('tmp/index', corpus,
num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes
->> scales well
@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a
user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525,
0.80638835174553156)]
@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)
● Elasticsearch search and analytics
● Recommendation engine (for applications)
● Drawings analysis
● Blockchain based smart contracts
● Artificial patent lawyer

Mais conteúdo relacionado

Mais procurados

Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
FFRI, Inc.
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
andrewcollette
 

Mais procurados (18)

Python for Dummies
Python for DummiesPython for Dummies
Python for Dummies
 
Functional concepts in C#
Functional concepts in C#Functional concepts in C#
Functional concepts in C#
 
Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
 
Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_Compact ordered dict__k_lab_meeting_
Compact ordered dict__k_lab_meeting_
 
Python for Linux System Administration
Python for Linux System AdministrationPython for Linux System Administration
Python for Linux System Administration
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
All'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in UserspaceAll'ombra del Leviatano: Filesystem in Userspace
All'ombra del Leviatano: Filesystem in Userspace
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202The Ring programming language version 1.8 book - Part 116 of 202
The Ring programming language version 1.8 book - Part 116 of 202
 
Schizophrenic files
Schizophrenic filesSchizophrenic files
Schizophrenic files
 
Rust vs C++
Rust vs C++Rust vs C++
Rust vs C++
 
Lz77 by ayush
Lz77 by ayushLz77 by ayush
Lz77 by ayush
 
Seminar Hacking & Security Analysis
Seminar Hacking & Security AnalysisSeminar Hacking & Security Analysis
Seminar Hacking & Security Analysis
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
C++17 std::filesystem - Overview
C++17 std::filesystem - OverviewC++17 std::filesystem - Overview
C++17 std::filesystem - Overview
 
Slide cipher based encryption
Slide cipher based encryptionSlide cipher based encryption
Slide cipher based encryption
 

Semelhante a Franta Polach - Exploring Patent Data with Python

Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
Elizabeth Smith
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
eComm2008
 

Semelhante a Franta Polach - Exploring Patent Data with Python (20)

Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
Python 3.6 Features 20161207
Python 3.6 Features 20161207Python 3.6 Features 20161207
Python 3.6 Features 20161207
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor data
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Php Extensions for Dummies
Php Extensions for DummiesPhp Extensions for Dummies
Php Extensions for Dummies
 
EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
Exploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernelExploitation of counter overflows in the Linux kernel
Exploitation of counter overflows in the Linux kernel
 
GE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_NotesGE3151_PSPP_UNIT_5_Notes
GE3151_PSPP_UNIT_5_Notes
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
Getting started with Clojure
Getting started with ClojureGetting started with Clojure
Getting started with Clojure
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 

Mais de PyData

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Último (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Franta Polach - Exploring Patent Data with Python

  • 1. Exploring patent space with python Franta Polach @FrantaPolach IPberry.com PyData 2014
  • 6. @FrantaPolach 6 Outline ● Why patents ● Data kung fu ● Topic modelling ● Future
  • 8. @FrantaPolach 8 Why patents ● The system is broken ● Messy, slow & costly process ● USPTO data freely available ● Data structured, mostly consistent ● A chance to learn
  • 9. @FrantaPolach 9 Data kung fu Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu) – a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
  • 10. @FrantaPolach 10 USPTO Data ● xml, SGML key-value store ● 1975 – present ● eight different formats ● > 70GB (compressed) ● patent grants ● patent applications ● How to parse? ● Parsed data available? – Harvard Dataverse Network – Coleman Fung Institute for Engineering Leadership, UC Berkeley – PATENT SEARCH TOOL by Fung Institute – http://funginstitute.berkeley.edu/tools-and-data
  • 11. @FrantaPolach 11 Coleman Fung Institute for Engineering Leadership, UC Berkeley patent data process flow The code is in Python 2 on Github.
  • 12. @FrantaPolach 12 Fung Institute SQL database schema
  • 13. @FrantaPolach 13 Entity-relationship diagram Patents with citations, claims, applications and classes
  • 15. @FrantaPolach 15 Topic modelling ● Goal: build a topic space of the patent documents ● i.e. compute semantic similarity ● Tools: nltk, gensim ● Data: patent abstracts, claims, descriptions ● Usage: have invention description, find semantically similar patents
  • 16. @FrantaPolach 16 Text preprocessing ● Have: parsed data in a relational database ● Want: data ready for semantic analysis ● Do: – lemmatization, stemming – collocations, Named Entity Recognition
  • 17. @FrantaPolach 17 Text preprocessing Lemmatization, stemming print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data")) ['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN'] i.e. group together different inflected forms of a word so they can be analysed as a single item Collocations, Named Entity Recognition detect a sequence of words that co-occur more often than would be expected by chance import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures e.g. entity such as "General Electric" stays a single token Stopwords generic words, such as "six", "then", "be", "do".... from gensim.parsing.preprocessing import STOPWORDS
  • 18. @FrantaPolach 18 Data streaming Why? data is too large to fit into RAM Itertools are your friend class PatCorpus(object): def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens) corpus_tokenized = PatCorpus('in.tsv') print(list(itertools.islice(corpus_tokenized, 2))) [('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
  • 19. @FrantaPolach 19 Vectorization ● First we create a dictionary, i.e. index text tokens by integers id2word = gensim.corpora.Dictionary(corpus_tokenized) ● Create bag-of-words vectors using a streamed corpus and a dictionary text = "A community for developers and users of Python data tools." bow = id2word.doc2bow(tokenize(text)) print(bow) [(12832, 1), (28124, 1), (28301, 1), (32835, 1)] def tokenize(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS]
  • 20. @FrantaPolach 20 Semantic transformations ● A transformation takes a corpus and outputs another corpus ● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc. model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None) _ = model.print_topics(-1) INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
  • 21. @FrantaPolach 21 Transforming unseen documents text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) transform text into the bag-of-words space bow_vector = id2word.doc2bow(tokenize(text)) print([(id2word[id], count) for id, count in bow_vector]) [(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)] 2) transform text into our LDA space vector = model[bow_vector] [(0, 0.024384265946835323), (1, 0.78941547921042373),... 3) find the document's most significant LDA topic model.print_topic(max(vector, key=lambda item: item[1])[0]) 0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
  • 22. @FrantaPolach 22 Evaluation ● Topic modelling is an unsupervised task ->> evaluation tricky ● Need to evaluate the improvement of the intended task ● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model ● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder ● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
  • 23. @FrantaPolach 23 The topic space ● a topic is a distribution over a fixed vocabulary of terms ● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
  • 24. @FrantaPolach 24 memory: 188 cell: 146 plurality: 102 array: 86 bit: 71 address: 51 Exploring topic space speed: 178 line: 163 performance: 107 characteristic: 79 skin: 63 suspension: 45 signal: 324 output: 142 input: 108 frequency: 62 phase: 49 clock: 35 portion: 310 housing: 109 end: 62 edge: 53 mounting: 43 form: 35
  • 25. @FrantaPolach 25 Topics distribution many topics in total, but each document contains just a few of them ->> sparse model
  • 26. @FrantaPolach 26 Semantic distance in topic space ● Semantic distance queries from scipy.spatial import distance pairwise = distance.squareform(distance.pdist(matrix)) >> MemoryError ● Document indexing from gensim.similarities import Similarity index = Similarity('tmp/index', corpus, num_features=corpus.num_terms) The Similarity class splits the index into several smaller sub-indexes ->> scales well
  • 27. @FrantaPolach 27 Semantic distance queries query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment." 1) vectorize the text into bag-of-words space bow_vector = id2word.doc2bow(tokenize(query)) 2) transform the text into our LDA space query_lda = model[bow_vector] 3) query the LDA index, get the top 3 most similar documents index.num_best = 3 print(index[query_lda]) [(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
  • 28. @FrantaPolach 28 Future ● Graph of USPTO data (Neo4j) ● Elasticsearch search and analytics ● Recommendation engine (for applications) ● Drawings analysis ● Blockchain based smart contracts ● Artificial patent lawyer