All US patents from 1976 onwards are freely available in computer-readable formats providing a large corpus for chemical text mining. Other patent offices are increasingly also offering their back-catalogues as XML, allowing chemical text mining to be performed in the same way as for recent US patents.
We investigate how much chemistry is found in non-US patents (compared to US patents) and, where the chemistry is present in publications from multiple patent offices, how long were the delays between these publications.
We show that non-US patents can be text mined for a large number of chemical reactions and analyse the overlap with reactions from US patents. Finally we use all the extracted chemical reactions to explore whether models for predicting reaction yield may be built from features such as the reaction type and its reaction conditions (as text mined from the patent text).
1. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Chemistry and reactions from non-
US patents
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
2. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Topics
• Coverage of USPTO vs EPO patents
• Extraction of non-chemical-compound data
• Can text-mining provide insights into reaction
informatics
3. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
What am I missing by only using the
USPTO data?
4. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
EPO and USPTO background
• USPTO: 303k grants published in 2013
• EPO: 67k grants published in 2013
• EPO patents must be in one or more of
English, French or German
• USPTO documents are all in English
5. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Compounds
(using StdInChI)
3,345,877
2,674,069
445,588
USPTO = 1976-2013 patent grants + 2001-2013 patent applications
EPO = 1978-2013 patent grants and applications
6. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Reactions
(using separately sorted StdInChIs for reactants/agents and products)
706,524
317,533
99,681
USPTO = 1976-2013 patent grants + 2001-2013 patent applications
EPO = 1978-2013 patent grants and applications
7. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Effect of different Languages
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Numberofstructuresextracted
English
German
French
Translation performed by Lexichem v2014.Jun
8. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
First Disclosure Occurrence by
Language
Excluding compounds
disclosed by USPTO (1976-
2013) patents
All compounds
9. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Timeliness
• Competitive intelligence requires up to date
information
• Methodology:
– Consider compounds present in both EPO and
USPTO date not disclosed by either prior to 2006
– Compare the earliest publication date for each
compound
10. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Average lag between EPO/USPTO
Compounds first disclosed between 2006-2013
Mean: US 540 days earlier
11. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
10000
20000
30000
40000
50000
60000
70000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Applications Vs Applications
Mean: US 223 days earlier
Compounds first disclosed between 2006-2013
12. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Grants vs Grants
Compounds first disclosed between 2006-2013
Mean: US 287 days earlier
14. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/US 1976-2013 and chembl19
overlap
187,486
1,155,034
6,278,048
15. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Patents 4-5
years earlier
Patents 3-4
years earlier
Patents 2-3
years earlier
Patents 1-2
years earlier
Patents 0-1
years earlier
Patents 0-1
years later
Patents 1-2
years later
Patents 2-3
years later
Patents 3-5
years later
Patents 4-5
years later
Average lag between
Patents/CHEMbl19
Compounds first disclosed between 2006-2013
Mean: Patents 345 days
earlier
16. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Other data in patents
• Genes/proteins
• Reactions including role of reagents
• Physical quantities
• Spectra
• Diseases
• Organisms
• Companies
• …
17. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
• Identify trends in drug target popularity
• Count number of patents (in a year)
mentioning a given gene or its gene products
and map to the HGNC symbol
• Many genes are referenced by short
ambiguous terms hence great care is required
to have good precision
18. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
(Hepatocyte growth factor receptor)(Epidermal growth factor receptor)
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Percentageofprotein/genementionsinthatyear(ChEMBL)
Percentageofprotein/genementionsinthatyear(Patents)
EGFR
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Percentageofprotein/genementionsinthatyear(ChEMBL)
Percentageofprotein/genementionsinthatyear(Patents)
MET
21. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
22. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
Over 99k so far!
23. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
CHEMICAL reactions
24. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Predicting yield
• Features to consider:
• 2D fingerprints (especially around the reaction
centers)
• Reaction type
• Temperature
• Time
• Change in complexity e.g. chiral centres
25. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Data extraction
• Can pull out yields, quantities, times and
temperatures
• Can sanity check text-mined yield by
calculating from amounts (or even masses)
𝑚𝑎𝑠𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑(𝑔)
𝑚𝑜𝑙𝑎𝑟 𝑚𝑎𝑠𝑠 [𝑐𝑎𝑙𝑐 𝑓𝑟𝑜𝑚 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒](𝑔𝑚𝑜𝑙−1)
= 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑙𝑖𝑚𝑖𝑡𝑖𝑛𝑔 𝑟𝑒𝑎𝑐𝑡𝑎𝑛𝑡
= %𝑦𝑖𝑒𝑙𝑑
26. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Outliers inevitable
27. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
sCale vs yield
2001-2013 US applications, Suzuki couplings
28. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Temperatures
30. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1)
stereoComplexityScore = log10(nStereoCenters + 1)
macrocyclePenalty = log10(nMacrocycles + 1)
sizePenalty = natoms1.005 − natoms
∑ Ertl & Schuffenhauer 2009
doi:10.1186/1758-2946-1-8
31. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11
Numberofproductheavyatoms
Number of steps
Average number of heavy atoms
32. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Yield vs Change in Complexity
2001-2013 US applications, all reactions
33. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Suzukicouplingsasapercentageofreactionsinayear
Classification performed using NameRXN
34. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Percentageofreactionsinthatyear
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene
35. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
36. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Conclusions
• A significant amount of novel chemistry, from EPO patents, comes
from the non-English patents
• Compounds disclosed by both the USPTO and EPO are on average
published earlier by the USPTO but for many compounds an EPO
patent will be the earlier disclosure
• Gene/protein identification can identify clear changes in patenting
behaviour over time
• Text mining provides the tools to answer many reaction
informatics questions
37. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Acknowledgements
• Funding provided by:
38. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com