SlideShare uma empresa Scribd logo
1 de 15
Babies and Bathwater
Keeping linguistics alongside machine
learning in patent search
David Woolls – CFL Software Limited, UK
Matter
• Therefore, we cannot think that matter is made of points
without extension, because no matter how many of these we
manage to put together, we never obtain something with an
extended dimension.
Carlo Rovelli , Reality is not what it seems (2016 p:12)
• Quindi non si può pensare che la materia sia fatta di punti
senza estensione, perché, per quanti ne mettessimo
insieme, non otterremmo mai qualcosa con una dimensione
estesa.
• What is the matter with this sentence? Does this matter? As
a matter of fact it does. That’s another matter.
• What does ‘matter’ mean on this page?
Imagined Readers – Text differences
"It was a dark and stormy night, the rain
came down in torrents, there were brigands on
the mountains, and wolves, and the chief of the
brigands said to Antonio, 'I'm bored - tell us a
story!’”
Janet and Allan Ahlberg
From “Paul Clifford”
LSTM and linguistics
• But there are also cases where we need more
context.
• Consider trying to predict the last word in the text “I
grew up in France… I speak fluent French.”
Humans usually provide linguistic assistance in the form of function words
(grammar)
I grew up in France so I speak fluent … Definitely French
I grew up in France and I speak fluent … Possibly French but maybe another
I grew up in France but I speak fluent … Definitely not French
I grew up in France but I also speak fluent … Very definitely not French
I grew up in France but I don’t speak fluent … Definitely French
I grew up in France so I don’t speak fluent … Definitely not French
Babies, bathwater,
stems, lemmas and function words
Becomes
I think Christoph is brilliant Think Christoph brilli
I thought Christoph was brilliant Think Christoph brilli
I thought Christoph was brilliant but now I’m not
so sure.
Think Christoph brilli sure
Hearing Christoph’s brilliance I asked him to
speak.
Hear Christoph brilli ask speak
I wouldn’t do that if I were you! !
This is called telegraphic language and is spoken by children between 18
months and three years old during language acquisition. Perhaps not ideal for
computers and comprehension.
Linguistic LSTM with real sentences.
• It is a truth universally acknowledged, [6]
• that a single man [4]
• in possession of a good fortune, [6]
• must be in want of a wife. [7]
• [23/4] = 6
The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for
Processing Information
by George A. Miller
originally published in The Psychological Review, 1956, vol. 63, pp. 81-97
http://www.musanim.com/miller1956/
It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.
LSTM
• However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in
the minds of the surrounding families, that he is considered the
rightful property of some one or other of their daughters.
• However little known the feelings or views [7]
• of such a man may be [7]
• on his first entering a neighbourhood, [6]
• this truth is so well fixed [6]
• in the minds of the surrounding families, [7]
• that he is considered the rightful property [7]
• of some one or other of their daughters. [7]
• [47/7] = 7
Why linguistics?
• Patents are communicative documents, written in many languages.
• Communication is achieved by context which can be close or distant.
• Boolean searching gives results by document; range searching needs to be done
by claim.
• There are distractor numbers in a claim (e.g. Claim numbers, temperatures,
lengths).
• There are potential data quality or format problems introduced by OCR, machine
translation
or extraction from a database.
• All these and others need to be taken into account to find only relevant
material.
ICIC 2017 8
Why linguistics for ranges?
• Range information is in the unstructured text
– The location and referent of ranges is signalled by linguistic structures and forms:
• Range then element or Element then range or both 0,80 < Si < 1,20
• Elements by symbol Si or in full Silicon or silicon
• Implicit or explicit marking: 1-5 or between 1 and 5
• Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76
• Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt %
– There can be mixtures of these forms in a single claim.
ICIC 2017 9
Reading
The program is a linear text reader because we need to:
1. Identify claims
2. Identify pairs of elements and ranges in each claim.
So each line in the file is read word by word just once in the
same sequence as a human reader.
ICIC 2017 10
Reading
• Items are identified as numbers, range indicators or elements in sequence.
• As each element/range pair is identified, the relationship with the specification is
calculated.
• Following calculation the element and the range is colour-coded and the claim
built for potential display.
• At the conclusion of each claim the total found is compared with the total
specification.
• If the claim meets the overall specification requirement it is added to the list for
display.
• At the conclusion of the reading process, all the results are ranked and
displayed.
• The program can process the full claims of around 300 patents per second.
ICIC 2017 11
Native languages v Machine Translation
ICIC 2017 12
Here is the problem from the PatBase collection.
<Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness,
characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80
percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent;
</CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C
:0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00%
;
You can see that the MT version into English is appalling!.
You can also see that the original claim will be understandable by the program because the presentation is clear.
Detailed example (continued)
ICIC 2017 13
It is not practicable to write a program that takes account of all the things that might go wrong, without also
introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original
as correct as you see here.
So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an
indication of potential interest you can use a good MT program to translate just the claims of interest.
This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase
one.
CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight
percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W
elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or
two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
Use of CN, JP, KR originals - rationale
• Machine translation is often hard to understand and sometimes incomprehensible
• Using native language patents ensures data quality
• Limited inbuilt knowledge required for numerical searching
– Searching for elements requires only that a program has the CJK equivalents
for full element names; international symbols are identical.
– Searching for ranges requires knowledge of potential CJK equivalent codes
for digits
– Searching for range indicators requires language specific identification of hyphen, <,
> and words.
• Accurate identification of the search specification with display of the claims means only
those claims of interest need translation by machine or human
ICIC 2017 14
Thank you
Contact: d.woolls@cflsoftware.com
Website: www.cflsoftware.com

Mais conteúdo relacionado

Semelhante a ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
 
Exploiting Loopholes in CAP
Exploiting Loopholes in CAPExploiting Loopholes in CAP
Exploiting Loopholes in CAPC4Media
 
Serge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge Gladkoff
 
TDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageTDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageLuciano Sabença
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?Dominik Seisser
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issuesElasticsearch
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...Scality
 
HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathSimeon Simeonov
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docxevonnehoggarth79783
 
KantanFest: Andy Way
KantanFest: Andy WayKantanFest: Andy Way
KantanFest: Andy Waykantanmt
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docxRAHUL126667
 
Formidable College Supplemental Essays Th
Formidable College Supplemental Essays ThFormidable College Supplemental Essays Th
Formidable College Supplemental Essays ThMegan Mack
 

Semelhante a ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search (20)

Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
Exploiting Loopholes in CAP
Exploiting Loopholes in CAPExploiting Loopholes in CAP
Exploiting Loopholes in CAP
 
Serge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-final
 
Messaging
MessagingMessaging
Messaging
 
TDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-LanguageTDC 2020 - Implementing a Mini-Language
TDC 2020 - Implementing a Mini-Language
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
HyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard MathHyperLogLog Intuition Without Hard Math
HyperLogLog Intuition Without Hard Math
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
 
KantanFest: Andy Way
KantanFest: Andy WayKantanFest: Andy Way
KantanFest: Andy Way
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
 
Formidable College Supplemental Essays Th
Formidable College Supplemental Essays ThFormidable College Supplemental Essays Th
Formidable College Supplemental Essays Th
 

Mais de Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 

Mais de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Último

PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxgalaxypingy
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.krishnachandrapal52
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolinonuriaiuzzolino1
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查ydyuyu
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 

Último (20)

PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 

ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

  • 1. Babies and Bathwater Keeping linguistics alongside machine learning in patent search David Woolls – CFL Software Limited, UK
  • 2. Matter • Therefore, we cannot think that matter is made of points without extension, because no matter how many of these we manage to put together, we never obtain something with an extended dimension. Carlo Rovelli , Reality is not what it seems (2016 p:12) • Quindi non si può pensare che la materia sia fatta di punti senza estensione, perché, per quanti ne mettessimo insieme, non otterremmo mai qualcosa con una dimensione estesa. • What is the matter with this sentence? Does this matter? As a matter of fact it does. That’s another matter. • What does ‘matter’ mean on this page?
  • 3. Imagined Readers – Text differences "It was a dark and stormy night, the rain came down in torrents, there were brigands on the mountains, and wolves, and the chief of the brigands said to Antonio, 'I'm bored - tell us a story!’” Janet and Allan Ahlberg From “Paul Clifford”
  • 4. LSTM and linguistics • But there are also cases where we need more context. • Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Humans usually provide linguistic assistance in the form of function words (grammar) I grew up in France so I speak fluent … Definitely French I grew up in France and I speak fluent … Possibly French but maybe another I grew up in France but I speak fluent … Definitely not French I grew up in France but I also speak fluent … Very definitely not French I grew up in France but I don’t speak fluent … Definitely French I grew up in France so I don’t speak fluent … Definitely not French
  • 5. Babies, bathwater, stems, lemmas and function words Becomes I think Christoph is brilliant Think Christoph brilli I thought Christoph was brilliant Think Christoph brilli I thought Christoph was brilliant but now I’m not so sure. Think Christoph brilli sure Hearing Christoph’s brilliance I asked him to speak. Hear Christoph brilli ask speak I wouldn’t do that if I were you! ! This is called telegraphic language and is spoken by children between 18 months and three years old during language acquisition. Perhaps not ideal for computers and comprehension.
  • 6. Linguistic LSTM with real sentences. • It is a truth universally acknowledged, [6] • that a single man [4] • in possession of a good fortune, [6] • must be in want of a wife. [7] • [23/4] = 6 The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information by George A. Miller originally published in The Psychological Review, 1956, vol. 63, pp. 81-97 http://www.musanim.com/miller1956/ It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.
  • 7. LSTM • However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. • However little known the feelings or views [7] • of such a man may be [7] • on his first entering a neighbourhood, [6] • this truth is so well fixed [6] • in the minds of the surrounding families, [7] • that he is considered the rightful property [7] • of some one or other of their daughters. [7] • [47/7] = 7
  • 8. Why linguistics? • Patents are communicative documents, written in many languages. • Communication is achieved by context which can be close or distant. • Boolean searching gives results by document; range searching needs to be done by claim. • There are distractor numbers in a claim (e.g. Claim numbers, temperatures, lengths). • There are potential data quality or format problems introduced by OCR, machine translation or extraction from a database. • All these and others need to be taken into account to find only relevant material. ICIC 2017 8
  • 9. Why linguistics for ranges? • Range information is in the unstructured text – The location and referent of ranges is signalled by linguistic structures and forms: • Range then element or Element then range or both 0,80 < Si < 1,20 • Elements by symbol Si or in full Silicon or silicon • Implicit or explicit marking: 1-5 or between 1 and 5 • Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76 • Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt % – There can be mixtures of these forms in a single claim. ICIC 2017 9
  • 10. Reading The program is a linear text reader because we need to: 1. Identify claims 2. Identify pairs of elements and ranges in each claim. So each line in the file is read word by word just once in the same sequence as a human reader. ICIC 2017 10
  • 11. Reading • Items are identified as numbers, range indicators or elements in sequence. • As each element/range pair is identified, the relationship with the specification is calculated. • Following calculation the element and the range is colour-coded and the claim built for potential display. • At the conclusion of each claim the total found is compared with the total specification. • If the claim meets the overall specification requirement it is added to the list for display. • At the conclusion of the reading process, all the results are ranked and displayed. • The program can process the full claims of around 300 patents per second. ICIC 2017 11
  • 12. Native languages v Machine Translation ICIC 2017 12 Here is the problem from the PatBase collection. <Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness, characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80 percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent; </CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C :0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00% ; You can see that the MT version into English is appalling!. You can also see that the original claim will be understandable by the program because the presentation is clear.
  • 13. Detailed example (continued) ICIC 2017 13 It is not practicable to write a program that takes account of all the things that might go wrong, without also introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original as correct as you see here. So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an indication of potential interest you can use a good MT program to translate just the claims of interest. This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase one. CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
  • 14. Use of CN, JP, KR originals - rationale • Machine translation is often hard to understand and sometimes incomprehensible • Using native language patents ensures data quality • Limited inbuilt knowledge required for numerical searching – Searching for elements requires only that a program has the CJK equivalents for full element names; international symbols are identical. – Searching for ranges requires knowledge of potential CJK equivalent codes for digits – Searching for range indicators requires language specific identification of hyphen, <, > and words. • Accurate identification of the search specification with display of the claims means only those claims of interest need translation by machine or human ICIC 2017 14