SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

PATENT DATABASE: A METHODOLOGY OF
INFORMATION RETRIEVAL FROM PDF
Pawan Sharma*, R.C.Tripathi**
*Research scholar, Indian institute of information technology, Allahabad India.
**Professor, Indian institute of information technology, Allahabad India.

ABSTRACT
Patent document holds wealth of information in itself. A brief detail of Indian patent application
information is published as eighteen month publication by Indian patent Office, in electronic gazette
weekly. To date, a proper database of Indian patents specifically for research determination has not been
available, making it complicated for researcher to use this data for measuring any kind of research
activities in terms of patents in India. To facilitate this, we constructed a comprehensive patent database
which incorporates the information presented in the electronic gazette. This database includes information
such as technology class, applicant, inventor, country of origin etc., of the patent submitted. We present the
methodology for the creation of this database, its basic features along with its accuracy and reliability in
this research paper. Patent based database has been developed and can be used for various innovation
researches and activities.

KEYWORDS
Patent document, Indian patent, Electronic gazette.

1. INTRODUCTION
Despite the widely shared view that technological knowledge is one of the most important factors
in economic development, many difficulties have been encountered in the measurement and
counting of technological knowledge. The sources of data capable of serving as a basis for
studying technological knowledge have been limited, stemming from the difficulties inherent in
measuring knowledge. Patent data has been considered one of the few precious sources of
standardized information for technological knowledge. However the lack of user friendly
databases, coupled with the sheer volume of the data has kept researcher from exploiting this rich
and valuable data mine. Since the national bureau of economic research of the United States made
public its NBER patent citation data file[1], thanks to heroic efforts on the part of Bronwyn hall
and her associates, research on innovation has made major strides. The rapid progress of
computing has also aided this research. Many important papers have been written based on this
valuable database, providing significant insight into the process of innovation.
Following this example, we developed a patent database based on patents applications filed with
the Indian patent office. We call this database the PRC patent database. PRC stands for patent
referral cell a research section based in IIIT-Allahabad that works in patent filing and related
activities. Access to patent database is limited. There are hardly any databases for Indian patent
DOI : 10.5121/ijdms.2013.5502

9
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

specifically apart from Ekaswa developed by Patent Facilitating Centre (PFC), Department of
Science and Technology under Technology Information Forecasting and Assessment Council
(TIFAC) in 1995. But the database is restriction in its use for research as it can be only used for
single component keyword search. An instant substitute is to develop a database in a widely used
package that can be updated and transmitted easily for research. Using this database the
materialization of new areas of research and the level of activity in other important areas can
eventually be traced over the last decade.
The rest of this paper is organized as follows. The next section 2 explains the basic features of
this database, section 3 describe the field use as parameters of the database, section 4 explains the
methodology underlying the construction of the database, and section 5 presents the database of
granted patent database followed by section 6 & 7 which explains the graphical tool and future
work with conclusions.

2. BASIC FEATURES OF DATABASE
The Indian patent office disseminates a patent electronic gazette weekly[2]. This gazette contains
detail of patent application filed by the applicants in Indian patent office and published after 18
months of application filling. This electronic gazette is freely and easily available in form of Pdf
year wise and also monthly and weekly on the official website of patent India. This pdf contains
value of information for researcher. But since the gazette being in a pdf form with a varieties of
difference in pdf it is almost impossible to be used by simple available tools which can convert
this pdf into any other format. For using this data for any kind of research activity, it should be
easily convert able to any other database programs like SQL, MYSQL etc. It is also desirable to
have plotting and data analysis capabilities to improve individual sorting tasks. Microsoft excel
word sheet, meets enough of these criteria and was identified as useful for this purpose. The
conversancy and short learning curve connected with excel assist to compliment the preliminary
necessities. The basic database formation of excel makes it a good choice for researcher who
would like to incorporate this database with one of their own, since excel database is easily
adaptable by other database programs.

3. FIELDS
The main goal in developing the PRC database was to store this information in flexible, useful
form. The gazette is brief in its form and so only limited fields are available to be used. The final
database contains eleven fields excluding claims and figures. A brief description of each field
along with its categories is presented in table 1.
Table1. Database fields defined in PRC database main table
Field title

Description and Categories

Application number

Number granted during filling patent application.

Date of filing application

Date on which application is filed.

Publication date

Date on which first information after 18 months is published.

10
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013
Title of invention

Title provided by a patent applicant.

International classification

International patent classification.

Priority Document No

Number granted by a country where the application is first filed.

Priority Date

Date on which the patent application was first filed in a country.

Name of priority country

Country name where the patent application is first filed.

Name of Applicant

Name of applicant or person who will be assignee of the patent in
future.

Name of Inventor

Name of person involved in invention. This could be a single name
or many depending on the number of people involved in an
invention.

Abstract

A brief description of the invention.

4. METHODOLOGY
Many portable document format (pdf) converters are available on internet, which simply provide
a format conversion, with the absence of structural information in the final format, except at the
very low level. Also these available converters have little or no understanding of the document’s
structure with probabilities for errors and omissions. Information contained in the original Pdf file
is in a way simply translated in another format. OCR oriented software’s provides a functionality
to convert pdf through scanning and OCR. But the drawback is that, information present in the
pdf file is clear in its nature and the image scrutiny step can introduce noise for example
unrecognized text as text zone; or unrecognized images. Also, this strategy is useful only when
the pdf file only contains images from scanning.

4.1 PDF Extraction
Data from Pdf to excel cannot be directly converted. In order to convert data from PDF to Excel,
the PDF must have visible tables i.e., visible table borders. This is because the conversion process
takes the contents in the tables of the PDF and places them into corresponding cells of the Excel
spreadsheet. If the PDF data is not in a tabular form it is pretty hard to convert it to Excel sheet. It
is always possible to convert the PDF to a DOC file regardless of tables but not to an excel sheet.
Excel spreadsheets could hold only 65,000 rows. Also in cases of very large PDF documents that
require conversion from PDF to Excel, it is possible that the number of rows in Excel could
exceed. This may cause Excel to crash or freeze. In such case, it is required that the Excel
conversion take place in two or three parts. This will ensure that the 65,000 row limit is not
exceeded and a new file for every year will be formed. The methodology we adopted involves
conversion of pdf to an intermediate XML [4]file following the basic technological principles of
information extraction, and then extracts information from the intermediate file and store it into a
Microsoft excel sheet.
11
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

Also we have developed, a database for granted patent based on the url retrieval technique which
is discussed in the later part of methodology section. The system consisted of three major
components: the basic input file, the rule method and the extraction system. The input of the basic
module is the pdf file. The intermediate output was the xml file[6], stored in buffer describing the
content and structure of the pdf file. The rule module analyzed and processed the pdf file, based
on the rules set to extract the information from pdf files and extraction module stores it into xml
files. Every page of the e gazette, contains a patent description header which starts with the
heading “(12) Patent Application Publication”. We use this valuable information to identify the
pages of our purpose from the e gazette. This heading has been used as header identification for
using the information enclosed in that page. A weekly electronic e gazette published by patent
office contains 300-400 pages and sometimes extended to 1200 pages, of which few pages in the
beginning and at the end are redundant as they contain generic information in every e gazette pdf
which is not required for our database. Setting a header rule, will filter out such pages. An
automated system is prone to result in errors if applied, without setting header rules. With this
condition applied, only pages containing patent application information are retrieved. Each pdf is
analyzed, according to the rule module and information is extracted. This rule module is based on
the string matching characteristics. Each field in the pdf has identified string and is in a fixed
format. These strings are matched, on the basis of string matching algorithm and the data is stored
in xml files.

4.2 String matching
The purpose of using string searching is to find the location of a specific text pattern within a
larger body of text. The main consideration for selecting a string matching algorithm for search
is speed and efficiency. There are a numerous strings searching algorithms in existence today, but
the one we used for our database construction is Rabin-Karp[7]. The Rabin-Karp string searching
algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to
be compared. If the hash values are unequal, the algorithm will calculate the hash value for next
M-character sequence. If the hash values are equal, the algorithm will do a Brute Force
comparison between the pattern and the M-character sequence. In this way, there is only one
comparison per text subsequence, and Brute Force is only needed when hash values match.
Consider an M-character sequence as an M-digit number in base B, where B is the number of
letters in the alphabet. The text subsequence t[i .. i+M-1] is mapped to the number
X(i)=T[i].BM-1+T[i+1].BM-2+…..+t[i+M-1]
Furthermore, given X (I) we can compute X (I+1) for the next subsequence T [I+1.. I+M] in
constant time, as follows:
X ( I + 1 ) + T [ I+1 ] . BM-1 + T [I+2] . BM-2 + ……+ T [I+M]
X [I+1] = X (I) . B one digit shifted to left
-T [I] . BM subtract left most digit
+T [I+M] add new right most digit

12
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

In this way, we never openly calculate a new value. We merely regulate the existing value as we
move over one character. To elaborate more let’s say that our alphabet holds 10 characters. Our
alphabet = a, b, c, d, e, f, g, h, i, j. Let’s say that “a” corresponds to 1, “b” corresponds to 2 and so
on. The hash value for string “cah” would be ...3*100 + 1*10 + 8*1 = 318. If M is large, then the
resulting value (~bM) will be enormous. For this reason, we hash the value by taking it mod a
prime number q. The mod function (% in Java) is particularly useful in this case due to several of
its inherent properties:
(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
For these reasons:
h(i)=((t[i]⋅ bM-1 mod q) +(t[i+1]⋅ bM-2 mod q) + ...
+(t[i+M-1] mod q))mod q
h (i+1) =( h(i) ⋅ b mod q
Shift left one digit
-t[i] ⋅ b M mod q
Subtract leftmost digit
+t [i+M] mod q )
Add new rightmost digit
mod q
If a adequately large prime number is used for the hash function, the hashed values of two
dissimilar patterns will generally be different. In such case, searching takes O (A) time, where A
is the number of characters in the larger body of text. If the prime number used for hashing is
small a complexity of O (MA) case can take place but this is likely to happen if the prime number
used for hashing is small.
Applying this string matching algorithm, we match strings of each required field which we have
described earlier in the table. Words located after the strings are tagged with XML and stored in
the buffer memory. In the next step of our execution, we retrieved these tags and stored it in the
specific cells of excel file. A new excel sheet is automatically utilized once the limit of 65,000 is
exhausted and data are stored in the new excel file. A flow chart of the overall process is depicted
in figure 1. The overall research process consists of several steps. In the first phase, patent data is
collected from the pdf. This data is analyzed with string matching algorithm and stored in xml
format. Finally the tagged xml data is stored in excel sheet for final output.

13
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

Figure 1: Work flow model of the process

5. DATABASE OF GRANTED PATENTS
The database which we developed is formed from the patent application published by the patent
office India weekly. These patent applications do not result into a patent. As most of them are left
unprocessed, some which proceed with FER results into case of similarity or infringement issues.
On an average, of the total application only 40-45% applications are granted as a patent. For
example in year 2008-2009 total numbers of applications for patents filed was 36,812 where as
the number of Patent granted were 16,061. In such a situation it is important to analyze the
patents which are unprocessed or which have been left abandoned, since they are merely
applications and cannot be termed as a granted patent.
To filter such application from our main database we used the URL get method. Data was sent as
a get method and was matched with the URL of granted patent. This URL is available on the
website of Indian patent once we search for any specific patent in form of HTML page request
and is in unencrypted format. Input data is the application number. For example the URL for html
pages was in the following format http://ipindiaservices.gov.in /patentsearch/GrantedSearch/
completeSpecification.aspx? ID=491/MUMNP/2008[3]. In this URL we see that the last digits
are application number which has been assigned an electronic id. We sent the input data ranging
from 001/MUM/2008 till the last digit which was stored in our excel sheet in application number
field. The program filtered out all the granted and not granted patents. For granted patents, the
result was obtained in form of HTML format, complete specification along with details which we
stored in buffer memory. Thereafter with the help of brute force string matching algorithm[8], we
matched strings like application number, applicant name, assignee etc. Once these strings were
14
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

found the data was retrieved till it finds a new string which was set as next input. These retrieved
strings were, stored in an intermediary XML file[4], which tagged every field. In the next step of
process these tags were stored in various fields of excel sheet. The accuracy percentage is 97%
since the html pages are in format of running text and the accuracy rate cannot be expected to be
100% compared to pdf files which are published by patent office. The benefit of applying a string
matching and retrieving the data from html pages is that, we got addition information about
patents, the claim part which is not made available in the published PDF. Also using a brute force
algorithm for string matching was that it is more flexible in error and exception handling.

6. THE GRAPHICAL USER INTERFACE
We developed a tool to allow the operator to set up the whole conversion chain for a given
gazette pdf file published weekly and validate and/or correct the processing output. The
conversion takes approximately 2-3 hours for conversion of one pdf with 350 pages
approximately. A snap shot image of tool is depicted below as figure 2.

Figure 2: Snap Shot image of tool

7. CONCLUSION AND FUTURE WORK
We have presented a system to convert patent pdf into logically structured database in excel
sheet. The specificity of our approach relies in the exploitation of the native internal pdf objects
rather than using image based techniques on a pdf document. This database helps to solve the
problem which a researcher faces in his research regarding the study of Indian patents since there
are no data bases readily available which can provide such solution. Manual entry of such data is
15
International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013

time consuming and prone to errors. Commercial data bases like Delphion, Ekaswa etc., provides
solutions for patent search rather than for research work. In future we wish to form a citation
network between patents based on semantic information retrieval which will help a patent
examinee to find citation for filed patents in India and will reduce time in clearing the reports.
The database formed is solely used for research purpose and has no commercial applicability.

REFERENCES
[1]
[2]
[3]
[4]

[5]

[6]
[7]
[8]

Hall, B., jaffe, A. and M. Trajtenberg (2000), Market Value and Patent Citations: A First Look,
NBER WP 7441
http://ipindia.nic.in/ipr/patent/patents.htm managed by Indian patent office, India.
http://ipindiaservices.gov.in/patentsearch/search/index.aspx managed by Indian patent office, India.
Herve Dejean, Jean-Luc Meunier, “ A system for converting PDF documents into structured XML
format document analysis system VII Pages pp 129-140 Book Subtitle 7th International Workshop,
DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings Publisher Springer Berlin
Heidelberg Print ISBN 978-3-540-32140-8.
L.A. Ochoa-Franco, C.T.Haas, C.M.Dailey, A.E.Traver, Automation and Robotics in Construction
Xi Proceedings of the 11th International Symposium on Automation and Robotics in Construction
Copyright © 1994 Elsevier B.V. All rights reserved. Edited by: D.A. Chamberlain ISBN: 978-0-44482044-0
WENDE Zhang, (2008),”converting PDF files to XML files”, The Electronic Library, Vol. 26 iss:1
pp. 68-74.
Karp, Richard M.; Rabin, Michael O. (March 1987). Efficient randomized pattern-matching
algorithms 31 (2). Retrieved 2008-10-14
Christof Paar, Jan Pelzl, Bart Preneel (2010). Understanding Cryptography: A Textbook for
Students and Practitioners Springer p. 7 ISBN 3-642-04100-0

16

Mais conteúdo relacionado

Semelhante a Patent database a methodology of information retrieval from pdf

Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceIJMER
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
 
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxJournal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxLaticiaGrissomzz
 
Algorithm Procedure and Pseudo Code Mining
Algorithm Procedure and Pseudo Code MiningAlgorithm Procedure and Pseudo Code Mining
Algorithm Procedure and Pseudo Code MiningIRJET Journal
 
Search API in Ruby with ES and complicated factors
Search API in Ruby with ES and complicated factorsSearch API in Ruby with ES and complicated factors
Search API in Ruby with ES and complicated factorsTrung Vu
 
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET Journal
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonIRJET Journal
 
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...
 HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ... HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...IJCSEA Journal
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir systemcsandit
 
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET Journal
 
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...IJCSEA Journal
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure DataIRJET Journal
 
Database System: A Case Study of A.T.E.C Central Library
Database System: A Case Study of A.T.E.C Central LibraryDatabase System: A Case Study of A.T.E.C Central Library
Database System: A Case Study of A.T.E.C Central LibraryBhojaraju Gunjal
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentationBhadra Gowdra
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET Journal
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine ScrapperIRJET Journal
 
Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewIRJET Journal
 

Semelhante a Patent database a methodology of information retrieval from pdf (20)

Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map Reduce
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword Extraction
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
 
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxJournal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
 
Algorithm Procedure and Pseudo Code Mining
Algorithm Procedure and Pseudo Code MiningAlgorithm Procedure and Pseudo Code Mining
Algorithm Procedure and Pseudo Code Mining
 
Search API in Ruby with ES and complicated factors
Search API in Ruby with ES and complicated factorsSearch API in Ruby with ES and complicated factors
Search API in Ruby with ES and complicated factors
 
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...IRJET-  	  Data Analysis and Solution Prediction using Elasticsearch in Healt...
IRJET- Data Analysis and Solution Prediction using Elasticsearch in Healt...
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
 
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...
 HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ... HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND ...
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir system
 
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
 
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...
HIGH SPEED DATA RETRIEVAL FROM NATIONAL DATA CENTER (NDC) REDUCING TIME AND I...
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
 
Database System: A Case Study of A.T.E.C Central Library
Database System: A Case Study of A.T.E.C Central LibraryDatabase System: A Case Study of A.T.E.C Central Library
Database System: A Case Study of A.T.E.C Central Library
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Worldranking universities final documentation
Worldranking universities final documentationWorldranking universities final documentation
Worldranking universities final documentation
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining Techniques
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine Scrapper
 
Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A Review
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Patent database a methodology of information retrieval from pdf

  • 1. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 PATENT DATABASE: A METHODOLOGY OF INFORMATION RETRIEVAL FROM PDF Pawan Sharma*, R.C.Tripathi** *Research scholar, Indian institute of information technology, Allahabad India. **Professor, Indian institute of information technology, Allahabad India. ABSTRACT Patent document holds wealth of information in itself. A brief detail of Indian patent application information is published as eighteen month publication by Indian patent Office, in electronic gazette weekly. To date, a proper database of Indian patents specifically for research determination has not been available, making it complicated for researcher to use this data for measuring any kind of research activities in terms of patents in India. To facilitate this, we constructed a comprehensive patent database which incorporates the information presented in the electronic gazette. This database includes information such as technology class, applicant, inventor, country of origin etc., of the patent submitted. We present the methodology for the creation of this database, its basic features along with its accuracy and reliability in this research paper. Patent based database has been developed and can be used for various innovation researches and activities. KEYWORDS Patent document, Indian patent, Electronic gazette. 1. INTRODUCTION Despite the widely shared view that technological knowledge is one of the most important factors in economic development, many difficulties have been encountered in the measurement and counting of technological knowledge. The sources of data capable of serving as a basis for studying technological knowledge have been limited, stemming from the difficulties inherent in measuring knowledge. Patent data has been considered one of the few precious sources of standardized information for technological knowledge. However the lack of user friendly databases, coupled with the sheer volume of the data has kept researcher from exploiting this rich and valuable data mine. Since the national bureau of economic research of the United States made public its NBER patent citation data file[1], thanks to heroic efforts on the part of Bronwyn hall and her associates, research on innovation has made major strides. The rapid progress of computing has also aided this research. Many important papers have been written based on this valuable database, providing significant insight into the process of innovation. Following this example, we developed a patent database based on patents applications filed with the Indian patent office. We call this database the PRC patent database. PRC stands for patent referral cell a research section based in IIIT-Allahabad that works in patent filing and related activities. Access to patent database is limited. There are hardly any databases for Indian patent DOI : 10.5121/ijdms.2013.5502 9
  • 2. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 specifically apart from Ekaswa developed by Patent Facilitating Centre (PFC), Department of Science and Technology under Technology Information Forecasting and Assessment Council (TIFAC) in 1995. But the database is restriction in its use for research as it can be only used for single component keyword search. An instant substitute is to develop a database in a widely used package that can be updated and transmitted easily for research. Using this database the materialization of new areas of research and the level of activity in other important areas can eventually be traced over the last decade. The rest of this paper is organized as follows. The next section 2 explains the basic features of this database, section 3 describe the field use as parameters of the database, section 4 explains the methodology underlying the construction of the database, and section 5 presents the database of granted patent database followed by section 6 & 7 which explains the graphical tool and future work with conclusions. 2. BASIC FEATURES OF DATABASE The Indian patent office disseminates a patent electronic gazette weekly[2]. This gazette contains detail of patent application filed by the applicants in Indian patent office and published after 18 months of application filling. This electronic gazette is freely and easily available in form of Pdf year wise and also monthly and weekly on the official website of patent India. This pdf contains value of information for researcher. But since the gazette being in a pdf form with a varieties of difference in pdf it is almost impossible to be used by simple available tools which can convert this pdf into any other format. For using this data for any kind of research activity, it should be easily convert able to any other database programs like SQL, MYSQL etc. It is also desirable to have plotting and data analysis capabilities to improve individual sorting tasks. Microsoft excel word sheet, meets enough of these criteria and was identified as useful for this purpose. The conversancy and short learning curve connected with excel assist to compliment the preliminary necessities. The basic database formation of excel makes it a good choice for researcher who would like to incorporate this database with one of their own, since excel database is easily adaptable by other database programs. 3. FIELDS The main goal in developing the PRC database was to store this information in flexible, useful form. The gazette is brief in its form and so only limited fields are available to be used. The final database contains eleven fields excluding claims and figures. A brief description of each field along with its categories is presented in table 1. Table1. Database fields defined in PRC database main table Field title Description and Categories Application number Number granted during filling patent application. Date of filing application Date on which application is filed. Publication date Date on which first information after 18 months is published. 10
  • 3. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 Title of invention Title provided by a patent applicant. International classification International patent classification. Priority Document No Number granted by a country where the application is first filed. Priority Date Date on which the patent application was first filed in a country. Name of priority country Country name where the patent application is first filed. Name of Applicant Name of applicant or person who will be assignee of the patent in future. Name of Inventor Name of person involved in invention. This could be a single name or many depending on the number of people involved in an invention. Abstract A brief description of the invention. 4. METHODOLOGY Many portable document format (pdf) converters are available on internet, which simply provide a format conversion, with the absence of structural information in the final format, except at the very low level. Also these available converters have little or no understanding of the document’s structure with probabilities for errors and omissions. Information contained in the original Pdf file is in a way simply translated in another format. OCR oriented software’s provides a functionality to convert pdf through scanning and OCR. But the drawback is that, information present in the pdf file is clear in its nature and the image scrutiny step can introduce noise for example unrecognized text as text zone; or unrecognized images. Also, this strategy is useful only when the pdf file only contains images from scanning. 4.1 PDF Extraction Data from Pdf to excel cannot be directly converted. In order to convert data from PDF to Excel, the PDF must have visible tables i.e., visible table borders. This is because the conversion process takes the contents in the tables of the PDF and places them into corresponding cells of the Excel spreadsheet. If the PDF data is not in a tabular form it is pretty hard to convert it to Excel sheet. It is always possible to convert the PDF to a DOC file regardless of tables but not to an excel sheet. Excel spreadsheets could hold only 65,000 rows. Also in cases of very large PDF documents that require conversion from PDF to Excel, it is possible that the number of rows in Excel could exceed. This may cause Excel to crash or freeze. In such case, it is required that the Excel conversion take place in two or three parts. This will ensure that the 65,000 row limit is not exceeded and a new file for every year will be formed. The methodology we adopted involves conversion of pdf to an intermediate XML [4]file following the basic technological principles of information extraction, and then extracts information from the intermediate file and store it into a Microsoft excel sheet. 11
  • 4. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 Also we have developed, a database for granted patent based on the url retrieval technique which is discussed in the later part of methodology section. The system consisted of three major components: the basic input file, the rule method and the extraction system. The input of the basic module is the pdf file. The intermediate output was the xml file[6], stored in buffer describing the content and structure of the pdf file. The rule module analyzed and processed the pdf file, based on the rules set to extract the information from pdf files and extraction module stores it into xml files. Every page of the e gazette, contains a patent description header which starts with the heading “(12) Patent Application Publication”. We use this valuable information to identify the pages of our purpose from the e gazette. This heading has been used as header identification for using the information enclosed in that page. A weekly electronic e gazette published by patent office contains 300-400 pages and sometimes extended to 1200 pages, of which few pages in the beginning and at the end are redundant as they contain generic information in every e gazette pdf which is not required for our database. Setting a header rule, will filter out such pages. An automated system is prone to result in errors if applied, without setting header rules. With this condition applied, only pages containing patent application information are retrieved. Each pdf is analyzed, according to the rule module and information is extracted. This rule module is based on the string matching characteristics. Each field in the pdf has identified string and is in a fixed format. These strings are matched, on the basis of string matching algorithm and the data is stored in xml files. 4.2 String matching The purpose of using string searching is to find the location of a specific text pattern within a larger body of text. The main consideration for selecting a string matching algorithm for search is speed and efficiency. There are a numerous strings searching algorithms in existence today, but the one we used for our database construction is Rabin-Karp[7]. The Rabin-Karp string searching algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to be compared. If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence. If the hash values are equal, the algorithm will do a Brute Force comparison between the pattern and the M-character sequence. In this way, there is only one comparison per text subsequence, and Brute Force is only needed when hash values match. Consider an M-character sequence as an M-digit number in base B, where B is the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is mapped to the number X(i)=T[i].BM-1+T[i+1].BM-2+…..+t[i+M-1] Furthermore, given X (I) we can compute X (I+1) for the next subsequence T [I+1.. I+M] in constant time, as follows: X ( I + 1 ) + T [ I+1 ] . BM-1 + T [I+2] . BM-2 + ……+ T [I+M] X [I+1] = X (I) . B one digit shifted to left -T [I] . BM subtract left most digit +T [I+M] add new right most digit 12
  • 5. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 In this way, we never openly calculate a new value. We merely regulate the existing value as we move over one character. To elaborate more let’s say that our alphabet holds 10 characters. Our alphabet = a, b, c, d, e, f, g, h, i, j. Let’s say that “a” corresponds to 1, “b” corresponds to 2 and so on. The hash value for string “cah” would be ...3*100 + 1*10 + 8*1 = 318. If M is large, then the resulting value (~bM) will be enormous. For this reason, we hash the value by taking it mod a prime number q. The mod function (% in Java) is particularly useful in this case due to several of its inherent properties: (x mod q) + (y mod q)] mod q = (x+y) mod q (x mod q) mod q = x mod q For these reasons: h(i)=((t[i]⋅ bM-1 mod q) +(t[i+1]⋅ bM-2 mod q) + ... +(t[i+M-1] mod q))mod q h (i+1) =( h(i) ⋅ b mod q Shift left one digit -t[i] ⋅ b M mod q Subtract leftmost digit +t [i+M] mod q ) Add new rightmost digit mod q If a adequately large prime number is used for the hash function, the hashed values of two dissimilar patterns will generally be different. In such case, searching takes O (A) time, where A is the number of characters in the larger body of text. If the prime number used for hashing is small a complexity of O (MA) case can take place but this is likely to happen if the prime number used for hashing is small. Applying this string matching algorithm, we match strings of each required field which we have described earlier in the table. Words located after the strings are tagged with XML and stored in the buffer memory. In the next step of our execution, we retrieved these tags and stored it in the specific cells of excel file. A new excel sheet is automatically utilized once the limit of 65,000 is exhausted and data are stored in the new excel file. A flow chart of the overall process is depicted in figure 1. The overall research process consists of several steps. In the first phase, patent data is collected from the pdf. This data is analyzed with string matching algorithm and stored in xml format. Finally the tagged xml data is stored in excel sheet for final output. 13
  • 6. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 Figure 1: Work flow model of the process 5. DATABASE OF GRANTED PATENTS The database which we developed is formed from the patent application published by the patent office India weekly. These patent applications do not result into a patent. As most of them are left unprocessed, some which proceed with FER results into case of similarity or infringement issues. On an average, of the total application only 40-45% applications are granted as a patent. For example in year 2008-2009 total numbers of applications for patents filed was 36,812 where as the number of Patent granted were 16,061. In such a situation it is important to analyze the patents which are unprocessed or which have been left abandoned, since they are merely applications and cannot be termed as a granted patent. To filter such application from our main database we used the URL get method. Data was sent as a get method and was matched with the URL of granted patent. This URL is available on the website of Indian patent once we search for any specific patent in form of HTML page request and is in unencrypted format. Input data is the application number. For example the URL for html pages was in the following format http://ipindiaservices.gov.in /patentsearch/GrantedSearch/ completeSpecification.aspx? ID=491/MUMNP/2008[3]. In this URL we see that the last digits are application number which has been assigned an electronic id. We sent the input data ranging from 001/MUM/2008 till the last digit which was stored in our excel sheet in application number field. The program filtered out all the granted and not granted patents. For granted patents, the result was obtained in form of HTML format, complete specification along with details which we stored in buffer memory. Thereafter with the help of brute force string matching algorithm[8], we matched strings like application number, applicant name, assignee etc. Once these strings were 14
  • 7. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 found the data was retrieved till it finds a new string which was set as next input. These retrieved strings were, stored in an intermediary XML file[4], which tagged every field. In the next step of process these tags were stored in various fields of excel sheet. The accuracy percentage is 97% since the html pages are in format of running text and the accuracy rate cannot be expected to be 100% compared to pdf files which are published by patent office. The benefit of applying a string matching and retrieving the data from html pages is that, we got addition information about patents, the claim part which is not made available in the published PDF. Also using a brute force algorithm for string matching was that it is more flexible in error and exception handling. 6. THE GRAPHICAL USER INTERFACE We developed a tool to allow the operator to set up the whole conversion chain for a given gazette pdf file published weekly and validate and/or correct the processing output. The conversion takes approximately 2-3 hours for conversion of one pdf with 350 pages approximately. A snap shot image of tool is depicted below as figure 2. Figure 2: Snap Shot image of tool 7. CONCLUSION AND FUTURE WORK We have presented a system to convert patent pdf into logically structured database in excel sheet. The specificity of our approach relies in the exploitation of the native internal pdf objects rather than using image based techniques on a pdf document. This database helps to solve the problem which a researcher faces in his research regarding the study of Indian patents since there are no data bases readily available which can provide such solution. Manual entry of such data is 15
  • 8. International Journal of Database Management Systems ( IJDMS ) Vol.5, No.5, October 2013 time consuming and prone to errors. Commercial data bases like Delphion, Ekaswa etc., provides solutions for patent search rather than for research work. In future we wish to form a citation network between patents based on semantic information retrieval which will help a patent examinee to find citation for filed patents in India and will reduce time in clearing the reports. The database formed is solely used for research purpose and has no commercial applicability. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Hall, B., jaffe, A. and M. Trajtenberg (2000), Market Value and Patent Citations: A First Look, NBER WP 7441 http://ipindia.nic.in/ipr/patent/patents.htm managed by Indian patent office, India. http://ipindiaservices.gov.in/patentsearch/search/index.aspx managed by Indian patent office, India. Herve Dejean, Jean-Luc Meunier, “ A system for converting PDF documents into structured XML format document analysis system VII Pages pp 129-140 Book Subtitle 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings Publisher Springer Berlin Heidelberg Print ISBN 978-3-540-32140-8. L.A. Ochoa-Franco, C.T.Haas, C.M.Dailey, A.E.Traver, Automation and Robotics in Construction Xi Proceedings of the 11th International Symposium on Automation and Robotics in Construction Copyright © 1994 Elsevier B.V. All rights reserved. Edited by: D.A. Chamberlain ISBN: 978-0-44482044-0 WENDE Zhang, (2008),”converting PDF files to XML files”, The Electronic Library, Vol. 26 iss:1 pp. 68-74. Karp, Richard M.; Rabin, Michael O. (March 1987). Efficient randomized pattern-matching algorithms 31 (2). Retrieved 2008-10-14 Christof Paar, Jan Pelzl, Bart Preneel (2010). Understanding Cryptography: A Textbook for Students and Practitioners Springer p. 7 ISBN 3-642-04100-0 16