SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
European Journal of Scientific Research
ISSN 1450-216X Vol.20 No.4 (2008), pp.844-851
© EuroJournals Publishing, Inc. 2008
http://www.eurojournals.com/ejsr.htm


      Database Interfacing using Natural Language Processing

                                     Imran Sarwar Bajwa
           Department of Computer Science and IT, The Islamia University of Bahawalpur
                               E-mail: imransbajwa@gmail.com

                                       Shahzad Mumtaz
           Department of Computer Science and IT, The Islamia University of Bahawalpur
                               E-mail: shahzadz22@hotmail.com

                                      M. Shahid Naweed
           Department of Computer Science and IT, The Islamia University of Bahawalpur
                             E-mail: shahid_naweed@hotmail.com

                                              Abstract

             To write technically correct SQL queries is a complex and skill requiring task
     especially for a novel user. This situation becomes more complex when a low skilled
     person has to use a database management system for a specific business purpose. S/He has
     to write some quires at his own and perform various tasks. This scenario requires more
     expertise and skills in terms of understanding and writing the accurate and functional
     queries. The task of the novel user can be simplified by providing an easy interface that is
     well known to that user. In order to resolve all such issues, automated software is needed,
     which facilitates both users and software engineers. User writes the requirements in simple
     English in a few statements and the designed system has the ability to analyze the given
     script. After composite analysis and mining of associated information, the designed system
     generates the intended SQL queries that can be run directly. The paper describes a system
     that can create SQL queries automatically. The designed system provides a quick and
     reliable way to generate SQL queries to save time and budget of both the user and system
     analyst.


     Keywords: Information extraction, Automatic Query Generation, Knowledge Retrieval,
               Natural language processing.

1.0. Introduction
Relational databases are the premier way of storing common data repositories. After storing the data
contents in a database, an interfacing mechanism is required to talk with the prearranged repository of
the confined data. The conventional way of communicating with a database is to fist build a connection
stream and then adding, deleting or updating the data contents in the database by using a standardized
interfacing mechanism [1]. Simple command shells are typically used and they are often incorporated
within every distinct database product. These command shells are typically simple filters which helps a
use to log on to the database, execute particular commands and receive output. These command shells
provide access to the database from the machine on which the database is actually running [2]. After
hooking to a particular database a user or a programmer requires an interface and typically that
Database Interfacing using Natural Language Processing                                                  845

interface is provided by some technical languages. These languages are called query languages and are
constituted of the database commands typically used for asking questions to a distinctive database and
getting intended response. SQL [3] (Structured Query Language) is the most popular query language
which is actually the defacto language of databases today. SQL is an orthodox tool of database
querying. Different database management systems implement this standardized language with trivial
alterations and adjustments. However, in spite of these proprietary extensions by the vendors, the core
of this querying language is the same in all of the environments.
        From an application programmer's point of view, the major novelty in the relational database is
that one uses a declarative query language, SQL. Most computer languages are procedural. The
programmer tells the computer what to do, step by step, specifying a procedure. Using SQL interface,
the programmer defines his requirements and questions and the RDBMS query planner figures out how
to get it [5]. There are two compensations of using a declarative language. The first is that the queries
no longer depend on the data depiction. The RDBMS is free to store data according to its own design
requirements [6]. The second major factor is improved software dependability. For various web-based
and stand-alone applications the generic SQL is used to make the things simple and straightforward.
Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes this
language monotonous and difficult to learn and use. It is quite intricate to remember these SQL
commands and use them accurately and precisely.
        In order to resolve all such issues, an automated software is needed, which facilitates both users
and software engineers. As far as this software is concerns the time, it takes to explore all the facilities
and services, should be quite less than a minute and this information is quite useful for the users.


2.0. Problem Description
Modern software engineering requires quick and automated solutions which may have ability to create
the accurate and precise SQL queries automatically. For complex queries an expert programmer also
requires assistance in terms of automatic query generation. He can use these queries after making
appropriate adjustments and alterations in the automated generated queries with less effort in less time
as compared to the traditional approaches.
         The task of the novel user can be simplified by providing an easy interface that is more familiar
and well known to that user. In order to resolve all such issues, an automated software is needed, which
facilitates both users and software engineers. User writes the requirements in simple English in a few
statements and the designed system has obvious ability to analyze the given script. After composite
analysis and mining of associated information, the designed system generates the intended SQL queries
that can be run directly. The designed system has robust ability to create code automatically without
external environment. The designed system provides a quick and reliable way to generate SQL queries
to save the time and budget of both the user and system analyst


3.0. Used Methodology
The understanding and multi-aspect processing of the natural languages that are also termed as "speech
languages", is actually one of the arguments of greater interest in the field artificial intelligence field
[8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are based
on un-formal grammars. There are the geographical, psychological and sociological factors which
influence the behaviours of natural languages [12]. There are undefined set of words and they also
change and vary area to area and time to time.Due to these variations and inconsistencies, the natural
languages have different flavours as English language has more than half dozen renowned flavours all
over the world [14]. These flavours have different accents, set of vocabularies and phonological
aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a
difficult task to process them as compared to the formal languages [13].
846                                      Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

       The English language statements are effortlessly converted into a SQL query by using the
newly designed rule based algorithm. Select query is the common query used to choose a set of values
from a table [4]. An example of a college database has been used in the conducted research. Student’s
data will be retrieved, inserted and deleted by automatically generated queries from simple English
text.

3.1. SELECT Query
First of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following:
                            SELECT           *          FROM         Students

                          Keyword        Required Set keyword         Table Name
       ‘SELECT’ query can easily be generated from the provided input string of as there are two
keywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’.
To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional norms
of the English language and grammatical rule are used. The conventional structure of simple English
sentence is the key rule of comprehending and analyzing the natural language text [13] as in the
following example:
       “I need names of all students.”
       Following is the complete analysis of this simple sentence.

Table 01: Generating SELCET Query from text

 Lexicons                          Phase-I                            Phase –II
 I                                 Noun                               ----------
 need                              Verb                               ----------
 names                             Noun                               Field Name
 of                                preposition                        ----------
 all                               Noun                               *
 students                          Noun                               Table Name

       In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute as following:
       Select * from Students
       Here the table Name is searched from the array of available all tables in the database. From all
available tables, the nearest table name is picked that ‘students’ in this example.

3.2. INSERT Query
After ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments as
following:
                  INSERT      INTO          Students      VALUES      (5, ‘Ali’)

                     Keyword        keyword        Table Name     Keyword          Record

       ‘INSERT’ query can also produced from the given statement as there are three keywords
‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and
‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in the
following example:
                  “I want to insert a student whose Roll No. is 5 and Name is Ali.”
       Following is the complete analysis of this simple sentence.
Database Interfacing using Natural Language Processing                                            847
Table 02: Generating INSERT Query from text

Lexicons                          Phase-I                            Phase –II
I                                 Noun                               -----------
want                              Verb                               -----------
to                                Preposition                        -----------
insert                            Verb                               Action
a                                 article                            -----------
student                           Noun                               Table Name
whose                             Conjunction                        -----------
Roll No                           Noun                               Attribute
is                                Helping Verb                       ------------
5                                 Noun                               Value
and                               Conjunction                        ------------
Name                              Noun                               Attribute
is                                Helping Verb                       ------------
Ali                               Noun                               Value

       In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array of
available all table sin the database. From all available tables, the nearest table name is picked that
‘students’ in this example.

3.3. DELETE Query
Same like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’
query has five parts as following:
                      DELETE       FROM     Students     WHERE           Age > 25

                    Keyword        Keyword         Table Name Keyword           Condition
       The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and
‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and
‘Condition’ parameters the English language defined grammatical rule are used as in the following
example:
                        “I want to delete the students more than 25 years age.”
       Following is the complete analysis of this simple sentence.

Table 03: Generating DELETE Query from text

 Lexicons                         Phase-I                            Phase –II
 I                                Noun                               ---------
 want                             Verb                               ---------
 to                               preposition                        ---------
 delete                           verb                               Action
 the                              article                            ---------
 students                         Noun                               Table Name
 more                             preposition                        Condition
 than                             Noun                               ----------
 25                               Noun                               Value
 years                            Noun                               -----------
 age                              Noun                               Parameter

       For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value are
combined with Condition parameters. In this example table Name is also retrieved from the array of
available all tables in the database.
848                                      Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

4.0. Work Flow of Designed System
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” is adequately capable of automatically generating queries. This designed system performs
its function in multi-phase procedure. There are five modules in total that are Text input acquisition,
text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is the
brief detail of all these phases.

4.1. Text input Acquisition
This module helps to acquire input text scenario. User provides the business scenario in from of strings
of the text. This module reads the input text in the form characters and generates the words by
concatenating the input characters. This module is the implementation of the lexical phase. Lexicons
and tokens are generated in this module. After the lexicons generation further processing can be
performed on the input text.

                              Figure 01: Lexical analysis of input text string




4.2. Text Comprehension
This module reads the input from module one in the form of words or lexicons. These words are
categorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions,
conjunctions, etc. These classes are further used to understand the various parts of the given sentence.

                              Figure 02: Parts of speech tagging of input text




4.3. Information Retrieval
This module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc.
Keywords are found by matching the tokens with the given array of al possible keywords. These key
Database Interfacing using Natural Language Processing                                             849

words are further used to generate the respective queries. The information like table name, field name,
number of attributes and logical conditions are also extracted in this phase.

                                Figure 03: Query information extraction




4.4. SQL Queries generation
This module combines the keywords and other required parameters for a particular query. SQL query
is ultimately generated here according to the given rules in the designed algorithm. As separate
scenario will be provided for various types of queries, the separate functions have been implemented
for particular query.

                                  Figure 04: Generation of SQL Query




5.0. Results and Analysis
After designing and coding the query generating system, its accuracy and efficiency was tested. For
testing purpose of the queries generated by the designed system simple and complex level queries were
generated. Each query from each category as Select, Insert, Delete was checked.
        15 sample queries were generated and the intended results have been shown in the following
table.
850                                        Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed
Table 04: Accuracy ratio of various types of queries

 Types                                Simple                      Complex                Total
 SELECT                                 14                          13                    90%
 INSERT                                 13                          11                    80%
 DELETE                                 14                          12                    87%
Total Accuracy = 86%


       A matrix representing accuracy of query generation test (%) for simple level and complex level
queries has been constructed. Overall diagrams accuracy for all types of queries is determined by
adding total accuracy of all categories and calculating its average that is 86% in this case.

                              Figure 05: Graphical representation of the results


                                 14

                                 12

                                 10

                                 8
                                                                            Simple
                                 6
                                                                            Complex
                                 4

                                 2

                                 0
                                      SELECT   INSERT    DELETE




       The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETE
queries in terms of simple and complex queries parameters.


6.0. Conclusion
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” facilitates both users and software engineers in terms of generating SQL queries
automatically. The task of the novel user can be simplified by providing an easy interface that is more
familiar and well known to that user. In order to resolve all such issues, an automated software is
needed, which facilitates both users and software engineers. User writes the requirements in simple
English in a few statements and the designed system has obvious ability to analyze the given script.
After composite analysis and mining of associated information, the designed system generates the
intended SQL queries that can be run directly. The designed system has robust ability to create code
automatically without external environment. The designed system provides a quick and reliable way to
generate SQL queries to save the time and budget of both the user and system analyst. An elegant
graphical user interface has also been provided to the user for entering the Input scenario in a proper
way and generating UML diagrams.


7.0. Future Work
There is also some margin of improvements in the algorithms for generating the intended SQL queries.
Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% by
improving the algorithms and inducing the ability of learning in the system. In this research only three
types of queries has been addressed as SELECT, INSERT, and DELETE query. There are still other
types of queries that require some sufficient solution.
Database Interfacing using Natural Language Processing                                            851

References
[1]    Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company,
       New York.
[2]    Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language
       Structure and Use. Cambridge Univ. Press, Cambridge, U.K.
[3]    D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL
       Translation using Dynamic Interval Encoding. In SIGMOD.
[4]    C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database
       queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and
       Language Acquisition (1997).
[5]    Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill,
       New York.
[6]    A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle
       Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343.
[7]    Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing
       for document retrieval. Journal of the American Society for Information Science, 40 (2), 115–
       132.
[8]    J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic
       programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence
       (AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822.
[9]    Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer,
       Boston.
[10]   Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM
       Transactions on Information Systems, 10, 115–141.
[11]   Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal
       of the American Society for Information Science, 39(1), 8–16.
[12]   Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information
       retrieval and filtering: An empirical basis for grammatical rules. Information Processing and
       Management, 32(2), 185–197.
[13]   Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language
       Processing. MIT Press, Cambridge, Mass.
[14]   Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics.
       Kluwer, Dordrecht, The Netherlands.

Mais conteúdo relacionado

Mais procurados

_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1..._var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
Ioan Tuns
 

Mais procurados (10)

CV_for_house_withPhotoSign_180716
CV_for_house_withPhotoSign_180716CV_for_house_withPhotoSign_180716
CV_for_house_withPhotoSign_180716
 
NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)
 
Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)
 
Your Guide to be a Software Engineer
Your Guide to be a Software EngineerYour Guide to be a Software Engineer
Your Guide to be a Software Engineer
 
_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1..._var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
_var_www_moodledata_temp_turnitintooltwo_1014058337._Ioan_Tuns-HNDCSD-PJ-19-1...
 
rams.fresher.oracle
rams.fresher.oraclerams.fresher.oracle
rams.fresher.oracle
 
A c program of Phonebook application
A c program of Phonebook applicationA c program of Phonebook application
A c program of Phonebook application
 
UML Generator (NCC18)
UML Generator (NCC18)UML Generator (NCC18)
UML Generator (NCC18)
 
Resume
ResumeResume
Resume
 
UCD Generator (ICIET 2007)
UCD Generator (ICIET 2007)UCD Generator (ICIET 2007)
UCD Generator (ICIET 2007)
 

Semelhante a NL Interface for Database - EJSR 20(4)

Final Total Preliminary Report
Final Total Preliminary ReportFinal Total Preliminary Report
Final Total Preliminary Report
Mrugen Deshmukh
 
Coverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-QueriesCoverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-Queries
Mohamed Reda
 
Sql tutorial, tutorials sql
Sql tutorial, tutorials sqlSql tutorial, tutorials sql
Sql tutorial, tutorials sql
Vivek Singh
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
Sravanthi Mullapudi
 

Semelhante a NL Interface for Database - EJSR 20(4) (20)

Pattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to DatabasePattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to Database
 
IBM-TGMC e-learning resource locator_project report
IBM-TGMC e-learning resource locator_project reportIBM-TGMC e-learning resource locator_project report
IBM-TGMC e-learning resource locator_project report
 
E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)E learning resource Locator Project Report (J2EE)
E learning resource Locator Project Report (J2EE)
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query Processing
 
Final Total Preliminary Report
Final Total Preliminary ReportFinal Total Preliminary Report
Final Total Preliminary Report
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
Hindi language as a graphical user interface to relational database for tran...
Hindi language as a graphical user interface to relational  database for tran...Hindi language as a graphical user interface to relational  database for tran...
Hindi language as a graphical user interface to relational database for tran...
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 
Pl sql content
Pl sql contentPl sql content
Pl sql content
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAM
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
Coverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-QueriesCoverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-Queries
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)
 
DBMS_final_ppt_grp3.pptx
DBMS_final_ppt_grp3.pptxDBMS_final_ppt_grp3.pptx
DBMS_final_ppt_grp3.pptx
 
Sql tutorial, tutorials sql
Sql tutorial, tutorials sqlSql tutorial, tutorials sql
Sql tutorial, tutorials sql
 
Chapter no 1
Chapter no 1Chapter no 1
Chapter no 1
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
IRJET- Voice based Billing System
IRJET-  	  Voice based Billing SystemIRJET-  	  Voice based Billing System
IRJET- Voice based Billing System
 

Mais de IT Industry

Mais de IT Industry (15)

The News Today 24 (https://thenewstoday24.com/)
The News Today 24 (https://thenewstoday24.com/)The News Today 24 (https://thenewstoday24.com/)
The News Today 24 (https://thenewstoday24.com/)
 
Meaning Extraction - IJCTE 2(1)
Meaning Extraction - IJCTE 2(1)Meaning Extraction - IJCTE 2(1)
Meaning Extraction - IJCTE 2(1)
 
Requirement Analysis - ijcee 2(3)
Requirement Analysis - ijcee 2(3)Requirement Analysis - ijcee 2(3)
Requirement Analysis - ijcee 2(3)
 
Virtual Telemedicine (IJITWE 5(1))
Virtual Telemedicine (IJITWE 5(1))Virtual Telemedicine (IJITWE 5(1))
Virtual Telemedicine (IJITWE 5(1))
 
NL to OCL Transformation (EDOC 2010)
NL to OCL Transformation (EDOC 2010)NL to OCL Transformation (EDOC 2010)
NL to OCL Transformation (EDOC 2010)
 
BPM & SOA for Small Business Enterprises (ICIME 2009)
BPM & SOA for Small Business Enterprises (ICIME 2009)BPM & SOA for Small Business Enterprises (ICIME 2009)
BPM & SOA for Small Business Enterprises (ICIME 2009)
 
Web Layout Mining - JECS 29(2)
Web Layout Mining - JECS 29(2)Web Layout Mining - JECS 29(2)
Web Layout Mining - JECS 29(2)
 
Web User Forms (ICOMMS 2006)
Web User Forms (ICOMMS 2006)Web User Forms (ICOMMS 2006)
Web User Forms (ICOMMS 2006)
 
Image Classification (icast 2006)
Image Classification  (icast 2006)Image Classification  (icast 2006)
Image Classification (icast 2006)
 
Reuse Software Components (IMS 2006)
Reuse Software Components (IMS 2006)Reuse Software Components (IMS 2006)
Reuse Software Components (IMS 2006)
 
GIS for Quetta (ICAST 2006)
GIS for Quetta (ICAST 2006)GIS for Quetta (ICAST 2006)
GIS for Quetta (ICAST 2006)
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)
 
Web Layout Generation (IC-SCCE 2006)
Web Layout Generation (IC-SCCE 2006)Web Layout Generation (IC-SCCE 2006)
Web Layout Generation (IC-SCCE 2006)
 
PCA Clouds (ICET 2005)
PCA Clouds (ICET 2005)PCA Clouds (ICET 2005)
PCA Clouds (ICET 2005)
 
Feature Based Image Classification by using Principal Component Analysis
Feature Based Image Classification by using Principal Component AnalysisFeature Based Image Classification by using Principal Component Analysis
Feature Based Image Classification by using Principal Component Analysis
 

NL Interface for Database - EJSR 20(4)

  • 1. European Journal of Scientific Research ISSN 1450-216X Vol.20 No.4 (2008), pp.844-851 © EuroJournals Publishing, Inc. 2008 http://www.eurojournals.com/ejsr.htm Database Interfacing using Natural Language Processing Imran Sarwar Bajwa Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: imransbajwa@gmail.com Shahzad Mumtaz Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: shahzadz22@hotmail.com M. Shahid Naweed Department of Computer Science and IT, The Islamia University of Bahawalpur E-mail: shahid_naweed@hotmail.com Abstract To write technically correct SQL queries is a complex and skill requiring task especially for a novel user. This situation becomes more complex when a low skilled person has to use a database management system for a specific business purpose. S/He has to write some quires at his own and perform various tasks. This scenario requires more expertise and skills in terms of understanding and writing the accurate and functional queries. The task of the novel user can be simplified by providing an easy interface that is well known to that user. In order to resolve all such issues, automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has the ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The paper describes a system that can create SQL queries automatically. The designed system provides a quick and reliable way to generate SQL queries to save time and budget of both the user and system analyst. Keywords: Information extraction, Automatic Query Generation, Knowledge Retrieval, Natural language processing. 1.0. Introduction Relational databases are the premier way of storing common data repositories. After storing the data contents in a database, an interfacing mechanism is required to talk with the prearranged repository of the confined data. The conventional way of communicating with a database is to fist build a connection stream and then adding, deleting or updating the data contents in the database by using a standardized interfacing mechanism [1]. Simple command shells are typically used and they are often incorporated within every distinct database product. These command shells are typically simple filters which helps a use to log on to the database, execute particular commands and receive output. These command shells provide access to the database from the machine on which the database is actually running [2]. After hooking to a particular database a user or a programmer requires an interface and typically that
  • 2. Database Interfacing using Natural Language Processing 845 interface is provided by some technical languages. These languages are called query languages and are constituted of the database commands typically used for asking questions to a distinctive database and getting intended response. SQL [3] (Structured Query Language) is the most popular query language which is actually the defacto language of databases today. SQL is an orthodox tool of database querying. Different database management systems implement this standardized language with trivial alterations and adjustments. However, in spite of these proprietary extensions by the vendors, the core of this querying language is the same in all of the environments. From an application programmer's point of view, the major novelty in the relational database is that one uses a declarative query language, SQL. Most computer languages are procedural. The programmer tells the computer what to do, step by step, specifying a procedure. Using SQL interface, the programmer defines his requirements and questions and the RDBMS query planner figures out how to get it [5]. There are two compensations of using a declarative language. The first is that the queries no longer depend on the data depiction. The RDBMS is free to store data according to its own design requirements [6]. The second major factor is improved software dependability. For various web-based and stand-alone applications the generic SQL is used to make the things simple and straightforward. Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes this language monotonous and difficult to learn and use. It is quite intricate to remember these SQL commands and use them accurately and precisely. In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. As far as this software is concerns the time, it takes to explore all the facilities and services, should be quite less than a minute and this information is quite useful for the users. 2.0. Problem Description Modern software engineering requires quick and automated solutions which may have ability to create the accurate and precise SQL queries automatically. For complex queries an expert programmer also requires assistance in terms of automatic query generation. He can use these queries after making appropriate adjustments and alterations in the automated generated queries with less effort in less time as compared to the traditional approaches. The task of the novel user can be simplified by providing an easy interface that is more familiar and well known to that user. In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has obvious ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The designed system has robust ability to create code automatically without external environment. The designed system provides a quick and reliable way to generate SQL queries to save the time and budget of both the user and system analyst 3.0. Used Methodology The understanding and multi-aspect processing of the natural languages that are also termed as "speech languages", is actually one of the arguments of greater interest in the field artificial intelligence field [8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are based on un-formal grammars. There are the geographical, psychological and sociological factors which influence the behaviours of natural languages [12]. There are undefined set of words and they also change and vary area to area and time to time.Due to these variations and inconsistencies, the natural languages have different flavours as English language has more than half dozen renowned flavours all over the world [14]. These flavours have different accents, set of vocabularies and phonological aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a difficult task to process them as compared to the formal languages [13].
  • 3. 846 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed The English language statements are effortlessly converted into a SQL query by using the newly designed rule based algorithm. Select query is the common query used to choose a set of values from a table [4]. An example of a college database has been used in the conducted research. Student’s data will be retrieved, inserted and deleted by automatically generated queries from simple English text. 3.1. SELECT Query First of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following: SELECT * FROM Students Keyword Required Set keyword Table Name ‘SELECT’ query can easily be generated from the provided input string of as there are two keywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’. To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional norms of the English language and grammatical rule are used. The conventional structure of simple English sentence is the key rule of comprehending and analyzing the natural language text [13] as in the following example: “I need names of all students.” Following is the complete analysis of this simple sentence. Table 01: Generating SELCET Query from text Lexicons Phase-I Phase –II I Noun ---------- need Verb ---------- names Noun Field Name of preposition ---------- all Noun * students Noun Table Name In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table Name’ filed is filled by the ‘Table Name’ attribute as following: Select * from Students Here the table Name is searched from the array of available all tables in the database. From all available tables, the nearest table name is picked that ‘students’ in this example. 3.2. INSERT Query After ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments as following: INSERT INTO Students VALUES (5, ‘Ali’) Keyword keyword Table Name Keyword Record ‘INSERT’ query can also produced from the given statement as there are three keywords ‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and ‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in the following example: “I want to insert a student whose Roll No. is 5 and Name is Ali.” Following is the complete analysis of this simple sentence.
  • 4. Database Interfacing using Natural Language Processing 847 Table 02: Generating INSERT Query from text Lexicons Phase-I Phase –II I Noun ----------- want Verb ----------- to Preposition ----------- insert Verb Action a article ----------- student Noun Table Name whose Conjunction ----------- Roll No Noun Attribute is Helping Verb ------------ 5 Noun Value and Conjunction ------------ Name Noun Attribute is Helping Verb ------------ Ali Noun Value In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table Name’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array of available all table sin the database. From all available tables, the nearest table name is picked that ‘students’ in this example. 3.3. DELETE Query Same like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’ query has five parts as following: DELETE FROM Students WHERE Age > 25 Keyword Keyword Table Name Keyword Condition The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and ‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and ‘Condition’ parameters the English language defined grammatical rule are used as in the following example: “I want to delete the students more than 25 years age.” Following is the complete analysis of this simple sentence. Table 03: Generating DELETE Query from text Lexicons Phase-I Phase –II I Noun --------- want Verb --------- to preposition --------- delete verb Action the article --------- students Noun Table Name more preposition Condition than Noun ---------- 25 Noun Value years Noun ----------- age Noun Parameter For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value are combined with Condition parameters. In this example table Name is also retrieved from the array of available all tables in the database.
  • 5. 848 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed 4.0. Work Flow of Designed System The designed system “Computational Linguistics based System for Automatic Database Query Generation” is adequately capable of automatically generating queries. This designed system performs its function in multi-phase procedure. There are five modules in total that are Text input acquisition, text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is the brief detail of all these phases. 4.1. Text input Acquisition This module helps to acquire input text scenario. User provides the business scenario in from of strings of the text. This module reads the input text in the form characters and generates the words by concatenating the input characters. This module is the implementation of the lexical phase. Lexicons and tokens are generated in this module. After the lexicons generation further processing can be performed on the input text. Figure 01: Lexical analysis of input text string 4.2. Text Comprehension This module reads the input from module one in the form of words or lexicons. These words are categorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions, conjunctions, etc. These classes are further used to understand the various parts of the given sentence. Figure 02: Parts of speech tagging of input text 4.3. Information Retrieval This module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc. Keywords are found by matching the tokens with the given array of al possible keywords. These key
  • 6. Database Interfacing using Natural Language Processing 849 words are further used to generate the respective queries. The information like table name, field name, number of attributes and logical conditions are also extracted in this phase. Figure 03: Query information extraction 4.4. SQL Queries generation This module combines the keywords and other required parameters for a particular query. SQL query is ultimately generated here according to the given rules in the designed algorithm. As separate scenario will be provided for various types of queries, the separate functions have been implemented for particular query. Figure 04: Generation of SQL Query 5.0. Results and Analysis After designing and coding the query generating system, its accuracy and efficiency was tested. For testing purpose of the queries generated by the designed system simple and complex level queries were generated. Each query from each category as Select, Insert, Delete was checked. 15 sample queries were generated and the intended results have been shown in the following table.
  • 7. 850 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed Table 04: Accuracy ratio of various types of queries Types Simple Complex Total SELECT 14 13 90% INSERT 13 11 80% DELETE 14 12 87% Total Accuracy = 86% A matrix representing accuracy of query generation test (%) for simple level and complex level queries has been constructed. Overall diagrams accuracy for all types of queries is determined by adding total accuracy of all categories and calculating its average that is 86% in this case. Figure 05: Graphical representation of the results 14 12 10 8 Simple 6 Complex 4 2 0 SELECT INSERT DELETE The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETE queries in terms of simple and complex queries parameters. 6.0. Conclusion The designed system “Computational Linguistics based System for Automatic Database Query Generation” facilitates both users and software engineers in terms of generating SQL queries automatically. The task of the novel user can be simplified by providing an easy interface that is more familiar and well known to that user. In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has obvious ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The designed system has robust ability to create code automatically without external environment. The designed system provides a quick and reliable way to generate SQL queries to save the time and budget of both the user and system analyst. An elegant graphical user interface has also been provided to the user for entering the Input scenario in a proper way and generating UML diagrams. 7.0. Future Work There is also some margin of improvements in the algorithms for generating the intended SQL queries. Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% by improving the algorithms and inducing the ability of learning in the system. In this research only three types of queries has been addressed as SELECT, INSERT, and DELETE query. There are still other types of queries that require some sufficient solution.
  • 8. Database Interfacing using Natural Language Processing 851 References [1] Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company, New York. [2] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge Univ. Press, Cambridge, U.K. [3] D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding. In SIGMOD. [4] C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and Language Acquisition (1997). [5] Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. [6] A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343. [7] Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40 (2), 115– 132. [8] J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence (AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822. [9] Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer, Boston. [10] Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10, 115–141. [11] Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal of the American Society for Information Science, 39(1), 8–16. [12] Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis for grammatical rules. Information Processing and Management, 32(2), 185–197. [13] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Mass. [14] Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics. Kluwer, Dordrecht, The Netherlands.