The document describes a system that can automatically generate SQL queries from natural language input. It discusses how the system works in multiple phases: it first acquires text input, then analyzes the text to comprehend it and extract necessary information like table names and conditions. It then generates the SQL queries based on this extracted information and predefined rules. The system was tested on sample queries and showed 80-90% accuracy in generating simple and complex queries for operations like select, insert and delete. While accurate, the authors note room for improving the algorithms to achieve higher than 85% accuracy and handle more types of queries.
2. Database Interfacing using Natural Language Processing 845
interface is provided by some technical languages. These languages are called query languages and are
constituted of the database commands typically used for asking questions to a distinctive database and
getting intended response. SQL [3] (Structured Query Language) is the most popular query language
which is actually the defacto language of databases today. SQL is an orthodox tool of database
querying. Different database management systems implement this standardized language with trivial
alterations and adjustments. However, in spite of these proprietary extensions by the vendors, the core
of this querying language is the same in all of the environments.
From an application programmer's point of view, the major novelty in the relational database is
that one uses a declarative query language, SQL. Most computer languages are procedural. The
programmer tells the computer what to do, step by step, specifying a procedure. Using SQL interface,
the programmer defines his requirements and questions and the RDBMS query planner figures out how
to get it [5]. There are two compensations of using a declarative language. The first is that the queries
no longer depend on the data depiction. The RDBMS is free to store data according to its own design
requirements [6]. The second major factor is improved software dependability. For various web-based
and stand-alone applications the generic SQL is used to make the things simple and straightforward.
Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes this
language monotonous and difficult to learn and use. It is quite intricate to remember these SQL
commands and use them accurately and precisely.
In order to resolve all such issues, an automated software is needed, which facilitates both users
and software engineers. As far as this software is concerns the time, it takes to explore all the facilities
and services, should be quite less than a minute and this information is quite useful for the users.
2.0. Problem Description
Modern software engineering requires quick and automated solutions which may have ability to create
the accurate and precise SQL queries automatically. For complex queries an expert programmer also
requires assistance in terms of automatic query generation. He can use these queries after making
appropriate adjustments and alterations in the automated generated queries with less effort in less time
as compared to the traditional approaches.
The task of the novel user can be simplified by providing an easy interface that is more familiar
and well known to that user. In order to resolve all such issues, an automated software is needed, which
facilitates both users and software engineers. User writes the requirements in simple English in a few
statements and the designed system has obvious ability to analyze the given script. After composite
analysis and mining of associated information, the designed system generates the intended SQL queries
that can be run directly. The designed system has robust ability to create code automatically without
external environment. The designed system provides a quick and reliable way to generate SQL queries
to save the time and budget of both the user and system analyst
3.0. Used Methodology
The understanding and multi-aspect processing of the natural languages that are also termed as "speech
languages", is actually one of the arguments of greater interest in the field artificial intelligence field
[8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are based
on un-formal grammars. There are the geographical, psychological and sociological factors which
influence the behaviours of natural languages [12]. There are undefined set of words and they also
change and vary area to area and time to time.Due to these variations and inconsistencies, the natural
languages have different flavours as English language has more than half dozen renowned flavours all
over the world [14]. These flavours have different accents, set of vocabularies and phonological
aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a
difficult task to process them as compared to the formal languages [13].
3. 846 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed
The English language statements are effortlessly converted into a SQL query by using the
newly designed rule based algorithm. Select query is the common query used to choose a set of values
from a table [4]. An example of a college database has been used in the conducted research. Student’s
data will be retrieved, inserted and deleted by automatically generated queries from simple English
text.
3.1. SELECT Query
First of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following:
SELECT * FROM Students
Keyword Required Set keyword Table Name
‘SELECT’ query can easily be generated from the provided input string of as there are two
keywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’.
To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional norms
of the English language and grammatical rule are used. The conventional structure of simple English
sentence is the key rule of comprehending and analyzing the natural language text [13] as in the
following example:
“I need names of all students.”
Following is the complete analysis of this simple sentence.
Table 01: Generating SELCET Query from text
Lexicons Phase-I Phase –II
I Noun ----------
need Verb ----------
names Noun Field Name
of preposition ----------
all Noun *
students Noun Table Name
In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute as following:
Select * from Students
Here the table Name is searched from the array of available all tables in the database. From all
available tables, the nearest table name is picked that ‘students’ in this example.
3.2. INSERT Query
After ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments as
following:
INSERT INTO Students VALUES (5, ‘Ali’)
Keyword keyword Table Name Keyword Record
‘INSERT’ query can also produced from the given statement as there are three keywords
‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and
‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in the
following example:
“I want to insert a student whose Roll No. is 5 and Name is Ali.”
Following is the complete analysis of this simple sentence.
4. Database Interfacing using Natural Language Processing 847
Table 02: Generating INSERT Query from text
Lexicons Phase-I Phase –II
I Noun -----------
want Verb -----------
to Preposition -----------
insert Verb Action
a article -----------
student Noun Table Name
whose Conjunction -----------
Roll No Noun Attribute
is Helping Verb ------------
5 Noun Value
and Conjunction ------------
Name Noun Attribute
is Helping Verb ------------
Ali Noun Value
In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table
Name’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array of
available all table sin the database. From all available tables, the nearest table name is picked that
‘students’ in this example.
3.3. DELETE Query
Same like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’
query has five parts as following:
DELETE FROM Students WHERE Age > 25
Keyword Keyword Table Name Keyword Condition
The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and
‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and
‘Condition’ parameters the English language defined grammatical rule are used as in the following
example:
“I want to delete the students more than 25 years age.”
Following is the complete analysis of this simple sentence.
Table 03: Generating DELETE Query from text
Lexicons Phase-I Phase –II
I Noun ---------
want Verb ---------
to preposition ---------
delete verb Action
the article ---------
students Noun Table Name
more preposition Condition
than Noun ----------
25 Noun Value
years Noun -----------
age Noun Parameter
For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value are
combined with Condition parameters. In this example table Name is also retrieved from the array of
available all tables in the database.
5. 848 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed
4.0. Work Flow of Designed System
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” is adequately capable of automatically generating queries. This designed system performs
its function in multi-phase procedure. There are five modules in total that are Text input acquisition,
text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is the
brief detail of all these phases.
4.1. Text input Acquisition
This module helps to acquire input text scenario. User provides the business scenario in from of strings
of the text. This module reads the input text in the form characters and generates the words by
concatenating the input characters. This module is the implementation of the lexical phase. Lexicons
and tokens are generated in this module. After the lexicons generation further processing can be
performed on the input text.
Figure 01: Lexical analysis of input text string
4.2. Text Comprehension
This module reads the input from module one in the form of words or lexicons. These words are
categorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions,
conjunctions, etc. These classes are further used to understand the various parts of the given sentence.
Figure 02: Parts of speech tagging of input text
4.3. Information Retrieval
This module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc.
Keywords are found by matching the tokens with the given array of al possible keywords. These key
6. Database Interfacing using Natural Language Processing 849
words are further used to generate the respective queries. The information like table name, field name,
number of attributes and logical conditions are also extracted in this phase.
Figure 03: Query information extraction
4.4. SQL Queries generation
This module combines the keywords and other required parameters for a particular query. SQL query
is ultimately generated here according to the given rules in the designed algorithm. As separate
scenario will be provided for various types of queries, the separate functions have been implemented
for particular query.
Figure 04: Generation of SQL Query
5.0. Results and Analysis
After designing and coding the query generating system, its accuracy and efficiency was tested. For
testing purpose of the queries generated by the designed system simple and complex level queries were
generated. Each query from each category as Select, Insert, Delete was checked.
15 sample queries were generated and the intended results have been shown in the following
table.
7. 850 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed
Table 04: Accuracy ratio of various types of queries
Types Simple Complex Total
SELECT 14 13 90%
INSERT 13 11 80%
DELETE 14 12 87%
Total Accuracy = 86%
A matrix representing accuracy of query generation test (%) for simple level and complex level
queries has been constructed. Overall diagrams accuracy for all types of queries is determined by
adding total accuracy of all categories and calculating its average that is 86% in this case.
Figure 05: Graphical representation of the results
14
12
10
8
Simple
6
Complex
4
2
0
SELECT INSERT DELETE
The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETE
queries in terms of simple and complex queries parameters.
6.0. Conclusion
The designed system “Computational Linguistics based System for Automatic Database Query
Generation” facilitates both users and software engineers in terms of generating SQL queries
automatically. The task of the novel user can be simplified by providing an easy interface that is more
familiar and well known to that user. In order to resolve all such issues, an automated software is
needed, which facilitates both users and software engineers. User writes the requirements in simple
English in a few statements and the designed system has obvious ability to analyze the given script.
After composite analysis and mining of associated information, the designed system generates the
intended SQL queries that can be run directly. The designed system has robust ability to create code
automatically without external environment. The designed system provides a quick and reliable way to
generate SQL queries to save the time and budget of both the user and system analyst. An elegant
graphical user interface has also been provided to the user for entering the Input scenario in a proper
way and generating UML diagrams.
7.0. Future Work
There is also some margin of improvements in the algorithms for generating the intended SQL queries.
Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% by
improving the algorithms and inducing the ability of learning in the system. In this research only three
types of queries has been addressed as SELECT, INSERT, and DELETE query. There are still other
types of queries that require some sufficient solution.
8. Database Interfacing using Natural Language Processing 851
References
[1] Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company,
New York.
[2] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language
Structure and Use. Cambridge Univ. Press, Cambridge, U.K.
[3] D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL
Translation using Dynamic Interval Encoding. In SIGMOD.
[4] C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database
queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and
Language Acquisition (1997).
[5] Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill,
New York.
[6] A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle
Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343.
[7] Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing
for document retrieval. Journal of the American Society for Information Science, 40 (2), 115–
132.
[8] J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic
programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence
(AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822.
[9] Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer,
Boston.
[10] Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM
Transactions on Information Systems, 10, 115–141.
[11] Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal
of the American Society for Information Science, 39(1), 8–16.
[12] Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information
retrieval and filtering: An empirical basis for grammatical rules. Information Processing and
Management, 32(2), 185–197.
[13] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Mass.
[14] Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics.
Kluwer, Dordrecht, The Netherlands.