1. 1
MC0077 – Advanced Database Systems
Question 1- List and explain various Normal Forms. How BCNF differs from the Third
Normal Form and 4th Normal forms?
First Normal Form - First normal form (1NF) is a property of a relation in a relational
database. A relation is in first normal form if the domain of each attribute contains only
atomic values, and the value of each attribute contains only a single value from that domain.
First normal form is an essential property of a relation in a relational database. Database
normalization is the process of representing a database in terms of relations in standard
normal forms, where first normal is a minimal requirement. First normal form deals with the
"shape" of a record type. Under first normal form, all occurrences of a record type must
contain the same number of fields. First normal form excludes variable repeating fields and
groups.
Second Normal Form - Second normal form (2NF) is a normal form used in database
normalization. A table that is in first normal form (1NF) must meet additional criteria if it is to
qualify for second normal form. Specifically: a table is in 2NF if and only if it is in 1NF and no
non-prime attribute is dependent on any proper subset of any candidate key of the table. A
non-prime attribute of a table is an attribute that is not a part of any candidate key of the
table. Put simply, a table is in 2NF if and only if it is in 1NF and every non-prime attribute of
the table is either dependent on the whole of a candidate key, or on another non-prime
attribute. When a 1NF table has no composite candidate keys (candidate keys consisting of
more than one attribute), the table is automatically in 2NF. Second and third normal forms
deal with the relationship between non-key and key fields.
Third normal form - Third normal form is a normal form used in database normalization. A
table is in 3NF if and only if both of the following conditions hold: The relation R (table) is in
second normal form (2NF), every non-prime attribute of R is non-transitively dependent (i.e.
directly dependent) on every super key of R.
Fourth Normal form - Under the fourth normal form, a table cannot have more than one multi
valued column. A multivalve column is one where a single entity can have more than one
attribute for that column.
Fifth Normal Form - Fifth normal form deals with cases where information can be
reconstructed from smaller pieces of information that can be maintained with less
redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal
form generalizes to cases not covered by the others. The fifth normal form is created by
removing any columns that can be created from smaller pieces of data that can be
maintained with less redundancy.
Difference between BCNF and Third Normal Form
Both 3NF and BCNF are normal forms that are used in relational databases to minimize
redundancies in tables. In a table that is in the BCNF normal form, for every non-trivial
2. 2
functional dependency of the form A → B, A is a super-key whereas, a table that complies
with 3NF should be in the 2NF, and every non-prime attribute should directly depend on
every candidate key of that table. BCNF is considered as a stronger normal form than the
3NF and it was developed to capture some of the anomalies that could not be captured by
3NF. Obtaining a table that complies with the BCNF form will require decomposing a table
that is in the 3NF. This decomposition will result in additional join operations (or Cartesian
products) when executing queries. This will increase the computational time. On the other
hand, the tables that comply with BCNF would have fewer redundancies than tables that
only comply with 3NF.
Difference between BCNF and 4th Normal Form
● Database must be already achieved to 3NF to take it to BCNF, but database must be
in 3NF and BCNF, to reach 4NF.
● In fourth normal form, there are no multi-valued dependencies of the tables, but in
BCNF, there can be multi-valued dependency data in the tables.
Question 2 - What are differences in Centralized and Distributed Database Systems? List
the relative advantages of data distribution.
A distributed database is a database that is under the control of a central database
management system (DBMS) in which storage devices are not all attached to a common
CPU. It may be stored in multiple computers located in the same physical location, or may
be dispersed over a network of interconnected computers. Collections of data (e.g. in a
database) can be distributed across multiple physical locations. A distributed database can
reside on network servers on the Internet, on corporate intranets or extranets, or on other
company networks. The replication and distribution of databases improves database
performance at end-user worksites. To ensure that the distributive databases are up to date
and current, there are two processes: replication and duplication. Replication involves using
specialized software that looks for changes in the distributive database. Once the changes
have been identified, the replication process makes all the databases look the same. The
replication process can be very complex and time consuming depending on the size and
number of the distributive databases. This process can also require a lot of time and
computer resources. Duplication on the other hand is not as complicated. It basically
identifies one database as a master and then duplicates that database. The duplication
process is normally done at a set time after hours. This is to ensure that each distributed
location has the same data. In the duplication process, changes to the master database only
are allowed. This is to ensure that local data will not be overwritten. Both of the processes
can keep the data current in all distributive locations. Besides distributed database
replication and fragmentation, there are many other distributed database design
technologies. For example, local autonomy, synchronous and asynchronous distributed
database technologies. These technologies' implementation can and does depend on the
needs of the business and the sensitivity/confidentiality of the data to be stored in the
database, and hence the price the business is willing to spend on ensuring data security,
consistency and integrity.
A database User accesses the distributed database through:
Local applications: Applications which do not require data from other sites.
3. 3
Global applications: Applications which do require data from other sites.
A distributed database does not share main memory or disks. A centralized database has
all its data on one place, as it is totally different from distributed database which has data on
different places. In centralized database as all the data reside on one place so problem of
bottle-neck can occur, and data availability is not efficient as in distributed database.
Advantages of Data Distribution
The primary advantage of distributed database systems is the ability to share and access
data in a reliable and efficient manner.
1. Data sharing and Distributed Control: If a number of different sites are connected to each
other, then a user at one site may be able to access data that is available at another site. For
example, in the distributed banking system, it is possible for a user in one branch to access
data in another branch. Without this capability, a user wishing to transfer funds from one
branch to another would have to resort to some external mechanism for such a transfer. This
external mechanism would, in effect, be a single centralized database. The primary
advantage to accomplishing data sharing by means of data distribution is that each site is
able to retain a degree of control over data stored locally. In a centralized system, the
database administrator of the central site controls the database. In a distributed system,
there is a global database administrator responsible for the entire system. A part of these
responsibilities is delegated to the local database administrator for each site. Depending
upon the design of the distributed database system, each local administrator may have a
different degree of autonomy which is often a major advantage of distributed databases.
2. Reliability and Availability: If one site fails in distributed system, the remaining sited may be
able to continue operating. In particular, if data are replicated in several sites, transaction
needing a particular data item may find it in several sites. Thus, the failure of a site does not
necessarily imply the shutdown of the system. The failure of one site must be detected by
the system, and appropriate action may be needed to recover from the failure. The system
must no longer use the service of the failed site. Finally, when the failed site recovers or is
repaired, mechanisms must be available to integrate it smoothly back into the system.
Although recovery from failure is more complex in distributed systems than in a centralized
system, the ability of most of the systems to continue to operate despite failure of one site,
results in increased availability. Availability is crucial for database systems used for real-time
applications.
3. Speedup Query Processing: If a query involves data at several sites, it may be possible to
split the query into sub queries that can be executed in parallel by several sites. Such
parallel computation allows for faster processing of a user’s query. In those cases in which
data is replicated, queries may be directed by the system to the least heavily loaded sites.
Question 3 - Describe the concepts of Structural Semantic Data Model (SSM).
A data model in software engineering is an abstract model that describes how data
are represented and accessed. Data models formally define data elements and relationships
among data elements for a domain of interest. A data model explicitly determines the
structure of data or structured data. Typical applications of data models include database
models, design of information systems, and enabling exchange of data. Usually data models
are specified in a data modeling language. Communication and precision are the two key
benefits that make a data model important to applications that use and exchange data. A
4. 4
data model is the medium which project team members from different backgrounds and with
different levels of experience can communicate with one another. Precision means that the
terms and rules on a data model can be interpreted only one way and are not ambiguous. A
data model can be sometimes referred to as a data structure, especially in the context of
programming languages. Data models are often complemented by function models,
especially in the context of enterprise models.
A semantic data model in software engineering is a technique to define the meaning of data
within the context of its interrelationships with other data. A semantic data model is an
abstraction which defines how the stored symbols relate to the real world. A semantic data
model is sometimes called a conceptual data model. The logical data structure of a database
management system (DBMS), whether hierarchical, network, or relational, cannot totally
satisfy the requirements for a conceptual definition of data because it is limited in scope and
biased toward the implementation strategy employed by the DBMS. Therefore, the need to
define data from a conceptual view has led to the development of semantic data modeling
techniques. That is, techniques to define the meaning of data within the context of its
interrelationships with other data. As illustrated in the figure. The real world, in terms of
resources, ideas, events, etc., is symbolically defined within physical data stores. A semantic
data model is an abstraction which defines how the stored symbols relate to the real world.
Thus, the model must be a true representation of the real world
Data modeling in software engineering is the process of creating a data model by applying
formal data model descriptions using data modeling techniques. Data modeling is a
technique for defining business requirements for a database. It is sometimes called
database modeling because a data model is eventually implemented in a database. Data
architecture is the design of data for use in defining the target state and the subsequent
planning needed to hit the target state. It is usually one of several architecture domains that
form the pillars of an enterprise architecture or solution architecture. Data architecture
describes the data structures used by a business and/or its applications. There are
descriptions of data in storage and data in motion; descriptions of data stores, data groups
and data items; and mappings of those data artifacts to data qualities, applications, locations
etc. Essential to realizing the target state, Data architecture describes how data is
processed, stored, and utilized in a given system. It provides criteria for data processing
operations that make it possible to design data flows and also control the flow of data in the
system.
Question 4 - Describe the following with respect to Object Oriented Databases: a) Query
Processing in Object-Oriented Database Systems b) Query Processing Architecture
a. Query Processing in Object-Oriented Database Systems
One of the criticisms of first-generation object-oriented database management systems
(OODBMSs) was their lack of declarative query capabilities. This led some researchers to
brand first generation (network and hierarchical) DBMSs as object-oriented. It was
commonly believed that the application domains that OODBMS technology targets do not
need querying capabilities. This belief no longer holds, and declarative query capability is
accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current
prototype systems experiment with powerful query languages and investigate their
5. 5
optimization. Commercial products have started to include such languages as well e.g. O2
and Object-Store.
Query optimization techniques are dependent upon the query model and language. For
example, a functional query language lends itself to functional optimization which is quite
different from the algebraic, cost-based optimization techniques employed in relational as
well as a number of object-oriented systems. The query model, in turn, is based on the data
(or object) model since the latter defines the access primitives which are used by the query
model. These primitives, at least partially, determine the power of the query model. Despite
this close relationship, in this unit we do not consider issues related to the design of object
models, query models, or query languages in any detail.
Almost all object query processors proposed to date use optimization techniques developed
for relational systems. However, there are a number of issues that make query processing
more difficult in OODBMSs. The following are some of the more important issues:
Type System - Relational query languages operate on a simple type system consisting of a
single aggregate type: relation. The closure property of relational languages implies that
each relational operator takes one or more relations as operands and produces a relation as
a result. In contrast, object systems have richer type systems. The results of object algebra
operators are usually sets of objects (or collections) whose members may be of different
types. If the object languages are closed under the algebra operators, these heterogeneous
sets of objects can be operands to other operators.
Encapsulation - Relational query optimization depends on knowledge of the physical storage
of data (access paths) which is readily available to the query optimizer. The encapsulation of
methods with the data that they operate on in OODBMSs raises (at least) two issues. First,
estimating the cost of executing methods is considerably more difficult than estimating the
cost of accessing an attribute according to an access path. In fact, optimizers have to worry
about optimizing method execution, which is not an easy problem because methods may be
written using a general-purpose programming language. Second, encapsulation raises
issues related to the accessibility of storage information by the query optimizer. Some
systems overcome this difficulty by treating the query optimizer as a special application that
can break encapsulation and access information directly.
Complex Objects and Inheritance - Objects usually have complex structures where the state
of an object references other objects. Accessing such complex objects involves path
expressions. The optimization of path expressions is a difficult and central issue in object
query languages.
Object Models - OODBMSs lack a universally accepted object model definition. Even though
there is some consensus on the basic features that need to be supported by any object
model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed
collections), how these features are supported differs among models and systems. As a
result, the numerous projects that experiment with object query processing follow quite
different paths and are, to a certain degree, incompatible, making it difficult to amortize on
the experiences of others.
6. 6
b. Query Processing Architecture
A query processing methodology similar to relational DBMSs, but modified to deal with the
difficulties,
The steps of the methodology are as follows.
1. Queries are expressed in a declarative language
2. It requires no user knowledge of object implementations, access paths or
processing strategies
3. The calculus expression is first
4. Calculus Optimization
5. Calculus Algebra Transformation
6. Type check
7. Algebra Optimization
8. Execution Plan Generation
9. Execution
Question 5 - Describe the Differences between Distributed & Centralized Databases.
1 Centralized Control vs. Decentralized Control - In centralized control one "database
administrator" ensures safety of data whereas in distributed control, it is possible to use
hierarchical control structure based on a "global database administrator" having the central
responsibility of whole data along with "local database administrators", who have the
responsibility of local databases.
2 Data Independence - In central databases it means the actual organization of data is
transparent to the application programmer. The programs are written with "conceptual" view
of the data (called "Conceptual schema"), and the programs are unaffected by physical
organization of data. In Distributed Databases, another aspect of "distribution dependency"
is added to the notion of data independence as used in Centralized databases. Distribution
Dependency means programs are written assuming the data is not distributed. Thus
correctness of programs is unaffected by the movement of data from one site to another;
however, their speed of execution is affected.
3 Reduction of Redundancy - In centralized databases redundancy was reduced for two
reasons :(a) inconsistencies among several copies of the same logical data are avoided, (b)
storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed
databases data redundancy is desirable as (a) locality of applications can be increased if
data is replicated at all sites where applications need it, (b) the availability of the system can
be increased, because a site failure does not stop the execution of applications at other sites
if the data is replicated. With data replication, retrieval can be performed on any copy, while
updates must be performed consistently on all copies.
4 Complex Physical Structures and Efficient Access - In centralized databases complex
accessing structures like secondary indexed, interfile chains are used. All these features
provide efficient access to data. In distributed databases efficient access requires accessing
7. 7
data from different sites. For this an efficient distributed data access plan is required which
can be generated either by the programmer or produced automatically by an optimizer.
Problems faced in the design of an optimizer can be classified in two categories: a) Global
optimization consists of determining which data must be accessed at which sites and which
data files must consequently be transmitted between sites. b) Local optimization consists of
deciding how to perform the local database accesses at each site.
5 Integrity, Recovery and Concurrency Control - A transaction is an atomic unit of execution
and atomic transactions are the means to obtain database integrity. Failures and
concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of
transaction execution, thus violating the atomicity requirement. Concurrent execution of
different transactions may permit one transaction to observe an inconsistent, transient state
created by another transaction during its execution. Concurrent execution requires
synchronization amongst the transactions, which is much harder in all distributed systems.
6 Privacy and Security - In traditional databases, the database administrator, having
centralized control, can ensure that only authorized access to the data is performed. In
distributed databases, local administrators face the same as well as two new aspects of the
problem; (a) security (protection) problems because of communication networks is intrinsic
to database systems. (b) In certain databases with a high degree of "site autonomy" may
feel more protected because they can enforce their own protections instead of depending on
a central database administrator.
7 Distributed Query Processing - The DDBMS should be capable of gathering and presenting
data from more than one site to answer a single query. In theory a distributed system can
handle queries more quickly than a centralized one, by exploiting parallelism and reducing
disc contention; in practice the main delays (and costs) will be imposed by the
communications network. Routing algorithms must take many factors into account to
determine the location and ordering of operations. Communications costs for each link in the
network are relevant, as also are variable processing capabilities and loadings for different
nodes, and (where data fragments are replicated) trade-offs between cost and currency.
8 Distributed Directory (Catalog) Management - Catalogs for distributed databases contain
information like fragmentation description, allocation description, mappings to local names,
access method description, statistics on the database, protection and integrity constraints
(consistency information) which are more detailed as compared to centralized databases.
Question 6 - Describe the following: a) Data Mining Functions b) Data Mining Techniques
a) Data Mining Functions
Data mining refers to the broadly-defined set of techniques involving finding meaningful
patterns - or information - in large amounts of raw data. At a very high level, data mining is
performed in the following stages (note that terminology and steps taken in the data mining
process varies by data mining practitioner):
1. Data collection: gathering the input data you intend to analyze
2. Data scrubbing: removing missing records, filling in missing values where appropriate
8. 8
3. Pre-testing: determining which variables might be important for inclusion during the
analysis stage.
4. Analysis/Training: analyzing the input data to look for patterns
5. Model building: drawing conclusions from the analysis phase and determining a
mathematical model to be applied to future sets of input data
6. Application: applying the model to new data sets to find meaningful patterns
Data mining can be used to classify or cluster data into groups or to predict likely future
outcomes based upon a set of input variables/data.
b) Data Mining Techniques
There are several major data mining techniques have been developed and used in data
mining projects.
Association - Association is one of the best known data mining technique. In association, a
pattern is discovered based on a relationship of a particular item on other items in the same
transaction. For example, the association technique is used in market basket analysis to
identify what products that customers frequently purchase together.
Classification - Classification is a classic data mining technique based on machine learning.
Basically classification is used to classify each item in a set of data into one of predefined set
of classes or groups.
Clustering - Clustering is a data mining technique that makes meaningful or useful cluster of
objects that have similar characteristic using automatic technique. Different from
classification, clustering technique also defines the classes and put objects in them, while in
classification objects are assigned into predefined classes.
Prediction - The prediction as it name implied is one of a data mining techniques that
discovers relationship between independent variables and relationship between dependent
and independent variables
Sequential Patterns - Sequential patterns analysis in one of data mining technique that
seeks to discover similar patterns in data transaction over a business period. The uncover
patterns are used for further business analysis to recognize relationships among data.
Artificial neural networks - These are non-linear, predictive models that learn through
training. Although they are powerful predictive modeling techniques, some of the power
comes at the expense of ease of use and deployment.
Decision trees - They are tree-shaped structures that represent decision sets. These
decisions generate rules, which then are used to classify data. Decision trees are the
favored technique for building understandable models.
The nearest-neighbor method - This method classifies dataset records based on similar data
in a historical dataset.