2. History of databases?
Need to structurally organize data.
Various different models to fulfill this need.
Most common technique is called Relational Modelling
The databases supporting relational model are called Relational
Database Management Systems (RDBMS).
3. Relational Model
All data is represented in terms of tuples.
A tuple is an extension to a pair. A pair is between two items, and a
tuple is between N items where N is a countable number.
Tuples are grouped into relations.
In mathematical terms, a relational model is based on first-order
predicate logic.
4. Example of Tuples and Relations
Assume a road repair company wants to track their activities on
different roads.
Lets restrict their activities to ‘Patching’, ‘Overlay’ and ‘Crack Sealing’.
The company had overlaid I-95 on 1/12/01 and I-66 on 2/8/01.
How can we represent this information in a relational model using
tuples?
First, we see that there are two distinct things here
Activities
Work
Next we define tuples for both of these items as follows:
Activities = {activityName}
Works = {activity, date, routeNumber}
5. Example of Tuples and Relations
We see a relation between Activities and Work – the activity that is
to be performed.
In relational model we use the concept of ‘keys’ to describe the
relationship between different tuples.
In our example, activityName can act as a ‘key’ to describe the
relation that can be named as ‘ActivityWorks’.
For optimization reasons, keys are generally of numeric type.
Therefore we modify Activities and Works to add a numeric ID
Activities = {activityId, activityName}
Works = {activityId, date, routeNumber}
7. Relational Databases
Relational Modelling is a mathematical concept.
When we translate this mathematical concept into RDBMS system we describe tuples as rows, items in tuples as
columns and a group of ruples as tables. Relations are called relations in RDBMS terminology as well.
The example of our road repair company when translated into RDBMS would have two tables as follows:
Table Name: Activities. Columns:
activityId (Primary Key, number type)
activityName (string type)
Table Name: Works. Columns:
activityId (Foreign Key, number type)
date (date type)
routeNumber (string type)
It would have the following relation:
Relation Name: ActivityWorks. Participating Columns:
Primary Table: Activities. Key Column: activityId
Secondary Table: Works. Key Column: activityId
8. More on Relations
The relation in the previous example is commonly called a ‘one-to-
many’ or a ‘Master-Child’ relationship.
There are a total of three relationships:
One-to-One: For a row in primary table there can be at most one row in
secondary table. Commonly used to spread a single tuple across two
tables based on logical reasoning.
One-to-Many: For a row in primary table there can be multiple rows in
secondary table. Commonly used to reduce redundancy or duplication
of same data.
Many-to-Many: For multiple rows in primary table there can be multiple
rows in secondary table. Used to describe complex relationships.
Relations are always directional.
9. Querying databases
Databases provide an interface to define and manipulate data. It is
called queries.
There are two types of queries:
Data Describing Language (DDL) queries. They are used to create and
modify database structure. DB structure is called a schema definition.
Data Manipulating Language (DML) queries. They are used to query the
data base for data.
There are four major DML queries:
SELECT
INSERT
UPDATE
DELETE
10. SELECT Query
A SELECT query is the way to fetch data from a database.
At a minimum, it has two parts (called clauses):
The SELECT clause
The FROM clause
For example:
SELECT activityId, activityName
FROM Activities;
This query would return all rows in Activities table.
Apart from SELECT and FROM clauses, there are a number of other clauses that
are optional. These include (but not limited to):
WHERE
ORDER BY
GROUP BY
11. SELECT Query – The SELECT Clause
It enables you to define the columns you want.
Sometimes you want all columns, in those cases you can use the
wildcard operator (*). For example, the previous query can be
modified as:
SELECT *
FROM Activities;
A good practice is to name the columns rather than using *
The primary use of SELECT clause is to define a projection – a subset
of columns, so that the result can be restricted to such columns only.
12. SELECT Query – FROM Clause
This is where you tell the database the name of table(s) where it should
look for the columns you named in the SELECT clause.
When fetching data from multiple tables, list all tables and describe the
relation between them. For example, let us try to fetch data for all the
activities that have been performed on various routes along with dates.
SELECT Activities.activityName, Works.date, Works.routeNumber
FROM Activities INNER JOIN
Works ON Activities.activityId = Works.activityId
Notice the keywords ‘INNER JOIN’ and the part ‘ON Activities.activityId
= Works.activityId’.
The ON … part tells the database what are the columns to match
results on. It is also called the join condition. There can be more than
one joining condition depending on the underlying database schema.
13. SELECT Query – FROM Clause -
JOINs
JOIN is a keyword that allows you to let the database know that
there are multiple tables you intend to fetch data from.
There is a table mentioned before JOIN and another after it.
The one before is called the left table and the one after is called the
right table.
There are three types of joins:
INNER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
14. INNER JOIN
INNER JOIN is also sometimes called a ‘strict’ join.
Some RDBMS systems support dropping the ‘INNER’ and implicitly
assume it.
This type of join means that for each row in the left table find the
rows in the right table and skip if there is no match found.
This type of joins helps in eliminating empty records.
For example, in our road repair example, it would omit all such
Activities rows that don’t have records in Works table.
15. OUTER JOINs
In case we don’t want to omit empty records, we can use OUTER JOINs.
A LEFT OUTER JOIN suggests that for each row in left table find all rows in
right table.
A RIGHT OUTER JOIN suggests that for each row in right table find all
rows in left table.
For example, let us find all Activities and related Works. We can do this
by:
SELECT Activities.activityName, Works.date, Works.routeNumber
FROM Activities LEFT OUTER JOIN
Works ON Activities.activityId = Works.activityId
This query would return all Activities along with their associated Works.
For the Activities that don’t have corresponding Works it would put
‘NULL’ under date and routeNumber columns.
16. The JOIN Conditions
The ON … part is called the joining condition.
It is essentially an assertion condition describing column on the left
and right tables and the way they are to be evaluated.
In most circumstances, there are columns (from left and right tables)
that are matched with an = operator, however, in some cases that
might not be true.
Other conditional operators such as not equal, greater than, less
than, etc. are also supported.
There can be more than one JOINing conditions.
18. SELECT Query – WHERE clause
WHERE clause allows you to describe conditions on the data you want fetched.
For example, if we are interested in all Overlaying Works we’ll write a query:
SELECT *
FROM Works
WHERE activityId = 24
Another way to do the same without using an ID is:
SELECT Works.*
FROM Works INNER JOIN
Activities ON Works.activityId = Activities.activityId
WHERE Activities.activityName = ‘Overlay’
However, the second example would be a bit slow and non-optimal because
there is a certain overhead of joining and matching on string columns.
19. SELECT Query - ORDER BY Clause
Theoretically speaking, the records in a table are unordered. However, most
RDBMS usually store them in some kind of ordering (usually in the order of Primary
key column).
In any case, there might be a requirement to order the results in a particular
way.
ORDER BY clause allows you to describe data ordering and the direction of
ordering.
For example, if we want all Activities along with their associated Works ordered
alphabetically and sorted by date in a descending order, we can do that by:
SELECT Activities.activityName, Works.date, Works.routeNumber
FROM Activities INNER JOIN
Works ON Activities.activityId = Works.activityId
ORDER BY Activities.activityName ASC, Works.date DESC
The ASC keyword is implicit and can be skipped.
20. Aggregating Results
Sometimes we want to fetch aggregated results. For example, we
want to find out the number of times each Activity has been carried
out from the road repair example.
The GROUP BY clause provides this functionality.
SELECT Activities.activityName, COUNT(Works.routeNumber) AS
countActivity
FROM Activities INNER JOIN
Works ON Activities.activityId = Works.activityId
GROUP BY Activities.activityName
COUNT is an aggregate function. Others commonly used
aggregate functions are SUM, AVG, MIN and MAX.
21. SELECT Query – GROUP BY Clause
When a GROUP BY clause is defined then every column in the
SELECT and ORDER BY clauses either need to be part of an
aggregate function or mentioned in the GROUP BY clause.
For example, the following query is invalid:
SELECT Activities.activityName, Works.date,
COUNT(Works.routeNumber) AS countActivity
FROM Activities INNER JOIN
Works ON Activities.activityId = Works.activityId
GROUP BY Activities.activityName
22. Sub-queries
A SELECT query works on a table or a group of tables, meaning
tables are the operands for a SELECT operation.
The output of a SELECT query is (a kind of) a table.
Therefore, an output of a SELECT query can act as an
input/operand for another SELECT query.
23. Why use sub-queries?
Query optimization by breaking a large/complex query into smaller
queries that use WHERE clauses to reduce the data size.
Retrieving single valued records for related tables based on values
on some other columns in another query. Such as retrieving most
recent (or oldest) record in a table that holds data for single record
with updates over a period of time.
The above point is a reference to a common data warehousing use
case of storing data that changes over time and you want to
preserve these over the time changes.
Sometimes also referred to as Slowly Changing Dimension (SCD)
Using a sub-query in a WHERE clause to specify a match on a range
of values.
24. Sub-queries for optimization
Assume that we have a
service with one million
users.
There are only about
100,000 users that have
spent money on our
service.
Of the 100,000 users, only
about 1,000 users have
ever spent 100 dollars or
more in one go.
We would most likely have
a database with the
tables as shown in the
diagram
25. Sub-queries for optimization
You are required to analyze transcations with amount greater than
100 dollars.
Write down the query that fetches users (userId, name, gender,
country) and their transactions (transactionDate, amount).
A sub-optimal query follows on the next slide but don’t peak ahead.
Write down one yourself and compare with it later.
26. Sub-queries for optimization
SELECT users.userId, users.username, users.gender, users.country,
transactions.transactionDate, transactions.transactionAmount
FROM users INNER OUTER JOIN
transactions ON users.userId = transactions.userId
WHERE transactions.transactionAmount > 100;
Problems:
There were 100,000 users that had spent money. Of those there were only a 1,000 instances
where a the amount spent was greater than 100.
Assume that on average there are 2 transactions per user.
The query above would result in retrieval of 200,000 records and then the WHERE clause
would be applied to it to pick out the 1,000 such records where the amount was greater than
100.
This means that 99.5% of data fetched initially was of no use and wasted server resources
(time and memory).
27. Sub-queries for optimization
First, we know that we are only interested in transactions worth more than 100 dollars.
Following query gets use only these transactions:
SELECT transactions.userId, transactions.transactionDate, transactions.transactionAmount
FROM transactions
WHERE transactions.transactionAmount > 100
Since, the output of the above query would be a table, we’ll use this one to JOIN
with users table. The resulting query would be:
SELECT users.userId, users.username, users.gender, users.country,
t1.transactionDate, t1.transactionAmount
FROM users INNER OUTER JOIN
(SELECT transactions.userId, transactions.transactionDate,
transactions.transactionAmount
FROM transactions
WHERE transactions.transactionAmount > 100) AS t1 ON users.userId = t1.userId
28. Sub-queries for Retrieving SCD
From the previous example, assume that now we’re interested in
knowing when was the last time each of our users spent money
along with their gender and country.
How can we go about doing this?
The query that does that is on the next slide, but first try thinking out
how you can do that.
29. Sub-queries for Retrieving SCD
First, lets write a query that retrieves the latest transaction.
SELECT MAX(transactions.transactionDate) AS lastTransactionDate
FROM transactions
OR
SELECT transactions.transactionDate
FROM transactions
ORDER BY transactions.transactionDate DESC
LIMIT 1
But we want to know the last transaction for each user. We can modify the first example as:
SELECT transactions.userId, MAX(transactions.transactionDate) AS lastTransactionDate
FROM transactions
GROUP BY transactions.userId
The second one cannot be modified in a way that would give us the desired because??
SELECT transactions.userId, transactions.transactionDate
FROM transactions
ORDER BY transactions.transactionDate DESC
LIMIT 1
30. Sub-queries for Retrieving SCD
Now, we need to combine the result with user’s gender and
country.
SELECT users.userId, users.gender, users.country,
MAX(transactions.transactionDate) AS lastTransactionDate
FROM users LEFT OUTER JOIN
transactions ON users.userId = transactions.userId
GROUP BY users.userId, users.gender, users.location
The query above gives us the desired result, but it has one problem.
What?
31. Sub-queries for Retrieving SCD
We can use the discarded query two slides back if we can parameterize it somehow so that it
evaluates for each user and gives us the last date. The following query does that:
SELECT users.userId, users.gender, users.country,
(SELECT transactions.transactionDate
FROM transactions
WHERE transactions.userId = users.userId
ORDER BY transactionDate DESC
LIMIT 1) AS lastTransactionDate
FROM users
The query above does not have a join.
It does not use an aggregate function in the main query and enables us to easily add more
columns without worrying about the GROUP BY clause.
Modify the query above (or the one on previous slide) so that we now get the last transaction
dates for transactions worth more than 50 dollars for each user. (Answer on next slide)
32. Sub-queries for Retrieving SCD
SELECT users.userId, users.gender, users.country,
(SELECT transactions.transactionDate
FROM transactions
WHERE transactions.userId = users.userId
AND transactions.transactionAmount > 50
LIMIT 1) AS lastTransactionDate
FROM users
33. Handling NULL Values
The query on previous slide would return rows for all one million users
with most of them having lastTransactionDate as NULL.
NULLs don’t look good on a result set and are of no value for further
analysis. We can resolve this situation in two ways.
Assume that we do need to see all one million users and would like
to put a default value for the users that don’t have a transaction
(such as 1.Jan.1900). Such values are called ‘sentinels’.
To replace a NULL, we can use a function ISNULL to replace the
NULL with a sentinel value.
34. Handling NULL Values
SELECT users.userId, users.gender, users.country,
ISNULL((SELECT transactions.transactionDate
FROM transactions
WHERE transactions.userId = users.userId
AND transactions.transactionAmount > 50
LIMIT 1), ‘1.Jan.1900’) AS lastTransactionDate
FROM users
35. Sub-queries in WHERE clause
Or, we can modify the same query as:
SELECT users.userId, users.gender, users.country,
(SELECT transactions.transactionDate
FROM transactions
WHERE transactions.userId = users.userId
AND transactions.transactionAmount > 50
LIMIT 1) AS lastTransactionDate
FROM users
WHERE users.userId IN (SELECT transactions.userId
FROM transactions
WHERE transactions.transactionAmount > 50)
However, this is (and in general queries that user a sub-query in WHERE clause
are) sub-optimal to the point that it is quite a bad query.
36. Many-to-Many Relation Example
We are tasked to design a system for a college.
There are students and there are courses.
We need to provide a basic model that can store data for students,
courses and enrollment of students in courses over years and
semesters.
A student may have enrolled in multiple courses.
A course may have enrollment of multiple students.
A student may enroll in a course only once in a give semester of a
year.
Try modelling the above scenario. The slide following this shows a
common way to go about doing this.
38. Many-to-Many Relation Example
Write a query that retrieves records of enrollment for all students
ordered chronologically.
Write a query that retrieves semester-wise enrollment count for all
courses
Write a query that displays students that have enrolled in the same
course more than once along with the number of times they had
enrolled.
Write a query to display last enrollment for all students.