Querying Data (Programming the Perl DBI)

3.3. Querying Data

The first (and possibly most immediately useful) operation that SQL allows you to perform on data is to select and return rows of data from tables stored within the database. This activity forms the core of exactly what a database represents, a large repository of searchable information.

All SQL queries, no matter how simple or complex, use the SELECT keyword to specify the columns to fetch, the tables to fetch them from, and any conditions that must be met for the rows to be retrieved. SELECT falls into the group of commands known as Data Manipulation Language , or DML, commands.

The full syntax for SELECT can be intimidating to the new user, primarily because it sports a multitude of different ways in which the query can be customized. For example, you might wish to return only unique data rows, group certain rows together, or even specify how the returned rows should be sorted.

For the moment, we'll just look at the simplest cases.

In our example, we've tended to use capital letters for SQL commands and other reserved words, and lowercase letters for database object names (tables, columns, etc.). In most databases, the SQL commands are not case-sensitive, but the actual database object names may or may not be.

3.3.1. Simple Queries

The simplest SQL query is to ask for certain columns in all rows of a table. The SELECT syntax for this form of query can be expressed as simply as:

SELECT column, column, ..., column
FROM table

or:

SELECT *
FROM table

which will query and fetch back all the columns within the specified table.

Therefore, to select some of the rows from some columns in the megaliths table, the following SQL statement can be used:

SELECT name, location, mapref
FROM megaliths

which would return the following information:

+---------------------------------------------------------------+
| name         | location                          | mapref     |
+---------------------------------------------------------------+
| Callanish I  | Callanish, Isle of Lewis          | NB 213 330 |
| Lundin Links | Lundin Links, Fife, Scotland      | NO 404 027 |
| Stonehenge   | Near Amesbury, Wiltshire, England | SU 123 400 |
| Avebury      | Avebury, Wiltshire, England       | SU 103 700 |
| Sunhoney     | Near Insch, Aberdeenshire         | NJ 716 058 |
+---------------------------------------------------------------+

So even with the simplest SQL imaginable, the inherent flexibility of the syntax allows us to easily specify exactly which information we want from the database without having to write lots of excruciating lines of code to get it.

Another aspect of the relational database methodology is now visible, in that even though the database contains information on all the columns within a particular table, only a subset of the available columns needs to be retrieved. Therefore, we can extract exactly the data we need for a particular query and no more. This is an extremely powerful feature and separates the actual data stored within the database from our desired view of that data.

3.3.2. Queries and Condition Clauses

The previous example relied on retrieving all the rows within a table, whereas the more ordinary, everyday database operations will usually require more accurate targeting of specific rows. For example, "Tell me the names of all the stone circles in Wiltshire" is a more specific query than "Tell me about all the stone circles in the database." To achieve this task, SQL provides the ability to specify conditions that must be met before a row is returned to the user.

SQL's syntax regarding condition clauses is just as straightforward and obvious as that for specifying which columns are of interest. The condition clauses that narrow the query are specified after the list of tables from which data is being retrieved, i.e., after the FROM clause and table list.

Therefore, a query that retrieves the name and location columns from rows that contain the string ``Wiltshire'' in the location column, can be written as:

SELECT name, location
FROM megaliths
WHERE location LIKE '%Wiltshire%'

The information returned from this query would be:

+--------------------------------------------------+
| name         | location                          |
+--------------------------------------------------+
| Stonehenge   | Near Amesbury, Wiltshire, England |
| Avebury      | Avebury, Wiltshire, England       |
+--------------------------------------------------+

The returned information shows just the columns specified for the sites that have a location value containing the string, ``Wiltshire.'' The WHERE keyword is the one that specifies the beginning of the list of conditions that must be met for each row to be returned. That is, the condition states that the location column value must contain the desired string ``Wiltshire.''[24]

[24]The % character, in this case, is used as the standard SQL wildcard character to match any number of characters. A few databases use * instead.

The following table lists all of the comparison operators used by SQL for testing condition clauses. These conditions are generally Perl-like and should be familiar.

Operator	Purpose
`=`	This operator tests exact equality between columns and/or literal values. For example, the query: SELECT name, location FROM megaliths WHERE location = 'Fife' will return all rows where the location column is equal to the value "Fife".
`<>`	This operator tests for inequality between columns and/or literal values. For example, the query: SELECT name, location FROM megaliths WHERE location <> 'Fife' will return all rows where the location column is not equal to the value "Fife". Some databases alternatively use the `!=`, `^=`, or `~=` operators instead of `<>`.
`>` and `<`	These two operators represent ``greater than'' and ``less than'' tests between columns and/or literal values. For example, the query: SELECT name, location FROM megaliths WHERE id < 10 AND id > 5 will return the name and location of all megalithic sites whose `id` value is less than 10 and greater than 5. The type of comparison used depends on the datatype of the values involved. So numeric values are compared as numbers, string values are compared as strings, and date values are compared as dates. On a related note, there also exists the `<=` and `>=` operators that perform ``less than or equal to'' and ``greater than or equal to'' tests, respectively.
`IN`	This keyword tests equality of a column and/or literal value within a specified set of values. For example, the query: SELECT name, location FROM megaliths WHERE location IN ( 'Western Isles', 'Fife' ) will compare each member of the set using the equality operator against the specified column. Therefore, in this example, rows with a location column value of either ``Western Isles'' or ``Fife'' will be returned.
`LIKE`	The `LIKE` operator allows limited wildcard matching of strings against columns and/or literal values. For example, the query: SELECT name, description FROM megaliths WHERE description LIKE '%Largest%' will return the name and description columns where the description column contains the string ``Largest'' at any position in the string. Wildcard matches for characters in the string may be specified by using either the percent ( % ) character for multiple-character wildcarding or an underscore ( _ ) for single-character matching.[25] [25]Some databases use other names instead of `LIKE`, such as `MATCHES` or `CONTAINS`, and may use different wildcard characters such as `*` or `?`.

[25]Some databases use other names instead of LIKE, such as MATCHES or CONTAINS, and may use different wildcard characters such as * or ?.

Over time, the megalithic database might expand and contain information on thousands of sites in the country. Therefore, to quickly locate records, we might need to narrow the search criteria by specifying other condition clauses that must be met for a record to be returned. For example, if you wished to find information on all stone circles in Wiltshire, doing a query simply for all sites in ``Wiltshire'' could return hundreds of records, which you would have to wade through by hand. We can narrow this search by specifying as an extra condition that the mapref column must also begin with the string SU 123:

SELECT name, location
FROM megaliths
WHERE location LIKE '%Wiltshire%'
AND mapref LIKE 'SU 123%'

In this example, the second condition is simply added to the end of the list of conditions that must be met. The two conditions are joined together by a logical operator, AND. This statement now reads "Give me the name and location of all megalithic sites with a location of Wiltshire that are in the map region SU 123," which would return the name and location fields for the "Stonehenge" row, but reject the "Avebury" row.

Thus, conditions can be joined together into multiple condition lists, linked by logical boolean operators that control how the truth of the condition is evaluated.

The following table describes the boolean (or logical) operators defined by SQL that can be used to chain your condition clauses together.

Operator	Function
AND	Returns the logical AND of the two clauses on either side of the keyword. The following truth table can be used to evaluate whether the combined clause is true or not.
		TRUE	FALSE	NULL
	TRUE	true	false	null
	FALSE	false	false	false
	NULL	null	false	null
OR	Returns the logical OR of the two clauses on either side of the keyword. The following truth table can be used to evaluate whether the combined clause is true or not.
		TRUE	FALSE	NULL
	TRUE	true	true	true
	FALSE	true	false	null
	NULL	true	null	null
NOT	Negates the logical value of the following expression. The following truth table illustrates this in operation.
		TRUE	FALSE	NULL
		false	true	null

The truth of the overall condition is determined by combining the truth of each element separately using the AND, OR, and NOT operators.

It is now possible to calculate the effects of multiple condition clauses in a statement. For example, the following condition clauses:

WHERE location LIKE '%Wiltshire%'
AND mapref LIKE 'SU 123%'

evaluate in the following way for this row:

+----------------------------------------------------------+
| name          | mapref     | location                    |
+----------------------------------------------------------+
| Avebury       | SU 103 700 | Avebury, Wiltshire, England |
+----------------------------------------------------------+

location LIKE '%Wiltshire%'     => TRUE
mapref LIKE 'SU 123%'           => FALSE

TRUE AND FALSE => FALSE

thereby returning a false value for that row, causing it to be rejected by the query. However, the following row:

+---------------------------------------------------------------+
| name         | mapref     | location                          |
+---------------------------------------------------------------+
| Stonehenge   | SU 123 400 | Near Amesbury, Wiltshire, England |
+---------------------------------------------------------------+

evaluates as:

location LIKE '%Wiltshire%'     => TRUE
mapref LIKE 'SU 123%'           => TRUE

TRUE AND TRUE => TRUE

which ensures that the row is returned by the query.

When combining different logical operators, it is important to consider their precedence . The precedence (or priority) of the operators determines which gets combined first. The SQL standard specifies that NOT has the highest precedence, followed by AND, and then OR. Parentheses can be used around groups of operators to change their precedence.

For example, you might wish to select all the megalithic sites in either ``Wiltshire'' or ``Fife'' for which the description of the site contains the word ``awe-inspiring.''

This query could be wrongly expressed as:

SELECT name, location
FROM megaliths
WHERE location LIKE '%Wiltshire%' OR location LIKE '%Fife%'
AND description LIKE '%awe-inspiring%'

While this query looks correct at first, it does not take into account the order in which the condition clauses are combined. It would, in fact, select the awe-inspiring sites in Fife, but it would also select all the sites in Wiltshire regardless of their type.

This happens because the AND operator has a higher precedence than the OR operator and so is evaluated first. Therefore, our SQL statement evaluates by AND combining location LIKE %Fife% and description LIKE %awe-inspiring%. It then OR combines the result of the AND operation with location LIKE %Wiltshire%. This isn't quite what we had in mind.

This query can be more correctly written by using parentheses to logically group operators within the statement.

For example:

SELECT name, location
FROM megaliths
WHERE ( location LIKE '%Wiltshire%' OR location LIKE '%Fife%' )
AND description LIKE '%awe-inspiring%'

This changes the way in which the condition clauses are evaluated by evaluating the grouped clauses into a single truth value for the entire group. This is then used instead of truth values for each individual clause within the group.

Finally, there is another even more complex way of specifying condition clauses that is used quite frequently. This technique is to supply the values with which we are doing a comparison from a subquery.[26] For example:

[26]Support for this functionality is not necessarily present in all database systems supported by the DBI and its drivers.

SELECT name, description
FROM megaliths
WHERE name IN
    ( SELECT tourist_sitename
      FROM wiltshire_tourist_sites )

If we knew in advance that the subquery would return only a single row of information, then the = operator could be used instead of IN.

3.3.3. Queries over Multiple Tables

The previous section covered the structure of SQL statements in general, and how SQL may be used to query data from single tables in the database. However, from the discussion on relational database theory earlier in this chapter, you should remember that the power and flexibility of relational database design lies in the ability to join tables together -- that is, to link disparate records of data that are held in separate tables to reduce data duplication. This linking of records is a key part of working with relational databases.

To illustrate this concept, it is time to reintroduce the other tables we shall be using in our examples, namely the media table and site_types table.

The media table contains information on where multimedia clips for given sites can be located, allowing an external application to view or listen to these clips while the user is reading the textual information on the site stored in the megaliths table.

Similarly, the site_types table contains a lookup table of the different categorizations of megalithic sites described within the database.

To specify a SELECT statement from two or more tables in the database, we simply add the table names after the FROM keyword. Therefore, a sample query to fetch all the rows in two of the tables should theoretically look something like:

SELECT name, description, location, url, content_type
FROM megaliths, media

However, the output from this query will look somewhat scrambled. For each and every row in the first table, all the rows in the second table will be selected![27] This means that the media records for ``Lundin Links'' may be returned at the same time as the site information for ``Avebury.''

[27]This is known as a "Cartesian Product." If there were 100 rows in each, you'd get 10,000 rows returned. If there were 1,000,000 rows in each, you'd get 1,000,000,000,000 rows returned. To avoid this, you should ensure that if you have n tables, you have at least n - 1 join conditions.

How do you make sure that the values in the fields from the second table related to the values from the first -- that is, that the media clips for ``Stonehenge'' are only returned with the ``Stonehenge'' site information?

In our megaliths table, we have already defined a column called id that contains a unique identifier for each and every row stored within the table. Similarly, the media table has a column called id that performs the same purpose. Furthermore, the media table also contains a column called megaliths_id. When a row is inserted into the megaliths or media tables, a unique identifier is inserted into the id columns. Also, when a row is inserted into the media table, the megaliths_id column is populated with the unique identifier of the megalithic site to which the media clip relates.

This relationship of a link field is generally termed a primary key and foreign key relationship. The primary key is the unique value stored within the ``master'' table. The foreign key is that same value stored within multiple rows of the other ``detail'' table.

We can now write a query to fetch the appropriate information back from the database by joining the two tables on their related fields. This ensures that the media clips are associated with the correct site:

SELECT name, description, url, content_type
FROM megaliths, media
WHERE megaliths.id = megaliths_id

This illustrates another aspect of SQL conditions: instead of testing arbitrary values against columns in a table, it is possible to test against the value of another column. In the above case, we test to see if the primary and foreign keys of the two tables match, and, if so, the aggregated row created from the columns of both tables is returned.

Also note how we qualified the id field name in the condition clause by prepending it with the table name and a dot. Without that, the database would not have known if we were referring to the id field in the megaliths table or the media table and would fail with an error.

Similarly, if we wished to select the id fields of both tables, the following statement would simply confuse the database, and we'd get another error:

SELECT id, id, megaliths_id
FROM megaliths, media
WHERE id = megaliths_id

Therefore, it is good practice to explicitly specify the table name that the field belongs to, in cases where it may be ambiguous. For example:

SELECT megaliths.name, megaliths.description, 
       media.url, media.content_type
FROM megaliths, media
WHERE megaliths.id = media.megaliths_id

Of course, the downside to this process is that it takes forever to type. A saner alternative is to alias the table names. To do this, simply add the alias after its name in the FROM clause. For example:

SELECT mega.name, mega.description, med.url, med.content_type
FROM megaliths mega, media med
WHERE mega.id = med.megaliths_id

It is more common just to use the initial character of the table name provided the aliases are unique.

This theory of table joins is extensible to any number of tables.[28] In fact, it is not uncommon for some tables within a database to contain nothing but columns containing foreign keys that can be used to make multi-table joins more effective. The main point to remember here is that all tables involved in the query must be joined to another table on some column. Otherwise, a large quantity of very strange results may be returned from the database!

[28]In practice, databases that support joins often have some upper limit on the number of tables.

For example, fetching both the media associated with a site and the site type information can be expressed with the following query:

SELECT mega.name, mega.description, st.site_type,
       med.url, med.description
FROM megaliths mega, media med, site_types st
WHERE mega.id = med.megaliths_id
AND mega.site_type_id = st.id

Another type of join also possible with SQL is the outer join . In addition to the results returned by a simple join, an outer join also returns the rows from one table for which no rows from the other table satisfy the join condition. This is achieved by returning NULL values for all rows in the second table that have no matching values in the first table.

For example, we might wish to retrieve information on all of the sites located within Wiltshire and, if any exist, the URLs of any multimedia clips associated with them. Using a simple join such as:

SELECT mega.name, mega.location, med.url
FROM megaliths mega, media med
WHERE mega.id = med.megaliths_id
AND mega.location LIKE '%Wiltshire%'

would return only those sites in Wiltshire that had media clips associated with them. It would exclude those sites that had no media clips. An outer join is the way to solve this problem.

The official standard way to express an outer join is by using the phrase LEFT OUTER JOIN or RIGHT OUTER JOIN between the tables to be joined, instead of a comma, and adding an ON condition_expression clause.[29]

[29]Many database systems either don't fully support outer joins or use different syntax for it. Oracle 7 outer joins, for example, look just like inner joins but have the three characters (+) appended to one side of the join condition.

For example, the standard query to retrieve the information we desired can be written as:

SELECT mega.name, mega.location, med.url
FROM megaliths mega
     LEFT OUTER JOIN media med ON mega.id = med.megaliths_id
WHERE mega.location LIKE '%Wiltshire%'

In this example, we have made a left outer join on the id and megaliths_id columns because for any sites without media clips, there are no corresponding records in the media table. The left outer join will ensure that even if no media clip records exist, at least the name and location of each and every site in Wiltshire will be returned. A right outer join in this query would have returned the values where no entries in the megaliths table existed.

Finally, it is worth mentioning some ways to make efficient table joins. In our examples, we added additional columns to our tables to make a join between them. We could have simply added a column to the media table that contained the name of the megalithic site.

There are a few good reasons why we didn't do that:

If the name of the megalithic site was updated for some reason -- for example, if a spelling mistake needed correcting -- the name contained within the media table would be out of date and incorrect. This would break the join between the two tables for that particular row.
Integer keys use much less space than strings when building indexes on a table. Less space means more index entries per block of disk space, and therefore fewer disk reads. The smaller index and fewer disk reads make up for the slightly increased size of the master data table, and usually give you both speed and space gains.
It is slower to test strings than to test numbers, especially integers. As such, in a well-designed database, integers are often used for primary and foreign keys because they are faster to test against with comparison operators. A string, on the other hand, requires testing of every character within each string, which can be time-consuming.

Therefore, to help maximize speed on queries, you can design your database to perform joins using integer columns. It is also often worth building an index on the foreign key columns of the ``detail'' tables, if your database supports such functionality.

3.3.4. Grouping and Ordering Data

Often you'll desire a little more control over how your selected data is retrieved. The two most common ways of organizing your data are to order the retrieved rows by one or more columns, or to group the retrieved rows and apply functions to the groups instead of to individual rows.

Perl is well-suited to these tasks within your program, but performing ordering and grouping via SQL will offload the task onto the database server and also will save you writing, or using, potentially suboptimal techniques for organizing the data. Therefore, generally, use SQL rather than your own application-level code.

3.3.4.1. Ordering data

Ordering the data retrieved by a SELECT statement is easy and can be achieved simply by an ORDER BY clause. This clause is always found at the end of your queries, after all the join conditions have been specified.

The ORDER BY clause is specified as a comma-separated list of columns that should be used to order the data. For example, an ORDER BY clause of:

ORDER BY name, location

would order the rows by name and, if the names of the sites are identical, the location column would be used as a secondary ordering. You can change the direction of the ordering from the default ``ascending'' order (which goes from A to Z) to a ``descending'' order by appending the DESC keyword to any field names in the ORDER BY clause.

3.3.4.2. Grouping data

The ability to group items of data is very useful when attempting to make summarized reports. SQL features a clause called GROUP BY that allows you to group rows that share a common set of values and apply group functions to them.

A good example of this operation is where you want to total the values contained within a column in a table. In this instance, you would use the sum( ) grouping function in the following way to calculate the total value of orders taken on a given date:

SELECT order_date, sum( net ), sum( vat ), sum( total )
FROM sales
GROUP BY order_date

As with ORDER BY, groupings can be chained together in a comma-separated list to create complex subgroupings of columns.