Modifying Data Within Tables (Programming the Perl DBI)

3.4. Modifying Data Within Tables

Read-only databases (that is, databases that only allow you to SELECT data from them) are very useful. Data warehouses are typically massive read-only databases populated with archived data mangled into a form suitable for reporting. However, in the cut-and-thrust world of transaction-processing databases, the ability to modify data within the database quickly and efficiently is of paramount importance.

There are several core operations that comprise the broader definition of data modification, namely:

Inserting new data into the database
Deleting data from the database
Updating, or modifying, existing data within the database

Each of these operations falls into the grouping of Data Manipulation Language commands, or DMLs, alongside SELECT.

We shall discuss each of these tasks in turn and apply the theory to our example database.

3.4.1. Inserting Data

Before a database can be really of any use, data must be inserted into it by some means, either by manual data entry or with an automated batch loading program. The action of inserting data only applies to cases in which you wish to load a completely new record of information into the database. If the record already exists and merely requires modification of a column value, the update operation should be used instead.

Data inserts in the relational database model are done on a row-by-row basis: each record or item of information that you load into the database corresponds to a brand-new row within a given existing table. As each inserted record corresponds to a new row in one table, multitable inserts are not possible.[30]

[30]Well, this is not strictly true these days, as database servers get smarter. Oracle, for example, allows inserts into equi-join views and also supports "INSTEAD OF" triggers that make just about anything possible.

The SQL INSERT keyword provides a simple mechanism for inserting new rows of data into the database. For example, assuming the megaliths table is already present in the database and and contains the six columns shown earlier in Figure 3-1, a single row of data can be inserted into it using the following SQL statement:

INSERT INTO megaliths VALUES ( 0, 'Callanish I', 
                               '"Stonehenge of the North"',
                               'Western Isles',
                               'NB 213 330', 1 )

If you then SELECT back all the rows in the table, you should be able to see the row that has just been inserted.

Just as the SELECT statement could specify which columns from a table should be returned in the query, it is also possible (and good practice) to specify into which columns of the table the values should be inserted. The unspecified columns will take the default value, typically NULL. For example, if you wished to specify only the id and name columns of the table, allowing description and location to be NULL, the SQL statement would be:

INSERT INTO megaliths ( id, name ) VALUES ( 0, 'Callanish I' )

There must be an exact mapping between the number of columns and column values specified in the SQL statement. It is also essential to make sure that the datatypes of the supplied values and the corresponding columns match.

3.4.1.1. Using INSERT for data transfers

One of the more sneaky uses for the INSERT keyword is to transfer data from one table or column to another in one easy operation. This seems to fly in the face of our previous assertion that only one row can be inserted with each INSERT statement, but in fact, follows the rules correctly (in an underhand manner).

For example, if we wanted to make a quick copy of the megaliths table into a new table called megaliths_copy, the following SQL statement can be used:

INSERT INTO megaliths_copy
    SELECT *
    FROM megaliths

This process inserts each row returned from the SELECT statement into the new table, row by row, until an exact copy is created. This feature of INSERT is extremely useful for making quick copies of tables if you need to do some destructive maintenance work on the original, such as pruning redundant data. For this SQL to work, the original table and destination table must have an identical structure.

You can further refine this operation by specifying conditions that the rows to be transferred must meet before being inserted. For example, to copy across only the rows of data for megaliths located in Wiltshire:

INSERT INTO megaliths_copy
    SELECT *
    FROM megaliths
    WHERE location LIKE '%Wiltshire%'

Furthermore, you can make extracts of data from tables into new tables by explicitly specifying the targeted columns in the new table. This is useful when building large denormalized tables for use within a data warehouse. Therefore, if we had a table called megalocations that contained two columns called name and location, we could populate this new table from the megaliths table in the following way:

INSERT INTO megalocations
    SELECT name, location
    FROM megaliths

Or, we can even select data from multiple tables for inserting. A denormalized table containing the rows coalesced from the megaliths and media tables might contain two columns, name and url. Populating this table with an INSERT statement is easy:

INSERT INTO megamedia
    SELECT name, url
    FROM megaliths, media
    WHERE megaliths.id = media.megaliths_id

However, in general, table population via INSERT statements is usually performed by batch-loading programs that generate suitable SQL statements and execute them within the database, such as Oracle's SQL*Loader. Of course, Perl is a good example of a programming language that makes loading data from a file remarkably easy via the DBI.

3.4.2. Deleting Data

Now that you have spent copious amounts of time loading data into your tables to play with, the next thing you'll want to do is tidy it up and remove redundant or unwanted data.

The DELETE keyword defined within SQL is exactly what you are looking for, providing a simple syntax for permanently removing rows of data from tables. As with the INSERT statement, deleting rows applies only to a single table at a time; therefore, if you want to remove rows that are referred to by records in other tables, you should first delete those associated foreign key records from the secondary tables. This preserves the referential integrity of your database and is known as delete cascading.[31] Some databases support cascading delete mechanisms that automate these extra deletes.

[31]An analogy for this process is removing a file on a Unix system that has several symbolic links associated with it. It is good housekeeping to remove the stale symbolic links when the target file has been removed. This also applies to Windows and Macintosh systems with shortcuts to documents.

For example, a cascading delete applied to rows in the megaliths table would also need to remove the appropriate rows in the media table where the following join condition is true:

megaliths.id = media.megaliths_id

However, DELETE statements do not have the same "single row at a time" restriction that INSERT statements suffer from. DELETE can purge a table entirely in one statement. For example, to remove all the rows within the megaliths table, we could simply write:

DELETE FROM megaliths

Of course, we may not wish to remove all the rows from a table, but only certain rows. This can be done in a familiar manner by specifying a list of conditions that the data within a row must meet for it to be removed. Therefore, if we wanted to delete all the rows of data within the megaliths table that contain sites located in ``Wiltshire,'' then the following statement would work nicely:

DELETE FROM megaliths
WHERE location LIKE '%Wiltshire%'

To remove all the rows relating to stone circles, we could narrow the criterion that a row must meet by saying that the type of the site must be equal to ``Stone Circle.'' The tighter query would read:

DELETE FROM megaliths
WHERE location LIKE '%Wiltshire%'
AND site_type_id IN (SELECT id FROM site_types
                     WHERE site_type = 'Stone Circle')

It should be noted that deleting all the rows from a table does not actually remove the table from the database. The table will be left in place, but it will have no rows in it.[32]

[32]Some databases feature a faster and more efficient way of removing all the rows from a table with the TRUNCATE TABLE keyword. But beware! In some databases, that keyword removes all indices as well.

A more powerful way to determine which rows of data to delete can be expressed by using a subquery to return the target rows. A good example of this sort of behavior is deleting the foreign keys from a table when the primary keys are being deleted. This can be broken up into two separate DELETE statements, the first removing the foreign key rows, the second removing the primary key rows. The following examples remove the rows relating to megalithic sites in ``Wiltshire'' from both the media and megaliths tables:

DELETE FROM media
WHERE megaliths_id IN (
    SELECT id
    FROM megaliths
    WHERE location LIKE '%Wiltshire%'
  )

DELETE FROM megaliths
WHERE location LIKE '%Wiltshire%'

To sum up, removing data from tables is made extremely simple (perhaps too simple!) by use of the DELETE keyword. Later in this chapter, we'll discuss the database's perspective of the deletion process in more detail, including the all-important possibility of undoing deletions that go wrong.

3.4.3. Updating Data

The final way in which modifications can be performed on data stored within tables in a database is to make in-place modifications of existing data by updating the values of particular columns in particular rows. With an UPDATE statement, rows are neither inserted nor deleted, and the structure of the table itself is not altered.

UPDATE statements are extremely powerful, in that it is possible to update multiple rows of data in one statement. If desired, the new values may be supplied by the returned values from a SELECT statement following the syntax of the INSERT command.

The most simple and useful UPDATE is to update a column of a single row within a table to a new value. For example, if you wanted to update the location of the "Avebury" row within the megaliths table, the following SQL statement would work:

UPDATE megaliths
SET location = 'Near Devizes, Wiltshire'
WHERE name = 'Avebury'

You should notice the condition clause specified in this statement. If the statement did not check for the exact name of the site, every row within the table would have had the UPDATE statement performed against it, causing a potentially disastrous data corruption. Condition clauses may be specified in exactly the same way as used in other SQL commands such as DELETE and SELECT.

UPDATE statements may also update more than one column in a single statement, by simply listing the columns we wish to update in a comma-separated list. For example, to update both the name and description fields within the megaliths table, you can write the following SQL statement:

UPDATE megaliths
SET location = 'Callanish, Isle of Lewis',
    description = 'Complex site shaped as a buckled Celtic cross'
WHERE name = 'Callanish I'

In some database systems, it is also possible to update multiple columns simultaneously by using a subquery to return a list of values from another table. Those values are then used as the new values for the specified columns. For example, if we wanted to synchronize our megalithic database with the Wiltshire Tourist Board Database to use the same names and locations, we could use the following SQL statement:

UPDATE megaliths
SET ( name, location ) =
    ( SELECT tourist_site_name, tourist_site_location
      FROM tourist_sites
      WHERE tourist_site_name LIKE '%Avebury%'
      AND tourist_site_type = 'Stone Circle' )
WHERE name = 'Avebury'

This statement would update the name and location fields within the megaliths table with values returned by a query running against another table. An important note about this technique is that the subquery must return only a single row of data, otherwise the UPDATE will fail with an error.

3.4.4. Committing and Rolling Back Modifications

So, what happens if you make a horrible mistake when you are modifying the data within your database? Is the only course of action to resign? Fear not! Some database engines have a capability known as transaction rollback that will save not only your neck, but your data as well.

The principle of rollback is quite a simple one. For each modification to rows of data within the database, a copy of the row prior to modification is written into a log that records all the modifications made. Once you have decided that these changes are indeed correct, you can opt to commit the changes to the database. If the committed changes are actually wrong, then you're in trouble: you can clear out your desk and dust off your resume.

However, if by some sheer luck you check the modified rows and see that they are wrong before you commit the changes, you can rollback the modifications you have made, returning the rows to the values they held before you started modifying them. Your job is safe.

Even better, the changes you made within the transaction were not visible to anyone else looking at the database at that time. So no one need know of your mistake, and your reputation is safe.

Most databases automatically commit data upon disconnection from the database unless a rollback is explicitly issued. Therefore, if the software that is being run is not performing suitable error checking on the modifications it is making, it may disconnect and inadvertently commit wrong data to the database. There is a moral in this -- always check for errors!

Some database systems don't have a feature as sophisticated as rollback or undo. In these cases, it is even more important that before unleashing dramatic data manipulation SQL on your database, you make a backup. Backups are always a good idea, even in databases that do support transactions.