Database Design (Managing and Using MySQL, 2nd Edition)

7.1. Database Design Primer

Suppose you have a large collection of compact discs and you want to create a database to track them. The first step is to determine what data you are going to store. One good way to start is to think about why you want to store the data in the first place. In our case, we will most likely want to look up CDs by artist, title, and song. Since we want to look up those items, we know they must be included in the database. In addition, it is often useful to simply list items that should be tracked. One possible list might include: CD title, record label, band name, song title. As a starting point, we will store the data shown in Table 7-1.

Table 7-1. A CD database made up of a single table

Band name	CD title	Record label	Songs
Stevie Wonder	Talking Book	Motown	You Are the Sunshine of My Life, Maybe Your Baby, Superstition, ...
Miles Davis Quintet	Miles Smiles	Columbia	Orbits, Circle, ...
Wayne Shorter	Speak No Evil	Blue Note	Witch Hunt, Fee-Fi-Fo-Fum
Herbie Hancock	Headhunters	Columbia	Chameleon, Watermelon Man, ...
Herbie Hancock	Maiden Voyage	Blue Note	Maiden Voyage

For brevity's sake, we have left out most of the songs. At first glance, this table seems to meet our needs since we are storing all the data we need. Upon closer inspection, however, we find several problems. Take the band named Herbie Hancock, for example. "Band name" appears twice: once for each CD. This repetition is a problem for several reasons. First, when entering data in the database, we end up typing the same name over and over. Second, and more important, if any of the data changes, we have to update it in multiple places. For example, what if "Herbie" were misspelled? We would have to update the data in each of the two rows. The same problem would occur if the name Herbie Hancock changes in the future (like Jefferson Airplane changed to Jefferson Starship). As we add more Herbie Hancock CDs to our collection, we add to the effort required to maintain data consistency.

Another problem with the single CD table lies in the way it stores songs. We are storing them in the CD table as a list of songs in a single column. We will run into all sorts of problems if we want to use this data meaningfully. Imagine having to enter and maintain that list. And what if we want to store the length of the songs as well? What if we want to perform a search by song title? It quickly becomes clear that storing the songs in this fashion is undesirable.

This is where database design comes into play. One of the main purposes of database design is to eliminate redundancy from the database. To accomplish this task, we use a technique called normalization. Before we discuss normalization, let's start with some fundamental relational database concepts: entities, attributes, and data models.

7.1.1. Database Entities

An entity is a thing or object of importance about which data must be captured. Not all "things" are entities, only those things about which you need to capture information. Information about an entity is captured in the form of attributes and/or relationships. If something is a candidate for being an entity and it has no attributes or relationships, it is not really an entity. A database entity appears in a data model as a box with a title that is the name of the entity.

7.1.2. Entity Attributes

An attribute describes information about an entity that must be captured. Each entity has zero or more attributes that describe it, and each attribute describes exactly one entity. Each entity instance (row in the table) has exactly one value, possibly NULL, for each of its attributes. An attribute value can be numeric, a character string, a date, a time, or some other basic data value type. In the first step of designing a database, logical data modeling, we do not worry about how the attributes will be stored.

TIP: NULL provides the basis for dealing with missing information. It is specifically used for cases in which you lack a certain piece of information. As an example, consider a CD that does not list the song lengths of each of its tracks. Each song has a length, but you cannot tell from the case what that length is. You do not want to store the length as zero, since that would be incorrect. Instead, you store the length as NULL. If you are thinking you could store it as zero and use zero to mean "unknown," you are falling into one of the same traps that led to one of the Y2K problems. Not only did old systems store years as two digits, but they often gave a special meaning to 9-9-99, assuming it was safe to do that in the expectation that the system would be rewritten long before that date was ever reached.

Our example database refers to a number of things: CD titles, band names, songs, and record labels. Which of these are entities and which are attributes?

7.1.3. Data Model

A data model is a diagram of our database design. It documents and communicates how the database is structured.

Notice that we capture several pieces of data (CD title, band name, etc.) about each CD, and we absolutely cannot describe a CD without those items. CD is therefore one of those things we want to capture data about and is likely to be an entity. To start a data model, we will diagram CD as an entity. Figure 7-1 shows our sole entity as a data model.

Figure 7-1. The CD entity in a data model

By common entity naming conventions, an entity name must be singular. We therefore call the table where we store CDs CD and not CDs. We use this convention because each entity names an instance. For example, "San Francisco 49ers" is an instance of "Football Team," not "Football Teams."

At first glance, it appears that the rest of the database describes a CD. This would seem to indicate that they are attributes of CD. Figure 7-2 adds them to the CD entity in Figure 7-1. In a data model, attributes appear as names listed in their entity's box.

Figure 7-2. The CD entity with its attributes

This diagram is simple, but we are not done yet. In fact, we have only just begun. Earlier, we discussed how the purpose of data modeling is to eliminate redundancy using a technique called normalization. We have a nice diagram for our database, but we have not gotten rid of the redundancy as we set out to do. It is now time to normalize our database.