Programming Languages (Running Linux)

In this chapter, we'll explore the Linux programming environment and give you a five-cent tour of the many facilities it provides. Half of the trick to Unix programming is knowing what tools are available and how to use them effectively. Often the most useful features of these tools are not obvious to new users.

Since C programming has been the basis of most large projects (even though it is nowadays being replaced more and more by C++) and is the language common to most modern programmers--not only on Unix, but on many other systems as well--we'll start out telling you what tools are available for that. The first few sections of the chapter assume you are already a C programmer.

But several other tools are emerging as important resources, especially for system administration. We'll examine two in this chapter: Perl and Tcl/Tk. They are both scripting languages like the Unix shells, taking care of grunt work like memory allocation, so you can concentrate on your task. But both Perl and Tcl/Tk offer a degree of sophistication that makes them more powerful than shell scripts and appropriate for many programming tasks.

13.1. Programming with gcc

The C programming language is by far the most used in Unix software development. Perhaps this is because the Unix system itself was originally developed in C; it is the native tongue of Unix. Unix C compilers have traditionally defined the interface standards for other languages and tools, such as linkers, debuggers, and so on. Conventions set forth by the original C compilers have remained fairly consistent across the Unix programming board. To know the C compiler is to know the Unix system itself. Before we get too abstract, let's get to details.

The GNU C compiler, gcc, is one of the most versatile and advanced compilers around. Unlike other C compilers (such as those shipped with the original AT&T or BSD distributions or from various third-party vendors), gcc supports all the modern C standards currently in use--such as the ANSI C standard--as well as many extensions specific to gcc itself. Happily, however, gcc provides features to make it compatible with older C compilers and older styles of C programming. There is even a tool called protoize that can help you write function prototypes for old-style C programs.

gcc is also a C++ compiler. For those of you who prefer the obscure object-oriented environment, C++ is supported with all of the bells and whistles--including AT&T 3.0 C++ features, such as method templates. Complete C++ class libraries are provided as well, such as the iostream library familiar to many programmers.

For those with a taste for the particularly esoteric, gcc also supports Objective-C, an object-oriented C spinoff that never gained much popularity. But the fun doesn't stop there, as we'll see.

There's also a new kid on the block, egcs. egcs is not a completely new compiler, but is based on gcc. egcs has some advanced optimization features and is especially strong when it comes to newer C++ features like templates and namespaces. If you are going to do serious C++ programming, you will probably want to check it out. Alas, there is a problem with Version 2.0.x kernels that prevents them from being compiled with egcs. Newer kernels from the 2.1.x and those from the 2.2.x don't have this problem. But because of this, some distributors have opted to include the traditional gcc for C compilation and egcs for C++. You can read more about egcs at http://egcs.cygnus.com.

The Free Software Foundation has recently announced that egcs will become their default compiler, thus replacing egcs' own ancestor gcc.

In this section, we're going to cover the use of gcc to compile and link programs under Linux. We assume you are familiar with programming in C/C++, but we don't assume you're accustomed to the Unix programming environment. That's what we'll introduce here.

13.1.1. Quick Overview

Before imparting all of the gritty details of gcc itself, we're going to present a simple example and walk through the steps of compiling a C program on a Unix system.

Let's say you have the following bit of code, an encore of the much-overused "Hello, World!" program (not that it bears repeating):

#include <stdio.h>
int main() {
  (void)printf("Hello, World!\n");
  return 0; /* Just to be nice */
}

To compile this program into a living, breathing executable, there are several steps. Most of these steps can be accomplished through a single gcc command, but the specifics are left for later in the chapter.

First, the gcc compiler must generate an object file from this source code. The object file is essentially the machine-code equivalent of the C source. It contains code to set up the main( ) calling stack, a call to the mysterious printf( ) function, and code to return the value of 0.

The next step is to link the object file to produce an executable. As you might guess, this is done by the linker. The job of the linker is to take object files, merge them with code from libraries, and spit out an executable. The object code from the previous source does not make a complete executable. First and foremost, the code for printf() must be linked in. Also, various initialization routines, invisible to the mortal programmer, must be appended to the executable.

Where does the code for printf() come from? Answer: the libraries. It is impossible to talk for long about gcc without making mention of them. A library is essentially a collection of many object files, including an index. When searching for the code for printf(), the linker looks at the index for each library it's been told to link against. It finds the object file containing the printf() function and extracts that object file (the entire object file, which may contain much more than just the printf() function) and links it to the executable.

In reality, things are more complicated than this. As we have said, Linux supports two kinds of libraries: static and shared. What we have described in this example are static libraries: libraries where the actual code for called subroutines is appended to the executable. However, the code for subroutines such as printf() can be quite lengthy. Because many programs use common subroutines from the libraries, it doesn't make sense for each executable to contain its own copy of the library code. That's where shared libraries come in.

With shared libraries, all of the common subroutine code is contained in a single library "image file" on disk. When a program is linked with a shared library, stub code is appended to the executable, instead of actual subroutine code. This stub code tells the program loader where to find the library code on disk, in the image file, at runtime. Therefore, when our friendly "Hello, World!" program is executed, the program loader notices that the program has been linked against a shared library. It then finds the shared library image and loads code for library routines, such as printf(), along with the code for the program itself. The stub code tells the loader where to find the code for printf() in the image file.

Even this is an oversimplification of what's really going on. Linux shared libraries use jump tables that allow the libraries to be upgraded and their contents to be jumbled around, without requiring the executables using these libraries to be relinked. The stub code in the executable actually looks up another reference in the library itself--in the jump table. In this way, the library contents and the corresponding jump tables can be changed, but the executable stub code can remain the same.

But don't allow yourself to be befuddled by all this abstract information. In time, we'll approach a real-life example and show you how to compile, link, and debug your programs. It's actually very simple; most of the details are taken care of for you by the gcc compiler itself. However, it helps to have an understanding of what's going on behind the scenes.

13.1.2. gcc Features

gcc has more features than we could possibly enumerate here. Later, we present a short list and refer the curious to the gcc manual page and Info document, which will undoubtedly give you an eyeful of interesting information about this compiler. Later in this section, we'll give you a comprehensive overview of the most useful gcc features to get you started. This in hand, you should be able to figure out for yourself how to get the many other facilities to work to your advantage.

For starters, gcc supports the "standard" C syntax currently in use, specified for the most part by the ANSI C standard. The most important feature of this standard is function prototyping. That is, when defining a function foo(), which returns an int and takes two arguments, a (of type char *) and b (of type double), the function may be defined like this:

int foo(char *a, double b) {  
  /* your code here... */  
}

This is in contrast to the older, nonprototype function definition syntax, which looks like:

int foo(a, b) 
char *a;  
double b;  
{ 
  /* your code here... */  
}

which is also supported by gcc. Of course, ANSI C defines many other conventions, but this is the one most obvious to the new programmer. Anyone familiar with C programming style in modern books, such as the second edition of Kernighan and Ritchie's The C Programming Language, can program using gcc with no problem. (C compilers shipped on some other Unix systems do not support ANSI features such as prototyping.)

The gcc compiler boasts quite an impressive optimizer. Whereas most C compilers allow you to use the single switch -O to specify optimization, gcc supports multiple levels of optimization. At the highest level of optimization, gcc pulls tricks out of its sleeve such as allowing code and static data to be shared. That is, if you have a static string in your program such as Hello, World!, and the ASCII encoding of that string happens to coincide with a sequence of instruction code in your program, gcc allows the string data and the corresponding code to share the same storage. How's that for clever?

Of course, gcc allows you to compile debugging information into object files, which aids a debugger (and hence, the programmer) in tracing through the program. The compiler inserts markers in the object file, allowing the debugger to locate specific lines, variables, and functions in the compiled program. Therefore, when using a debugger, such as gdb (which we'll talk about later in the chapter), you can step through the compiled program and view the original source text simultaneously.

Among the other tricks offered by gcc is the ability to generate assembly code with the flick of a switch (literally). Instead of telling gcc to compile your source to machine code, you can ask it to stop at the assembly-language level, which is much easier for humans to comprehend. This happens to be a nice way to learn the intricacies of protected-mode assembly programming under Linux: write some C code, have gcc translate it into assembly language for you, and study that.

gcc includes its own assembler (which can be used independently of gcc), just in case you're wondering how this assembly-language code might get assembled. In fact, you can include inline assembly code in your C source, in case you need to invoke some particularly nasty magic but don't want to write exclusively in assembly.

13.1.3. Basic gcc Usage

By now, you must be itching to know how to invoke all these wonderful features. It is important, especially to novice Unix and C programmers, to know how to use gcc effectively. Using a command-line compiler such as gcc is quite different from, say, using a development system such as Borland C under MS-DOS. Even though the language syntax itself is similar, the methods used to compile and link programs are not at all the same.

Let's return to our innocent-looking "Hello, World!" example. How would you go about compiling and linking this program?

The first step, of course, is to enter the source code. This is accomplished with a text editor, such as Emacs or vi. The would-be programmer should enter the source code and save it in a file named something like hello.c. (As with most C compilers, gcc is picky about the filename extension; that is, how it can distinguish C source from assembly source from object files, and so on. The .c extension should be used for standard C source.)

To compile and link the program to the executable hello, the programmer would use the command:

papaya$ gcc -o hello hello.c

and (barring any errors), in one fell swoop, gcc compiles the source into an object file, links against the appropriate libraries, and spits out the executable hello, ready to run. In fact, the wary programmer might want to test it:

papaya$ ./hello 
Hello, World! 
papaya$

As friendly as can be expected.

Obviously, quite a few things took place behind the scenes when executing this single gcc command. First of all, gcc had to compile your source file, hello.c, into an object file, hello.o. Next, it had to link hello.o against the standard libraries and produce an executable.

By default, gcc assumes that not only do you want to compile the source files you specify but also that you want them linked together (with each other and with the standard libraries) to produce an executable. First, gcc compiles any source files into object files. Next, it automatically invokes the linker to glue all the object files and libraries into an executable. (That's right, the linker is a separate program, called ld, not part of gcc itself--although it can be said that gcc and ld are close friends.) gcc also knows about the "standard" libraries used by most programs and tells ld to link against them. You can, of course, override these defaults in various ways.

You can pass multiple filenames in one gcc command, but on large projects you'll find it more natural to compile a few files at a time and keep the .o object files around. If you want only to compile a source file into an object file and forego the linking process, use the -c switch with gcc, as in:

papaya$ gcc -c hello.c

This produces the object file hello.o and nothing else.

By default, the linker produces an executable named, of all things, a.out. By using the -o switch with gcc, you can force the resulting executable to be named something different, in this case, hello. This is just a bit of left-over gunk from early implementations of Unix, and nothing to write home about.

13.1.4. Using Multiple Source Files

The next step on your path to gcc enlightenment is to understand how to compile programs using multiple source files. Let's say you have a program consisting of two source files, foo.c and bar.c. Naturally, you would use one or more header files (such as foo.h) containing function declarations shared between the two programs. In this way, code in foo.c knows about functions in bar.c, and vice versa.

To compile these two source files and link them together (along with the libraries, of course) to produce the executable baz, you'd use the command:

papaya$ gcc -o baz foo.c bar.c

This is roughly equivalent to the three commands:

papaya$ gcc -c foo.c 
papaya$ gcc -c bar.c 
papaya$ gcc -o baz foo.o bar.o

gcc acts as a nice frontend to the linker and other "hidden" utilities invoked during compilation.

Of course, compiling a program using multiple source files in one command can be time consuming. If you had, say, five or more source files in your program, the gcc command in the previous example would recompile each source file in turn before linking the executable. This can be a large waste of time, especially if you only made modifications to a single source file since last compilation. There would be no reason to recompile the other source files, as their up-to-date object files are still intact.

The answer to this problem is to use a project manager such as make. We'll talk about make later in the chapter, in the section "Section 13.2, "Makefiles"."

13.1.5. Optimizing

Telling gcc to optimize your code as it compiles is a simple matter; just use the -O switch on the gcc command line:

papaya$ gcc -O -o fishsticks fishsticks.c

As we mentioned not long ago, gcc supports different levels of optimization. Using -O2 instead of -O will turn on several "expensive" optimizations that may cause compilation to run more slowly but will (hopefully) greatly enhance performance of your code.

You may notice in your dealings with Linux that a number of programs are compiled using the switch -O6 (the Linux kernel being a good example). The current version of gcc does not support optimization up to -O6, so this defaults to (presently) the equivalent of -O2. However, -O6 is sometimes used for compatibility with future versions of gcc to ensure that the greatest level of optimization is used.

13.1.6. Enabling Debugging Code

The -g switch to gcc turns on debugging code in your compiled object files. That is, extra information is added to the object file, as well as the resulting executable, allowing the program to be traced with a debugger such as gdb (which we'll get to later in the chapter--no worries). The downside to using debugging code is that it greatly increases the size of the resulting object files. It's usually best to use -g only while developing and testing your programs and to leave it out for the "final" compilation.

Happily, debug-enabled code is not incompatible with code optimization. This means that you can safely use the command:

papaya$ gcc -O -g -o mumble mumble.c

However, certain optimizations enabled by -O or -O2 may cause the program to appear to behave erratically while under the guise of a debugger. It is usually best to use either -O or -g, not both.

13.1.7. More Fun with Libraries

Before we leave the realm of gcc, a few words on linking and libraries are in order. For one thing, it's easy for you to create your own libraries. If you have a set of routines you use often, you may wish to group them into a set of source files, compile each source file into an object file, and then create a library from the object files. This saves you having to compile these routines individually for each program you use them in.

Let's say you have a set of source files containing oft-used routines, such as:

float square(float x) { 
  /* Code for square()... */
}

int factorial(int x, int n) {
  /* Code for factorial()... */
}

and so on (of course, the gcc standard libraries provide analogs to these common routines, so don't be misled by our choice of example). Furthermore, let's say that the code for square() is in the file square.c and that the code for factorial() is in factorial.c. Simple enough, right?

To produce a library containing these routines, all that you do is compile each source file, as so:

papaya$ gcc -c square.c factorial.c

which leaves you with square.o and factorial.o. Next, create a library from the object files. As it turns out, a library is just an archive file created using ar (a close counterpart to tar). Let's call our library libstuff.a and create it this way:

papaya$ ar r libstuff.a square.o factorial.o

When updating a library such as this, you may need to delete the old libstuff.a, if it exists. The last step is to generate an index for the library, which enables the linker to find routines within the library. To do this, use the ranlib command, as so:

papaya$ ranlib libstuff.a

This command adds information to the library itself; no separate index file is created. You could also combine the two steps of running ar and ranlib by using the s command to ar :

papaya$ ar rs libstuff.a square.o factorial.o

Now you have libstuff.a, a static library containing your routines. Before you can link programs against it, you'll need to create a header file describing the contents of the library. For example, we could create libstuff.h with the contents:

/* libstuff.h: routines in libstuff.a */
extern float square(float);
extern int factorial(int, int);

Every source file that uses routines from libstuff.a should contain an #include "libstuff.h" line, as you would do with standard header files.

Now that we have our library and header file, how do we compile programs to use them? First of all, we need to put the library and header file somewhere the compiler can find them. Many users place personal libraries in the directory lib in their home directory, and personal include files under include. Assuming we have done so, we can compile the mythical program wibble.c using the command:

papaya$ gcc -I..
/include -L..
/lib -o wibble wibble.c -lstuff

The -I option tells gcc to add the directory .. /include to the include path it uses to search for include files. -L is similar, in that it tells gcc to add the directory .. /lib to the library path.

The last argument on the command line is -lstuff, which tells the linker to link against the library libstuff.a (wherever it may be along the library path). The lib at the beginning of the filename is assumed for libraries.

Any time you wish to link against libraries other than the standard ones, you should use the -l switch on the gcc command line. For example, if you wish to use math routines (specified in math.h), you should add -lm to the end of the gcc command, which links against libm. Note, however, that the order of -l options is significant. For example, if our libstuff library used routines found in libm, you must include -lm after -lstuff on the command line:

papaya$ gcc -Iinclude -Llib -o wibble wibble.c -lstuff -lm

This forces the linker to link libm after libstuff, allowing those unresolved references in libstuff to be taken care of.

Where does gcc look for libraries? By default, libraries are searched for in a number of locations, the most important of which is /usr/lib. If you take a glance at the contents of /usr/lib, you'll notice it contains many library files--some of which have filenames ending in .a, others ending in .so.version. The .a files are static libraries, as is the case with our libstuff.a. The .so files are shared libraries, which contain code to be linked at runtime, as well as the stub code required for the runtime linker (ld.so) to locate the shared library.

At runtime, the program loader looks for shared library images in several places, including /lib. If you look at /lib, you'll see files such as libc.so.5.4.47. This is the image file containing the code for the libc shared library (one of the standard libraries, which most programs are linked against).

By default, the linker attempts to link against shared libraries. However, there are several cases in which static libraries are used. If you enable debugging code with -g, the program will be linked against the static libraries. You can also specify that static libraries should be linked by using the -static switch with gcc.

13.1.7.1. Creating shared libraries

Now that you know how to create and use static libraries, it's very easy to make the step to shared libraries. Shared libraries have a number of advantages. They reduce memory consumption if used by more than one process, and they reduce the size of the executable. Furthermore, they make developing easier: when you use shared libraries and change some things in a library, you do not need to recompile and relink your application each time. You need to relink only if you make incompatible changes, such as adding arguments to a call or changing the size of a struct.

Before you start doing all your development work with shared libraries, though, be warned that debugging with them is slightly more difficult than with static libraries, because the debugger usually used on Linux, gdb, has some problems with shared libraries.

Code that goes into a shared library needs to be position independent. This is a just a convention for object code that makes it possible to use the code in shared libraries. You make gcc emit position-independent code by passing it one of the command-line switches -fpic or -fPIC (the former is preferred, unless the modules have grown so large that the relocatable code table is simply too small in which case the compiler will emit an error message, and you have to use -fPIC. To repeat our example from the last section:

papaya$ gcc -c -fPIC square.c factorial.c

This being done, it is just a simple step to generate a shared library:[47]

[47]In the ancient days of Linux, creating a shared library was a daunting task that even wizards were afraid of. The advent of the ELF object-file format a few years ago has reduced this task to picking the right compiler switch. Things sure have improved!

papaya$ gcc -shared -o libstuff.so square.o factorial.o

Note the compiler switch -shared. There is no indexing step as with static libraries.

Using our newly created shared library is even simpler. The shared library doesn't require any change to the compile command:

papaya$ gcc -I../include -L../lib -o wibble wibble.c -lstuff -lm

You might wonder what the linker does if there is a shared library libstuff.so and a static library llibstuff.a available. In this case, the linker always picks the shared library. To make it use the static one, you will have to name it explicitly on the command line:

papaya$ gcc -I../include -L../lib -o wibble wibble.c libstuff.a -lm

Another very useful tool for working with shared libraries is ldd. It tells you which shared libraries an executable program uses. Here's an example:

papaya$ ldd wibble
        libstuff.so => libstuff.so (0x400af000)
        libm.so.5 => /lib/libm.so.5 (0x400ba000)
        libc.so.5 => /lib/libc.so.5 (0x400c3000)

The three fields in each line are the name of the library, the full path to the instance of the library that is used, and where in the virtual address space the library is mapped to.

If ldd outputs not found for a certain library, you are in trouble and won't be able to run the program in question. You will have to search for a copy of that library. Perhaps it is a library shipped with your distribution that you opted not to install, or it is already on your hard disk, but the loader (the part of the system that loads every executable program) cannot find it.

In the latter situation, try locating the libraries yourself and find out whether they're in a nonstandard directory. By default, the loader looks only in /lib and /usr/lib. If you have libraries in another directory, create an environment variable LD_LIBRARY_PATH and add the directories separated by colons.

13.1.8. Using C++

If you prefer object-oriented programming, gcc provides complete support for C++ as well as Objective-C. There are only a few considerations you need to be aware of when doing C++ programming with gcc.

First of all, C++ source filenames should end in the extension .C or .cc. This distinguishes them from regular C source filenames, which end in .c.

Second, the g++ shell script should be used in lieu of gcc when compiling C++ code. g++ is simply a shell script that invokes gcc with a number of additional arguments, specifying a link against the C++ standard libraries, for example. g++ takes the same arguments and options as gcc.

If you do not use g++, you'll need to be sure to link against the C++ libraries in order to use any of the basic C++ classes, such as the cout and cin I/O objects. Also be sure you have actually installed the C++ libraries and include files. Some distributions contain only the standard C libraries. gcc will be able to compile your C++ programs fine, but without the C++ libraries, you'll end up with linker errors whenever you attempt to use standard objects.

Chapter 13. Programming Languages

Contents: