Chapter 14. Extending MySQLWithin the spectrum of relational database management systems, MySQL has always occupied a unique position. Its open source nature and fast performance have made it the most popular database server for small-to-medium web-based applications. However, its lack of support for features such as transactions, foreign keys, row-level locking, and stored procedures have made many pass over it for other applications, though its performance would otherwise be beneficial. As MySQL has matured over time, various solutions have emerged for all these problems. Through alternative table types such as Berkeley DB and InnoDB, MySQL now supports transactions and row-level locking. Thanks to the InnoDB table type, the MySQL team has added support for foreign keys. One of MySQL's virtues is that it is intended to be extended. For example, stored procedures remain the most requested Oracle/SQL Server/Sybase feature lacking in MySQL. Though MySQL currently has no support for stored procedures (it probably will by the time you read this book), you can extend MySQL by writing your own user-defined functions (UDFs). In this chapter, we examine how to write your own functions and add character sets to MySQL. 14.1. User-Defined FunctionsUDFs are simply native programming functions. MySQL links to the libraries containing them and executes them as SQL functions in the same way you have used the PASSWORD( ) and NOW( ) functions. MySQL gives these functions access to the internals of MySQL and empowers them to manipulate data. You will generally write UDFs in C. The examples in this chapter are all in C. However, because MySQL accesses your functions through shared libraries, you can write them in any language that you can compile into a native-code shared library. In other words, you can write a UDF in C, C++, Objective C, C#, Java, VisualBasic, or even Pascal. MySQL supports two types of UDFs: standard and aggregate. MySQL applies standard functions to each row in the result set. PASSWORD( ) and DAYNAME( ) are examples of two common standard functions built into MySQL. If you execute the SQL in the mysql database: SELECT UPPER(User) FROM user; you will see the name of each user in your database in all capital letters. An aggregate function operates on groups of rows. You generally execute aggregate functions in conjunction with the SQL GROUP BY clause. This clause causes MySQL to group result sets into sections, in which each section is a single row. An aggregate function operates on all of the values of each group. AVG( ) and MAX( ) are examples of two common aggregate functions built into MySQL: SELECT AVG(age) FROM People GROUP BY city; The result of this query is a single column containing the average of all age values for people who live in the same city. You create aggregate and standard functions in much the same manner—aggregate functions simply have two extra routines. 14.1.1. Standard FunctionsCreate a standard UDF in three steps, two of which are optional:
When an SQL query calls a standard function, these routines are called in the following manner: init main main main ... deinit MySQL calls the main routine once for each row returned by the SQL query. The return values of the main routine comprise the results of the UDF as a whole. When creating a UDF, all three routines must reside in the same library. You can, however, bundle multiple functions in the same library. TIP: We use the terms "routine" and "function" in a specific manner in this section to reduce the possibility of confusion. The problem is that multiple native functions make up a single UDF. When we refer to a routine, we mean one of the native functions that make up the UDF. When we refer to a function, we mean the UDF itself. 14.1.1.1. The init routineThe init routine is the first routine called when a user attempts to execute a UDF within a SQL query. For a function with the name remove_html( ), the init routine must have the name remove_html_init( ). The init function must have the following signature: my_bool func_name_init(UDF_INIT *initid, UDF_ARGS *args, char *message); The three parameters passed to the init routine are all pointers to modifiable data that will be made available to the main function.
The return value of the init routine is a my_bool (Boolean) type. If the initialization completes successfully, the routine should return 0 (false). If an error occurs, the routine should return 1 (true). In addition, in the case of an error, a user-level error message should be placed into the message argument. 14.1.1.2. The main routineThe main routine is the heart of the UDF. In fact, if no special initialization or cleanup is required, this routine is the only one needed to create a UDF. For a function with the name remove_html( ), the main routine should have the same name: remove_html( ). The signature of the main routine depends on the return value of the UDF. There are three different SQL return values possible for UDFs: STRING, INTEGER, and REAL. The main routine for each of these types has the following signatures. For a STRING: char *func_name(UDF_INIT *initid, UDF_ARGS *args, char *result, unsigned long *length, char *is_null, char *error) For an INTEGER: long long func_name(UDF_INIT *initid, UDF_ARGS *args, char *is_null, char *error) For a REAL: double func_name(UDF_INIT *initid, UDF_ARGS *args, char *is_null, char *error) All the different routine signatures handle incoming arguments the same way. The differences lie in how each routine returns the result. We will consider the four parameters that are identical among the signatures first:
There are two additional arguments specific to UDF that return STRING results:
The return value of the main routine depends upon the return value of the UDF as a whole. For INTEGER_RETURN functions, the main routine should return the return value of the function as a long long integer value. For REAL_RETURN functions, the main routine should return the return value of the function as a double value. For STRING_RETURN functions, the main routine should return the return value of the function as a pointer to a character array. As mentioned above, if the resulting value is less than 255 characters, this pointer should be the result pointer passed into the main routine. Otherwise, it should be a pointer allocated within the init routine. 14.1.1.3. The deinit routineThe deinit routine is responsible for freeing any memory allocated by the init routine and performing any other necessary cleanup. For a UDF with the name remove_html( ), the deinit routine must have the name remove_html_deinit( ). It has the following signature: void func_name_deinit(UDF_INIT *initid) The deinit routine takes one argument, the main UDF_INIT structure used by all three routines. This structure contains the generic ptr pointer that is available for generic use throughout the routines. If this pointer was allocated during the init routine, this deinit routine is responsible for freeing that memory. The deinit routine returns no value. Example 14-1 shows a UDF that returns the contents of an image file, given the filename of that image. This function could be used to efficiently access image data without storing that data as a BLOB in the database. Example 14-1. A UDF that returns the contents of an image file based on its filename#include <mysql/my_global.h> #include <mysql/my_sys.h> #include <mysql/mysql.h> #include <mysql/m_ctype.h> #include <mysql/m_string.h> #define MAX_IMAGE_SIZE 1048576 #define MAX_FILE_NAME 1024 /* Our 'init' routine is responsible for some simple set-up: - Check the argument count. This function must have exactly one argument, the filename, so we return an error if there are any other number. - Coerce the types of the arguments. Theoretically, the image names could be stored as integer IDs (and thus stored in files with simple numbers for names in the filesystem). Therefore, we want to allow any type of argument, and coerce it to a string. - Allocate the data buffer to hold the image data. This is a fairly large buffer to accommodate images of differing sizes. */ my_bool getImage_init(UDF_INIT *initid, UDF_ARGS *args, char *message) { if (args->arg_count != 1) { strmov(message,"Usage: getImage( fileName )"); return 1; } args->arg_type[0] = STRING_RESULT; if ( !(initid->ptr = (char *) malloc( sizeof(char) * MAX_IMAGE_SIZE ) ) ) { strmov(message, "Couldn't allocate memory!"); return 1; } bzero( initid->ptr, sizeof(char) * MAX_IMAGE_SIZE ); return 0; } /* This routine frees the memory allocated by the image buffer */ void getImage_deinit(UDF_INIT *initid) { if (initid->ptr) free(initid->ptr); } /* The main routine does most of the work for this function. The argument passed to the function is the filename of an image file. This function first uses 'chdir' to set the directory to the directory of the image file. This prevents misuse of this function by using it to read an arbitrary file on the filesystem. (We also remember the old working directory so that we can reset it at the end.) Once we've set the current directory, we open the image file using the Filename passed in as the argument. If the image opens successfully, we read the contents of the file into the empty data buffer. Before returning the data, we set the length pointer to tell MySQL how Large our image buffer is. Finally, we return the pointer to the data as the string result of this function. */ char *getImage(UDF_INIT *initid, UDF_ARGS *args, char *result, unsigned long *length, char *is_null, char *error) { FILE *img_file; size_t img_size; char filename[MAX_FILE_NAME]; char old_dir[MAX_FILE_NAME]; bzero(filename, MAX_FILE_NAME); bzero(old_dir, MAX_FILE_NAME); getcwd(old_dir, MAX_FILE_NAME); chdir("/path/to/my/pictures"); strncpy( filename, args->args[0], args->lengths[0] ); img_file = fopen( filename, "r" ); if (! img_file ) { *is_null = 1; return 0; } img_size = fread( (void *)initid->ptr, sizeof(char), MAX_IMAGE_SIZE, img_file ); *length = (unsigned long)img_size; chdir(old_dir); fclose(img_file); return initid->ptr; } 14.1.2. Aggregate FunctionsAs mentioned earlier, aggregate functions work just like standard functions except that they have a couple of extra routines. Those routines drive the aggregation for each group of data:
The order in which routines are called illustrates the differences between standard and aggregate functions: init reset add add add ... main reset add add add ... main reset add add add ... main ... deinit While a standard function is comprised of a single loop, an aggregate function is a loop of loops. For a standard function, the main routine is called for each row of data, which corresponds directly with each row of the result set. For an aggregate function, the rows of the result set are made up of groups, in which each row represents a group of rows of data. The add routine is called for each row of data (except the first), while the main routine is called only once for each group (resulting in one row of the result set). While the signatures of the routines called within a standard function do not differ for an aggregate function, their roles are somewhat different. 14.1.2.1. initUnlike a standard function, the init function is almost always used for aggregate functions. The initid->ptr generic pointer plays an important role, which it does not always do for other UDFs. This pointer is initialized to store data about a single group of data. As the rows are processed by the add routine, the data structure is populated. When the main routine is finally called for each group, it accesses the populated structure to perform the desired calculation. All the other abilities of the init routine are also available for aggregate functions, such as forcing the arguments' types and sizes and indicating the return type. 14.1.2.2. resetThe reset routine works as a sort of init routine that is called once per group. For a function named calc_interest( ), the reset routine must be named calc_interest_reset( ). It has the following signature: void func_name_reset(UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error ) The arguments are those already covered under the main routine. The reset routine returns no value. At the beginning of each group, this routine is called to clear out any data that is used on per-group data. This is generally stored in the initid->ptr pointer as a data structure that contains the aggregated group data. This routine should zero out or otherwise clear this structure to prepare for a new group of data. Because of an oddity in the MySQL UDF engine, the add routine does not get called for the first row of data in every group. Therefore, if your aggregate function needs to act upon every row of data (and most will), it is necessary to explicitly call the add routine from the reset routine. This is because this first row of data is passed to the reset routine, which can then pass it to the add routine. For each subsequent row of the group, the add routine will be called directly. 14.1.2.3. addThe add routine is called on every row of data except the first within every group of the SQL query. For a function named calc_interest( ), the add routine must be named calc_interest_add( ). It has the following signature: void func_name_add(UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error ) The arguments are those already covered under the main routine. The add routine returns no value. The purpose of the add routine is to aggregate the data from the rows that comprise each group specified in the SQL query. This is generally done by taking the data from the rows and accumulating it in a data structure (in some matter that makes sense to the operation). The data structure is usually stored in the initid->ptr pointer, allocated by the init routine. 14.1.2.4. mainAs mentioned earlier, the purpose of the main routine is significantly different from that of a standard UDF. Instead of operating on single rows of data, the main routine operates on aggregated data collected by the add routine. This data is generally stored in a data structure referenced by the initid->ptr pointer, allocated by the init routine. The main routine takes the data from this structure and performs any calculation needed on the aggregated data. The main routine returns the result in the same manner as a standard UDF. 14.1.2.5. deinitThe deinit routine works identically to that of a standard function. It will almost always exist for an aggregate function, because there is almost always an allocated initid->ptr structure to free. 14.1.2.6. Aggregate exampleExample 14-2 is a UDF that calculates the average distance from the center of a range of points. This function takes two arguments, which correspond to the x and y coordinates of points on a graph. The function then aggregates all the points given to it by the SQL query and returns the distance from the center of the graph (x=0, y=0) to the average of all the points. Example 14-2. An aggregate UDF that calculates the distance from the center of a graph#include <mysql/my_global.h> #include <mysql/my_sys.h> #include <mysql/mysql.h> #include <mysql/m_ctype.h> #include <mysql/m_string.h> /* This is the structure that defines the aggregate of all of the points * It contains the sum of all 'x' values, the sum of all 'y' values, * and the total number of points counted (for a single group). */ struct point { double x; double y; long count; }; /* Our 'init' routine has several duties, this time: - Check the argument count. This function must have exactly two arguments, x and y, so we return an error if there are any other number. - Coerce the types of the arguments. Since we are dealing with decimal values, we don't want to penalize the user for passing in integers, or even numeric values stored within a string, so we coerce the incoming arguments into a 'real' type. - Set the 'maybe_null' flag to zero. This function will never return a null value. Maybe a 'zero' value, but not null. - Set the 'decimals' flag to 5 and the 'max_length' flag to 10. This will cause the results to be in the format +/-###.##### with a guarantee of 5 decimal places and up to three integer places. - Allocate the point structure. This structure will be used to house the aggregate point data as we accumulate it. */ my_bool distanceFromCenter_init( UDF_INIT* initid, UDF_ARGS* args, char* message ) { /* Check number of arguments */ if (args->arg_count != 2) { strmov(message,"Usage: distanceFromCenter( x, y )"); return 1; } /* Coerce arguments to 'real' */ args->arg_type[0] = REAL_RESULT; args->arg_type[1] = REAL_RESULT; /* Set flags */ initid->maybe_null = 0; initid->decimals = 5; initid->max_length = 10; /* Allocate pointer */ initid->ptr = malloc( sizeof( struct point ) ); if (! initid->ptr ) { strmov(message, "Cannot allocate memory for 'point' structure"); return 1; } return 0; } /* The 'deinit' routine only has to de-allocate the 'point' structure */ void distanceFromCenter_deinit( UDF_INIT* initid ) { free(initid->ptr); } /* The 'add' routine is called once for each row of data. It is responsible for aggregating the point information. It simply adds the 'x' and 'y' values to the totals stored in the point structure and then increments the count in the same structure. */ void distanceFromCenter_add( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error ) { ((struct point *)initid->ptr)->x += *((double *)args->args[0]); ((struct point *)initid->ptr)->y += *((double *)args->args[0]); ((struct point *)initid->ptr)->count++; } /* The 'reset' routine is called at the beginning of each group. It simply zeros out the information in the point structure in preperation for a new group of data. As mentioned earlier, this is the only routine called on the first row of data in a group. Therefore, we must also call the 'add' routine from this routine, to include the first row of data in the aggregate set. */ void distanceFromCenter_reset( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error ) { ((struct point *)initid->ptr)->x = 0.0; ((struct point *)initid->ptr)->y = 0.0; ((struct point *)initid->ptr)->count = 0; distanceFromCenter_add( initid, args, is_null, error ); } /* The 'main' routine has a very simple role here. It just has to take the aggregate data from the point structure and calculate a point (x,y) value for that data. It then runs the classic a^2 + b^2 = c^2 formula to determine the distance to the center of the grid for that point. double distanceFromCenter( UDF_INIT* initid, UDF_ARGS* args, char* is_null, char *error ) { double avg_x, avg_y; /* Calculate the average 'x' and the average 'y' avg_x = ((struct point *)initid->ptr)->x / (double)((struct point *)initid->ptr)->count; avg_y = ((struct point *)initid->ptr)->y / (double)((struct point *)initid->ptr)->count; /* Return the square root of a^2 + b^2 */ return sqrt( avg_x * avg_x + avg_y * avg_y ); } 14.1.3. Calling a UDFNow that we have successfully created the routines necessary for a UDF, the next step is to call those functions from an SQL query. This is accomplished in three steps:
As discussed earlier, the routines that make up a UDF can be implemented in any programming language that can be compiled into an operating system-level shared library. On most Unix-based systems, a shared library is indicated by a .so extension. In Microsoft Windows, shared libraries have .dll extensions. Once the routines are written, use whatever procedure is generally followed for compiling source code into a shared library. For example, compiling a shared library from a C or C++ file using the GNU C Compiler on a Unix-based system can be accomplished with the following command: gcc -shared -o myfunc.so myfunc.c The -shared flag indicates that a shared library should be created from the functions within the source file. Different compilers, operating systems, and programming languages will have different methods for creating a shared library. As mentioned earlier, multiple UDFs can be defined and implemented within a single shared library file. However, all the routines necessary for each function must be present within that file. That is, you can place multiple functions within a single shared library, but you cannot split the routines necessary for a single function between multiple libraries. Once a shared library has been created, it should be placed somewhere on the operating system's filesystem that is designated for shared libraries. On Unix-based systems, this generally means the /lib and /usr/lib directories as well as any directories specified in the LD_LIBRARY_PATH environment variable (although the name of the environment variable differs on some Unix-based systems). For Microsoft Windows systems, this generally means the WINDOWS\System32 directory (where WINDOWS is the main Windows directory, often C:\Winnt for Windows NT and 2000 systems). Finally, it may be necessary to restart the MySQL server for it to find the shared library file. This may not be necessary, though, so if restarting the MySQL server is an issue, first try to load the function library without restarting. Once the shared library file is in a place that is accessible to the MySQL server, it is possible to access the UDF directly from MySQL. To do this, the UDF must be loaded, which generally means loading the shared library into MySQL's memory so it can access the function routines. Loading a new UDF is accomplished through the SQL CREATE FUNCTION statement: CREATE [AGGREGATE] FUNCTION function_name RETURNS {STRING|REAL|INTEGER} SONAME shared_library_name Basically, this statement indicates the name of the UDF, its return type, and the shared library where it is located. The optional AGGREGATE keyword indicates that the function is an aggregate function. The name of the function must be the name of the main routine used to implement the function. The name is case sensitive, so it must be capitalized the same way it is in the shared library. The return type of the function is specified by one of the keywords STRING, REAL, or INTEGER. This value must match the return type of the main routine. The shared library is given as an SQL literal string. It should be only the filename (not a directory path), and it is case sensitive. The shared library given here must be accessible to MySQL. Behind the scenes, loading a UDF involves adding and removing data from the func table of the system mysql database. Therefore, to execute the CREATE FUNCTION statement, a user must have INSERT and DELETE privileges for that table (or for the whole database). In general, this should be restricted to administrative users, because loading arbitrary UDFs can give a user almost complete control over the MySQL system. (Imagine loading a function that kills the main MySQL process. Because the function is run from MySQL itself, nothing will stop it from shutting down the system abruptly.) The following SQL calls add two UDFs, one standard and the other aggregate: CREATE FUNCTION getImage RETURNS STRING SONAME 'myfuncs.so'; CREATE AGGREGATE FUNCTION distanceFromCenter RETURNS REAL SONAME 'myfuncs.so'; Once a UDF has been loaded into MySQL, it can be accessed just as if it were a built-in MySQL function, because MySQL functions are case insensitive. Even though it was necessary to specify the correct case when loading the function, once loaded, the function can be executed with any case: SELECT GETIMAGE( image_name ) AS image FROM Product; SELECT distanceFromCenter( x, y ) FROM Point GROUP BY shape_id Copyright © 2003 O'Reilly & Associates. All rights reserved. |
||||||||
|