7.2 Text ProcessingThe enormous popularity of Perl is due in large part to its sophisticated text processing capabilities. A single Perl command can replace dozens of PL/SQL INSTR and SUBSTR operations. Additionally, Perl's pattern matching capabilities are well suited for processing and validating the text entered into HTML forms. The PL/SQL toolkit has two packages that bring a subset of these capabilities to PL/SQL. The first, OWA_TEXT, manipulates large chunks of text. The second, OWA_PATTERN, allows developers to use sophisticated search patterns to perform many of the text operations found in Perl. Beyond their mere utility, these two packages are interesting examples of good package design. OWA_TEXT is similar to the abstract datatypes described in Chapter 6, PL/SQL . OWA_PATTERN builds on OWA_TEXT to create dozens of variations of the search and replace procedure, each useful in particular circumstances. 7.2.1 OWA_TEXT: Representing TextThe largest PL/SQL string can contain 32,767 characters. Unfortunately, there are many cases where we might like to process larger chunks of text. A document indexing system, for example, must almost certainly process files much larger than 32K.[ 1 ] The OWA_TEXT package overcomes the 32K limitation by breaking text streams into smaller pieces that are stored as elements in a PL/SQL array.
Additionally, it is often useful to treat the components of a string as a single entity (e.g., a sentence as an array of words). OWA_TEXT is ideal for this type of application. In practice, though, you'll probably not use OWA_TEXT directly; instead, it's used to provide more flexibility to the OWA_PATTERN package discussed later in this chapter. 7.2.1.1 Data structuresOWA_TEXT's specification declares four data structures. The first two, vc_arr and int_arr, are PL/SQL arrays that are the building blocks of the more complex types. vc_arr is a 32K string array; int_arr holds indexes to the interesting rows of vc_arr. The declarations for these two datatypes are: TYPE vc_arr IS TABLE OF VARCHAR2(32767) INDEX BY BINARY_INTEGER; TYPE int_arr IS TABLE OF INTEGER INDEX BY BINARY_INTEGER; The third data structure, called multi_line, is used to store information about an entire text stream. multi_line contains three fields: a vc_arr array to hold the individual rows of the stream, an integer to hold the number of rows in the vc_arr array, and a Boolean flag to indicate the presence of a partial row. Its declaration is: TYPE multi_line IS RECORD ( rows vc_arr, num_rows INTEGER, partial_row BOOLEAN ); The fourth data structure, row_list, is used to represent pointers into the rows in a multi_line structure. This structure is generally used by other toolkit packages, such as OWA_PATTERN, discussed later in this chapter. The declaration for the row_list structure is: TYPE row_list IS RECORD ( rows int_arr, num_rows INTEGER ); 7.2.1.2 ProceduresThe procedures in OWA_TEXT define a limited set of operations similar to those of a classic linked list. There are procedures to create a new multi_line structure, to add a new row onto the end of an existing structure, and even to print its contents. Table 7.4 summarizes these procedures:
7.2.1.3 ExampleLet's look at a quick example that illustrates OWA_TEXT in action. The following procedure, TOKENIZE, uses OWA_TEXT to break apart and print the individual words in a sentence: CREATE OR REPLACE PROCEDURE tokenize ( sentence IN VARCHAR2 DEFAULT NULL ) IS mline OWA_TEXT.multi_line; i NUMBER; n NUMBER := LENGTH (sentence); c VARCHAR2(1); BEGIN OWA_TEXT.new_multi (mline); -- Initialize the structure FOR i IN 1 .. n LOOP c := SUBSTR (sentence, i, 1); -- Fetch current character IF c = ' ' THEN -- Add a new row if the character is a space OWA_TEXT.add2multi (c, mline, FALSE); ELSE -- Otherwise, append the character to the string OWA_TEXT.add2multi (c, mline); END IF; END LOOP; /* || Print individual words in sentence */ FOR i IN 1 .. mline.num_rows LOOP HTP.print ('Word ' || i || ' is '); HTP.print (mline.rows (i) || '<br>'); END LOOP; END;
The following HTML form is used to test the procedure; note how the <html> <title>Test tokenizer procedure</title> <body> Enter the text to tokenize: <form action="http://gandalf/agent_webtest/plsql/tokenize"> <textarea name=sentence>Enter sentence here</textarea> <p> <input type=submit> </form> </body> </html> Figure 7.2 shows the results of the TOKENIZE procedure. The image on the left shows the form used to submit the sentence; the image on the right shows the corresponding output generated by TOKENIZE. Figure 7.2: Results of the TOKENIZE procedure7.2.1.4 Design noteIn a classic ADT, such as the stack presented in Chapter 6 , the datatypes would be hidden within the package body and would be accessible only through a programmatic interface. While this information hiding approach gives the developer strict control over how the package is used, it also makes it difficult to extend the package. Although it seems like an obscure issue, the placement of the declaration has a profound impact on the life of the package. Declaring everything in the specification can make the package unwieldy and hard to maintain. The other extreme, declaring everything in the body and making nothing accessible, results in a package that is rigid and difficult to use in new circumstances. OWA_TEXT resolves this tension admirably. By placing the declarations in the specification, the developers are consciously creating a general-purpose object that other packages can use. However, the package also contains a well-defined, private set of procedures that limit the operations that can be performed against its structures. In the next section, we'll see how the package is used to extend the toolkit's pattern searching capabilities. 7.2.2 OWA_PATTERN: Searching and Replacing TextThe OWA_PATTERN package is the second component of our text processing unit. As its name implies, OWA_PATTERN performs more complex text manipulation than is possible with PL/SQL's INSTR and SUBSTR functions. Regular expressions make this sophistication possible. 7.2.2.1 Regular expressions
A
regular expression
, or RegExp, is a compact description for a pattern of characters used to find matches within another string. Chances are you have used a simple RegExp to perform wildcard file searches using commands such as You can use regular expressions to create more sophisticated patterns. For example, suppose you want to take some action if any date appears within a string; you are only interested in its presence and do not know its value ahead of time. Clearly, the following INSTR test is not very effective: IF INSTR (some_string, '07/13/71') OR INSTR (some_string, '07/14/71') OR INSTR (some_string, '07/15/71') OR INSTR (some_string, '07/16/71') ... What you are really after is a pattern consisting of three sets of two digits separated by slashes (for clarity, assume the date is always DD-MM-YY). A regular expression is a mini-language that uses a compact vocabulary to describe these patterns. The first part of the RegExp vocabulary defines the different types of characters that can be matched, such as digits, letters, or tabs. These characters are sometimes called atoms because they form the basic building blocks on which the expressions are based. The atoms that can be used in OWA_PATTERN are shown in Table 7.5 .
The next part of the vocabulary defines how many characters must appear to constitute a match. For example, we may want to return a match only if there are exactly two consecutive digits. The characters in this set are called the quantifiers ; the possibilities for them are shown in Table 7.6 .
There are two possible assertions , or sets of characters used to fix the position of a match, as shown in Table 7.7 .
Finally, the vocabulary of the regular expression contains a set of flags that are used to control the behavior of the search. Unlike the atoms, quantifiers, and assertions, these flags are not included as part of the RegExp itself. Instead, they are passed as a separate parameter to control how the various OWA_TEXT procedures behave. The two available flags are shown in Table 7.8 .
There are additional special characters that remember the portions of the original string that was matched. The first special character, the ampersand (&), can be used during the replace phase of a search and replace operation. The & represents the original pattern found in a match; including it in a replace string recreates the original string of characters that matched the pattern. The second special character is a pair of parentheses. When a portion of a match sequence is enclosed in parentheses, the subsequent replace operation can remember each parenthesized match. These remembered strings are called back references (backrefs) and are stored in an array. 7.2.2.2 Data structuresYou must supply a regular expression to each function in OWA_PATTERN. Initially, the pattern is stored as a simple VARCHAR2 string. In order to use the expression, however, OWA_PATTERN transforms it into a more useful format. This relatively time-consuming process converts the regular expression from a VARCHAR2 into a PL/SQL array, using the following declaration: TYPE pattern IS TABLE OF VARCHAR2(4) INDEX BY BINARY_INTEGER; Like many of the other data structures we've seen, pattern datatypes are initialized by calling a procedure. In this case, the procedure is called GET_PAT. There are two parameters to this procedure. The first is a VARCHAR2 string called arg that holds the regular expression to be parsed. The second is a pattern datatype (declared as an IN OUT mode parameter) to hold the resultant parsed pattern. In the next section, we'll see once again how the toolkit's developers intentionally placed the declaration in the specification and not the body, even though it's a purely internal representation. This time, however, the intent is to improve the package's performance as well as its usability. 7.2.2.3 Procedures and functionsIn addition to GET_PAT, the OWA_PATTERN package contains three other basic functions: MATCH, AMATCH, and CHANGE. In an attempt to match the enormous flexibility of Perl, each function has several overloaded versions that derive from the data structures found in OWA_TEXT. For example, the MATCH function can search either a simple VARCHAR2 string or the more complex multi_line data structure. This is a great example of the power and flexibility a good package can provide. However, with 14 variations of just three functions, OWA_PATTERN reveals an API that just might be too complex. The next three sections describe the functions for this package, shown in Table 7.9 . Keep in mind that some of these functions are like the finches on the Galapagos Islands: very specialized.
7.2.2.3.1 The MATCH function.This function returns a Boolean value indicating whether a regular expression was found inside a chunk of text. There are six overloaded versions. The parameters for this function are:
Table 7.10 lists the formal parameters for each of the different versions of MATCH.
7.2.2.3.2 The AMATCH function.This function is similar to MATCH, except that it returns the position of the end of the first match found within the string.[ 2 ] The function returns if no match is found. There are four overloaded versions; the parameters for each version are:
Table 7.11 lists the four versions of AMATCH.
7.2.2.3.3 CHANGE.The CHANGE function or procedure searches a chunk of text for a pattern. When it finds a match, it replaces the matched substring with a new string. When used as a procedure, CHANGE simply updates the text with the appropriate matches. When used as a function, it makes the changes and returns the number of substitutions. This second usage is questionable. When a function changes the value of a parameter, it violates the most important rule about functions: that a function should return exactly one value. By updating the parameters, CHANGE is essentially returning two values: one for the number of updates and another for the actual results of that update. Including an OUT mode parameter to the procedural version would probably have been a better design. The parameters used in each version are:
Table 7.12 lists the four versions of CHANGE.
7.2.2.4 ExampleAs you can imagine, there are a lot of possible examples for the OWA_PATTERN package. However, it's not necessary to detail every one. Instead, let's focus on a single example, based on the CHANGE procedure, that allows you to test the effect of various regular expressions in search and replace operations. The example procedure, regexp_test, accepts the parameters of version 1 of CHANGE and builds an HTML table that breaks the final page into two columns. The first column contains a data entry form with the fields necessary to test the CHANGE procedure. The second column displays the results of the CHANGE procedure when it is executed with the regexp_test procedure's parameters.
The interesting thing about this procedure is that it preserves the values entered when the user submits the form. This is accomplished by setting the form CREATE OR REPLACE PROCEDURE regexp_test ( line IN OUT VARCHAR2 DEFAULT NULL, from_str IN VARCHAR2 DEFAULT NULL, to_str IN VARCHAR2 DEFAULT '<b>&</b>', flags IN VARCHAR2 DEFAULT NULL ) IS BEGIN HTP.print ('<html><title>Pattern Test</title><body>'); HTP.print ('<table border=1><tr><td>'); -- Used to format results HTP.print ('<form action=regexp_test>'); HTP.print ('Line:<textarea name=line>' || line || '</textarea><br>'); HTP.print ('From:<input name=from_str value="' || from_str || '"><br>'); HTP.print ('To:<input name=to_str value="' || to_str || '"><br>'); HTP.print ('Flags:<input name=flags value="' || flags || '"><br>'); HTP.print ('<input type=submit>'); HTP.print ('</form></td><td>'); -- Results print in second column -- Call the change procedure OWA_PATTERN.change (line, from_str, to_str, flags); HTP.print (line_copy); HTP.print ('</td></tr></html>'); END; Figure 7.3 shows the output of the regexp_test procedure. Figure 7.3: The results of the regexp_test procedureCopyright (c) 2000 O'Reilly & Associates. All rights reserved. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|