10. PLVprs, PLVtkn, and PLVprsps: Parsing Strings

Contents:
PLVprs: Useful String Parsing Extensions
PLVtkn: Managing PL/SQL Tokens
PLVprsps: Parsing PL/SQL Strings

Parsin g a computer program can be a very frustrating and complex experience. There are all kinds of exceptions to the rules and special rules to handle in one's logic. The only way I could handle all of these details was to deal with as narrow a portion of functionality at a time as I could. The result is a set of four different packages oriented to different levels of the parsing process. By isolating various areas of complexity into these different packages, I can keep each of the individual parsing programs brief and relatively easy to write and understand.

PL/Vision offers several different string and source code parsing packages:

PLVprs: Generic string-parsing extensions to PL/SQL . This is the lowest level of string-parsing functionality and will prove useful in many different situations.
PLVtkn: Interface to the PLV_token table, which contains more than 1,200 keywords of the PL/SQL language, including those for Oracle Forms. Use PLVtkn to determine if an identifier is a keyword.
PLVlex: PL/SQL lexical analysis package. Performs similar parsing actions as those available in PLVprs , but does so with an awareness of the syntax and delimiters of the PL/SQL language.[ 1 ]

[1] This package is not described further in the book; see the the companion disk for more information.
PLVprsps: Highest-level package to parse PL/SQL source code (hence the prsps name) into separate atomics. Relies on the other parsing packages to get its job done.

10.1 PLVprs: Useful String Parsing Extensions

The PLVprs (PL/Vision PaRSe) package offers a set of procedures and functions that provide generic and very flexible string-parsing functionality. These programs extend the builtin string functions of PL/SQL . They can:

Parse a string into its atomics
Count the number of atomics in a string
Count the frequency of a substring within a string
Return the n th atomic in a string
Wrap a long line of text into a paragraph
Display a wrapped line of text

You can use PLVprs within other packages and programs to analyze strings and display their contents. It is also used within PL/Vision by PLVvu and PLVdyn to display long messages.

All of our applications require manipulation of textual information. I have often encountered the need to parse and analyze those strings in order to answer requests like the following (and I am sure that you could add more to the list):

Count the number of words in a string.
Separate out all words and punctuation in a string into separate components.
Return the n th value in a semicolon-delimited string. This is a very common situation in Oracle Forms applications, in which a developer might pack a set of values into a global variable like this: "123;5555;6623.11;".

10.1.1 Developing a General Solution

Taken separately, it is not too hard to develop a solution to any of the items on this list. If you build solutions to each individual requirement on a case-by-case basis, you will end up with an enormous volume of redundant code which you cannot easily enhance or upgrade.

Rather than construct a smattering of different, specialized routines to analyze strings, I offer with PLVprs a set of very generic and flexible functions and procedures. Yet the only way to make sure that my approach will handle many different situations and requirements is to base that approach on a general analysis of the components of a string.

In the world of PLVprs, a string is made up of a series of atomics, the smallest indivisible elements of the string. An atomic is either a word (defined as a contiguous set of letters and/or numbers) or a delimiter (any one of a set of special characters which separate or delimit words). In the English language, for example, common delimiters, a.k.a. punctuation, include the comma, semicolon, question mark, period, etc. But in a more general scenario, we have to be careful not to assume that the set of delimiters is a constant. For example, if I want to perform operations on a string of values delimited by a semicolon, then the set of delimiters for that particular operation is a single character: the semicolon.

Once you see strings as a set of atomics, of words separated by delimiters, you can reinterpret and enhance the kinds of requests listed above, such as "count all words in a string." Do you want to count all the atomics or just the words or just the delimiters? PLVprs gives you that level of flexibility.

10.1.2 Customizing the Delimiter Set

The PLVprs package interprets a string as a series of atomics: words and delimiters. But what is a delimiter? PLVprs predefines two sets of delimiters:

Name	Characters in Delimeter Set
`std_delimeters`	!@#$%^&*()‐_=+\\|`~{{]};:''",<.>/?' plus space, tab, and newline characters
`plsql_delimeters`	!@%^&*()‐=+\\|`~{{]};:''",<.>/?' plus space, tab, and newline characters

The only difference between these lists is that the plsql_delimiters set omits the underscore, dollar sign, and pound sign characters. These characters are valid symbols in a PL/SQL identifier, which should certainly be considered a word.

Would you ever need any other delimiter sets? Sure, why not? You might have a string which is packed with values separated by the vertical bar. In this situation, when you call PLVprs.string to separate the string into separate atomics, you will want to be able to specify your special and very short delimiter list.

PLVprs lets you specify a non-default delimiter list in the following programs:


next_atom_loc


nth_atomic


display_atomics


string


numatomics

Here is an example of how I could parse a string with atomics packed between vertical bars and ignore any other delimiters:

DECLARE
   atomic PLVtab.vc2000_table;
   num INTEGER;
BEGIN
   PLVprs.string ('A#-%|12345|(*&*)|0101R|', atomic, num, '|');
   PLVtab.display (atomic,num);
END;
/

This is the output seen after executing the above script:

Contents of Table
A#-%
|
12345
|
(*&*)
|
0101R
|

With the default set of delimiters, this string would have been broken up by PLVprs.string into fifteen separate atomics, rather than just eight. So don't forget to use your own delimiter list as needed to simplify your parsing and analysis jobs.

10.1.3 Parsing Strings into Atomics

The PLVprs package offers a number of programs that perform different parsing operations on strings. They are each discussed below. The implementations of most of these programs were discussed in Oracle PL/SQL Programming , so I will not go beyond an explanation here of how you can use these functions.

10.1.3.1 next_atom_loc function

The next_atom_loc returns the location in the specified string of the next atomic. It is used by other PLVprs programs, but is also available for use by other PL/Vision packages -- and by you. Its header is:

   FUNCTION next_atom_loc 
        (string_in IN VARCHAR2, 
         start_loc_in IN NUMBER,
         direction_in IN NUMBER := +1,
         delimiters_in IN VARCHAR2 := std_delimiters)
   RETURN INTEGER;

The string_in parameter is the string to be scanned. The start_loc_in parameter provides the starting position of the search for the start of the next atomic. The direction_in parameter indicates whether the search should move forward or backward through the string. The final argument allows you to specify the set of characters to be considered delimiters (which indicate the start of a new atomic).

The next_atom_loc function returns the location in the string of the starting point of the next atomic (from the start location). The function scans forward if direction_in is +1, otherwise it scans backwards through the string. Here is the logic to determine when the next atomic starts:

If the current atomic is a delimiter (that is, if the character at the start_loc_in of the string is a delimiter), then the next character starts the next atomic since all delimiters are a single character in length.
If the current atomic is a word (that is, if the character at the start_loc_in of the string is a letter or number), then the next atomic starts at the next delimiter. Any letters or numbers in between are part of the current atomic.

The next_atomic_loc function loops through the string one character at a time and applies these tests. It also has to check for the end of string. If it scans forward, the end of string comes when the SUBSTR that pulls out the next character returns NULL. If it scans backward, then the end of the string comes when the location is less than 0.

10.1.3.2 display_atomics procedure

The display_atomics procedure displays the atomics found in the specified string. Its header is shown below:

PROCEDURE display_atomics
   (string_in IN VARCHAR2, 
    delimiters_in IN VARCHAR2 := std_delimiters);

You specify the string you want parsed and the set of characters you want to be treated as delimiters for the parsing. To make it easier to view blank lines, spaces are displayed as a description of the number of spaces present (as in "1 blank" or "6 blanks"). This feature is shown below:

SQL> exec PLVtab.showrow
SQL> exec PLVprs.display_atomics ('Compassion is a human thing.');
Parsed Atomics in String
Row 1 = Compassion
Row 2 = 1 blank
Row 3 = is
Row 4 = 1 blank
Row 5 = a
Row 6 = 1 blank
Row 7 = human
Row 8 = 2 blanks
Row 9 = thing
Row 10 = .

You can specify an alternate delimiter set in order to display the contents of a string according to the format of that string and how it is used. Compare the two calls to display_atomics below:

SQL> exec PLVprs.display_atomics ('1234|A$%|67YYY|(big)');
11
Parsed Atomics in String
1234
-|-
A
-$-
-%-
-|-
67YYY
-|-
-(-
big
-)-
SQL> exec PLVprs.display_atomics ('1234|A$%|67YYY|(big)', '|');
7
Parsed Atomics in String
1234
-|-
A$%
-|-
67YYY
-|-
(big)

In the second call, I am telling display_atomics to consider "|" as the sole delimiter in the string.

10.1.3.3 numatomics function

The numatomics function returns the number of atomics in a string. You can calculate the number of all atomics, only the words, or only the delimiters. You can specify in your call to the function what characters should be considered delimiters. The header for this function is:

FUNCTION numatomics 
   (string_in IN VARCHAR2, 
    count_type_in IN VARCHAR2 := c_all,
    delimiters_in IN VARCHAR2 := std_delimiters)
RETURN INTEGER;

The following examples demonstrate how to use the count_type_in argument. The first call relies on the default value of "all atomics." The next two calls pass in a specific request for type of atomic, relying on the constants provided in the package specification.

SQL> exec p.l (PLVprs.numatomics ('this, is%not.'))
7
SQL> exec p.l (PLVprs.numatomics ('this, is%not.', PLVprs.c_word))
3
SQL> exec p.l (PLVprs.numatomics ('this, is%not.', PLVprs.c_delim))
4

10.1.3.4 nth_atomic function

The nth_atomic function returns the n th atomic in a string. You can, for example, ask for the seventh word or the sixth delimiter. The header for nth_atomic is:

FUNCTION nth_atomic
   (string_in IN VARCHAR2, 
    nth_in IN NUMBER,
    count_type_in IN VARCHAR2 := c_all,
    delimiters_in IN VARCHAR2 := std_delimiters)
RETURN VARCHAR2;

The nth_atomic function is very flexible, following the model of the builtin functions, INSTR and SUBSTR . You can scan both forward and backward through the string. If you provide a positive nth_in argument, then nth_atomic scans forward to the n th atomic. If, on the other hand, the nth_in value is negative, nth_atomic will scan backwards through the string to the n th atomic. This feature is shown in the examples below:

SQL> exec p.l(PLVprs.nth_atomic ('this, is%not.', 2))
,
SQL> exec p.l(PLVprs.nth_atomic ('this, is%not.', 2, PLVprs.c_word))
is
SQL> exec p.l(PLVprs.nth_atomic ('this, is%not.', 3, PLVprs.c_delim))
%
SQL> exec p.l(PLVprs.nth_atomic ('this, is%not.', -3, PLVprs.c_word))
this

The PLVdyn package also utilizes nth_atomic function to extract the type of program unit and the name of the program unit that is being compiled and stored by the compile procedure:

/* Get the first word.*/
v_name1 := PLVprs.nth_atomic (stg_in, 1, PLVprs.c_word);

/* Get the second word. */
v_name2 := PLVprs.nth_atomic (stg_in, 2, PLVprs.c_word);

/* If a package body, then get the THIRD word. */
IF UPPER (v_name1||' '||v_name2) = 'PACKAGE BODY'
THEN
   v_name1 := v_name1 || ' ' || v_name2;
   v_name2 := PLVprs.nth_atomic (stg_in, 3, PLVprs.c_word);
END IF;

10.1.3.5 string procedures

There are two, overloaded versions of the string procedure , which parses a string into its separate atomics. The string procedure is available to users of the PLVprs package; it is also used by programs in the PLVprs package. The display_atomics procedure, for example, calls the string procedure to parse the specified string and then calls PLVtab.display to display the table that contains the parsed atomics.

The headers for the string procedures are:

PROCEDURE string
   (string_in IN VARCHAR2, 
    atomics_list_out OUT PLVtab.vc2000_table, 
    num_atomics_out IN OUT NUMBER,
    delimiters_in IN VARCHAR2 := std_delimiters);

PROCEDURE string 
   (string_in IN VARCHAR2, 
    atomics_list_out IN OUT VARCHAR2, 
    num_atomics_out IN OUT NUMBER,
    delimiters_in IN VARCHAR2 := std_delimiters);

The first argument is the string to be parsed. The second and third arguments contain the parsed output of the procedure call. The fourth argument allows you to specify an alternative set of delimiter characters.

The table version of string fills a PL/SQL table with the atomics of the string, one atomic per row. The VARCHAR2 version of the string procedure returns the atomics in a string separated by the vertical bar delimiter. This "string version" of string simply calls the table version and then dumps the contents of the table into a string.

10.1.3.6 numinstr function

The numinstr function is a good example of how you can supplement the fine, but finite set of builtin string functions with your own basic and quite reusable functions. The numinstr function returns the number of times a substring appears in a string. Its header is:

FUNCTION numinstr
   (string_in IN VARCHAR2,
    substring_in IN VARCHAR2,
    ignore_case_in IN VARCHAR2 := c_ignore_case)
RETURN INTEGER;

The first argument is the string, the second is the substring, and the third allows you to specify whether you want case to be ignored or respected in the search. The following examples illustrate this flexibility:

SQL> exec p.l (PLVprs.numinstr ('abcabC', 'c'));
2
SQL> exec p.l (PLVprs.numinstr ('abcabC', 'c', PLVprs.c_respect_case));
1

In the following code fragment, I call numinstr to determine the number of placeholders for bind variables in a dynamically constructed SQL string (a placeholder is defined as a colon followed by a PL/SQL identifier, such as :newname or :bindvar1 ). I then use a numeric FOR loop to issue a call to the BIND_VARIABLE procedure of the builtin DBMS_SQL package for each of the bind variables.

numvars := PLVprs.numinstr (sql_string, ':');
FOR bindvar_ind IN 1 .. numvars
LOOP
   DBMS_SQL.BIND_VARIABLE 
     (cur_handle,
      'bindvar' || TO_CHAR (bindvar_ind), 
      variables_table (bindvar_in));
END LOOP;

I believe I am a fairly good PL/SQL programmer. I can churn out hundreds of lines of complicated code that works, more or less, after just a few rounds of revisions. On the other hand, I feel I am also very much open to the possibility that others can and have done better -- and that I can learn from them. None of us has all the answers -- and some of us have more answers than others.

Let me give you an example. In my first book on PL/SQL , Oracle PL/SQL Programming (O'Reilly & Associates), I give every impression that I know what I am talking about -- and a big part of what I talk about is writing concise, high-quality code. In Chapter 11, Character Functions , I take my readers through the exercise of building a function to count the number of times a substring occurs in a string (a function not provided by PL/SQL ). I end up with two implementations, the shorter of which is shown in Example 10.1 .

Example 10.1: Counting the Frequency of a Substring Within a String

FUNCTION numinstr 
   (string_in IN VARCHAR2, substring_in IN VARCHAR2) 
RETURN INTEGER
IS
   substring_loc NUMBER;  
   return_value NUMBER := 1;
BEGIN
   LOOP
      substring_loc := 
         INSTR (UPPER (string_in), 
                UPPER (substring_in), 1, return_value);

      /* Terminate loop when no more occurrences are found. */
      EXIT WHEN substring_loc = 0;

      /* Found match, so add to total and continue. */
      return_value := return_value + 1;
   END LOOP;   
   RETURN return_value - 1;   
END numinstr;

I was quite content with this function -- until I received an email from Kevin Loney, author of ORACLE DBA Handbook (Oracle Press). This email very politely complimented me on my book and offered an alternative implementation for numinstr -- a simpler, more efficient, and more elegant implementation. As Kevin noted, he came up with this solution before the days of PL/SQL , when all he had to work with was SQL (a set-at-a-time, nonprocedural language). In his approach (shown in Example 10.2 ), he took advantage of the REPLACE builtin function to substitute NULL for any occurrences of the substring. He could then compare the size of the original string with the "replaced" string and use that as the basis for his calculation.

Example 10.2: Counting the Frequency of a Substring Within a String

FUNCTION numinstr
    (string_in IN VARCHAR2, sub_string_in IN VARCHAR2)
RETURN INTEGER
/* Divide difference of two lengths by length of substring. */
IS
BEGIN
   RETURN
     ((LENGTH (string_in) -
       NVL (LENGTH (REPLACE (string_in, sub_string_in)), 0))
     / LENGTH (sub_string_in));
END;

I felt a wave of embarrassment wash over me for just a moment when I first read Kevin's note. Then I regained sanity. Of course there are better ways of doing things. Discovering and sharing these improvements -- programming altruism -- is one of the finest aspects of our work. And programming humility -- the willingness to accept these improvements -- can make our lives much easier. If you are open to the ideas of others, then you are also open to the idea of using the work of others (with permission!). This means that you will spend much less time coding something that is already available -- you will avoid reinventing the wheel. Kevin's implentation for numinstr (enhanced to ignore or respect case) is now in version I provide in PLVprs.

10.1.4 Wrapping Strings into Paragraphs

How many times have you been confronted with the need to display a long string (defined, essentially, as anything longer than 80 characters) in a format which can be read easily? Unfortunately, PL/SQL does not have any paragraph-wrapping capabilities built into it. All you have is DBMS_OUTPUT.PUT_LINE, which accepts a maximum of 255 characters and is displayed in whatever manner handled by your environment.

PLVprs fills this gap in functionality by providing the following suite of string-wrapping programs:

wrap: Wraps a long string into a series of lines with a maximum specified length, each line of which is stored in consecutive rows in a PL/SQL table.
wrapped_string: Returns a long string wrapped into a series of lines separated by newline characters.
display_wrap: Displays a long string in paragraph-wrapped form at the specified length.

The wrap procedure supplies the core paragraph-wrapping functionality and is called both by wrapped_string and display_wrap . All three programs are described below.

10.1.4.1 wrap procedure

The header for the wrap procedure is:

PROCEDURE wrap
  (text_in IN VARCHAR2,
   line_length IN INTEGER,
   paragraph_out IN OUT PLVtab.vc2000_table,
   num_lines_out IN OUT INTEGER);

The first parameter is the text to be wrapped. The second parameter, line_length , is the length of the line into which the long text is to be wrapped. The paragraph_out argument is the PL/SQL table which will receive the wrapped text (each row, starting from 1, contains a single line of text). The num_lines_out argument contains the number of lines of text that have been placed in paragraph_out .

Examples of PLVprs.wrap are shown in the following sections.

10.1.4.2 display_wrap procedure

The display_wrap procedure comes in handy when you only want to display the end result -- the wrapped string. If you have no need to store that string in a PL/SQL table or a single string variable for further manipulation, call display_wrap . Then you will not have to declare temporary data structures to hold the wrapped text until it is displayed.

PROCEDURE display_wrap
  (text_in IN VARCHAR2,
   line_length IN INTEGER := 80,
   prefix_in IN VARCHAR2 := NULL);

The first parameter, text_in , is the string to be wrapped and then displayed. The second argument, line_length , is the length of the line within which the string is wrapped. The third argument, prefix_in , allows you to specify an optional prefix to display before each line of the wrapped string.

Here is an example of display_wrap :

SQL> exec PLVprs.display_wrap (RPAD ('a string',120,' to remember'),30);
a string to remember to
remember to remember to
remember to remember to
remember to remember to
remember to remember to

The implementation of display_wrap is short and sweet. It calls the wrap procedure to fill up a locally declared PL/SQL table with the wrapped lines of text. It then calls PLVtab.display to display the contents of that table.

The p package of PL/Vision makes use of the display_wrap procedure to automatically wrap long lines of text. As a result, when you call p.l , you don't have to worry about the DBMS_OUTPUT restriction of 255 characters displayed in a string. Instead, it will automatically check the length of the string and wrap the output as shown below:

ELSIF LENGTH (line_in) > 80
THEN
   PLVprs.display_wrap (line_in, 75, NULL);

Now why did I include a third argument of NULL in my call to display_wrap ? That (lack of a) value is, after all, the default. I could simply have called display_wrap as follows:

PLVprs.display_wrap (line_in, 75);

and received the same results.

Usually I am a strong advocate of typing only the absolute minimum necessary to get the job done. Have I violated my own guidelines? Not really. I have, in fact, typed exactly what I needed to get the job done. The question you should now ask is: What is my "job" or requirement? I want to display a wrapped string without any prefix. The way I do that is to pass NULL for the prefix; hence, the inclusion of three arguments.

Notice that I did not state my requirement as follows: "I want to display a wrapped string with the default prefix." If that were my desire, I should pass just two arguments. That way, if the default ever changes, my call to display_wrap will automatically adapt to the new value. But in the p.l package I would not want the output from PLVprs.display_wrap to change when the default value changes. I really do want a NULL value, regardless of the actual default value for the display_wrap procedure. In p.l 's use of display_wrap , in other words, the fact that the desired prefix (none) is the same as the default is nothing more than a coincidence.

This distinction between needing the default value and happening to match the default value is an important one. If you simply have a coincidence, do not rely on the default value. Instead, pass in the value, even if it is currently the default and not, strictly speaking, needed. Your code will be less likely to break as the underlying layers of code you rely on change.

10.1.4.3 wrapped_string function

The last string-wrapping program is the wrapped_string function . This function returns a string with the wrapped lines of text concatenated together, with newline characters after every wrapped line. The specification for wrapped_string is as follows:

FUNCTION wrapped_string
  (text_in IN VARCHAR2,
   line_length IN INTEGER := 80,
   prefix_in IN VARCHAR2 := NULL)
RETURN VARCHAR2;

The first argument, text_in , is the string to be wrapped and then displayed. The second argument, line_length , is the length of the line within which the string is wrapped. The third argument, prefix_in , allows you to specify an optional prefix to display before each line of the wrapped string.

This function is useful in situations (such as that currently found in Oracle Developer/2000 Version 1) where you cannot reference and use PL/SQL tables. The wrapped_string returns the text in a string which can then be immediately displayed in the right format (because of the embedded newline characters) or passed to another program which works with the wrapped text.

Here are some factors to consider when working with PLVprs:

PLVprs does not currently recognize many nonprinting characters as delimiters (only newline and tab are in the predefined delimiter lists). If you have, for example, a backspace character in your string, this will not be treated as a delimiter. You can, of course, define your own delimiter list and then pass that set of characters to PLVprs's program.
When you wrap a string, all carriage returns are replaced with single characters. Your "hard-coded" line breaks will be ignored in the wrap process.