home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


20.4. Converting ASCII to HTML

Problem

You want to convert ASCII text to HTML.

Solution

Use the simple little encoding filter in Example 20.3 .

Example 20.3: text2html

#!/usr/bin/perl -w -p00
# text2html - trivial html encoding of normal text
# -p means apply this script to each record.
# -00 mean that a record is now a paragraph

use HTML::Entities;
$_ = encode_entities($_, "\200-\377");

if (/^\s/) {
    # Paragraphs beginning with whitespace are wrapped in <PRE> 
    s{(.*)$}        {<PRE>\n$1</PRE>\n}s;           # indented verbatim
} else {
    s{^(>.*)}       {$1<BR>}gm;                    # quoted text
    s{<URL:(.*?)>}    {<A HREF="$1">$1</A>}gs         # embedded URL  (good)
                    ||
    s{(http:\S+)}   {<A HREF="$1">$1</A>}gs;        # guessed URL   (bad)
    s{\*(\S+)\*}    {<STRONG>$1</STRONG>}g;         # this is *bold* here
    s{\b_(\S+)\_\b} {<EM>$1</EM>}g;                 # this is _italics_ here
    s{^}            {<P>\n};                        # add paragraph tag
}

Discussion

Converting arbitrary plain text to HTML has no general solution because there are too many different, conflicting ways of representing formatting information in a plain text file. The more you know about the input, the better the job you can do of formatting it.

For example, if you knew that you would be fed a mail message, you could add this block to format the mail headers:

BEGIN {
    print "<TABLE>";
    $_ = encode_entities(scalar <>);
    s/\n\s+/ /g;  # continuation lines
    while ( /^(\S+?:)\s*(.*)$/gm ) {                # parse heading
        print "<TR><TH ALIGN='LEFT'>$1</TH><TD>$2</TD></TR>\n";
    }
    print "</TABLE><HR>";
}

See Also

The documentation for the CPAN module HTML::Entities


Previous: 20.3. Extracting URLs Perl Cookbook Next: 20.5. Converting HTML to ASCII
20.3. Extracting URLs Book Index 20.5. Converting HTML to ASCII

Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.