home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl CookbookSearch this book

20.4. Converting ASCII to HTML

Problem

You want to convert ASCII text to HTML.

Solution

Use the simple little encoding filter in Example 20.3 .

Example 20.3: text2html

#!/usr/bin/perl -w -p00
# text2html - trivial html encoding of normal text
# -p means apply this script to each record.
# -00 mean that a record is now a paragraph

use HTML::Entities;
$_ = encode_entities($_, "\200-\377");

if (/^\s/) {
    # Paragraphs beginning with whitespace are wrapped in <PRE> 
    s{(.*)$}        {<PRE>\n$1</PRE>\n}s;           # indented verbatim
} else {
    s{^(>.*)}       {$1<BR>}gm;                    # quoted text
    s{<URL:(.*?)>}    {<A HREF="$1">$1</A>}gs         # embedded URL  (good)
                    ||
    s{(http:\S+)}   {<A HREF="$1">$1</A>}gs;        # guessed URL   (bad)
    s{\*(\S+)\*}    {<STRONG>$1</STRONG>}g;         # this is *bold* here
    s{\b_(\S+)\_\b} {<EM>$1</EM>}g;                 # this is _italics_ here
    s{^}            {<P>\n};                        # add paragraph tag
}

Discussion

Converting arbitrary plain text to HTML has no general solution because there are too many different, conflicting ways of representing formatting information in a plain text file. The more you know about the input, the better the job you can do of formatting it.

For example, if you knew that you would be fed a mail message, you could add this block to format the mail headers:

BEGIN {
    print "<TABLE>";
    $_ = encode_entities(scalar <>);
    s/\n\s+/ /g;  # continuation lines
    while ( /^(\S+?:)\s*(.*)$/gm ) {                # parse heading
        print "<TR><TH ALIGN='LEFT'>$1</TH><TD>$2</TD></TR>\n";
    }
    print "</TABLE><HR>";
}

See Also

The documentation for the CPAN module HTML::Entities


Previous: 20.3. Extracting URLs Perl Cookbook Next: 20.5. Converting HTML to ASCII
20.3. Extracting URLs Book Index 20.5. Converting HTML to ASCII

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.