tc2html Notes

Paul DuBois
dubois@primate.wisc.edu

Wisconsin Regional Primate Research Center
Revision date: 9 March 1997

Table of Contents


Introduction


tc2html is a postprocessor for converting troffcvt output to HTML. It's used by the troff2html front end. This document describes how tc2html works and some of the design issues involved in writing it.

In general, the goal of tc2html is that you should get reasonable HTML output with no need for special treatment of the troff input file. The most important thing is that you use a standard macro package. However, there are some additional principles you can follow that will improve the quality of the HTML that tc2html generates. For example, it's possible to embed hypertext links in your troff source with a little prior planning. Techniques for such things are discussed in the section "Generating Better HTML." If you're not interested in implementation details, you can skip directly to that section.

Output Format


tc2html reads output from troffcvt and produces an HTML document that has the following general form:

   <HTML>
   <HEAD>
   <TITLE>title text</TITLE>
   </HEAD>
   <BODY>
   <H1>title text</H1>
   body text
   </BODY>
   </HTML>
The document HEAD part may be missing if tc2html detects no title in the input. In this case the initial heading at the beginning of the document BODY part also will be missing. The entire document BODY may be missing or empty if the input document is empty.

Determining Input Document Structure


HTML documents typically are highly structured, being written in terms of elements such as headers, paragraphs, lists, and displays (preformatted text). But troffcvt output normally contains very little structural information beyond markers like those for inter-paragraph spacing and line breaks (in the form of \space and \break control lnes). The result when tc2html reads such troffcvt output is that it produces HTML that is relatively unstructured -- just a lot of text broken by occasional <P> or <BR> markers.

However, if your document is marked up using macros from a macro package such as -ms or -man, it's possible to get output from troffcvt that's much more suitable for tc2html. The trick is to map troff requests to HTML structure markers, rather than trying to guess the structure from the low-level troffcvt output that normally results from those requests. This is accomplished using the following strategy:

Note that "extending" the troffcvt output language to include the \html control is done using request definitions in an action file. Source-level changes to troffcvt itself are not needed.

The effect of the strategy outlined above is to remap the macros in your macro package from their usual actions onto actions that produce document structure information that tc2html can recognize. For this to work well, all the important structure-related macros in a macro package must be redefined, so the redefinition files used for tc2html tend to be more extensive than those used for other postprocessors. This is really the source of most of the work involved in getting tc2html to function well. Once a set of redefinitions is written for a given macro package, translation from troff to HTML is a straighforward process that usually generates fairly reasonable HTML.

Here's an example of how the strategy described above works in practice. The .LP macro in the -ms macro package means "begin paragraph." But .LP typically is implemented by executing several other requests (restore font, margins, adjustment, spacing, point size, etc.), and the troffcvt output you'd get by processing those requests really contains nothing that specifically indicates a paragraph. To work around this, we use the fact that tc2html interprets \html para as indicating a paragraph beginning, and define a macro to generate that control:

   req H*para eol output-control "html para"
Then we can redefine the .LP macro in terms of the .H*para macro:
   req LP eol \
        break center 0 fill adjust b font R \
        push-string ".H*para\n"
The break, fill, adjust, and font actions cause troffcvt to adjust its internal state to match the effect that the .LP macro normally has. The call to .H*para results in \html para in the output, so that tc2html can recognize the paragraph beginning.

The \html markers that tc2html recognizes are shown below:

   \html title                Begin document title
   \html header N             Begin level N header
   \html header-end           End header (any level)
   \html para                 Begin paragraph
   \html blockquote           Begin block quote
   \html blockquote-end       End block quote
   \html list                 Begin list
   \html list-end             End list
   \html list-item            Begin list item
   \html display              Begin display (preformatted text)
   \html display-end          End display
   \html display-indent N     Set display indent to N spaces
   \html definition-term      Begin definition list term
   \html definition-desc      Begin definition list description
   \html shift-right          Shift left margin right
   \html shift-left           Shift left margin left
   \html anchor-href URL      Begin HREF anchor for link to URL
   \html anchor-name LABEL    Begin NAME anchor with label LABEL
   \html anchor-toc N         Begin NAME anchor for level N TOC entry
   \html anchor-end           End anchor (any kind)
The troff-level macros used to generate the \html controls are shown below. These macros are defined in the action file actions-html:
   .H*title                   Begin document title
   .H*header N                Begin level N header
   .H*header-end              End header (any level)
   .H*para                    Begin paragraph
   .H*bq                      Begin block quote
   .H*bq-end                  End block quote
   .H*list                    Begin list
   .H*list-end                End list
   .H*list-item               Begin list item
   .H*disp                    Begin display (preformatted text)
   .H*disp-end                End display
   .H*disp-indent N           Set display indent to N spaces
   .H*dterm                   Begin definition list term
   .H*ddesc                   Begin definition list description
   .H*shift-right             Shift left margin right
   .H*shift-left              Shift left margin left
   .H*ahref URL               Begin HREF anchor for link to URL
   .H*aname LABEL             Begin NAME anchor with label LABEL
   .H*atoc N                  Begin NAME anchor for level N TOC entry
   .H*aend                    End anchor (any kind)
Note that since these names are longer than two characters, they cannot be used in compatibility mode.

Invoking tc2html


The \html controls are defined in a file actions-html that you can access on the troffcvt command line using -a actions-html. If you use a macro package -mxx, you specify it on the command line, along with the general and HTML-specific troffcvt redefinitions for that macro package; these are in the action files tc.mxx and tc.mxx-html. Thus, to translate a file that you'd normally process using -ms, the command would look like this:

   % troffcvt -a actions.html -ms -a tc.ms -a tc.ms-html myfile.ms \
        | tc2html > myfile.html
That's pretty ugly, of course; it's better to use a wrapper script like troff2html that supplies the necessary options for you:
   % troff2httml -ms myfile.ms > myfile.html

Implementation of Various HTML Constructs


This section provides some specifics on how several troff concepts are turned into HTML elements. It should be considered illustrative rather than exhaustive.

Document Titles


Title macros are implemented in terms of .H*title, which generates an \html title control. When tc2html sees this control, it goes into document HEAD collection mode. If the document contains a title, the \html title line must be the first \html control that tc2html sees. Should any other \html control or document text occur first, tc2html assumes no title is present. Any leading document whitespace (\space or \break lines) occurring prior to the title is skipped.

The title is terminated by the next \html line with a structural marker, such as \html para. The title text is used to produce the TITLE in the document HEAD part and the initial header in the document BODY part. \space and \break lines within the title do not terminate title text collection; instead, they are turned into spaces in the title and into <P> and <BR> in the initial header. Consider the following troff input (using -ms macros):

   .TL
   My
   .sp
   Title
   .LP
   This is a line
This is converted by troffcvt into the following:
   \html title
   My
   \space
   Title
   \break
   \html para
   This is a line.
The output from troffcvt is converted in turn by tc2html into this HTML:
   <HEAD>
   <TITLE>
   My Title
   </TITLE>
   </HEAD>
   <BODY>
   <H2>
   My
   <P>
   Title
   </H2>
   <P>
   This is a line.
-T title may be specified on the tc2html or troff2html command line to specify a title explicitly. It overrides the title in the document if there is one.

Standard Paragraphs


The "standard" paragraph is a paragraph with the first line flush left. There is no mechanism for writing paragraphs with an indented first line; they're treated simply as standard paragraphs.

The standard paragraph is implemented in terms of .H*para, which generates an \html para control. This is turned by tc2html into <P>.

In the document BODY part, \space is also interpreted as a paragraph marker, but during document title collection, \space is treated as described above under "Document Titles ."

Indented Paragraphs


Indented paragraphs (with or without a hanging tag) are implemented using definition lists (<DL>...</DL>). The tag is written as a definition term (<DT>...</DT>) and the paragraph body is written as a definition description (<DD>...</DD>). If there is no tag, the term part is empty.

Indented paragraph macros are implemented in terms of .H*dterm and .H*ddesc, which generate \html definition-term and \html definition-desc controls.

One problem with mapping indented paragraphs onto definition lists is that it's not always clear from the troff input where the list ends. In HTML, the definition list is a container for which you must write both a beginning and ending tag, but in troff only the beginnings of paragraphs are specified. This problem is handled (perhaps poorly) by closing the list when other HTML structural elements like a standard paragraph or a header are seen. Suppose you write something like this:

   .IP (i)
   Para 1
   .IP (ii)
   Para 2
   .LP
   Para 3
This is converted by troffcvt into the following:
   \html definition-term
   (i)
   \html definition-desc
    Para 1
   \break
   \html definition-term
   (ii)
   \html definition-desc
    Para 2
   \break
   \html para
   Para 3
   \break
When tc2html sees the first \definition-term, it begins a definition list. The second \definition-term continues the same list. The \html para (resulting from the .LP) is part of a different structural element, so tc2html closes the list and begins a standard paragraph. The resulting HTML looks like this:
   <DL>
   <DT>
   (i)
   </DT>
   <DD>
   Para 1<BR>
   </DD>
   <DT>
   (ii)
   </DT>
   <DD>
   Para 2<BR>
   </DD>
   </DL>
   <P>
   Para 3<BR>

Right and Left Shifts


In troff, the left margin can be shifted right and left, e.g., as is done with the -ms and -man packages using .RS and .RE. HTML has no good way of shifting the margin, so shifts are performed using <UL> and </UL>. This is admittedly a hack, but it works reasonably well. Shift macros are redefined to be implemented in terms of .H*shift*right and .H*shift*left, which generate \html shift-right and \html shift-left controls. These in turn are converted by tc2html to <UL> and </UL>.

Displays


Displays are implemented as preformatted text (<PRE>...</PRE>). Tabstops are respected within displays, although they must be approximated since characters widths are unknown. tc2html assumes 10 characters/inch for determining the width of tabstops.

Display macros are implemented in terms of .H*disp and .H*disp*end. Preformatted text in HTML has no additional indent relative to the left margin, but troff displays often are indented a bit. To handle this, .H*disp*indent N can be used to set the display indent to N spaces.

.H*disp, .H*disp*end, and .H*disp*indent generate \html display, \html display-end, and \html display-indent controls. The first two of these are converted by tc2html into <PRE> and </PRE>. \html display-indent generates no output itself, but causes tc2html to add spaces to the beginning of each line of a display.

Centered and right-justified displays are not implemented. They're treated as regular displays.

Tables


If your input document has tables written in the tbl language, preprocess the document with tblcvt rather than with tbl. Your output will look better that way.

Table cell borders are hard to do well. In tbl you can put a border on any cell boundary, but in HTML a table has either no borders or borders around every cell. Currently, tc2html puts borders around every cell.

Font Handling


Fonts are handled in tc2html by means of a table that associates four tags with each font name. The first two tags are used to turn the font on and off in normal text. The second two tags are used to turn the font on and off in displays. This table is read at runtime from the html-fonts file. Here's an example of what the file might look like:

   R    ""          ""             ""           ""
   I    <I>         </I>           <I>          </I>
   B    <B>         </B>           <B>          </B>
   BI   <B><I>      </I></B>       <B><I>       </I></B>
   C    <TT>        </TT>          ""           ""
   CW   <TT>        </TT>          ""           ""
   CI   <TT><I>     </I></TT>      <I>          </I>
   CB   <TT><B>     </B></TT>      <B>          </B>
   CBI  <TT><B><I>  </I></B></TT>  <B><I>       </I></B>
The difference between the tags for regular text and display text is that, since browsers implicitly switch the font to monospaced font in displays, the only thing that can be done for font changes there is to change the style attributes.

The initial font when tc2html begins is R (roman). When a font change occurs, the new font's begin tag is written out after terminating the previous font by writing its end tag. Using the font table just shown, this input:

   \font R
   abc
   \font I
   def
   \font CW
   ghi
   \font R
   jkl
becomes this output:
   abc<I>def</I><TT>ghi</TT>jkl

Tabs


Tabs are ignored except in displays. Adding extra space to tab over has no effect in regular paragraphs anyway, because browsers typically collapse runs of spaces.

Right-justified and centered tabs are treated as left-justified tabs. That is, they're completely botched.

Generating Better HTML


This section describes how you can embed hypertext links in your troff source and how to produce a table of contents containing clickable links to the main sections of your document.

Generating Hypertext Links


The \html controls used to generate hypertext links are:

   \html anchor-href URL
   \html anchor-name LABEL
   \html anchor-end
The first two controls generate opening <A HREF=URL> and <A NAME=LABEL> tags; the third generates a closing </A> tag.

To embed hypertext links in your troff source, you can use the macros .H*ahref and .H*aend, or .H*aname and .H*aend. To write an HREF link, the troff source looks like this:

   .H*ahref http://www.some.host/some/path
   hypertext link
   .H*aend
The resulting HTML looks like this:
   <A HREF="http://www.some.host/some/path">
   hypertext link</A>
To write a NAME link, the troff source looks like this:
   .H*aname my-name
   name link
   .H*aend
The resulting HTML looks like this:
   <A NAME="my-name">
   name link</A>
Section-header macros are usually redefined to generate a NAME anchor for the table of contents, so don't surround a section header with anchor-generating macros. You'll end up with nested anchors, which tc2html disallows. You can generate a NAME link for a section (e.g., so that you refer to it using a specific name) as long as you don't write the link like this:
   .H*aname better-html
   .SH "Generating Better HTML"
   .H*aend
Instead, write it like this:
   .H*aname better-html
   .H*aend
   .SH "Generating Better HTML"
Unfortunately, some browsers don't seem able to jump to NAME anchors unless there is some text between the <A NAME> and </A> tags.

You can't make a section header a hypertext link. You'd have to put the header (which generates a NAME link for the TOC) between the .H*ahref and .H*aend macros, which would result in nested anchors.

Generating a Table of Contents


Putting a table of contents (TOC) into an HTML document requires some postprocessing of the tc2html output. The TOC entries can't be written to the beginning of the document because they're not all known until the input has been read entirely. The approach adopted with tc2html is as follows:

If you run tc2html directly, you must also run tc2html-toc directly. If you use troff2html, tc2html-toc is run for you automatically.

The \html controls used to generate TOC entries are:

   \html anchor-toc N
   \html anchor-end
Text occurring between \html anchor-toc and \html anchor-end pairs is written to the output, but it's also collected and remembered. When tc2html encounters end of file on its input, it writes the TOC entries to the output between two other HTML comments:
   <!-- TOC BEGIN -->
   TOC entries
   <!-- TOC END -->
If you want to generate a TOC entry explicitly in your troff source, use .H*atoc and .H*aend. For example:
   .H*atoc 1
   My TOC Entry
   .H*aend
The argument to .H*atoc is the TOC entry level (1, 2, 3, ...).

It's unnecessary to invoke TOC macros directly if the section-header macros in your macro package are redefined to invoke the TOC macros for you. For example, the .SH for the -ms package is redefined like this in the tc.ms-html action file:

   req SH parse-macro-args eol \
        break fill adjust b \
        push-string ".H*atoc 1\n" \
        push-string ".H*header 2\n" \
        push-string "$1\n" \
        push-string ".H*header*end\n" \
        push-string ".H*aend\n"
To specify the TOC title and generate the TOC position marker, use the .H*toc*title macro. Invoke it as shown below, passing the title of your TOC as the first argument:
   .H*toc*title "Table of Contents"
.H*toc*title writes the TOC title to the output followed by a special HTML comment:
   Table of Contents
   <!-- INSERT TOC HERE -->
The INSERT TOC HERE comment is used by tc2html-toc, along with the TOC BEGIN and TOC END comments, to find the TOC entries and move them to the desired location.

Action files that provide macro package redefinitions for tc2html can try to place an advisory TOC location marker in the document. This is used if you don't specify a location marker explicitly with .H*toc*title:

   <!-- INSERT TOC HERE, MAYBE -->
For instance, the -man redefinitions put out this marker when the .TH macro has been seen. The marker causes a TOC to be placed after the title line and the first man page section, unless one is specified explicitly. No TOC title is written with the advisory marker however, so the TOC will be "title-less."