troffcvt -- A troff Converter

Paul DuBois
dubois@primate.wisc.edu

Wisconsin Regional Primate Research Center
Revision date: 18 May 1997

ABSTRACT
Yet another troff-related program, with yet another set of misassumptions about how troff interprets its input, and its own set of deficiencies and bugs.

Table of Contents


Introduction


This document describes troffcvt ("troff convert"), a program which assists in the process of converting troff documents to other formats. troffcvt doesn't do the full job of translation itself; rather, it is a preprocessor that turns troff files into an intermediate format with a syntax that is easier to interpret than the raw troff input language. troffcvt is intended as a front end that supplies input to a postprocessor which finishes the translation to produce output in the target format. Since the job of writing a translator for a given target format then need not include writing a troff-parser, the burden of the translator writer is reduced. In a sense, troffcvt is simply another sort of ditroff, one that produces a different output language than does ditroff.

troffcvt started out as a sed script for converting troff to RTF (Rich Text Format), but it quickly became evident that that wasn't going to be a very simple job to do correctly. It seemed a better justification of effort to write a more general tool that would be useful in contexts other than that of RTF production. The source distribution contains some example translation methods (simple postprocessors) you can look at. A standard troffcvt output reader is included in the distribution; it can be configured for use with your own postprocessors.

troffcvt has a number of significant shortcomings. It doesn't do very well with input that has been passed through tbl, eqn or pic. (For input containing tables, you can use tblcvt rather than tbl to get better results.) troff constructs that involve determination of motion or sizes sometimes are calculated inaccurately since troffcvt knows nothing of font metrics. Conditional construct processing is problematic also, as are position-dependent traps. troffcvt has other limitations; if those just listed are insufficient to dissuade you from using it as the basis for a translator, see the document troffcvt -- Notes, Bugs, Deficiencies.

Translation Model


troff files consist of text to be formatted interspersed with markup requests indicating how the formatting is to be done. The language implemented by troff is essentially an inverted programming language, where document text comprises the comments and markup requests provide the program indicating how to format the comments. This language isn't especially easy to parse, which may be why there are few tools for translating troff documents into other formats. Most tools that do exists seem to use pattern match-based transformations, rather than making any attempt to actually understand the troff language. The purpose of troffcvt is to make it easier to write troff-to-XXX translators, for arbitrary XXX, by doing the hard work of turning troff input into something easier to interpret. This means part of the job is already done for postprocessor writers, who can then concentrate on producing output in the desired target format rather than on figuring out how to understand troff files.

For example, point size might be set by some disgusting sequence like this:

   .ds x *a
   .ds y *b
   .nr \(\*x\(\*x 12
   .ds \(\*y\(\*y \\n(\(\*x\(\*x
   .ps \*(\(\*y\(\*y
This is digested by troffcvt and appears in the output in somewhat simpler form:
   \point-size 12
Caveat: The quality of translators obviously depends on the quality of troffcvt's preprocessing, which is suspect. Nor is the situation improved by the fact that various versions of troff sometimes do different things with identical input. This makes it difficult for troffcvt to do the "correct" thing in all cases, especially for input files that have been tailored to work with (i.e., around bugs in) a particular version of troff.

troffcvt produces output that preserves information about the structure of a document (e.g., margins, page length) and its contents (the text it contains). The goal is not to lay out text on pages. That is left to postprocessors, which are expected to lay out document content by interpreting structure information. Postprocessors may use the structure and/or content to varying extents. For instance, a postprocessor that simply extracts the text would ignore the structure information. A postprocessor that produces a summary of the structure (e.g., page layout information) would ignore the text. Most postprocessors will fall somewhere between these extremes.

Inevitably, a certain amount of information is lost. Usually this results from not knowing all the characteristics of the output device. For instance, no font metric information is used, so it's not possible to determine the position on the current page, or even to know what the current page number is.

An example of troffcvt operation is shown below. (The default resolution of 432 units/inch is assumed.)

   Input                         Output
   .ps 14                        \point-size 14
   .vs 16                        \spacing 96
   .ce                           \center 
   .ft B                         \font B 
   troffcvt\-a troff converter   troffcvt 
   .ft                           @minus 
                                 a troff converter 
                                 \break 
                                 \adjust-full 
                                 \font R 

troffcvt Output


troffcvt produces a mixture of control and text lines. Control lines correspond to document structure. They consist of a backslash character \ followed by a control word and possibly some parameters for the control word, e.g., \space, \font R, \page-length 4752. Text lines correspond to document content, and are either plain text written literally to the output, or begin with a "@" to indicate special characters (for instance, @bullet for the "*" character or @alpha for "[[alpha]]").

None of the control or special-text keywords overlap, but it's still convenient to use different leading characters \ and @ to make it easier for simple filter programs to distinguish between them. For example, the following command strips control lines from a file containing troffcvt output:

   % sed -e "/^\\/d" filename
troffcvt output is rife with troff-isms, such as \need and \embolden. Little effort was made to map these to more general document layout concepts since it's not clear what gain, if any, there would be in doing so.

How troffcvt Works


There are two steps to turning troff files into some other format:

troffcvt is configured by means of action files, which are described below. troffcvt postprocessor writing is a separate issue from understanding how troffcvt itself works, and is covered in a separate document, troffcvt Output Format and Postprocessor Writing.

Probably the easiest way to get some idea of the relationship of troffcvt's input and output is to run some troff files through it and look at what comes out. When troffcvt runs, it reads one or more action files to configure itself, then processes input files according to the information in the action files. These are text files containing symbolic actions that specify what happens when requests occur. Action files are also used to define special characters and to set processing parameters.

troffcvt doesn't have built-in knowledge about any troff request. Stated another way, unless troffcvt is told how to implement a given troff request by means of some action file, it ignores that request. It also knows about very few of the characters that have special meaning (by design, since these vary from one version of troff to another). All of this stuff has to be specified in an action file. By default, troffcvt reads the action file actions when it runs. You can also name additional action files on the command line using the -a option.

The format of action files is simple. Blank lines are ignored. Lines beginning with a "#" character are also ignored, so you can use them to include comments. Actions are specified on a line consisting of a leading keyword to indicate the action type (imm or req), followed by an action list of zero or more actions. (An action line may be continued to the next line by putting a backslash at the end of the line.) Action lists can be executed immediately at the time the action file is read, or they can be associated with a request, to be executed whenever the request occurs in the input.

Immediate actions consist of the word imm followed by an action list that is executed as soon as it has been read. The first imm line below sets the point size to 10 points and vertical spacing to 12 points, whereas the second sets the font to roman:

   imm point-size 10 spacing 12p
   imm font R
Request actions are similar but specify a request name, a set of actions for parsing the request's arguments, and a set of actions for processing those arguments after they have been parsed:
   req request-name parsing-actions eol post-parsing-actions ...
request-name is the name of the request (without the leading period). The parsing-actions section specifies how to parse the arguments expected by the request. If parsing-actions is empty, no request arguments are expected (or are to be ignored). The eol keyword is mandatory. It signifies the end of the parsing actions and causes troffcvt to skip to the end of the request line. (If this were not done, the remaining part of the request line would be read as a separate line to be processed.) The post-parsing-actions section specifies what should happen after the request arguments have been parsed. Typically this involves interpreting the request arguments. If the post-parsing-actions section is empty, nothing is done with the request (the request is ignored).

troffcvt associates each action name with the number of arguments that should follow the action when it occurs in action lists. When an action is performed, any arguments specified in the action list are passed to it. For instance, the .so request can be described like this:

   req so parse-filename eol push-file $1
The parse-filename action parses the line on which the request occurs to find a filename. This filename becomes the value of argument 1, which can be referred to later as $1. push-file pushes the file named by $1 on the input stack. Since $1 refers to the first argument parsed from the .so request, if the request is ".so junk", then "push-file $1" becomes "push-file junk", and junk becomes the current input file.

The two req lines below show how the .ps and .ce requests can be defined:

   req ps parse-absrel-num x point-size eol point-size $1
   req ce parse-num x eol center $1
The actions to take when a .ps request occurs are: parse a number, which can be an absolute setting or relative to the current point size; skip to the end of the request line; set the point size using the previously parsed number. The actions for .ce are to parse a number, skip to the end of the request line, and cause the next "$1" input lines to be centered.

"Missing" arguments are passed as empty strings. A reference to $n is passed to the action as the empty string if no n-th argument was present on the input request line. Suppose the .ds request is defined like this:

   req ds parse-name parse-string-value y eol define-string $1 $2
Then if the following input line occurs, the parse-string-value action will find no string on the line, and the define-string action will define xx as the empty string:
   .ds xx
The language implemented by troff is expressive (if somewhat unwieldy), so a large number of actions seem to be necessary to allow requests to be specified properly. Descriptions of all actions are given in the troffcvt Action Reference document.

If you don't like the actions file supplied with the troffcvt distribution, you can modify it as necessary for your own purposes.

Specifying troffcvt's behavior in terms of symbolic actions rather than hardwiring them into the code allows a good deal of flexibility, because troffcvt's initial state and response to requests can be modified without changing troffcvt itself. For example, different versions of troff often know about different sets of special characters; building the list at runtime allows different versions to be accommodated. The initial page layout can also be specified this way, since although initial values for processing parameters are the same as those given in the Ossanna manual, you can change them. Thus you can set up layouts for letter size, legal, A4, etc.

This method of configuring troffcvt also meansx you can experiment quite easily with troffcvt's response to particular requests.

Names and Objects


Request, macro, string, and register definitions consist of two parts: a name, and the underlying object to which the name points. troffcvt allows groff-style aliases to be created, such that referring to an alias name is the same as referring to the original name. Aliases are implemented by creating multiple names that all point to the same underlying object. The object structure contains a reference count indicating how many names point to it.

For example, when a macro is defined, a name is allocated and pointed at a macro object structure that holds the macro contents (the body of the macro). The reference count in the object structure is set to one. If an alias to the macro is created, a new name is created and made to point to the same object structure as the original name. The reference count in the object structure is set to two. When a request, macro, string, or register is removed, the name is deallocated and the reference count is decremented. If the count goes to zero, the underlying object is no longer needed (no other names point to it), and the object structure is deallocated as well.

The reference count also includes the number of times an object is currently in use. When a request or macro is invoked or a string reference occurs, the reference count of the underlying object is incremented. When the request or macro terminates, or the end of the string is reached, the count is decremented. This use of the reference count has two purposes:

Consider the following example of a macro that removes itself:
   .de xx
   .rm xx
   ..
When the macro is defined initially, the name xx is created and made to point at a macro object, which is given a reference count of one. Invoking the macro results in the following actions:
Now consider the following slightly more complicated macro, which removes itself after creating an alias to itself:
   .de xx
   .als yy xx
   .rm xx
   ..
The reference count is set to one when the macro is defined, two when the macro begins executing, three when the alias is created, two when xx removes itself, and one when xx terminates. In this case, however, since the reference count is still one when xx terminates (the name yy still points to the underlying macro object), the object is not deallocated.

Aliases provide a convenient way to implement the .rn request. The new name is created as an alias for the existing name, and thus points to the same underlying object. The old name is then removed, but since the underlying object is now pointed to by another name, it persists as it should.

Macro Package Handling


troff is commonly invoked with some sort of -mxx flag (e.g., -man, -me, -mm, -ms), so these need to be handled by troffcvt as well. There are several ways of handling a macro package, some better than others:

You can also use a combination of the methods above. Probably the easiest way to start is to run a few documents through troffcvt and supply only an -mxx argument:
   % troffcvt -mxx myfile
This will tell you which macros troffcvt handles okay and which it botches. With that information in hand, you can construct an action file tc.mxx containing redefinitions for those macros that troffcvt needs help with. Try out the action file like this:
   % troffcvt -mxx -a tc.mxx myfile
By experimenting with tc.mxx, you can improve troffcvt's handling of any document that uses the -mxx macro package.

Some of the examples shown above demonstrate how to redefine macros, but do so by defining them using req lines. Thus, these "macros" are actually treated by troffcvt as requests. Before you redefine a macro as a request, be sure you understand the following points:

If a name that you're defining in an action file must refer to a macro and not to a request (e.g., if you want to use it with .it or .em, or if you want to be able to append to it later using .am), then don't define it using a req line. If you do, it'll be considered a request by troffcvt. Instead, use an imm line containing a push-string action to execute a string that contains the contents of a .de request. For example:
   imm push-string ".de xx\n.tm this is macro xx\n..\n"
If you provide redefinitions that might get used in concert with macro packages written for groff, here's something to watch out for: before redefining a name for which a definition may have already been read from the macro package file, it's prudent to remove the name first, like this:
   imm remove-name XX
   req XX definition...
This is due to the way that groff implements macro packages. Consider the -ms macros. These are supposed to be used such that .TL, .AU, .AI, and .AB occur in order if they are used. To make sure they aren't invoked out of order, the groff -ms definitions initially create .AU, .AI, and .AB as aliases to a macro that checks whether or not .TL has been invoked. When .TL is invoked it redefines the other macros appropriately with their "real" definitions. Now, suppose that you handle -ms by reading the macro package file and then redefining in an action file some of the macros such as .AI, and .AB. If you simply provide a new definition of .AI, what happens is that you also redefine all other names that are aliased along with .AI. In other words, you also redefine .AU and .AB! If you then redefine .AB, you also redefine .AU and .AI. Removing a name before giving it a new definition avoids this problem.

Conditions That Prevent Macro Redefinition


Suppose you normally format a document mydoc using a command something like this:

   % troff -ms mydoc
If you use .so mymacros in mydoc to read a file of macro definitions, you may have a problem if you want to process mydoc with troffcvt. In particular, if you want to redefine any of the macros in mymacros for troffcvt's benefit, you won't be able to use an action file to do so:
If you really need to redefine the macros in mymacros when you format mydoc with troffcvt, you can use the following strategy:
If you use groff, an alternative strategy can be used. Leave the .so mymacros request in mydoc, but surround each definition in mymacros with an .if d test:
   .if d xx .ig end_ignore
   ...macro definition here...
   .end_ignore
Then you can format the document with troffcvt like this:
   % troffcvt -ms -a tc.mymacros mydoc
When tc.mymacros is processed, it defines some or all the the macros used in mymacros. When mydoc is read and the .so mymacros request is processed, only those macros that were not already defined in tc.mymacros will be defined.

Similar considerations apply if you define macros directly in your troff source file. You won't be able to override them in an action file because the definition in the troff source file occurs later and will take precedence. To work around this, put the macro definitions in a separate file and use the first strategy described above, or use .if d as in the second strategy.

Input/Output Mechanisms


Character Coding


ChIn() returns values of type XChar, which is typedef'ed as an unsigned integer type. The return value falls into the following ranges:

0

This value signifies end of file on the current input source.
1..127 (0x01..0x7f)

Plain ASCII character.
128..255 (0x80..0xff)

8-bit (non-ASCII) input character.
257..511 (0x101..0x1ff)

Escape code for ASCII or 8-bit character preceded by an escape character (except \(, see below). The code for \X is constructed as 0x100|X. Note that 0x100 is not a valid escape code because null bytes are stripped from the input.
>=512 (>=0x200)

Special-character code. Sequences of the form \(xx or \[xxx] are recognized and converted to special-character codes. These codes start at 512 so that they are greater than all ASCII, 8-bit, or escape codes. If a special-character reference is encountered for a name that has no definition (i.e., the character was not defined in any action file), a new special character with an empty value is created on the fly. This is done on the following grounds:

UnChIn() takes an XChar argument, which is usually a value returned from ChIn(). UnChIn() pushes the argument onto the input pushback stack, unpacking escape and special-character codes into their original multiple-character input sequences. Unpacking is done to prevent problems. Suppose an escaped or special character is first seen in non-copy mode, then pushed back and reread in copy mode. If the escape code or special-character code itself were pushed back, the character wouldn't be reread in copy mode properly.

Values for plain ASCII and 8-bit characters can be represented in a single byte (as an unsigned character), but escape codes and special-character codes cannot, since they begin at 512. This is why the XChar type is wider than a single byte.

Special characters are disallowed in request arguments and escape sequences that might be written back out directly. For instance, .ft F is written out as \font F, so F isn't allowed to contain special characters. A similar restriction applies to diversion names.

Special-character names must consist entirely of printable ASCII characters. They are not allowed to be composed of other special characters, e.g., \(\(ts\(ts is disallowed.

Input Processing


Input may come from a file, a macro, a named string (created with the define-string action, usually in response to a .ds request), or an anonymous string (defined below under the description of the AChIn() function). The bulk of input usually comes from input files named on the command line, which are processed in sequence. Inputs sources may be nested (e.g., a macro or string may be referenced while reading a file). The current input is suspended when another input source is interpolated into the input stream, and is resumed when the interpolated source is exhausted.

ChIn() returns the next input character from the input stream. Embedded newlines (introduced with a backslash character \ at the end of a line) are deleted so that the following input line appears contiguous with the current line to any higher-level routines. Comments (introduced with \) are deleted up to (but not including) the end of line character. For instance, this makes:

   text followed by comment\" this is the comment
appear to be:
   text followed by comment
The handling of lines that begin with .\" happens properly; the comment stripping makes the line look like a line beginning with a control character but no request, so it is ignored. ChIn() also manages encoding of escaped characters, and pushing to input sources for number register, string or macro argument references. Handling of escape sequences differs depending on whether copy mode is in effect or not.

Input characters accepted by the file-input routine are non-null ASCII values (null bytes and bytes with bit 8 on are discarded). Escaped characters (\x) and special-character references \(xx or \[xxx]) are converted to escape codes and special-character codes as described above under "Character Coding."

Input source pushing occurs automatically in ChIn() when \n, \* or \$ occur (and also \w if not in copy mode): the input source switches to a string representing the value of the number register or string, the macro argument, or the result of the width calculation. Higher level routines also can cause the current input source to be pushed down, e.g., when a .so request occurs.

ChIn() is also used for the ugly task of processing multi-line conditional input (bracketed with the \{ and \} sequences). The conditional request processor saves the current if-level when it sees a \{, bumps it up one, then processes lines until the level drops back down to the saved value. ChIn() notices \}, silently deletes it and decrements the if-level, which is then noticed by the conditional processor. Pretty horrid.

UnChIn() is used to push characters back onto the input stream. It understands how to push back escape codes and special-character codes properly. It also understands how to push back multiple characters (characters must be pushed in the reverse order from that in which they were read).

ChIn0() returns the next raw (uninterpreted) character from the input stream. If there are any pushed back characters waiting to be reread, it returns the one most recently pushed. Otherwise, it returns the next input character from the current input source. If that source is exhausted, it resumes reading from the previous source. When there is no more input, it returns endOfInput. Input source unwinding is undetectable at any level above ChIn0(), including ChIn().

FChIn(), MChIn(), and AChIn() are the lowest level input routines; they're called by ChIn0(). These return a single character from a file, macro or named string, or "anonymous" string input source. Each returns endOfInput when the source is exhausted (which only means the current source is done, not necessarily that all sources are done). EOF is not returned because that is typically -1 (negative), and the input routines return a value of type XChar, which is unsigned.

FChIn() discards nulls. (It also converts CR or CRLF to LF; this has nothing to do with troff, but allows text files from MS-DOS or Macintosh machines to be read without requiring you to convert line endings first.)

MChIn() reads the next character from a macro or string definition. (Strings are implemented internally as macros without arguments.)

AChIn() reads the next character from an anonymous string, which is just some arbitrary string that is to be used as an input source. For instance, when a number register reference (\n) or width expression (\w) occur, the resulting value is converted to a character string, which becomes the current input until the string is completely read. References to macro arguments are treated similarly; the argument value is retrieved and pushed on the input stack. Another source of anonymous strings is the push-string action, which can be used in action files to push an arbitrary string onto the input stream. This is convenient for processing certain requests. For instance, if you want to redefine a macro, you can define the action for that macro to be one that pushes alternative input. Here's an example that shows how the .AB macro from the -ms macro package might be redefined:

   req AB parse-macro-args eol \
     push-string ".br\n.ce\n\\fIABSTRACT\\fR\n.sp\n"
One sticky problem occurs with the .nx request, usually processed with the switch-file action. When .nx occurs, it might happen while other files or macros are active. If the current input source is a file there is no problem since the file pointer for that source is simply switched to the new file. But if the request occurs in the middle of a macro, it's less clear what should happen. Should the macro continue to be processed? I elect to terminate macro sources and unwind the source stack until a file is found, then switch the file pointer of the file source. Possibly this is wrong; the troff manual is ambiguous on this point. (Which may be why different versions of troff behave differently in this situation.)

For handling the .ex request, the end-input action it used; it sets a flag causing ChIn() to return endOfInput forever after.

Output Processing


At the lowest output level, there are two calls. One is for writing characters and it simply writes to the output file and dies if there was an error. The other is for writing strings; it calls the write-character routine for each character in the string.

The next level up manages the mechanics of collecting plain text lines and interspersing them with special text and control lines. The basic issues are insertion of spaces between successive output text lines and making sure that special text and control lines don't get written into the middle of a plain text line.

Control lines begin with a backslash character \. Any plain text output line being collected is flushed so the control string doesn't appear on the same line.

There are two kinds of text output: plain text lines, and special text lines that indicate special characters (e.g., @backslash for the \ character. Whenever text output (either kind) is written, a check is made to see whether it's necessary to write a preceding space first. A space is usually needed between consecutive input text lines (exceptions are when centering or no-fill are in effect, or if an input line ends with a \c). For special text, any plain text output line being collected is flushed so the control string doesn't appear on the same line.

The output character set for text is such that most printable ASCII characters appear as themselves, and others are written out as special text lines. The characters tab, backspace, \, and @ are printable but written as specials @tab, @backspace, @backslash, and @at. The leader character SOH is written as @leader.

Input Levels


troffcvt maintains a notion of input level. The level is incremented each time a new input source begins and decremented when the current source ends. A file interpolated with .so is an input source, but so is a macro, a macro argument, a string, or a number register. This helps avoid the problem of interpreting something like this:

   .if '\*[xx]'y' ...
when the string xx contains an apostrophe. troffcvt uses the input level in such a way that troff constructs bounded by delimiters do not consider the closing delimiter to be found unless it occurs at the same input level as the opening delimiter. (If you simply look at characters as they occur, then the apostrophe in the string prematurely terminates the scan for the first of the strings to be compared, and throws off the comparison.) The affected constructs include:
   .if 'x'y'
   .tl 'left'center'right'
   \b'abc...'
   \h'N'
   \l'Nc'
   \L'Nc'
   \o'abc...'
   \v'N'
   \w'string'
The input level also affects parsing of macro arguments that begin with a double quote. Only a quote at the same input level as the opening quote terminates the argument.

The behavior just described mimics how groff treats its input, not how standard troff treats its input. However, groff ignores the input level (and thus acts like standard troff), in compatibility mode. troffcvt does the same. (Parsing routines that need to check the input level call the ILevel() function. In compatibility mode this function always returns zero, making all input appear to be at the same level.)

groff produces a quoted argument list when \$@ occurs in the input. The groff documentation says that it processes the list such that the quotes surrounding an argument appear at the same input level, whereas the argument itself is processed at a higher level. (This prevents the problems that would occur if an argument contained a quote.) I take this to mean that the quotes surrounding the arguments are at a level higher than the context in which the \$@ occurs, and the arguments one level higher than that, in case something like the following occurs in a macro:

   .xx "\\$@"
If the quotes produced by \$@ here were treated as being at the same level as quotes in the surrounding text, the arguments to .xx could be messed up.

troffcvt handles \$@ by constructing a string consisting of a list of argument references that looks like this:

   "\\$1" "\\$2" ... "\\$n"
Then the string is pushed on the input stack. This causes the quotes to be processed a level higher than the surrounding text. When each argument reference in the string is encountered, the value of the argument is pushed on the stack, causing the reference to be processed another level higher.

troffcvt handles \$* in a manner similar to \$@ except that no quotes are added to the string containing the list of argument references.

Macro Argument Quoting


Macro arguments consist of strings of non-white characters. Arguments may be quoted to allow whitespace to be included. An argument that begins with a double quote is parsed in quote mode until a closing quote, and the leading and terminating quotes are stripped off.

Double quotes in macro arguments are handled as follows:

The quote stripping described above presents an interesting problem in standard troff. If an argument contains quotes, and then is used inside the macro by being passed to another macro, quote stripping occurs again. This is really ugly, because it means you must understand the implementation of the macros you're using and know how many extra quotes to put in your arguments so that they end up with the correct number when they finally reach the bottom-level macro.

Neither groff nor troffcvt have this problem, since quotes in arguments occur at a higher level than the surrounding text. (In compatibility mode, troffcvt uses the quote-stripping behavior of standard troff.)