troffcvt
Notes, Bugs, Deficiencies

Paul DuBois
dubois@primate.wisc.edu

Wisconsin Regional Primate Research Center
Revision date: 7 March 1997

Table of Contents


General Information


This document contains miscellaneous observations about how troffcvt behaves, and tries to document its limitations. It should be read by anyone trying to write a postprocessor for troffcvt output.

troffcvt supports the full troff language, aside from some specific exceptions noted below. These are discussed in sections numbered in parallel with Ossanna's troff manual. Many are related to insignificant or obscure features of the language (e.g., .fl, .pm). Some are more significant (e.g., diversion mishandling). In some sense these exceptions form the troffcvt bug list.

troffcvt supports a limited subset of the groff extensions to standard troff. In general, you should assume that any particular groff extension is not supported by troffcvt, but there are some important exceptions such as aliases and long names. For more details, see the document troffcvt Support for groff.

The most general and pervasive exception to standard troff processing is that troffcvt knows nothing about the characteristics of any output device; in particular, it uses no font metric information. This means it doesn't know how wide or tall any character is. This exception is pervasive in that it affects handling of a number of requests and other aspects of the language. Some of the implications are:

Non-use of font metric information is deliberate; it isn't a goal of troffcvt to lay out text on pages. If it were, ditroff would be more useful. The goal is to make it easier for other programs to lay out text, by producing input for those programs that's more easily interpretable than straight troff input. Along with this is the goal of producing input that's easily transformable before it's fed into the final translator. Example transformations include: mapping \font lines onto fonts available in the target format; scaling character sizes up or down for easier reading or tighter packing, by mapping \point-size and \spacing lines; changing page layout, e.g., for production of legal or A4 size pages. Tying troffcvt output to font metrics would make these sorts of transformations difficult.

The numbering of the sections that follow correponds to the section numbering in the Ossanna troff manual, to make it easier to determine where troffcvt bugs affect requests listed in a given section of the Ossanna manual.

1. General Explanation


1.2. Formatter and device resolution


The default resolution used by troffcvt is 432 units/inch, but it may be changed with the -r option. (A good value might be the least common multiple of 72 and the resolution you use in the target format.)

Since resolution is not fixed, postprocessors should use the value specified on the \resolution line that appears as the first line of the setup section. It indicates number of basic units per inch. Numeric values on following control lines that are specified in basic units can be converted to other units as necessary using this resolution.

1.3. Numerical parameter input


The default scaling for unscaled numbers in troff requests is not hardwired into troffcvt. Instead, scaling is specified in the action file, although it's a good idea to use the same default there that troff uses:

   req sp parse-num v eol break space $1   good
   req sp parse-num i eol break space $1   bad
Expressions that involve calculation of "amount of motion to reach an absolute position" (as in, e.g., |3.2c) evaluate to zero. Since the current position is unknown, the distance to any other position cannot be determined. This affects processing of tbl output particularly, since tbl is fond of using \h´|N´ to line up columns.

2. Font and Character Size Control


2.2. Fonts


Fonts R, I and B are initially mounted on positions 1, 2 and 3, respectively, and the special font is mounted on all other positions. This means fonts R and 1, I and 2, etc., are considered equivalent. If a different font is mounted on a given position, references to that font, either by the name or number, are considered equivalent. This is logical to me, although in fact it doesn't reflect the behavior of all troff versions. For instance, xroff does not necessarily consider R and 1 equivalent unless you mess around with its font map.

The .f register is set to the number of the current font, or zero if the current font is not mounted (it is allowable to refer to a font simply by naming it, so the current font doesn't necessarily have any number). .ft 0 and \f0 are taken as referring to this font.

Font changes are written by name (not number) in the form \font name, where name must be interpreted by the postprocessor. This is a difficult problem since font names tend to be site-specific and idiosyncratic, although the standard troffcvt file reader provides some simple font handling support that might be useful.

Output from the .bd request typically appears as the \embolden and \embolden-special control lines. It's not clear whether it's worth it for postprocessors to support this request postprocessors, particularly the special-font variant. Although one can switch to the special font explicitly (.ft S, \fS), characters from the special font are also logically part of the other default fonts, and thus referenced for particular characters even if S is not the current font. To fully support special font bolding, you'd need to keep track of all the characters in the special font and check every output character to see if it needs to come from that font. Besides, this whole business of the relationship between the special font and other fonts seems tightly linked to the particular typesetting machinery used when troff was originally written.

2.3. Character size


Any positive character size is allowed. For historical reasons, embedded absolute size changes may be one or two digits up to a size of 36, i.e., \s36 is the same as .ps 36 while \s37 is the same as .ps 3 followed by "7". Non-numeric input following \s is interpreted the same way as \s0.

For the .cs request, only the font name is written out on the \constant-width line; the width in which the characters are to be written is currently ignored.

3. Page Control


Since the current page number cannot be reliably determined, .bp and .pn requests which specify a relative page number change are not reliable.

My troff Summary and Index indicates that Vs are the default scaling unit for .bp and .po requests. The actions file supplied with the troffcvt distribution tells troffcvt to ignore scaling for .bp and to use ems for .po, which seems to make more sense.

.mk and .rt are not supported.

4. Text Filling, Adjusting and Centering


troff tosses extra spaces at the end of text lines. troffcvt tries to do the same but gets confused by sequences such as "abc\fI \fP". The trailing spaces are retained in the output, erroneously.

4.1. Filling and adjusting


Use of the .j register as the argument to the .ad request is allowed. Note: This depends on all the internal adjustment mode type values being single-digit non-negative integers so that the argument can be parsed by the parse-char action. The internal codes are not necessarily the same as those used by any particular version of troff. (The codes are known not to be the same as those assumed by tbl, but I'm not sure exactly what tbl assumes.)

No hyphenating is done; that is left for the postprocessor. The optional hyphenation character appears as @opthyphen in the output.

The .n, nl and .h registers are not set.

\p appears in the output as \break-spread.

4.2. Text Interruption


In all the CFA (center, fill, adjust) modes, text interruption in the input (\c) is processed such that the next text line appears to be logically glued to the current one. The resulting logical line counts as a single input line. (Actually, this appears to be only sometimes true, e.g., for .ce, but not, evidently, for .ul or .it. Huh.)

Text interruption in the input will not appear explicitly in the output and thus is of no importance for postprocessors. \c is manifest in troffcvt output merely as an absence of a leading space on the next text output line. Example:

   Input 1        Input 2
   abc            abc\c
   def            def
   Output 1       Output 2
   abc            abc
    def           def
Postprocessors would write these out as "abc def" and "abcdef", respectively.

5. Vertical Spacing


5.2. Extra line-space


The .a register is not set.

.sv, .os, .ns, .rs are not supported.

7. Macros, Strings, Diversion, and Position Traps


The troff manual doesn't say it, but .rm allows multiple names to be specified for removal on a single request. troffcvt does, too.

The troff manual doesn't say that you can invoke macros as strings, either, but you can. troff prints "abc" when given the following input:

   .de xx
   abc
   ..
   \*(xx
You can also invoke a string as though it is a macro (i.e., by uttering the string name on a line by itself with a leading dot). The contents of the string are interpolated into the input in place of the line on which the invocation occurs. However, since strings have no terminating newline, the input line following this "macro" invocation is taken as part of the same input line on which the invocation occurs.

troffcvt treats macros and strings as essentially equivalent. The primary difference is that strings don't have arguments.

7.3. Copy Mode


The copy mode mechanism doesn't care how long strings are.

7.4. Diversions


These are "supported" in a poor way that probably should be changed. Diversion output isn't saved and just goes to stdout like everything else. Output for diversion xx is bracketed by \diversion-begin xx and \diversion-end xx for .di or by \diversion-append xx and \diversion-end xx for .da. Diversion output may be nested, which is one reason support is poor. (It puts the burden on the postprocessor to unnest them.)

Diversion output is not saved in a macro body, because diversions are often linked to position traps and thus might never be called. Since that would lose the output completely, I judged it better to interpolate the diversion into the output at the point at which it is created. The down side is that for diversions which are invoked explicitly, the diversion doesn't appear where it should.

Possibly diversion output should be saved in temporary files and written to the output when the diversion is done. But the question is: when is a diversion "done"? (There may be a .da later in the input.)

The .d, .h, .t, dn and dl registers are not set. The .z register is the name of the current diversion, not a numeric value. Its value is empty if no diversion is currently active, otherwise the current diversion name is interpolated into the output.

7.5. Traps


Position and diversion traps (.wh, .ch, .dt) are not supported. troffcvt ought at least to write out some of the information for these requests so that postprocessors could try to use it if they wanted.

The input line trap (.it) is supported.

8. Number Registers


The troff manual doesn't say it, but .rr allows multiple registers to be specified for removal on a single request. troffcvt does, too.

The manual also doesn't say that if the increment or format arguments are missing, and the register already exists, the existing increment and format carry into the new definition. In troffcvt, only the increment carries through, since formats are broken (see below).

You cannot set, rename, remove or change the format of read-only registers.

The number register formats i, I, a and A are broken. These all print in the default format. Formats 01, 001, etc. are not parsed correctly either, yet.

The ct, dl, dn, hp, ln, nl, sb, and st registers are not supported.

The value of the % register is unreliable, since the "current page number" is unknown.

The .A, .T, .a, .d, .h, .n, .t, .x, and .y registers are not supported.

The .w register is always set to 1 en, since troffcvt calculates widths of strings by assuming that all characters are 1 en wide. (See §11.)

The .z register is anomalous, since it's not really a number; see notes for §7.4. (This isn't a troffcvt bug; troff treats .z specially, too.)

References to non-existent or unsupported registers are interpolated as "0" (zero).

9. Tabs, Leaders, and Fields


Tab and leader characters appear as @tab and @leader in the output.

.ta with no arguments is written as \reset-tabs. The postprocessor should reset tab settings to "every half-inch". If explicit settings are given, the first one is written as \first-tab position type and all following as \next-tab position type.

Field delimiter characters are written as @fieldbegin or @fieldend, depending on whether they begin or end a field.

Field padding characters are written as @fieldpad when the character occurs between pairs of field delimiter characters (otherwise it is deleted, which may or may not be correct).

10. Input and Output Conventions and Character Translations


10.1. Input character translations


STX, ETX, ENQ, ACK, BEL, SO, SI and ESC are not treated specially. You deserve what you get if you have them in your input files. So there.

10.2. Ligatures


Ligature mode as set by .lg is not supported. The special characters \(ff, \(fl, \(fl, \(Fi and \(Fi normally should be defined in the action file to write out @ff, @fi, @fl, @ffi and @ffl, and postprocessors should be trained to recognize these sequences.

10.3. Backspacing, underlining, overstriking, etc.


No motion is generated for backspace characters; they appear as @backspace in the output.

Underlining is indicated by \underline for normal underlining and \cunderline for continuous underlining. These are identical in troff; postprocessors may or may not wish to consider them so, depending on the capabilities of the target format. Underlining (both kinds) is turned off with \nounderline.

10.5. Output translation


.tr doesn't work for special characters or for escaped characters. The output character can be anything, but the input character must be plain text. This is legal:

   .tr x\(**
This is not:
   .tr \(**x

10.6. Transparent throughput


Transparent mode (\!) is not supported very well.

Real-life observations of behavior of troff versions: It doesn't appear to be quite true that the rest of the line after \! is always passed as is, at least from my observations on groff and SunOS 4.1.1 nroff. Embedded newlines are still processed. Comments are still stripped. If a transparent line within a multi-line section of conditional input contains \} on multi-line conditional input is recognized and terminates the input if it is within a rejected clause. If it is within an accepted clause, the \} appears on the transparent line.

10.7. Comments and concealed newlines


Comments and concealed newlines are swallowed at a very low level in the input routines, and are thus unavailable to postprocessors.

11. Local Horizontal and Vertical Motions, and the Width Function


\w´string´ computes widths of strings only to an approximation. Since character widths are unknown, the width is computed as though all characters in the string are 1 en wide. Font and size changes are recognized but ignored, which leads to particularly egregious errors for constructs such as \w´\s+9\s+9\s+9X\s-9\s-9\s-9´. The ramifications of the fact that \w yields only approximate results are legion, since \w may be used in any expression, e.g., in numeric arguments to requests, or in escape sequences such as \h´N´.

12. Overstrike, Bracket, Line-drawing and Zero-width Functions


\b´string´ and \o´string´ are supported by writing the characters in string to the output, sandwiched between \bracket-begin (\overstrike-begin) and \bracket-end (\overstrike-end). Certain characters, if present in string, are botched, such as \e.

\l´Nc´ and \L´Nc´ are supported but don't always work. In particular, if the repetition character is "x", as in \l'10x', the "x" is eaten as part of the expression and not recognized as the repetition character. Certain other repetition characters aren't written to the output correctly (same bug as for \b and \o).

\zc appears as \zero-width c in the output.

13. Hyphenation


.nh and .hy appear as \hyphenate N in the output. The value of N should be interpreted as indicated in the troff manual. If N is zero, hyphenation should be turned off.

The current hyphenation character is recognized and appears as @opthyphen in the output.

.hw is not supported.

15. Output Line Numbering


Not supported, because there is no way to determine how a postprocessor might lay out text on a page. This is especially true for tc2html: the resulting HTML document may be reformatted dynamically whenever a user viewing the document in a Web browser window resizes the window.

16. Conditional Acceptance of Input


Unfortunately, processing of conditionals (.if, .ie/.el) is to a large extent meaningless and may introduce errors into the output. The tests for the conditions t and n are processed properly, but other tests may not be. For instance, many times a conditional will test the value of the current page number (\n%), which cannot be determined reliably.

Conditional requests are processed in a special way. Normally, to process a request, the arguments are parsed first. Then troffcvt scans to the end of the request line, to avoid having extraneous junk be parsed as text or another request, and then any actions remaining in the request's action list are executed to interpret the request arguments.

For conditional requests, that doesn't work. A different approach is taken. Here is how the conditional requests can be specified in an action file:

   req if parse-condition n eol
   req ie parse-condition y eol
   req el process-condition eol
For .if and .ie, the argument is the condition to be tested, but after parsing it the rest of the line cannot be skipped over without losing some of the conditional input. What happens instead is that the parse-condition action gobbles up the condition and skips any following whitespace. If the conditional input is a single line (no \{ present), the input-line processor is invoked once recursively, which causes the rest of the line to be processed as though it were a new line. The tricky part is that processing this line will involve reading the rest of the line, including the terminating linefeed. When the inner invocation of the line processor returns from handling the conditional input, the outer invocation of the processor that is handling the conditional request (i.e., the one performing the parse-condition action), is still in its argument-parsing phase, and still expects to skip to the end of the input line after parsing the condition. So a fake linefeed is shoved into the input before returning to the condition parser.

For conditional requests that are followed by multi-line input, a mild elaboration suffices. If the conditional input begins with \{, the current conditional level is incremented and the input processor is called repeatedly until the level returns to the original value. (The level is decremented by the input routine ChIn() which simply discards the \} and returns the next character.)

If a condition fails, the input is scanned character-by-character until the end of the current line (for single-line conditional input), or until a \} matching the beginning \{ is found (for multi-line input).

The else part of the .ie/.el if-else construction is accepted or rejected by remembering the value of the previous .ie. If the .ie succeeded, the .el part is skipped, otherwise it's processed.

It does not appear to be necessary that the .el immediately follow .ie, so troffcvt does not require that. .el following .if is skipped, as is .el following another .el.

Observations about troff versions (which don't really belong here, but I'm writing them down so I don't completely forget about them):

17. Environment Switching


It's not explicit in the troff manual, but for revertible parameters such as indent or point size, the current and previous values are saved in the environment. troffcvt does this, too.

18. Insertions from Standard Input


.rd is not supported.

19. Input/Output File Switching


.nx doesn't properly unwind the input stack if current input source is not a file. The request is simply ignored after printing an error message.

.pi is not supported.

20. Miscellaneous


.mc, .pm, .fl are not supported. For .fl, this doesn't matter because output isn't buffered anyway.

Addendum


Most of the stuff mentioned in the troff addendum is unimplemented.

A.1. Command-Line Options


These are irrelevant to troffcvt.

A.2. Requests


The description of the .ab request doesn't specify whether the string argument is to be read in copy mode or not. Assuming that it should be, .ab can be defined in the action file as

   req ab parse-string-value n eol abort $1
The .ad, .ft and .so requests behave as described.

Other requests in this section are unsupported.

A.3. New Escape Sequences


These are all unsupported.

A.4. New Predefined Number Registers


.R and c. are supported as general registers. .R always contains a large value, since troffcvt always assumes it can get more memory.

$$, .L, .b, and .j are supported as read-only registers.

.P, .k, .T are not supported.

A.5. Other Important Changes


Conditional input is treated as described.