troffcvt Output Format
and
Postprocessor Writing

Paul DuBois
dubois@primate.wisc.edu

Wisconsin Regional Primate Research Center
Revision date: 21 May 1997

Table of Contents


Introduction


troffcvt turns troff input files into a more easily parsed intermediate format to assist in the process of developing troff-to-XXX translators. To provide further assistance, the troffcvt distribution contains code for a library that sequences troffcvt output into tokens. The library is called the troffcvt reader (which means that it reads troffcvt output, not that it is used by troffcvt). The combination of troffcvt and the troffcvt reader essentially turns troff files into a typed token stream. This simplifies the job of writing postprocessors. Generally a postprocessor sits on one side of a pipe reading the input from troffcvt, which sits on the other side of the pipe. The reader code is linked into the postprocessor and is called by it to get the next token from the pipe.

This document describes troffcvt output format and discusses how to write postprocessors that convert such output into some target format. If you decide not to use the reader when writing a postprocessor, you must understand how to interpret troffcvt files. If you do use the reader, then you don't need to know as much about troffcvt format since the reader tokenizes everything for you. However, it's still useful to have at least a rudimentary knowledge of the format.

Output Format


troffcvt writes three kinds of lines:

A troffcvt output file is structured as follows:
   \setup-begin
   \resolution N
   other setup lines...
   \setup-end
   rest of document...
The first portion of the file consists of a setup section bracketed by \setup-begin and \setup-end lines. The lines in between indicate the initial document layout. The first line of this information is \resolution N, where N is the number of basic units per inch. This indicates the resolution at which troffcvt performed its calculations. Numbers obtained from other control lines may be converted to ems, points, etc., using this resolution. For instance, if the resolution is 1440, the control line \spacing 240 indicates a baseline spacing of 1/6 inch.

The default resolution used by troffcvt is 432, but can be changed to whatever you want. Probably the lowest resolution you want to use is the least common multiple of 72 and the resolution you expect to use in the target format. Otherwise you may end up with ugly round-off errors when you convert units back to ems, points, etc.

If the resolution is r, other common troff units may be calculated as follows. (S is the current point size.)


Unit Name Number of basic units
i inch r
c centimeter rx50/127
P pica = 1/6 inch r/6
m em = S points Sxr/72=Sr/72
n en = em/2 Sxr/72x1/2=Sr/144
p point = 1/72 inch r/72
u basic unit 1
v vertical line space varies; set by \spacing N


The rest of the lines in the setup section contain information for the page length, page width, indents, etc.

Special Text Lines


troffcvt knows the names of special characters from two sources of information. The input sequence and output sequence for a given special character is either built in and recognized implicitly, or taken from an action file that is read at runtime. Special characters with an input sequence of the form \(xx or \[xxx] are always in the latter category.

Built-in Special Characters


Built-in characters form a short list. Most of these are listed in the "Escape Sequences for Characters, Indicators, and Functions" section of the troff Summary and Index document.


Input Sequence Output Sequence Note
\e @backslash affected by .ec
` @quoteleft
´ @quoteright
`` @quotedblleft
´´ @quotedblright
\& @zerospace
\^ @twelfthspace
\| @sixthspace
\0 @digitspace
\(space) @hardspace
\- @minus
\` @grave
@acute
\% @opthyphen affected by .hc
\a,SOH @leader
\t,TAB @tab
\(backspace) @backspace
varies @fieldbegin affected by .fc
varies @fieldend affected by .fc
varies @fieldpad affected by .fc


The output sequence for \e actually depends on the current escape character, which may be changed with .ec. The input sequence for the optional hyphenation character may be changed with .hc.

The characters @ and \ have special meaning in troffcvt files, so they are indicated in troffcvt output by the specials @at and @backslash where they are to appear in the final output literally. Postprocessors should convert them back to @ and \ characters.

The field delimiter and field pad characters defined with .fc are written out as @fieldbegin or @fieldend, and @fieldpad. (Odd delimiters begin fields; even ones end fields).

Non-built-in Special Characters


The number of non-built-in characters is not fixed. All the special characters listed in the Ossanna troff manual are defined in the default action file supplied with the troffcvt distribution, but the list may be modified as necessary to reflect extra special characters available in your local version(s) of troff. The default action file as distributed with troffcvt includes a number of special characters known by groff.

The troffcvt Reader


The troffcvt file reader reads troffcvt output and tokenizes it, setting several global variables in the process:

   tcrClass  token class
   tcrMajor  token major number
   tcrMinor  token minor number
   tcrArgv[] token text vector
   tcrArgc   number of elements in tcrArgv[] vector
All tokens are assigned to a class. The other variables are set or not depending on the class. The classes are:
   tcrEOF    end of input
   tcrControl control line
   tcrText   plain text character
   tcrSText  special text character
Elements of tcrArgv[] are null-terminated strings. There are tcrArgc elements in the vector. For plain text and special text tokens, tcrArgc is always 1. For control tokens, the rest of the line is automatically parsed to find any following arguments and these are placed into tcrArgv[1] through tcrArgv[tcrArgc-1]. tcrArgv[tcrArgc] is NULL in all cases.

All numbers on control lines are written as integers. Because numbers may be quite large, postprocessors generally should convert them to long rather than to short or int.

To use the reader, call TCRInit(), then call TCRGetToken() repeatedly. TCRGetToken() returns the token class value, which it also stores in the variable tcrClass. When TCRGetToken() returns tcrEOF, the input stream is exhausted and postprocessor can finish up.

Here is how the global variables are set for the various token classes.

tcrClass = tcrEOF:

             none of the other variables are set
tcrClass = tcrControl:
   tcrMajor  control major number (see tcr.h)
   tcrMinor  major number subtype (not set for all control words, see tcr.h)
   tcrArgv[i] for i = 0, text of control word, including leading \ character
             for i > 0, argument following control word
   tcrArgc   number of arguments, including control word
tcrClass = tcrText:
   tcrMajor  ASCII value of character (each character is a separate token)
   tcrArgv[0] one-byte string containing the character
   tcrArgc   = 1
tcrClass = tcrSText:
   tcrMajor  usually tcrSTUnknown, but see below
   tcrArgv[0] text of special character name, including leading @ character
   tcrArgc   = 1
All the built-in special characters are recognized and assigned distinct major numbers. Other specials are assigned the major number tcrSTUnknown and the postprocessor must examine the text of the token (tcrArgv[0]) to determine what it is and what to do with it. There is no way for the reader to assign fixed numbers to these since the set of special characters understood by troffcvt isn't fixed. One way of dealing with the problem is to read at runtime a file of all the special character names you expect to see. (Usually the same set of names specified in the action file used with troffcvt.)

Note: although built-in special characters do have fixed major numbers assigned, there is nothing to prevent you from processing them like other specials, i.e., by examining the token text. It may be more convenient to treat all specials uniformly.

The reader changes the characteristics of the default token scanner. This is done in TCRInit(). If you use the token scanning library for other purposes in your application, you need to change the scanner's characteristics to what you want and then restore them, or TCRGetToken() may not work correctly.

Other Global Variables


The tcrLineNumber holds the current input line number. This may be useful when printing error messages. Be aware that this is not the line number of the original troff input given to troffcvt; it's the line number of the output from troffcvt.

Postprocessors


I assume in this section that you use the troffcvt reader to write a postprocessor. It's not necessary that you do so, but if you don't, most of the following comments don't apply.

It's best that you examine the source for some of the postprocessors supplied in the troffcvt distribution before trying to write one of your own. You should also read the document troffcvt -- Notes, Bugs, Deficiences to acquaint yourself with troffcvt's many limitations.

A postprocessor can be set up this way:

The procedure described above may seem deceptively simple, and, considered in the abstract, it is. I find it easiest to begin with a copy of tc2null.c, which simply routes tokens through a bunch of switch statements. The switches can be filled in as you decide what to do with various tokens, which allows incremental development of a postprocessor. You'll probably find that, although this approach is simple conceptually, in practice the details quickly can become more complex than you'd like. Some of the important issues about which you need to be concerned are discussed in the following sections.

Special Character Handling


How will you specify what to do with special characters? Remember that the reader assigns distinct major numbers to only those special characters for which recognition is built into troffcvt.

Generally, postprocessors read in a list of special characters that parallels the list given in the action file used by troffcvt. If the action file list is changed, all the lists used by various postprocessors need to be changed, too. This is a headache, but at least the changes can be made by editing text files rather than by recompiling programs.

To get a list of all the special character names, run this command in the misc directory:

   % chk-specials /dev/null > junk

This puts into junk all the special character names that are not found in /dev/null, which, since that file is empty, will be all the names. You can use the contents of junk as a basis for constructing the output sequences you want the postprocessor to emit for various special characters.

To test the postprocessor, you can run list-specials, another command in the misc directory that generates a troff-format listing of all the special characters and their names. But running the output of list-specials through troffcvt and the postprocessor, you can see how each special character is actually treated.

Text Centering, Filling, and Adjusting


Text centering, filling, and adjusting interact in troff. My understanding of how this works is indicated below. Since my conceptual scheme is instantiated in the code, let's hope it's correct.

Centering (.ce) takes precedence over filling and adjustment. When centering is not on, no-fill mode (.nf) suspends filling and adjustment; input lines are copied to the output, left justified. If centering is off and filling is on (.fi), input lines are joined as necessary to fill output lines, which are then adjusted according to the current adjustment specified by .ad. Adjustment may be suspended with .na.

Turning off filling merely suspends adjustment. The adjustment setting is remembered and goes back into effect when filling is turned back on. Similarly, centering doesn't change the filling or adjustment settings; they are suspended while centering is in effect and resume when centering terminates.

troffcvt removes the need for postprocessors to handle these centering, filling and adjusting (CFA) interactions, by always explicitly writing out which CFA control code to use. This means the postprocessor only need remember the most recent one. If troffcvt did not do this, postprocessors would need to maintain a bunch of state variables (currently centering? currently filling? currently adjusting? which type of adjustment?).

The CFA control words are:

   \adjust-center
   \adjust-full
   \adjust-left
   \adjust-right
   \center
   \no-fill
When \center occurs, centering should be turned on. All text up to a \break should be placed on a single output line and centered. Centering continues until a different CFA control occurs.

When \nofill occurs, no-fill mode should be turned on. All text up to a \break should be placed on a single output line, left-justified. No-fill mode continues until a different CFA control occurs.

If neither centering nor no-fill are in effect, filling is on and one of the adjustment modes \adjust-left, \adjust-right, \adjust-full or \adjust-center will be issued. All text up to the next \break should be used to fill output lines. All output lines in a paragraph except the last should be adjusted in the proper way.

Postprocessors can likely treat \center and \adjust-center as equivalent. Ditto for \nofill and \adjust-left.

Note that there are no control words such as \nocenter, \fill or \noadjust. Centering is turned off by \nofill and the adjustment indicators. Filling is turned on by the adjustment indicators. The troff no-adjust request .na seems functionally equivalent to left-adjustment and so is indicated with \adjust-left. The reason for the .na request seems to be so that .na can be followed by .ad (with no argument) to resume whatever adjustment mode was in effect prior to the .na. Since troffcvt keeps track of adjustment modes it can write out the proper indicator explicitly.

It is not the case that troffcvt output will contain a single line of text corresponding to each input line when no-fill or centering are in effect. For example, when input contains special characters, each of these appears on a separate output line. Thus, it's important to read text until a \break is seen.

Paragraphing


Some document formats indicate paragraphs when they begin, others when they end. The postprocessor will need to follow whichever convention is used in the target format. This should be a simple matter since paragraph beginnings and endings both are readily located in troffcvt output. \break corresponds to paragraph endings. Beginnings are easily found also: the first text line begins one, and every time a \break occurs, the following text line begins one. (Remember that there may be other non-text lines between the \break and the following text line, though.)

Paragraph text should be treated conceptually as one unbroken string of text, even though it may appear physically on several lines of troffcvt output. Thus, successive text lines (either plain or special) should be considered to be part of the same paragraph until a \break control line occurs. The postprocessor should perform line filling and wrapping according to the most recent centering, filling or adjustment control line (one of \center, \nofill, \adjust-left, \adjust-right, \adjust-full or \adjust-center).

All characters on plain text lines are significant except the terminating linefeed, which should be ignored. Postprocessors should not treat leading or trailing spaces as extraneous without a good reason. Postprocessors also should not insert space characters between successive text lines; where necessary, spaces will already have been placed within the text itself. One exception is that the decision as to whether to put one or two spaces between sentences is left to the postprocessor. The main difficulty is determining when a sentence ends. If the usual suggested style for creation of troff input files is followed (i.e., that each sentence should begin on a new line), sentence-terminating periods, question marks and exclamation points will occur at the ends of lines. This property is preserved in troffcvt output. Postprocessors thus can locate sentence endings and have the information they need for determining whether to insert extra spaces, should they wish to do so.

Font Handling


Font handling can be a difficult issue. How do troff fonts correspond to the fonts available in your target format? One problem is that cannot predict in advance which fonts might be used in a troff document (although you can probably determine which ones are available at your site).

Another problem is that the way fonts are treated in troff doesn't correspond well to the way they're treated in other document formats (at least in my experience). In troff one switches from plain text to italic or boldface by switching fonts, e.g., from R to I, or from R to B. It is evident that troff collapses the two dimensions of typeface and style onto a single-dimensional font namespace. For some formats this can be handled by leaving the typeface the same but applying different style attributes to it.

For purposes of font support in the troffcvt reader it may be more fruitful to map troff font names onto typeface-style pairs, where the typeface is the font family a given font derives from and the style indicates those attributes that need to be applied to the plain font in that family to produce the effect of the troff font. For instance, the default troff fonts R, I and B can be described as follows:


Font Typeface Style
R Times plain
I Times italic
B Times bold


Treating fonts this way allows troff fonts to be manipulated so that "font" changes that really correspond to style changes can be handled as such.

A simple font to typeface-style map is included in the distribution (the tcr-fonts file). This file should be modified as necessary to reflect fonts available locally at your site and installed into the troffcvt library directory. r-font.c contains the code to use the font map. The sample postprocessor tc2rtf.c shows one way to use it.

Tabs


Tab stops should be interpreted relative to the current indent, not the page offset. This means that if tab stops are set and then the indent is changed, the effective tab stops relative to the page offset change. Some postprocessors may need to reset tabs in the target format when that happens.

Control Line Reference


The section documents the syntax of all control lines produced by troffcvt. The descriptions are grouped according to the section of the Ossanna troff manual to which they are most closely related. The exceptions are section 0, which contains descriptions for miscellaneous controls that don't correspond to anything in the troff manual, and section 15, which describes controls for table processing.

Unless otherwise indicated, numeric values on control lines are specified in basic units.

§ 0. Miscellaneous


\setup-begin


Indicates the beginning of the setup section of troffcvt output.

\setup-end


Indicates the end of the setup section of troffcvt output. When it occurs, the basic layout of the document will have been specified.

\resolution N


This line occurs first within the setup section. It indicates the resolution at which troffcvt performed its calculations, in basic units per inch.

\comment string


Indicates a comment, which may be ignored. The entire string will be in tcrArgv[1]. \comment lines may appear at any time, even before \setup-begin.

\pass string


Pass string literally through to the output without interpretation. This is used for postprocessor-specific purposes. The entire string will be in tcrArgv[1].

\line filename linenumber


This indicates the point at which line linenumber was read from file filename. \line controls are generated if the -l command line option was given to troffcvt. They provide a way of tracking the output that results from each input line.

\other string


Indicates a line that doesn't fall into any other class. The contents of string can be used for anything. string is parsed into separate arguments.

§ 2. Fonts and Character Size Control


\font F


Switch to font F.

\constant-width F


Treat font F as though it is non-proportional (each character the same width).

\noconstant-width F


Stop treating font F as though it is non-proportional.

\embolden F N


Embolden font F by smearing it N units. If N is zero, emboldening should be turned off.

If you implement emboldening, you might find it more profitable to ignore the smear value and make the font bold by some means other than reprinting the characters slightly displaced from the original printing like troff does.

\embolden-special F N


Like \embolden, but embolden characters in the special font whenever the current font is F. In practice, this probably has no meaning for most postprocessors, because the special font in troff is logically part of the three default fonts, something unlikely to be true in the target output format.

\point-size N


Set point size to N. N is number of points, not basic units.

\space-size N


Set space size to N/36m. Note that the actual instantaneous value depends of the size of an em (and thus on the current point size). For this reason it may be best to maintain N but also recompute the actual space size in basic units whenever the point size changes.

§ 3. Page Control


\begin-page [N]


If N is present, begin new page numbered N. Otherwise, just begin new page (presumably numbered in sequence with the current page).

\offset N


Set page offset to N.

\page-length N


Set page length to N.

\page-number N


Set page number to N. (Does not begin a new page.)

\need N


N units of vertical space are needed. If less than that remains on the current page, begin a new page.

\mark


Remember the current vertical position on the page. This assumes you have some notion of the current position, of course. Not all target formats have such a notion; troffcvt itself certainly doesn't.

§ 4. Text Filling, Adjusting, and Centering


\adjust-center

\adjust-full

\adjust-left

\adjust-right


Fill output lines, adjusting as indicated.

\nofill


Do not fill output lines.

\center


Center output lines.

\break


End of input line. Flush and terminate current output line.

\break-spread


Break at end of current word and spread output line to current line length.

§ 5. Vertical Spacing


\spacing N


Set vertical base-line spacing to N units, which becomes the meaning of 1 v.

\line-spacing N


Set line spacing to N v's.

\space N


Space vertically N units (negative = upward).

\extra-space N


Indicates that the current output line should have Nv of extra space added to it.

§ 6. Line Length and Indenting


\indent N


Set indent to N units.

\line-length N


Set line length to N units.

\temp-indent N


Temporarily set indent to N units. This is an absolute indent, not an value by which to adjust the current indent. In troff, the temporary indent value is used only for the next output line and the prevailing indent is used again after that. In other formats it likely corresponds to "paragraph first line indent" or something similar.

§ 7. Macros, Strings, Diversions, and Position Traps


\diversion-begin name


Indicates that subsequent output, until \diversion-end name occurs, was intended in the original input to be diverted to macro xx.

\diversion-append name


Indicates that subsequent output, until \diversion-end name occurs, was intended in the original input to be appended to diversion macro xx.

\diversion-end name


Indicates end of preceding \diversion-begin name or \diversion-append name.

§ 9. Tabs, Leaders, and Fields


\reset-tabs


Reset tab stops to default (every half-inch).

\first-tab N c


Clear tab stops and install first one at N units. c is l (left), c (center), or r (right).

\next-tab N c


Add tab stop to current set. N and c are as for \first-tab.

\tab-char [c]


Set tab repetition character to c. If c is missing, tabs should be implemented as motion.

\leader-char [c]


Set leader repetition character to c. If c is missing, leaders should be implemented as motion.

§ 10. Input and Output Conventions and Character Translations


\underline


Turn on underlining.

\cunderline


Turn on continuous underlining.

\nounderline


Turn off underlining (both kinds).

\underline-font F


Set underline font to F.

§ 11. Local Horizontal and Vertical Motions, and the Width Function


\motion N c


Move N units. c is h for horizontal motion (negative = left) or v vertical motion (negative = upward).

\line N c


Draw line. N and c are as for \motion.

§ 12. Overstrike, Bracket, Line-drawing, and Zero-width Functions


\bracket-begin


Characters on following text lines (until \bracket-end) should be used to build a bracket.

\bracket-end


Terminates preceding \bracket-begin.

\overstrike-begin


Characters on following text lines (until \overstrike-end) should be overstruck.

\overstrike-end


Terminates preceding \overstrike-begin.

\zero-width c


Print c without changing position on page.

§ 13. Hyphenation


\hyphenate N


Set hyphenation mode. If N is zero, turn off hyphenation. If N is non-zero, interpret N as in the troff manual.

§ 14. Three Part Titles


\title-length N


Set title length to N units.

\title-begin c


Indicates the beginning of title part c, where c is l (left), c (center), or r (right). Text up to the next \title-end control line should be taken as the content of this title part. troffcvt indicates title content by writing the following output sequence:
   \title-begin l
   text of left title part
   \title-end
   \title-begin m
   text of middle title part
   \title-end
   \title-begin r
   text of right title part
   \title-end
If no text occurs between the \title-begin and \title-end lines, it means the specified title part is empty. No control words will occur between the \title-begin and \title-end lines.

\title-end


Terminates preceding \title-begin.

§ 15. Tables


The troffcvt language contains special controls to indicate table structure. These result when tblcvt is used to preprocess troffcvt input. The controls should be written by troffcvt in a particular order, but troffcvt itself does no checking to verify the ordering. It relies on tblcvt to generate table-related requests that specify table elements in the proper sequence. For more details, see the document tblcvt -- A troffcvt Postprocessor.

For testing tblcvt, see the tblcvt/tests directory, which contains the tables from the Lesk tbl document, one table per file.

\table-begin rows cols header-rows align expand box allbox doublebox


Indicates the beginning of a table.

rows and cols are the number of rows and columns in the table. (A row that draws a line is considered a data row.)

For tables that are specified to have a header (using .TS H and .TH), header-rows is non-zero. Otherwise header-rows is 0. header-rows indicates how many of the initial data rows make up the table header. If this is non-zero, that many rows form a header that should be repeated if the table spans multiple pages. For a single-page table, header rows should be treated as just an ordinary part of the table.

align is L or C to indicate the table is left-justified or centered.

expand is y if the table is expanded to the full line width, n otherwise.

The box, allbox, and doublebox values are each y or n, depending on whether or not box, allbox, and doublebox were given in the table specification. (Note that allbox and doublebox both imply box.)

\table-end


Indicates the end of the current table.

\table-column-info width sep equal


Specifies values that apply to all cells in a table column. Following the \table-begin control, there will be one \table-column-info line for each column of the table. The column number is not specified; the controls for each column are written consecutively.

width is the minimum required width of the column. The value is non-zero if any entry in the given column specified a w option. If more than one entry specified w, the last one is used. If width is 0, no entry in the column specified w and the width is determined from the data values in the column.

sep is the column separation value.

The equal value is y if any entry in the column specified the e option, and n otherwise. All columns with an equal value of y should be made the same width.

\table-row-begin


Indicates the beginning of a row within a table.

\table-row-end


Indicates the end of the current table row.

\table-row-line N


Indicates that the table row is a single or double table-width line. The value of N indicates the type of line:
   \table-row-line 1  Table-width single line
   \table-row-line 2  Table-width double line
There is no end marker for this control, as none is needed.

\table-cell-info type vspan hspan vadjust border


Indicates layout information for a a cell within a table row.

type is the cell type:
   L         Left-justified
   R         Right-justified
   C         Centered
   N         Numeric (align to decimal point)
   A         Alphanumeric
vspan and hspan are the number of rows and columns spanned by the cell, including itself. Interpret these values as follows:
If all you want to know is whether or not a cell is spanned, the product of vspan and hspan is zero if and only if the cell is spanned. If you need to know whether spanning is in a particular direction, you need to examine vspan and hspan individually. This is summarized in the following table.


hspan = 0 hspan > 0
vspan = 0 spanned both ways spanned from above
vspan > 0 spanned from left not spanned


vadjust is T if the cell contents should be vertically adjusted from the top, C if the contents should be vertically centered. vadjust is meaningful only for multiple-line cells.

border is the border value. If the value is 0, there is no border. Otherwise, the value is a bitmap with the following fields:
   Bits      Value     Meaning
   0-1       1         Left border, single line
             3         Left border, double line
   2-3       1         Right border, single line
             3         Right border, double line
   4-5       1         Top border, single line
             3         Top border, double line
   6-7       1         Bottom border, single line
             3         Bottom border, double line

\table-cell-begin


Indicates the beginning of a table cell.

\table-cell-end


Indicates the end of the current table cell.

\table-empty-cell


Indicates a table cell that is empty. There is no end marker for this control, as none is needed.

\table-spanned-cell


Indicates a table cell that is spanned by an earlier cell. There is no end marker for this control, as none is needed. Note that that spanned cell may be spanned by a cell with data in it, an empty cell, or a line-drawing cell.

\table-cell-line N


Indicates that the content of a table cell is a line. The value of N indicates the type of line:
   \table-cell-line 0 Column-data-width single line
   \table-cell-line 1 Column-width single line
   \table-cell-line 2 Column-width double line
There is no end marker for this control, as none is needed.