tc2rtf Notes

Paul DuBois
dubois@primate.wisc.edu

Wisconsin Regional Primate Research Center
Revision date: 20 May 1997

Table of Contents


Introduction


tc2rtf is a postprocessor for converting troffcvt output to RTF. This document describes how it works and some of the design issues involved in writing it.

General Paragraph Formatting Properties


In RTF paragraph formatting properties can only be set once per paragraph, which means that once a paragraph has begun its properties are frozen. Some ways of resetting them are: (i) after the \par at the end of the previous paragraph, issue a \pard followed by new settings; (ii) put each paragraph in a group, and issue settings within each group. Each approach is similar in that paragraph properties are reset to some default and then can be set as appropriate for a new paragraph.

There are some differences between the approaches. The first approach resets paragraph properties to the RTF defaults. The second resets them to the paragraph state in effect at the time the group for the first paragraph is begun. This means it's possible to set up some arbitrary default which can be restored simply by beginning a new group. But it also undoes any changes made to character properties within the group. The first approach is "flatter" because there are fewer groups, and simpler in the sense that it's not necessary to restore any character formatting properties. The second approach is simpler in the sense that it's likely fewer paragraph properties will need to be reset, since the default state is more likely to be close to the format used throughout the document.

It's not obvious that either approach enjoys clear advantages over the other. tc2rtf uses the first approach.

The above discussion assumes all changes to paragraph properties occur between paragraphs and not within paragraph text. It's possible for troffcvt output to contain within-paragraph changes, however, since troff requests can occur anywhere, and can be specified with a no-break control character. If such changes are written in the middle of a paragraph, they do bad things things to RTF readers (e..g., Microsoft Word 5.0 botches a paragraph badly if \li or \fi are set in the middle). Two ways to handle this problem are to force a \par if a paragraph format change occurs within a paragraph, or to ignore the change when it occurs and let it take effect after the paragraph ends ("lazy evaluation!"). It's not evident that either solution is "correct." tc2rtf adopts the latter.

Margins and Indents


troff has concepts of page offset, indent, temporary indent, and line length. (These are expressed in troffcvt output as \offset, \indent, \temp-indent and \line-length). These are not isomorphic to RTF, which has concepts for left and right margins, left and right indent, and first-line indent for the first line of a paragraph. (These are expressed in RTF as \margl, \margr, \li, \ri and \fi.)

The troff settings can be changed at any time. The RTF left and right margin values are document formatting properties, and can only be set once (before any document text). The indents can only be set once per paragraph, as discussed above.

Differences between the two methods of expressing page layout are handled as follows. Output is turned off while tc2rtf is reading the setup section of troffcvt output. When the setup information has been completely read (\setup-end has been seen, tc2rtf assumes that the current offset+indent should be the document left margin, and that any space on the right not taken up by offset, indent or line length should be the right margin. Thereafter, changes in offset or indent may change the left indent, relative to the left margin. Changes in offset, indent or line length may change the right indent, relative to the right margin.

Changes in the temporary indent are mapped onto first-line indent, on the assumption that \temp-indent will normally occur before the text of a paragraph. A difference between troff and RTF is that the troff temporary indent is relative to the page offset, while RTF first-line indent is relative to the current left indent.

The temporary indent is reset to be equal to the left indent at each \par since in troff the .ti setting is transient.

Another difference between troff and RTF is that the temporary indent changes the tab settings for the first line of a paragraph, whereas the first-line-indent in RTF does not. tc2rtf does not attempt to simulate troff's behavior, since there isn't any way of knowing when the second line of a paragraph has been reached. (RTF includes no mechanism for expressing or discovering font metrics.)

Tabs


\leader-char and \tab-char are both ignored. Leaders and tabs are always written as plain tab characters.

Tables


A document containing tbl input is best handled by using tblcvt to preprocess the document before feeding the result to troffcvt and tc2rtf.

Tables are a pain to do well in RTF. As the RTF specification says, "tables are probably the trickiest part of RTF to read and write correctly." While adding support for tblcvt-related output to tc2rtf, I found it alarmingly easy to crash or lock Word (both Macintosh and Windows versions) unless table controls were written just right. Even now I'm not overly confident that tables are written correctly, though tc2rtf table output no longer seems to cause crashes. One of the keys is to make sure to write \intbl in each cell, even for empty cells. Further, it's typically a good idea to emit \pard for each cell, but it must be written before the \intbl, not after. Otherwise Word seems to forget that it's in a cell. (This seems silly. Surely if you've seen \intbl but not \cell or \row it's reasonable to expect Word to consider itself still in the cell? Apparently not.)

Brackets, Overstrikes


I don't know how to do these in RTF, so what tc2rtf does is write out ugly but highly visible sequences to make it obvious to the user that the document contains stuff that needs some hand tuning. Bracket characters are written out surrounded by <BRACKET< and >BRACKET>. Characters which should be overstruck are written out surrounded by <OVERSTRIKE< and >OVERSTRIKE>. Ick.