Augmented Plain Text (APT) Version 1.0

simplified web authoring

Locator: http://www.altheim.com/specs/NOTE-apt1.html
Author: Murray Altheim
Email: m.altheim@open.ac.uk
Revised: 24 May 2002
Revision: $Id: NOTE-apt1.html,v 1.2 2002/07/17 11:49:49 altheim Exp $
Neocortext.net: The Ceryle Project

Copyright ©2002 Murray Altheim. All Rights Reserved.

Abstract

The Augmented Plain Text (APT) specification is a design for a simple set of keyword tokens, that when added to a plain text document enables an APT processor to autogenerate valid XHTML documents. This can be used for conversion of existing plain text sources for the Web, or as a simple alternative to hypertext authoring.

Status of this Document

This document is intended for review and comment by interested parties. It is a “work in progress,” currently has no formal status, and its publication should not be construed as endorsement by any corporate or academic body. This document may be updated, replaced, rendered obsolete by other documents, or removed from circulation at any time. It is inappropriate to use this document as reference material, or cite it as anything other than a “work in progress.” Distribution of this document is unlimited.

Contents


top-of-sectionIntroduction to Augmented Plain Text (APT)

The 'APT' notation is a simple, augmented notation designed to simplify creation of XHTML documents from existing text sources such as email messages. Most current editing software has some facility to generate plain text output, with some claiming to generate HTML. Unfortunately, the "HTML" generated from most every single known product is far from adhering to any known HTML specification, and in the case of (for example, since it is the worst) MS Word, its output is so obtuse and convoluted that a slew of translators have been written to translate its "HTML" output into something akin to HTML, though not without substantial losses of content (and certainly formatting) in many cases.

APT was designed to fill a different niche, namely for those wishing to author in plain text, those who have existing text sources, or who'd like an XHTML-valid document with an autogenerated table of contents and hierarchically-numbered sections from what is ostensibly a plain text document (with a few simple codes added). Yes, APT is simple. It doesn't have every known Web feature, doesn't create JavaScript buttons or fry your bacon for you. It does simplify web authoring for those who think web authoring should be simple and straightforward. You can spend some time playing with the CSS stylesheet if you really want your output to look different than the default, or take the output document as input for further processing (perhaps adding your own JavaScript buttons and bacon).

An APT document looks something like this:

    #APT 1.0
    #AUTHOR Tim Bunwich
    #TITLE Not a Normal Day at the Park

    Today I went to http://centralpark.org/home.html #LINK Central Park
    to feed some squirrels, a thing I do most every day. Well, for some reason
    the squirrels seemed agitated, a bit put off at simply accepting the nuts
    I was handing out.

    I was at the point of positioning a piece of pecan in front of this big
    black squirrel's nose when suddenly he ran up my arm and stood on the top
    of my head, then began squealing wildly. I froze in place, not knowing
    quite what to do. These little buggers have very sharp claws and teeth,
    and me with no hair... well, I was afraid for my scalp.

    All around me were squirrels, all just a few feet from my Berkenstock'd
    toes. I knew my life was about to change for the worse.

Pretty simple? Note that inline HTML links are a Level 2 feature, so the inline link isn't active in this sample. Here's the above fragment converted into XHTML, plus some more samples of before and after:

beforeafterdescription
sample1.apt sample-apt1.html a simple APT conversion showing a mostly plain text source
sample2.apt sample-apt2.html a display of the hierarchical headings and autogenerated TOC
sample3.apt sample-apt3.html the same as sample 2 except in Spanish
sample4.apt sample-apt4.html an exercise of Level 1 and some Level 2 features

top-of-sectionBlock and Inline Content

HTML bases its formatting model on a distinction between block and inline content, that is, content that appears as separated blocks in paragraph style, and content that appears within the individual lines of those blocks.

Likewise (since the destination of APT processing is XHTML), APT's formatting model uses the same distinction. Blocks of plain text separated by one or more blank lines are considered separate paragraphs of content, and several of the APT statements expect blocks of content, scanning until a blank line. These include #P (paragraph), #PRE (preformatted), and #PRE (definition).

The rest of the APT statements scan until the end of the line in which the keyword occurs. Examples include #AUTHOR (author name), and #H1 through #H6 (heading levels 1 through 6).


top-of-sectionSyntax

APT uses keywords that start with a hash symbol (e.g., #AUTHOR) occurring in column 1 (i.e., the beginning of a line) to denote an APT statement. APT parsers should ignore keywords occurring elsewhere, or unknown keywords (perhaps emitting a warning when this occurs), so that other instances of hash characters followed by unknown tokens have no effect (other than to be replicated in the output file).

The APT syntax is designed according to implementation levels, to allow for varying levels of support. The core level, Level 1.0, is quite simple but provides support for headings and other block text types. Level 2.0 provides general linking support, with Level 3.0 providing inclusions and other advanced features.

Level 1.0 and 2.0 processors will autogenerate the document title, heading markup, divisions and paragraphs, as well as a table of contents using heading titles. Optional features include autogeneration of hierarchical section numbers. Note that if necessary, a hash character can be escaped using its XML character entity equivalent ("#"). A processor should be labeled as according to its implementation level. If an APT processor receives an APT source document whose level is greater than its implementation supports, it should refuse to process the document.

Most APT statements start with an APT keyword beginning in column one and continue to the end of the line. Lines may be continued with a backslash onto the next line. The exception to this rule (and there is only one) is the inline augmentation in APT Level 2 to Level 1's #LINK: #LINK).

APT Level 1.0

The following keywords should be supported in all APT Level 1.0 processors (APT header keywords appear in red):

#APT WS level
APT implementation level
#TITLE WS content
document title
#SUB WS content
document subtitle
#AUTHOR WS content
author name
#EMAIL WS content
author email
#REF WS linkURL
URL reference to this document
#REV WS content
document version or revision info (eg., CVS Id)
#TOC
insert table of contents here
#COM WS content
comment (i.e., as XML comment)
#H1 WS content
level 1 heading
#H2 WS content
level 2 heading
#H3 WS content
level 3 heading
#H4 WS content
level 4 heading
#H5 WS content
level 5 heading
#H6 WS content
level 6 heading
#H WS content
neutral heading
#DFN WS term "|" definition
definition list item
#LI WS content
list item (unordered)
#PRE WS content
preformatted content
#HR WS [percentage]
horizontal rule [optional width]
#LINK WS linkURL linkText
a hypertext link
APT Header Keywords

The keywords #APT, #TITLE, #SUB, #AUTHOR, #EMAIL, #REF, #REV, and #TOC are considered APT header keywords, and should only occur once in an APT file, grouped at the top (order is unimportant). "WS" stands for whitespace: space or tab characters.

APT Level 2.0

The following keywords should be supported in all APT Level 2.0 processors:

linkURL [#LINK WS link text]
an inline hypertext link
#IMG WS imageURL WS alt-text
an image
#LOGO WS imageURL WS alt-text
a logo image for top-of-page header
#FIG WS imageURL WS id WS alt-text
an image treated as a figure
#SETHEAD WS content
presets heading numbers (eg., content = "3.2.1")

APT Level 3.0

The following keywords should be supported in all APT Level 3.0 processors:

#INCLUDE WS linkURL
a transclusion (see note)

Plaintext (ie., whitespace-based) tables are supported using three statements:

#TABLE WS [parameters]
a table
#TH WS [column markers] [heading titles]
a table heading
#END
table end

Plaintext-based forms are currently under development.

[Other Level 3 features under consideration.]


top-of-sectionKeyword Documentation

Quick Reference:
1: #APT | #TITLE | #SUB | #AUTHOR | #EMAIL | #REF | #REV | #TOC | #COM
   | #H1 | #H2 | #H3 | #H4 | #H5 | #H6 | #H | #DFN | #LI | #PRE | #HR | #LINK
2: #LINK | #IMG | #LOGO | #FIG
3: #SETHEAD | #INCLUDE | #TABLE | #TH | #END

 

top-of-sectionAPT Level 1

top-of-section#APT:   APT Implementation Level

Syntax
    #APT WS number
  

The #APT keyword signals the APT processor as to the expected implementation level necessary for correctly processing the document. Accepted implementation levels are "1.0", "2.0" or "3.0". If this APT statement occurs, it should be the first statement in the document. If missing, "1.0" is the default.


top-of-section#TITLE:   Document Title

Syntax
    #TITLE WS title
  

The #TITLE keyword indicates that the text following is the document title. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one. This title is used as the HTML <title> as well as the displayed document title at the top of the document itself.


top-of-section#SUB:   Document Subtitle

Syntax
    #SUB WS subtitle
  

The #SUB keyword indicates that the text following is the document subtitle. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one.


top-of-section#AUTHOR:   Author Name

Syntax
    #AUTHOR WS author name
  

The #AUTHOR keyword indicates that the text following is the author name. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one.


top-of-section#EMAIL:   Email Address of Author or Contact

Syntax
    #EMAIL WS email address
  

The #EMAIL keyword indicates that the text following is the contact email address. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one.


top-of-section#REF:   Self Reference

Syntax
    #REF WS url
  

The #REF keyword indicates that the text following is a URL reference to the document. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one.


top-of-section#REV:   Document Revision

Syntax
    #REV WS revision number
  

The #REV keyword indicates that the text following is the revision number of the document. This should only occur once in an APT file, and a warning should be generated by the processor on encountering more than one.


top-of-section#TOC:   Table of Contents

Syntax
    #TOC WS [title prefix]
  

The #TOC keyword indicates that a table of contents is desired in the output document. Absent this keyword, a table of contents will not be generated, though some processors may wish to generate a table of contents at some default location (Ceryle has a property "urn:ceryle:properties:defaultTableOfContents" that determines which behaviour, though its default is false).

The optional literal string (i.e., in quotes) following the keyword is used as the prefix string for the section numbers appearing in the headings and table of contents. If an empty string (""), no text is used (just the section numbers); if "NONE" no section numbers will appear. A typical string might be "Section" or "Sec".


top-of-section#COM:   Comment

Syntax
    #COM WS comment
  

The #COM keyword indicates that the text following is a text comment, to be used in creating an XHTML comment. Such comments are visible in the XHTML source, but are not displayed by Web browsers.


top-of-section#H1 to #H6:   Hierarchical Headings, Level 1 to 6

Syntax
    #H1 WS level 1 heading text
    #H2 WS level 2 heading text
    #H3 WS level 3 heading text
    #H4 WS level 4 heading text
    #H5 WS level 5 heading text
    #H6 WS level 6 heading text
  

[description] The #H1 through #H6 keywords indicate that the text following is content for use in a heading. Such headings and the sections they head play a role in the general document hierarchy.


top-of-section#H:   Neutral Heading

Syntax
    #H WS heading text
  

The #H keyword indicates that the text following is heading text for use in a neutral heading. Such headings are used when the heading and its section are not to play a role in the document hierarchy, when the document does not have or need such a hierarchy, or when your life is already too complicated to care about such things as heading levels.

Neutral headings still create a wrapper <div> element, so that document manipulation via division is possible. Such "neutral divisions" are never nested; they are always leaf node divisions in the overall division structure.

This keyword is useful for expressing the document abstract, foreword, preface, or other front or back matter sections. It will have a class attribute of "ndivx" (where 'x' is the display heading level of 1-6) so that display can be customized in a stylesheet.


top-of-section#DFN:   Definition

Syntax
    #DFN WS term WS "|" WS definition
  

The #DFN keyword indicates that the text following is a term and its definition, divided by a vertical bar and whitespace. If a series of definitions are located together without intervening codes, blank lines, or text, the group will be treated together as an XHTML definition list (<dl> element).


top-of-section#LI:   List Item

Syntax
    #LI WS list item text
  

The #LI keyword indicates that the text following is a list item. If a series of list items are located together without intervening codes, blank lines, or text, the group will be treated together as an XHTML unordered list (<ul> element) with a class attribute of "list" so that display can be customized in a stylesheet.


top-of-section#PRE:   Preformatted Content

Syntax
    #PRE WS preformatted content
  

The #PRE keyword indicates that the text following is preformatted text. In such content, whitespace is preserved. This is used for display of poetry, program code, ASCII tables or art, etc. The preformatted block will continue until the first blank line. Lines will be continued if their last character is a backslash ("\"). A line containing only a backslash is considered as one blank line.


top-of-section#HR:   Horizontal Divider ("Rule")

Syntax
    #HR WS [width]
  

The #HR keyword indicates display of a horizontal line in the text. An optional width may be included as a percentage. If a width is included, the line will be left-justified.


top-of-section#LINK:   Level 1 Hypertext Link

Syntax
    #LINK WS linkURL linkText
  

In APT Level 1, links must begin on a line with the #LINK keyword, and cannot occur inline.

Example

The following APT statement

      #LINK http://www.altheim.com/lit/everest.html archy climbs everest
  

would be rendered as a link to archy climbs everest.


top-of-sectionAPT Level 2

top-of-section#LINK:   Level 2 Hypertext Link

Syntax
    url WS [#LINK WS link text]
  

In APT Level 1, links must begin on a line with the #LINK keyword, and cannot occur inline.

With APT Level 2, all http:, ftp:, and file: URLs embedded in APT source files are converted to links automatically*, so the #LINK keyword can be considered optional in cases where you want the link text to show up as a URL. Now if a URL is immediately followed (after some whitespace) by a #LINK keyword, the text following the keyword up until the end of the line is used as the link text.

* The APT processor scans for the strings "http://", "ftp://" and "file://" and ends the URL with the first whitespace encountered.

Example

The following APT statement

      Amazing is the http://www.altheim.com/murray/compbun.html #LINK The \
  Comprehensive Bunny Name List

would be rendered as a link to The Comprehensive Bunny Name List. Note the use of the backslash.


top-of-section#IMG:   Image

Syntax
    #IMG WS url WS alternate text
  

The #IMG keyword indicates display of an image. The source URL follows the APT keyword, then after the first whitespace following the URL all text until the end of line is used as the alternate text (used when the image is not displayed, in text-only browsers, and as an aid for the sight-impaired).

Example

The following APT statement

      #IMG img/neocortext-net.png Neocortext.net: Where Brain and Document Meet
  

would be rendered as the image located at img/neocortext-net.png, relative to the location of the generated HTML file, with the remaining text as 'alt' text. In this example, the image file is located an a subdirectory of the document's directory named "img". If desired, an absolute URL can be used rather than a relative reference.


top-of-section#LOGO:   Top of Page Logo

Syntax
    #LOGO WS url WS alternate text
  

The #LOGO keyword indicates display of an image at the top right of the page as a logo. The source URL follows the APT keyword, then after the first whitespace following the URL all text until the end of line is used as the alternate text.


top-of-section#FIG:   Figure

Syntax
    #FIG WS url WS id WS alternate text
  

The #FIG keyword indicates display of an image as a figure. The source URL follows the APT keyword, then the figure ID, then all text until the end of line is used as the alternate text. The ID may be used to link to the figure.


top-of-sectionAPT Level 3

top-of-section#SETHEAD:   Preset Heading Numbers

Syntax
    #SETHEAD WS ["heading prefix" WS ] initial heading number
  

Because documents are sometimes broken up into multiple web files, it is necessary to be able to set the initial heading number of a document so that it doesn't start a new hierarchy (i.e., displaying a "1."). The number is of the form of the desired section number.

Following the #SETHEAD keyword is an optional, double quote delimited string containing a text string to be prefixed in front of each heading, eg., the word "Section" in "Section 1.1", "Section 1.2". This can be set to an empty string ("") if no text is desired, or to "NONE" if no section numbers are desired at all.

Note:
This keyword may appear anywhere in a file in order to allow setting of the heading prefix and a section number, but results may be rather unpredictable if care is not taken.


top-of-section#INCLUDE:   Transclude Content

Syntax
    #INCLUDE WS url
  

The #INCLUDE keyword indicates that an external APT file is to be transcluded at the point in which it occurs. Heading information in the transcluded file is ignored.

Example:

      #APT 3.0
      #AUTHOR Don Marquis
      #TITLE The Cruise of the Jasper B.
      #SUB Night, Tempest, Love and Battle
      #INCLUDE "http://www.altheim.com/lit/jasperb.15.html"
  

Transclusions: To include an external file as an answer to a question:

      ...
      #DFN My first question is how many monkeys? |
      #INCLUDE answer1.html
  

Ed.Note:
This will include either an APT or XHTML document, in this case the XHTML document 'jasperb.15.html'. Because the included file may be an APT, HTML or XHTML file, and of the latter two may turn out to not be well-formed content, the processor must be capable of cleaning up the content to a degree necessary to include without corrupting the output, so error handling is currently the biggest design question.


top-of-section#TABLE:   Tables

Syntax
    #TABLE [ WS parameter ]*
  

The #TABLE keyword begins a series of statements considered together as a table. The end of the table content is marked by a #END keyword (absent this keyword, parsers will scan to the end of the file). The #TABLE keyword may be followed (on the same line) by parameters of the form name="value" separated by whitespace. These allow setting of optional table attributes including summary, width, border, frame, rules, width, cellpadding and cellspacing.

Whitespace is used to delineate both rows and columns. Column breaks are indicated by the first line of the table, using vertical bar characters (|) as indicators. Absent a first line whose first non-whitespace character is a vertical bar, the table will be created as one column. Row breaks are empty lines, or lines containing only whitespace. Use of tab characters is not recommended due to potential confusion (i.e., editor configuration may alter display of tabs), but when they occur are each converted to four space characters.

Table headings are set by the #TH statement, using vertical bar characters as delimiters between individual table headings. If there is a mismatch between the number of headings and the number of indicated columns (as described in the above paragraph), a warning should be issued.

Example:

      #TABLE border="0" width="80%"
      #TH Qty. | Description  | Price (ea.) | Total
               |                     |        | 
      3doz.       Free Range Eggs       1.55   4.65
                  12 medium 12s

      2lbs.       Cornhill Bacon        1.79   3.58
                  cooked smoked 40g

      1 loaf      Hovis white bread     0.59   0.59
                  medium sliced 800g
      Total                                     8.82
      #END
  

Note:
In the above example, note that the vertical bar characters of the heading (#TH) do not line up with those in the first line of data. In the former case they are used as delimiters between heading cell contents, in the latter as column indicators. But since extra whitespace is not significant within table cells, there would have been no harm in aligning them, and indeed it would make the table easier to read during editing of the APT file.


top-of-section#TH:   Transclude Content

Syntax
    #TH WS [ heading_title "|" ] [ WS + ]
  

The #TH keyword indicates the table headings for the current table. If the first character of the last line is a "+" character, that line will be used to set the column breaks for the table. Since alignment of the column divisions requires alignment of the "+" characters, it's recommended that this string of characters begin on a newline, using a backslash to begin the new line (see the example below). See #TABLE for more information.

Note that any characters occurring at the column breaks are ignored, as they are expected to be whitespace. The following example shows conversion of a code table:

    #TABLE
    #TH | UN Code | Country Name | ISO 3166 Code \ 
       +                +
    246 Finland          FIN
    250 France           FRA
    254 French Guiana    GUF
    258 French Polynesia PYF
    266 Gabon            GAB
    270 Gambia           GMB
    268 eorgia           GEO
    276 Germany          DEU
    288 Ghana            GHA
    292 Gibraltar        GIB
    300 Greece           GRC
    304 Greenland        GRL
    #END
  

top-of-section#END:   Transclude Content

Syntax
    #END
  

The #END keyword indicates the end of a table. See #TABLE for more information.



top-of-sectionLinking

Ed.Note:
This section is currently in progress...

If APT were not to support the generation of hypertext documents, it might be considered rather deficient, though creating a web of hyperlinks can also be quite difficult to manage.

A link consists of a link and its target. Each link in a hypertext document has a purpose — discerning this purpose simplifies to some extent their creation and management.


top-of-sectionProcessing

An APT source document must be a text document. This source document undergoes a number of transformations, enumerated below:

Ed.Note:
This section is currently in progress...

  1. Keyword Handling
    The file is processed line-by-line from the beginning, APT keywords are processed according to their individual specification. This generates an XHTML document (actually, a DOM Document in the case of Ceryle) that is populated by APT parsing events.
  2. Whitespace Handling
    Because an APT processor is searching for blank lines as paragraph delimiters, all lines consisting entirely of whitespace are converted into empty lines, and trailing whitespace at the end of each line is eliminated.
  3. Character Escaping
    As the document is processed, blocks of content are processed by "escaping" various characters so that they are "safe" in XHTML. This includes markup characters occurring in the text (such as ampersands), as well as known non-ASCII characters supported in the XHTML character entity set which are converted to named character entities (such as &uuml;). Unicode characters above the ASCII set ( > 255 ) not covered by these provisions are converted to numerical entity references (e.g., &#3204;). The exception to this is when a block of text is started with a #P (paragraph) keyword with the explicit goal of avoiding the conversion of existing XHTML markup.
  4. Transclusions
    When Level 3.0 transclusions are supported and the feature is active, all transclusions are processed at the time they are encountered. Transcluded documents are expected to be APT documents, so special note should be made about possible recursive loops should transcluded documents subsequently reference for transclusion documents that transclude documents that transclude documents that transclude documents that... (e-gad)
  5. Post Processing
    Once the source document has been parsed and the DOM Document generated, post processing includes optional generation of section numbers, generation of the table of contents, and creation of the default header and footer content, including a link to the default stylesheet.

top-of-sectionFeedback

Feedback on the APT design can be sent to its author at  <m.altheim@open.ac.uk>


top-of-sectionReferences

[references]

top-of-section