Skip to content

Latest commit

 

History

History
178 lines (157 loc) · 8.44 KB

non-ascii-characters-in-rfcxml.md

File metadata and controls

178 lines (157 loc) · 8.44 KB
title description published date tags editor dateCreated
Non-ASCII characters in RFCXML
true
2021-12-16 04:45:35 UTC
markdown
2021-12-16 02:40:30 UTC

The use of non-ASCII characters in RFCXML is detailed in RFC 7997. Your file encoding must be set as UTF-8 (the default).

non-ASCII characters used directly

non-ASCII characters in RFCXML (and I-Ds in general) may appear within the body of the document. The <u> element is required for cases where the non-ASCII characters are needed for correct protocol operation.

Notes on ascii attributes

  • For the <author> and <contact> elements, there exist both fullname, initials, and surname attributes that can hold non-ASCII characters and also the asciiFullname, asciiInitials, and asciiSurname attributes to hold the ASCII equivalents of non-ASCII characters that are not in the Unicode Latin blocks.
  • Postal address elements <street>, <city>, <region>, <city>, <country>, and <email> also have an ascii attribute to hold the ASCII equivalent, which will also appear in the output format.

non-ASCII characters wrapped in <u>

When non-ASCII characters are needed for correct protocol operation, they must be wrapped by the <u> element with the format attribute specifying how it is represented.

The simplified format consists of dash-separated keywords, where each keyword represents a possible expansion of the Unicode character or string; use for example <u format="lit-num-name">foo</u> to expand the text to its literal value, code point values, and code point names.

A combination of up to 3 of the following keywords may be used, separated by dashes: "num", "lit", "name", "ascii", "char". The keywords are expanded as follows and combined, with the second and third enclosed in parentheses if present:

  • "ascii" - The value of the 'ascii' attribute on the <u> element
  • "char" - The literal element text, without quotes
  • "lit" - The literal element text, enclosed in quotes
  • "name" - The Unicode name(s) of the element text
  • "num" - The numeric value(s) of the element text, in U+1234 notation

In order to ensure that no specification mistakes can result from rendering methods that cannot render all Unicode code points, "num" MUST always be part of the specified format.

The following RFCXML:

<t>Temperature changes are indicated by the character <u>Δ</u></t>

Generates the following outputs depending on the setting of format:

  • format="num-lit": Temperature changes are indicated by the character U+0394 ("Δ")
  • format="num-name": Temperature changes are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA)
  • format="num-lit-name": Temperature changes are indicated by the character U+0394 ("Δ", GREEK CAPITAL LETTER DELTA)
  • format="num-name-lit": Temperature changes are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA, "Δ")
  • format="name-lit-num": Temperature changes are indicated by the character GREEK CAPITAL LETTER DELTA ("Δ", U+0394)
  • format="lit-name-num": Temperature changes are indicated by the character "Δ" (GREEK CAPITAL LETTER DELTA, U+0394)

The default value is "lit-name-num"

Expansion of <u> multi-codepoint strings

If the <u> element encloses a sequence of Unicode codepoints, rather than a single one, the rendering reflects this. For example:

   <u format="num-lit">ᏚᎢᎵᎬᎢᎬᏒ</u>

will be expanded to "U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 ("ᏚᎢᎵᎬᎢᎬᏒ")".

Unicode characters in document text which are not enclosed in <u> will be replaced with a question mark (?) and a warning will be issued.

Non-simplified <u> format specifications

In order to provide for cases where the simplified format above is insufficient, without relinquishing the requirement that the number of a code point always must be rendered, the format attribute can also accept a full format string. This format uses placeholders which consist of any of the key words above enclosed in curly braces; outside of this, any ascii text is permissible. For example,

   The <u format="{lit} character ({num})">Δ</u>

will be rendered as

   The "Δ" character (U+0394).

As for the simplified format, "num" MUST always be part of the specified format in order to ensure that no specification mistakes can result for rendering methods that cannot render all Unicode code points,

Split expansion of <u> elements

There are cases which cannot be handled with either the simplified or full <u> format specifications. One is exemplified in Table 1 of the CSS sample document at https://rfc-format.github.io/draft-iab-rfc-css-bis/sample2-v2.html#s-3. Rendering this with <u> elements requires that the non-ascii content be rendered in one place (a table cell in one column) while the expansion is rendered in another cell in a different column. Provision for this has been made by modifying the expansion of <u> when it is referenced by an <xref>. This table, with <u> elements referenced by <xref> instances:

   <table>
     <name>A Sample of Legal Nicknames</name>
     <thead>
       <tr>
          <th>#</th>
          <th>Nickname</th>
          <th>Output for comparison</th>
       </tr>
     </thead>
     <tbody>
       <tr>
          <td>1</td>
          <td>&lt;Foo&gt;</td>
          <td>&lt;foo&gt;</td>
       </tr>
       <tr>
          <td>2</td>
          <td>&lt;foo&gt;</td>
          <td>&lt;foo&gt;</td> </tr>
       <tr>
          <td>3</td>
          <td>&lt;Foo Bar&gt;</td>
          <td>&lt;foo bar&gt;</td>
       </tr>
       <tr>
          <td>4</td>
          <td>&lt;foo bar&gt;</td>
          <td>&lt;foo bar&gt;</td>
       </tr>
       <tr>
         <td>5</td>
         <td>
            &lt;
            <u format="name-num" anchor="greek-upper-sigma">Σ</u>
            &gt;
         </td>
         <td> <xref target="greek-upper-sigma" /> </td>
       </tr>
       <tr>
          <td>6</td>
          <td>
             &lt;
             <u format="name-num" anchor="greek-lower-sigma">σ</u>
             &gt;
          </td>
          <td> <xref target="greek-lower-sigma" /> </td>
       </tr>
       <tr>
          <td>7</td>
          <td>
             &lt;
             <u format="name-num" anchor="greek-final-sigma">ς</u>
             &gt;
          </td>
          <td> <xref target="greek-final-sigma" /> </td>
       </tr>
       <tr>
          <td>8</td>
          <td>
             &lt;
             <u format="name-num" anchor="black-chess-king">♚</u>
             &gt;
          </td>
          <td>
             <xref target="black-chess-king" format="default"/>
          </td>
       </tr>
       <tr>
          <td>9</td>
          <td>
             &lt;Richard
             <u format="{char}> ({num})" anchor="richard-iv">Ⅳ</u>
             &gt;
          </td>
          <td>&lt;richard iv&gt;</td>
       </tr>
     </tbody>
   </table>

comes out as shown below:

| #   | Nickname               | Output for comparison                   |
| --: | :--------------------- | :-------------------------------------- |
| 1   | \<Foo\>                | \<foo\>                                 |
| 2   | \<foo\>                | \<foo\>                                 |
| 3   | \<Foo Bar\>            | \<foo bar\>                             |
| 4   | \<foo bar\>            | \<foo bar\>                             |
| 5   | \<Σ\>                  | GREEK CAPITAL LETTER SIGMA (U+03A3)     |
| 6   | \<σ\>                  | GREEK SMALL LETTER SIGMA (U+03C3)       |
| 7   | \<ς\>                  | GREEK SMALL LETTER FINAL SIGMA (U+03C2) |
| 8   | \<♚\>                  | BLACK CHESS KING (U+265A)               |
| 9   | \<Richard Ⅳ\> (U+2163) | \<richard iv\>                          |
_Table 1: A Sample of Legal Nicknames_