Chapter 6

Unicode

Scheme 48 fully supports ISO 10646 (Unicode): Scheme characters represent Unicode scalar values, and Scheme strings are arrays of scalar values. More information on Unicode can be found at the Unicode web site.

6.1 Characters and their codes

Scheme 48 internally represents characters as Unicode scalar values. The unicode structure contains procedures for converting between characters and scalar values:

(char->scalar-value char) -> integer
(scalar-value->char integer) -> char
(scalar-value? integer) -> boolean

Char->scalar-value returns the scalar value of a character, and scalar-value->char converts in the other direction. Scalar-value->char signals an error if passed an integer that is not a scalar value.

Note that the Unicode scalar value range is

In particular, this excludes the surrogates, which UTF-16 uses to encode scalar values with two 16-bit words. Note that this representation differs from that of Java, which uses UTF-16 code units as the character representation -- Scheme 48 effectively uses UTF-32, and is thus in line with other Scheme implementations and the current Unicode proposal for R⁶RS, as set forth in SRFI 75.

The R⁵RS procedures char->integer and integer->char are synonyms for char->scalar-value and scalar-value->char, respectively.

6.2 Character and string literals

The syntax specified here is in line with the current Unicode proposal for R⁶RS, as set forth in SRFI 75, except for case-sensitivity. (Scheme 48 is case-insensitive.)

6.2.1 Character literals

The following character names are available in addition to what R⁵RS provides:

#\nul (ASCII 0)
#\alarm (ASCII 7)
#\backspace (ASCII 8)
#\tab (ASCII 9)
#\vtab (ASCII 11)
#\page (ASCII 12)
#\return (ASCII 13)
#\esc (ASCII 27)
#\rubout (ASCII 127)
#\x<x><x>... hex, explicitly or implicitly delimited, where <x><x>... denotes the scalar value of the character

6.2.2 String literals

The following escape characters in string literals are available in addition to what R⁵RS provides:

\a: alarm (ASCII 7)
\b: backspace (ASCII 8)
\t: tab (ASCII 9)
\n: linefeed (ASCII 10)
\v: vertical tab (ASCII 11)
\f: formfeed (ASCII 12)
\r: return (ASCII 13)
\e: escape (ASCII 27)
\': quote (ASCII 39, same as unquoted)
\<newline><intraline whitespace>: elided (allows a single-line string to span source lines)
\x<x><x>...; hex, where <x><x>... denotes the scalar value of the character

6.2.3 Identifiers and symbol literals

Where R⁵RS allows a <letter>, Scheme 48 allows in addition any character whose scalar value is greater than 127 and whose Unicode general category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co.

Moreover, when a backslash appears in a symbol, it must start a \x<x><x>...; escape, which identifies an arbitrary character to include in the symbol. Note that a backslash itself can be specified as \x5C;.

6.3 Character classification and case mappings

The R⁵RS character predicates -- char-whitespace?, char-lower-case?, char-upper-case?, char-numeric?, and char-alphabetic? -- all treat the full Unicode range.

Char-upcase and char-downcase as well as char-ci=?, char-ci<?, char-ci<=?, char-ci>?, char-ci>=?, string-ci=?, string-ci<?, string-ci>?, string-ci<=?, string-ci>=? all use the standard simple locale-insensitive Unicode case folding.

In addition, Scheme 48 provides the unicode-char-maps structure for more complete access to the Unicode character classification with the following procedures and macros:

(general-category general-category-name) -> general-category (syntax)
(general-category? x) -> boolean
(general-category-id general-category) -> string
(char-general-category char) -> general-category

The syntax general-category returns a Unicode general category object associated with general-category-name. (See Figure 2 below.) General-category? is the predicate for general-category objects. General-category-id returns the Unicode category id as a string (also listed in Figure 2). Char-general-category returns the general category of a character.

general-category-name	primary-category-name	Unicode category id
`uppercase-letter`	`letter`	`"Lu"`
`lowercase-letter`	`letter`	`"Ll"`
`titlecase-letter`	`letter`	`"Lt"`
`modified-letter`	`letter`	`"Lm"`
`other-letter`	`letter`	`"Lo"`
`non-spacing-mark`	`mark`	`"Mn"`
`combining-spacing-mark`	`mark`	`"Mc"`
`enclosing-mark`	`mark`	`"Me"`
`decimal-digit-number`	`number`	`"Nd"`
`letter-number`	`number`	`"Nl"`
`other-number`	`number`	`"No"`
`opening-punctuation`	`punctuation`	`"Ps"`
`closing-punctuation`	`punctuation`	`"Pe"`
`initial-quote-punctuation`	`punctuation`	`"Pi"`
`final-quote-punctuation`	`punctuation`	`"Pf"`
`dash-punctuation`	`punctuation`	`"Pd"`
`connector-punctuation`	`punctuation`	`"Pc"`
`other-punctuation`	`punctuation`	`"Po"`
`currency-symbol`	`symbol`	`"Sc"`
`mathematical-symbol`	`symbol`	`"Sm"`
`modifier-symbol`	`symbol`	`"Sk"`
`other-symbol`	`symbol`	`"So"`
`space-separator`	`separator`	`"Zs"`
`paragraph-separator`	`separator`	`"Zp"`
`line-separator`	`separator`	`"Zl"`
`control-character`	`miscellaneous`	`"Cc"`
`formatting-character`	`miscellaneous`	`"Cf"`
`surrogate`	`miscellaneous`	`"Cs"`
`private-use-character`	`miscellaneous`	`"Co"`
`unassigned`	`miscellaneous`	`"Cn"`

Figure 2: Unicode general categories and primary categories

(general-category-primary-category general-category) -> primary-category
(primary-category primary-category-name) -> primary-category (syntax)
(primary-category? x) -> boolean

General-category-primary-category maps the general category to its associated primary category -- also listed in Figure 2. The primary-category syntax returns the primary-category object associated with primary-category-name. Primary-category? is the predicate for primary-category objects.

The unicode-char-maps procedure also provides the following additional case-mapping procedures for characters:

(char-titlecase? char) -> boolean
(char-titlecase char) -> char
(char-foldcase char) -> char

Char-titlecase? tests if a character is in titlecase. Char-titlecase returns the titlecase counterpart of a character. Char-foldcase folds the case of a character, i.e. maps it to uppercase first, then to lowercase. The following case-mapping procedures on strings are available:

(string-upcase string) -> string
(string-downcase string) -> string
(string-titlecase string) -> string
(string-foldcase string) -> string

These implement the simple case mappings defined by the Unicode standard -- note that the length of the output string may be different from that of the input string.

6.4 SRFI 14

The SRFI 14 (``Character Sets'') implementation in the srfi-14 structure is fully Unicode-compliant.

6.5 R6RS

The unicode-r6rs structure exports the procedures from the (r6rs unicode) library of 5.91 draft of R⁶RS that are not already in the scheme structure:

string-normalize-nfd
string-normalize-nfkd
string-normalize-nfc
string-normalize-nfkc
char-titlecase
char-title-case?
char-foldcase
string-upcase
string-downcase
string-foldcase
string-titlecase

The r6rs-unicode structure also exports a char-general-category procedure compatible with the (r6rs unicode) library. Note that, as Scheme 48 treats source code case-insensitively, the symbols it returns are all-lowercase.

6.6 I/O

Ports must encode any text a program writes to an output port to a byte sequence, and conversely decode byte sequences when a program reads text from an input port. Therefore, each port has an associated text codec that describes how encode and decode text.

Note that the interface to the text codec functionality is experimental and very likely to change in the future.

6.6.1 Text codecs

The i/o structure defines the following procedures:

(port-text-codec port) -> text-codec
(set-port-text-codec! port text-codec)

These two procedures retrieve and set the text codec associated with a port, respectively. A program can set text codec of a port at any time, even if it has already performed I/O on the port.

The text-codecs structure defines the following procedures and macros:

(text-codec? x) -> boolean
null-text-codec ( text-codec)
us-ascii-codec ( text-codec)
latin-1-codec ( text-codec)
utf-8-codec ( text-codec)
utf-16le-codec ( text-codec)
utf-16be-codec ( text-codec)
utf-32le-codec ( text-codec)
utf-32be-codec ( text-codec)
(find-text-codec string) -> text-codec or #f

Text-codec? is the predicate for text codecs. Null-text-codec is primarily meant for null ports that never yield input and swallow all output. The following text codecs implement the US-ASCII, Latin-1, Unicode UTF-8, Unicode UTF-16 (little-endian), Unicode UTF-16 (big-endian), Unicode UTF-32 (little-endian), Unicode UTF-32 (big-endian) encodings, respectively.

Find-text-codec finds the codec associated with an encoding name. The names of the above encodings are "null", "US-ASCII", "ISO8859-1", "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", and "UTF-32BE", respectively.

6.6.2 Text-codec utilities

The text-codec-utils structure exports a few utilities for dealing with text codecs:

(guess-port-text-codec-according-to-bom port) -> text-codec or #f
(set-port-text-codec-according-to-bom! port) -> boolean

These procedures look at the byte-order-mark (also called the ``BOM'', U+FEFF) at the beginning of a port and guess the appropriate text codec. This works only for UTF-16 (little-endian and big-endian) and UTF-8. Guess-port-text-codec-according-to-bom returns the text codec, or #f if it found no UTF-16 or UTF-8 BOM. Note that this actually reads from the port. If the guess does not succeed, it is probably a good idea to re-open the port. Set-port-text-codec-according-to-bom! calls guess-port-text-codec-according-to-bom, sets the port text codec to the result if successful and returns #t. If it is not successful, it returns #f. As with guess-port-text-codec-according-to-bom, this reads from the port, whether successful or not.

6.6.3 Creating text codecs

(make-text-codec strings encode-proc decode-proc) -> text-codec
(text-codec-names text-codec) -> list of strings
(text-codec-encode-char-proc text-codec) -> encode-proc
(text-codec-decode-char-proc text-codec) -> decode-proc
(define-text-codec id name encode-proc decode-proc) (syntax)
(define-text-codec id (name ...) encode-proc decode-proc) (syntax)

Make-text-codec constructs a text codec from a list of names, and an encode and a decode procedure. (See below on how to construct encode and decode procedures.) Text-codec-names, text-codec-encode-char-proc, and text-codec-decode-char-proc are the accessors for text codec. The define-text-codec is a shorthand for binding a global identifier to a text codec. Its first form is for codecs with only one name, the second for codecs with several names.

Encoding and decoding procedures work as follows:

(encode-proc char buffer start count) -> boolean maybe-count
(decode-proc buffer start count) -> maybe-char count

An encode-proc consumes a character char to encode, a byte vector buffer to receive the encoding, an index start into the buffer, and a block size count. It is supposed to encode the bytes into the block at [start, start + count). If the encoding is successful, the procedure must return #t and the number of bytes needed by the encoding. If the character cannot be encoded at all, the procedure must return #f and #f. If the encoding is possible but the space is not sufficient, the procedure must return #f and a total number of bytes needed for the encoding.

A decode-proc consumes a byte vector buffer, an index start into the buffer, and a block size count. It is supposed to decode the bytes at indices [start, start + count). If the decoding is successul, it must return the decoded character at the beginning of the block, and the number of bytes consumed. If the block cannot begin with or be a prefix of a valid encoding, the procedure must return #f and #f. If the block contains a true prefix of a valid encoding, the procedure must return #f and a total count of bytes (including the buffer) needed to complete the encoding. Note that this byte count is only a guess: the system will provide that many bytes, but the decoding procedures might still signal an incomplete encoding, causing the system to try to obtain more.

6.7 Default encodings

The default encoding for new ports is UTF-8. For the default current-input-port, current-output-port, and current-error-port, Scheme 48 consults the OS for encoding information.

For Unix, it consults nl_langinfo(3), which in turn consults the LC_ environment variables. If the encoding is not defined that way, Scheme 48 reverts to US-ASCII.

Under Windows, Scheme 48 uses Unicode I/O (using UTF-16) for the default ports connected to the console, and Latin-1 for default ports that are not.