Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Tuesday 02 September 2003 Arabica

More codecvt facet renaming. I, like many people, use the terms UTF16 and UCS2 interchangably when talking about slinging Unicode text as wide characters, even though I knew there was a difference between the two. Here's what the Unicode glossary says
  • UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets. (See Appendix C, Relationship to ISO/IEC 10646.)
  • UTF-16 Encoding Form. The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000 .. U+D7FF and U+E000 .. U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicde scalar value in the range U+10000 .. U+10FFFF to a surrogate pair, according to the Table 3-4, UTF-16 Bit Distribution.
  • UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.
All clear? No, I didn't think so either. However, with Arabica (and XML processing more generally) it's sometimes helpful to distinguish between a Unicode text as wide characters, and a byte sequence encoding thos wide characters. Therefore, I'm going to use UTF16 to mean a byte sequence, and UCS2 to mean a character sequence. This is still probably not quite right, but I'll take my chances.

Upshot of this is I've renamed utf8utf16codecvt to utf8ucs2codecvt. I've also committed two new codecvts, utf16beucscodecvt and utf16leucs2codecvt, which perform UCS2 to big-endian and little-endian UTF16 conversion.
Tagged code, arabica, xml, and c++


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About