northanger (northanger) wrote,

characters: set, code, point, encode

at work finally trying to understand this stuff ... notes to self.

Tutorial: Character sets & encodings in XHTML, HTML and CSS

  • character set: The set of characters you will use for a particular purpose.
  • coded character set: Set of characters for which a unique number has been assigned to each character.
  • code point: Unit of a coded character set.
  • character encoding: The way these abstract characters are mapped to bytes for manipulation in a computer.

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more common characters in use. Around a million further code point positions are available in the Unicode character set. Characters in this latter range are referred to as supplementary characters.

For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode.)

HTML Document Representation
The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.

Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings 1.0
1.4 User agents addressed. User agents, in this current version, means a number of mainstream browsers. (The scope may grow as resources and test results become available for other user agents.) In an attempt to make the task of tracking browser applicability manageable, we have chosen a 'base version' for each of the user agents we are tracking for applicability. This base version represents a fairly recent, standards-compliant version of the browser. Where a browser operates in both standards- and quirks-mode, standards-mode is assumed (ie. you should use a DOCTYPE statement). The base versions considered for this version of the document include: Internet Explorer 6 (Windows), Mozilla 1.4, Opera 7, Netscape Navigator 7, Safari, Internet Explorer 5 (Mac).

Alan Wood’s Unicode Resources
Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications

Universal Character Set
The Universal Character Set is a character encoding that is defined by the international standard ISO/IEC 10646. It maps hundreds of thousands of abstract characters, each identified by an unambiguous name, to integers, called numeric code points.

The UCS has over 1.1 million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) were commonly used before 2000. This situation began changing with mandate by the People's Republic of China in 2000 that computer systems sold there must support GB18030, which required that computer systems intended for sale in the PRC must move beyond the BMP.

ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards. The difference between them is that Unicode adds rules and specifications that are lacking in ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalisation of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.


  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened