Byte Order Mark

From Wikipedia, the free encyclopedia

Unicode
Encodings UTF-7 UTF-8 CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU Punycode GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces

A Byte Order Mark (BOM) is the character at code point U+FEFF ("zero-width no-break space"), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32.

In most encodings the BOM is a sequence which is unlikely to be seen in more conventional encodings or other Unicode encodings (usually looking like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within the text then it will generally be invisible due to the fact it is a zero-width no-break space. The "zero-width no-break space" semantics of the U+FEFF character has been deprecated in Unicode 3.2, allowing it to be used solely with the semantic of BOM.

In UTF-16, a BOM is expressed as the two-byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order. The value U+FFFE is guaranteed not to be a Unicode character at all, and may be used to detect byte order by contrast with U+FEFF which is a character.

While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters "ï»¿" in most text editors and web browsers not prepared to handle UTF-8.

Although a BOM could be used with UTF-32, this encoding is almost never used for transmission anyway.

[edit] Representations of byte order marks by encoding

Encoding	Representation
UTF-8	`EF BB BF`
UTF-16 Big Endian	`FE FF`
UTF-16 Little Endian	`FF FE`
UTF-32 Big Endian	`00 00 FE FF`
UTF-32 Little Endian	`FF FE 00 00`
SCSU	`0E FE FF`
UTF-7	`2B 2F 76` and one of the following byte sequences: `[ 38 \| 39 \| 2B \| 2F \| 38 2D ]` (*)
UTF-EBCDIC	`DD 73 66 73`
BOCU-1	`FB EE 28`

(*) NOTE: In UTF-7, the fourth byte of the BOM, before encoding as base64, is 001111xx in binary, and xx depends on the next character (the first character after the BOM). Hence, technically, the fourth byte is not purely a part of the BOM, but it also contains the information about the next (non-BOM) character. For xx=00, 01, 10, 11, this byte is 60, 61, 62, or 63 in decimal respectively, that is 38, 39, 2B, or 2F in hex when encoded as base64.