Seed7 Program: Toutf8

Screenshots

Toutf8

Source Code

Toutf8 converts files from various codepages to UTF-8.

The following codepages are supported:

: 437, 708, 720, 737, 775, 850, 852, 855, 857, 858, 860, 861, 862, 863, 864, 865, 866, 869, 874, 1125, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 8859-1, 8859-2, 8859-3, 8859-4, 8859-5, 8859-6, 8859-7, 8859-8, 8859-9, 8859-10, 8859-11, 8859-13, 8859-14, 8859-15, 8859-16, latin-1, latin-2, latin-3, latin-4, latin-5, latin-6, latin-7, latin-8, latin-9, latin-10, 037, 273, 277, 280, 285, 297, 500, 1047

The following IANA/MIME charset names are also accepted:

: ANSI_X3.4-1968, ARMSCII-8, ASCII, CP437, CP850, GEOSTD8, IBM437, IBM850, ISO_8859-1, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, KOI8-R, KOI8-U, MACINTOSH, NS_4551-1, TIS-620, US-ASCII, UTF-16BE, UTF-16LE, UTF-7, UTF-8, UTF8, VISCII, WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256, WINDOWS-1257, WINDOWS-1258

Example of a toutf8 usage:

s7 toutf8 -437 infile outfile

Operation method

The "charsets.s7i" library contains definitions of various codepages. Codepages are represented as constant strings of length 256. Such a codepage string contains the Unicode representation of all charcters defined by the codepage. To convert the character 'Σ' from codpage 437 to UTF-32 the following needs to be done:

: In codpage 437 the character 'Σ' is encoded with the number 228. The Unicode representation of 'Σ' is found at cp_437[229]. Note that 229 is used instead of 228 because the index of the first character in a string is 1 (cp_437[1] defines character 0 of codepage 437). The expression ord(cp_437[229]) is 931 (or 16#03A3) which is the Unicode representation of 'Σ'.

All Seed7 source files (*.sd7 and *.s7i) use UTF-8 encoding. Therefore the Unicode characters used in "charsets.s7i" are encoded with UTF-8 (an UTF-8 aware editor should be used to edit files containing UTF-8 characters). Although UTF-8 is used in the source files the codepage strings define the conversion to UTF-32 (the internal representation for chars and strings).

Write usage and allowed conversions