Seed7 - The extensible programming language
Seed7 FAQ Manual Screenshots Examples Libraries Algorithms Download Links
Screenshots Panic Mandelbr Planets Comanche Calc7 Savehd7 Compiler Dnafight Sudoku Wator Tar7 Sydir7 Ftp7 Castle Tetris Make7 Ftpserv Basic Pairs Shisen Eliza Toutf8 Lander Wiz Startrek Mahjong
Screenshots
Toutf8 Source Code
 previous   up   next 

Toutf8 converts files from various codepages to UTF-8.

The following codepages are supported:

437, 708, 720, 737, 775, 850, 852, 855, 857, 858, 860, 861, 862, 863, 864, 865, 866, 869, 874, 1125, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 8859-1, 8859-2, 8859-3, 8859-4, 8859-5, 8859-6, 8859-7, 8859-8, 8859-9, 8859-10, 8859-11, 8859-13, 8859-14, 8859-15, 8859-16, latin-1, latin-2, latin-3, latin-4, latin-5, latin-6, latin-7, latin-8, latin-9, latin-10, 037, 273, 277, 280, 285, 297, 500, 1047

The following IANA/MIME charset names are also accepted:

ANSI_X3.4-1968, ARMSCII-8, ASCII, CP437, CP850, GEOSTD8, IBM437, IBM850, ISO_8859-1, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, KOI8-R, KOI8-U, MACINTOSH, NS_4551-1, TIS-620, US-ASCII, UTF-16BE, UTF-16LE, UTF-7, UTF-8, UTF8, VISCII, WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256, WINDOWS-1257, WINDOWS-1258

Example of a toutf8 usage:

s7 toutf8 -437 infile outfile

Operation method

The "charsets.s7i" library contains definitions of various codepages. Codepages are represented as constant strings of length 256. Such a codepage string contains the Unicode representation of all charcters defined by the codepage. To convert the character 'Σ' from codpage 437 to UTF-32 the following needs to be done:

In codpage 437 the character 'Σ' is encoded with the number 228. The Unicode representation of 'Σ' is found at cp_437[229]. Note that 229 is used instead of 228 because the index of the first character in a string is 1 (cp_437[1] defines character 0 of codepage 437). The expression ord(cp_437[229]) is 931 (or 16#03A3) which is the Unicode representation of 'Σ'.

All Seed7 source files (*.sd7 and *.s7i) use UTF-8 encoding. Therefore the Unicode characters used in "charsets.s7i" are encoded with UTF-8 (an UTF-8 aware editor should be used to edit files containing UTF-8 characters). Although UTF-8 is used in the source files the codepage strings define the conversion to UTF-32 (the internal representation for chars and strings).


Write usage and allowed conversions

 previous   up   next