HomeSpecial charactersDashesUniversal charset
User Manual UDO > The UDO syntax > Special characters Index

Converting 8-bit characters

In an UDO source file you can use higher characters without having to know how a character has to look like in a destination format like LaTeX or Windows Help. So you can enter a German ß or ä without any fear, UDO converts it for you automatically.

UDO expects files containing chars of the system charset of your operating system. If you run UDO on a MS-DOS computer UDO expects text files that are written with the IBM PC character set by default. If UDO runs on an Atari computer UDO will expect the TOS character set by default.

But UDO can manage file that are written with another character set, too. You have simply to tell UDO which character set your source file uses with !code_source [<charset>].

Below is an overview of the character sets UDO knows about:

UDO supports various codepages for various systems. Below you see a list of all currently supported systems and codepages, some of which with multiple descriptors for the same codepage. It doesn't matter if you use these descriptors upper- or lowercase. (The descriptors base on the former UDO descriptors and on those supported by the Unix command iconv.)

System Encoding Descriptor0
Unicode UTF-8 UTF-8 UTF8
Windows Codepage 1250 CP1250 MS-EE WINDOWS-1250
Codepage 1251 CP1251 MS-CYRL RUSSIAN WINDOWS-1251
Codepage 1252 CP1252 MS-ANSI WINDOWS-1252 WIN
Codepage 1253 CP1253 GREEK MS-GREEK WINDOWS-1253
Codepage 1254 CP1254 MS-TURK TURKISH WINDOWS-1254
Codepage 1255 CP1255 HEBREW MS-HEBR WINDOWS-1255
Codepage 1256 CP1256 ARABIC MS-ARAB WINDOWS-1256
Codepage 1258 CP1258 WINDOWS-1258
ISO 8859-1 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 LATIN1 L1 CSISOLATIN1
8859-2 ISO-8859-2 ISO-IR-101 ISO8859-2 ISO_8859-2 LATIN2 L2 CSISOLATIN2
8859-3 ISO-8859-3 ISO-IR-109 ISO8859-3 ISO_8859-3 LATIN3 L3 CSISOLATIN3
8859-4 ISO-8859-4 ISO-IR-110 ISO8859-4 ISO_8859-4 LATIN4 L4 CSISOLATIN4
8859-6 ISO-8859-6 ISO-IR-127 ISO8859-6 ISO_8859-6 ARABIC CSISOLATINARABIC ASMO-708 ECMA-114
8859-7 ISO-8859-7 ISO-IR-126 ISO8859-7 ISO_8859-7 GREEK GREEK8 CSISOLATINGREEK ECMA-118 ELOT_928
8859-8 ISO-8859-8 ISO-IR-138 ISO8859-8 ISO_8859-8 HEBREW CSISOLATINHEBREW
8859-9 ISO-8859-9 ISO-IR-148 ISO8859-9 ISO_8859-9 LATIN5 L5 CSISOLATIN5 TURKISH
8859-10 ISO-8859-10 ISO-IR-157 ISO8859-10 ISO_8859-10 LATIN6 L6 CSISOLATIN6 NORDIC
8859-11 ISO-8859-11 ISO8859-11 ISO_8859-11 THAI
8859-13 ISO-8859-13 ISO-IR-179 ISO8859-13 ISO_8859-13 LATIN7 L7 CSISOLATIN7 BALTIC
8859-14 ISO-8859-14 ISO-IR-199 ISO8859-14 ISO_8859-14 LATIN8 L8 CSISOLATIN8 CELTIC ISO-CELTIC
8859-15 ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 LATIN9 L9 CSISOLATIN9
8859-16 ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 LATIN10 L10 CSISOLATIN10
DOS Codepage 437 437 CP437 IBM437 CSPC8CODEPAGE437 DOS
Codepage 850 850 CP850 IBM850 CSPC850MULTILINGUAL OS2

Important: If you have used latin1 in your old UDO documents, you should switch it to e.g. cp1252 because UDO used to assign Windows codepage 1252 to it before version 7 which correctly assigns ISO-8859-1 to it!

When you use so-called 1-byte codepages (all codepages supported by UDO, except Unicode) and use one codepage for your UDO documents, but a different one for your output documents, you might want to keep in mind that all codepages have different settings. A codepage is a collection of 256 characters from the whole range of all characters which have been defined in the Unicode standard already.

Imagine you have created an UDO document using the DOS encoding and use DOS graphic signs, but your target format is e.g. Apple MacRoman. Then you will not be able to see your DOS graphic signs. When you have used the Hebrew letters from the Atari TOS encoding, you will not be lucky to see them in most other codepages.

In these cases we recommend to use UTF-8, if it is available for the target format. Internally, UDO keeps all codepages in Unicode format so you will be able to use e.g. the Hebrew Alef from the TOS character set and see it properly even in UTF-8 and Windows codepage 1255.

Convert multiple files to Unicode
If you want to convert older project files from a 1-byte codepage to UTF-8 but don't have fun doing this conversion yourself for each single file, chances are that the Unix command iconv helps much. It can usually be found on Unix machines and on Mac OS X.
Here is a simple example how to convert any number of files with the suffix *.cs recursively (i.d. in any number of subfolders) at once from Czech (e.g. coded in ISO-8859-2) to UTF-8, using the bash shell in the Terminal application on Mac OS X. Regard the special apostrophes which enclose the find command!
for x in `find . -name '*.cs'`; do iconv -f ISO-8859-2 -t UTF-8 $x > "$x.utf8"; rm $x; mv "$x.utf8" $x; done
The encoding conversion cannot be done directly in the same file because it would be empty afterwards; thus we need the temporary *.utf8 files which are renamed with the original files names after the original files have been deleted.

Copyright © (Contact)
Last updated on May 19, 2014

HomeSpecial charactersDashesUniversal charset