![]() |
Humanities Computing TestbedThe future can be best predicted by inventing it.October 2004 Archives |
|
As I am preparing a LaTeX seminar, I am evaluating several free introductory texts. I will have to pick one of them, but here is a list of several good candidates:
Visit the Humanities Computing Testbed or email Oliver your question and the answer will be included here.
Unicode is a standard that allows us to encode many, and in theory all, scripts of the world using the same encoding method. Historically, there have been many different character encodings, even for the same language. English and other languages written in the Latin script are often encoded in ASCII, but some large IBM computers use the EBCDIC encoding. Ten different mutually incompatible encodings are necessary for the languages of Europe written in the Latin script alone; the last addition was a new encoding that contains the Euro sign. In addition to the multitude of standard encodings, many computer programs use their own encodings, incompatible with anything else. Whenever you wanted to move a file from one computer to another that expects a different encoding you had to reencode it, possibly leading to the loss of data if the source encoding does not map unambiguously to the target encoding. Unicode promises a lasting solution to this problem:
“Unicode provides a unique
number for every character,
no matter what the platform,
no
matter what the program,
no matter what the language.”
Here are some documents you might find interesting:
What is Unicode? A very short non-technical introduction to Unicode. On this page you'll also find links to translations of this document in many different languages, such as Swedish or Thai, that you can use to check whether your web browser or another application you're using displays a certain character correctly.
The Unicode® Standard: A Technical Introduction is a very short technical introduction to Unicode and the UTF formats used to represent it.
Where is my Character? contains useful information if you want to represent a certain character on your computer with Unicode (and Unicode almost certainly will have a way to represent your character) but don't know how that character is encoded.
And finally there's the Unicode standard itself. This is not a work you want to be consulting in your everyday work, but it is the authoritative source on all things Unicode.
The 8-bit encoding (or the Y2K) problem all over again: Unicode in its full glory uses a number between 0 and 1114111 for each character. Now 1114111 is a fairly large number, and most makers of supposedly Unicode-compliant software don’t think that you really need that many different characters. Therefore they only support the first 65536 (corresponding to 16 bits) characters of the Unicode encoding. These first 65536 characters contain what is called the Basic Multilingual Plane, and it contains all the characters necessary for all of the world’s common languages. However, the characters for some rare languages of mainly scholarly interest, such as Gothic or Shavian, are assigned numbers outside of the Basic Multilingual Plane, and programs that call themselves ‘Unicode-compliant,’ but really support only the BMP, won’t be able to handle these languages.
There may be more than one representation for a given character. This makes it difficult to search for a given text. For example, the character ‘Ä’ may be represented by number 196 (Unicode inherited the first 255 characters from Latin-1) or it may be represented by the character ‘A’ with the number 65 followed by the combining diaeresis character with the number 776. If you want to search and replace in Unicode-encoded texts, you might want to consider ‘normalizing’ your Unicode, i.e. making sure that each appearance of a character is represented in the same way. The unicode.org FAQ has a section on Unicode normalization.
One glyph, as in the example with the letter ‘Ä,’ might be represented by a sequence of characters. Therefore you cannot simply cut and paste sequences of Unicode; if you cut and pasted the combining diaeresis without the letter ‘A’ it would combine with whatever other letter it can find. In addition, if your Unicode-text is encoded in UTF-8 or UTF-16, you cannot rely on byte or word boundaries being character boundaries. To avoid these problems, only manipulate Unicode-encoded text with programs or libraries that can take care of all of the Unicode-handling for you.
If you want to enter a rare character every once in a while and you’re using one of the usual office suites, the menu point ‘Insert/Special Character’ will give you a list of all the odd characters your font can display and lets you select the character you want.
If you want to enter text in one language not covered by the U.S. Keyboard very often, you might want to switch keyboard layouts. All modern graphical user environments allow you to switch from one keyboard layout to another, usually by pressing a certain key combination or by clicking on a little flag icon somewhere on the screen. If you don’t like the idea of having to type Greek on an American keyboard, you might consider plugging a Greek USB keyboard into your computer in addition to the American one.
There’s also a clever way of entering unusual characters that originated in the Unix world, but is also available for other graphical user environments: You have a key called ‘Compose’ on your keyboard; or, since most likely you don’t really have that key, you define another one, such as the right ‘Ctr’ key to be your ‘Compose’ key. In order to enter a character that looks like a combination of two other characters, you press ‘Compose’ and then the keys of the characters of which your desired character is composed. Thus, ‘Compose’ + ‘A’ + ‘'’ gives the letter ‘Á,’ ‘Compose’ + ‘e’ + ‘`’ gives the letter ‘è,’ ‘Compose’ + ‘s’ + ‘s’ gives the letter ‘ß,’ and so on.