8.25.2009

Universal computer language

I was wondering what non-English speakers (and other language speakers who don't use the same alphabet) do when they want to build websites. Well I found out that they use UTF-8:

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

I have no idea what they're talking about. And the more detailed description is even more baffling:

The UTF-8 encoding is variable-width, ranging from 1-4 bytes. Each byte has 0-4 leading 1 bits followed by a zero bit to indicate its type. N 1 bits indicates the first byte in a N-byte sequence, with the exception that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a continuation byte in a multi-byte sequence (this was done for ASCII compatability). The scalar value of the Unicode code point is the concatenation of the non-control bits.

Well here's something that's comprehensible, if your screen can display it all: a web page that has been "encoded directly in UTF-8", which explains why you might not be able to see some of the languages.

Anyway, it's pretty cool that people are able to program in a universal computer language. Too bad my brain isn't big enough to understand it :(

No comments: