Categories: Top ::
About
Codejunkie
Monologues of a mobile retro coder.
skeezix[at]codejedi.com
www.codejedi.com
Subscribe
Subscribe to a syndicated RSS feed. I've
also made a Livejournal version and Ben whipped up an auto-RSS Livejournal
Blogs
DadHacker; epic rants.
ASCII@textfiles
Michael Mace
JoelOnSoftware
Bruce Schneier
Wil Wheaton
I, Cringely
WritingOnYourPalm
Dan Gillmor
GrandTextAuto
Freedom to Tinker
Mark's SysInternals Blog
A List Apart
Tam's Palm
Bytecellar retro goodness
Lost Garden
Bill Ing
Ben Combee
PocketGoddess
PocketFactory
Random Links
PalmInfoCenter
Zodiac Gamer
GP32x
Little Green Desktop
Atari Age
Penny Arcade
Hack-a-Day
Retro Remakes
SHMUPS!
Podcasts
1SRC
RetroGamingRadio
Recent Entries
| November 2008 | ||||||
|---|---|---|---|---|---|---|
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | ||||||
Archives
(With apologies to Stanislaw Lem for butchering a line of his.) (Rant on.) It is probably fortunate for the industry that many of todays computing technologies were birthed in the west -- or more to point -- birthed in a location where the local alphabet, counting system, punctuation marks and hints can all fit within a single byte of memory[1], using an alphabet called ASCII[2] (in nerdlish.) Sure, its made my life easier with all the popular programming languages having their keywords in English, but my point is that it probably assisted growth in the field by not having to tackle the issue of dealing with pictographic languages day one so inventors could instead focus on dealing with getting the machine to display to a screen instead of a lineprinter.
OKay, so I don't lose anyone, a few definitions up front. A single-byte character set can encode only 256 different character glyphs. An example is pure ASCII which fits within this single-byte and encodes 'A' separately to 'a' as well as including punctuation and numbers. This neatly covers the entire English language, and with a few tweaks can include French, German, Spanish and others quite handily.
On the other hand, a language based upon written symbols per word instead of symbols per letter or sound will require thousands of glyphs to encode it on paper, and of course this needs to be translated into something a computer can deal with. As a result, numerous different technologies evolved to encode all these symbols into something a computer program can read and write coherently, but most of which take up more than one-byte of memory. The Unicode specification is an attempt to unify under a few well documented encodings so that applications have a reasonable shot of talking to each other and getting the text correctly. Applications from one area can almost always talk to each other, but the difficulty kicks in when trying to talk between applications or datasets from different geographical locations.
Its the same when two people speak different languages use the same 'sound' to represent different concepts. Hilarity ensues.
Traditional programming languages like 'C' include built-in support for simple strings of characters, such as Hello World and so an enormous volume of code in the world (that is still being written today) exists with the assumption of simple single-byte character set such as ANSI. Newer (notice that I do not say 'better', since utility varies by task!) languages such java pretty much assume a Unicode uncarriage instead. Well and good if your source data is encoded in the same way as the application running it, such as data within the same program. When it talks to someone else though..
All well and good. Old programs can read and write 'John' just fine, and throw a hissy fit if you feed them a name with an umlaut in it. New programs written on top of a system based on Unicode can usually handle the umlaut just fine, since they just handle it like any other character. Further, when the application asks the system to reveal how long a string might be, the system functions are smart enough to say "even though we see this deep pile of data, we know that its really just 'john' with an umlaut," so life is good.
The problems come out when you either have to 'go to the metal' and write tight fast code that reveals these innner workings, or you have to write code to support multiple operating systems (even different versions of the same operating system), or you have to translate between old-style data and new-style data. Or, like me, you have a person write to you who is half Jewish and half Japanese and wants your program to handle both right-to-left and top-to-bottom words in the same screen. No joke. Rough :)
As I eluded to above, the problems tend to come from history. Back in the day we were ignorant of other characters sets, or hacked it by substituting one character set for another as long as it fit into the same amount of space and so most of our modern technology is derived from these flawed but pragmatic assumptions. Of course, these technologies are still in widespread use today. With Palm OS for instance, we have an OS that assumes a single-byte character set, exchanging data with a desktop system that more likely assumes a Unicode character set. Worse still, given a stream of data there is no way to guess what encoding it is .. it could be English, Swahili or Martian for all an application knows -- it just has to decode it in the native way and hope to god its correct. For instance, someone comes along and places their PDA in a cradle and presses hotsync, and the application gets a letter encoded with the number 65 -- is it 'A' as in ASCII and ANSI, or is it merely the beginning of a multi-byte encoded letter or what? The application would assume whatever is normal for a local person, but if the PDA was just flown over from elsewhere or is working with an email attachment from a foreign compatriot... dinnered.
Many frustrations will strike the developer; some very few of the ones that come to mind right off the top of my head...
Anyway, I don't want to go too too far into this; its late at night while I'm writing most of this, and was annoyed with Windows CE at the time for botching some operations because one system called returned data another call should have accepted :) WCE makes a lot of things easy -- like developing for a desktop -- but it being closer to Win32 (thank god!) brings you back to the Dark Days of Encoding Hell. Most of the time you can stick to one format of data and it Works, but when you're sharing data with people from all around the world on different operating systems from DOS to Mac to Windows ... the hurt comes...
So there :)
(Rant off.)
Oh, and just to add insult to injury -- if you're a developer, ask yourself if you're made your applications Left-Handed friendly. This is something desktop developers rarely consider, but mobile application developers suffer over :)
[1] Strictly speaking, ASCII fits into 7-bits while a byte usually refers to 8-bits. Further, many machines exist which do not work in terms of 8-bit units, so a machine extra bits could encode more complex alphabets. Again, I ask for slack to avoid turning this into an research paper :) But yes, I've worked with 20-bit machines and such so there :) [2] ASCII has exceedingly common over the last few decades, though there are numerous compatible variations such as "ANSI", not to mention the many "codepages" and other encodings such as EBCDIC.[ Category: / technology / coding ] [link] [Comments]>