When is Unicode not Unicode? When Microsoft gets involved!

Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode? They will have you believe that Unicode requires you to use a WCHAR (wide) character type and that Unicode cannot be represented by a CHAR (narrow) character type. In fact, both of these statements are completely and utterly false. Microsoft has misled you in the most egregious way.

Before we go any further, I need to clarify some terminology that is often confused. This is especially true of Windows programmers who, quite often, mistakeningly believe that using a wide character type means they are using Unicode:

Character Set: This is a complete set of characters recognized by the computer hardware and software.

Character Encoding: This is a way of encoding a character set, generally to fit within the boundaries of a particular data type. ASCII, ANSI and UTFx are all examples of character encodings.

Character Type: This is a fundamental data type used to represent a character.

These three things are intrinsically related. The character type chosen to represent a character set will have a direct impact on the character encoding used. In C++, the normal fundamental character types are either wchar_t (wide) or char (narrow). The size of the narrow and wide types are platform dependent, although C++11 has introduced fixed sized character types. For the purposes of this discussion, it being Windows centric, we will assume wide is 16 bit and narrow is 8 bit.

Unicode code points are 32 bit. That’s it, the end. If you want to work with raw Unicode code points you have no choice than to use a 32 bit character type. That said, some very clever people who work for The Unicode Consortium realised that the majority of the western world uses the Latin alphabet and most of this can be represented using just 8 bits. The majority of the rest of the world uses characters that can be represented by 16 bits. The remainder of the world requires 32 bits. On that basis, forcing the world to adopt 32 bit character types would be, for the most of us, completely insane. Of course, the same could be said for 16 bits… eh, Microsoft?

Those clever people went on to invent a number of Unicode Transformation Formats (character encodings) that would allow Unicode to be encoded using smaller character types than 32. The most common of these are UTF16 and UTF8, although other less common encodings do exist. These are encoding formats that allow Unicode to be represented as multi-byte representations of their 32 code points using either 16 or 8 bit character types. Of the two, UTF8 is by far the most efficient for the majority of cases and has the advantage of being directly backwards compatible with systems designed to only use ASCII (meaning all old programs will just work).

Unfortunately, Microsoft decided to jump on the Unicode bandwagon without really thinking things through and, in their infinite wisdom, decided to adopt UTF16 as the standard encoding format for Unicode on the Windows platform. Frankly, this couldn’t have been a worse decision, and it is one that has plagued Windows programmers the world over ever since. The rest of the sane world realised that was UTF16 was just stupid and decide to use UTF8. Amazingly enough, the rest of the world has no significant problems writing programs what will work with Unicode in a portable fashion. Windows on the other hand… um… no!

The reason UTF16 makes no sense is because not only is it very wasteful for the majority of us who just use plain old ASCII most of the time, it’s also a real pain to use, especially if you want to be able to generate data that is portable and can be used cross-platform. You see, each character in a UTF16 encoding is larger than a byte and so the storage and retrieval of text encoded in this format requires that the reader be able to identify and (if necessary) convert the encoding format to the correct endieness for the platform.

Also, most legacy programs are written using narrow character types and so to “port” these over to use Unicode means making major changes to the code base to use wide character types. Now, this might sound like a simple “search and replace” but it’s really not. In a language such a C or C++, where the programmer, and not the compiler, is completely responsible for preserving the integrity of the memory, introducing a larger data type without reviewing each and every change to make sure it doesn’t bust data boundaries is coding suicide. Basically, this decision to use UTF16 meant that all existing code has to be broken and then fixed to be international friendly. That is a huge cost to business, and so most just didn’t (and don’t) bother!

Further, regardless of what Microsoft would have you believe, UTF16 is still a multi-byte format because you can’t represent the full 32 bit code-point set using 16 bit types. Sure, the majority of usable code points will fit into 16 bits, but it is a lie to say that Unicode can be represented by single wide character types. It’s just impossible. 32 bit into 16 bits does not fit! A quart does not fit into a pint pot! The use of “Unicode” in Windows is a broken promise that just makes life oh so unnecessarily hard for the software engineer.

By contrast, other platforms (Linux, for example) use UTF8 natively. This means that all data can be stored and retrieved using narrow types. Because UTF8 is a byte level encoding format it has no sense of endianness and so is easy to port between different platforms. It is also way more efficient than UTF16 because, in the general case of only using the standard ASCII character set, each character requires only 1 byte (rather than 2) to be represented. Even in the extended case of non-standard ASCII it’s normally still way more efficient because only some non-ASCII chars require more than one byte of encoding. UTF8 is a highly efficient encoding format, UTF16 is just not!

When Microsoft talks of Unicode please don’t be confused. They are NOT referring to Unicode, they are referring to the UTF16 encoding format. They do this because they want the world to believe that only 16 bits and UTF16 can be used to represent Unicode. They do this because they don’t want you to know how stupid they were to decide to use this pointless encoding format. Yes, this is a way to represent Unicode, but it is not the only way and from a software engineering point of view it is quite probably the stupidest way.

Further, whereas the rest of the sensible software engineering world uses UTF8 as a narrow character encoding format to represent Unicode, Microsoft insists on sticking with ANSI Code Pages. Unlike UTF8, these cannot represent the full range of the Unicode character set and, worse, unless you know the original code-page you have absolutely no idea what the encoding format actually represents. You may as well be working with a file of random binary, because that’s about as useful as an ANSI format file with no code-page information would be. It wouldn’t be so bad if Microsoft offered UTF8 as a native encoding format, but to date, this isn’t the case. It’s UTF16, ANSI or nothing!

So, Windows programmers, when you start talking about your project being “Unicode” please remember that to the rest of the sane world this phrase is meaningless. All you are saying is that your project uses wide rather than narrow data types for representing characters and you just so happen to have been fooled into using UTF16 when you could quite as easily have used UTF8. That’s right, you don’t have to use UTF16, even in Windows, to be Unicode friendly, you can use UTF8. There, I said it. The secret is out! I always code all my projects using narrow character types and, internally, I work with UTF8. I only convert (on Windows) to UTF16 when I absolutely have to (at the system API boundary).

But why do this? Doesn’t that make life hard? Good question. Yes and no. Yes, because it means at some point I still have to convert to UTF16. No, because C++11 now provides nice, efficient tools to do this conversation process so it is pretty painless. What it does mean is that my code will work on any platform. By using a platform agnostic character set, which UTF8 is, it means my code will run just as well on Windows as it will on Linux or OS-X.

For more reading on why we should all be using UTF8, why forcing us to use UTF16 is just silly and why Microsoft owes us all a very large apology for the mess they have made of “Unicode” on Windows, I highly recommend taking a look a the excellent UTF8 Everywhere website.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.