Friday, August 26, 2005

Danny Thorpe on Unicode and VCL

As always, debates about Delphi and its future rages on in the borland.public.delphi.non-technical newsgroups. Often feelings race high and speculations, FUD, trolling and flamewars is the order of the day - you have been warned ;).

One way of getting at some nibbles of good, technical and useful information from the newsgroups is to read them "vertically" - I find myself often arranging the newsgroup posts by poster name, and reading most or all posts of people I know from experience are good posters.

One never disappointing source of such enlightening posts is Borland's Chief Scientist Danny Thorpe. Recently he posted some interesting points about how he views the challenge of a Unicode enabled native VCL (VCL for .NET already supports Unicode, of course). You can click the link to see Danny's full posts in context from the Google cache.

Here is my interpretation of what I understand are the main points:

  • Delphi .dfm files are already Unicode ready - strings are stored as utf-8
  • Delphi already has the types required to support old and new code (Char/AnsiChar/WideChar
    and String/AnsiString/WideString)
  • Keeping both Ansi VCL and Unicode VCL means; new 3rd party controls, numerous porting issues (char size etc.), duplicate IDE designers, etc., etc.
  • For performance and memory usage reasons, WideString should be made reference counted (using OleStr for external calls).
  • It makes sense to keep Win32 VCL Ansi, while targeting UniCode VCL for a Win64 Delphi platform

Here are some quotes from Danny's posts (published here with permission, of course) - emphasis is mine:

"Danny Thorpe" wrote:
> Does anyone know if Borland if ever plans to add Unicode support in VCL.

The main question is: How much compatibility are you willing to sacrifice to get a Unicode VCL? Unicode VCL for Win32 will not be fully compatible with many third party components out there. Unicode VCL for Win32 will require new component designers in the IDE that will not be compatible with Ansi VCL. Unicode VCL for Win32 will require new design interfaces in the IDE which will not be compatible with the existing design-time interfaces.

 [...]

Yes, we intend to produce a Unicode VCL. We already have in VCL.NET, and the only sane choice for 64 bit VCL is all-Unicode. The cost of adding Unicode support is less when you are starting with a new platform base which already has a compatibility barrier.

-Danny

Another quote:

"Danny Thorpe" wrote:
> The only thing you would have to do is update any literals stored in the DFM.

String literals in DFM files are already stored as UTF-8, a compressed Unicode encoding. UTF-8 looks like ANSI/ASCII for chars < 128. No DFM update utility is required.

> The break would be so minor, it shouldn't take more then a week to convert several hundred units.

The breaks I'm referring to run far deeper than DFMs. How much code do you have that runs through a PChar array by incrementing a pointer by one? In a Unicode world, PChar = PWideChar, which means each char is 2 bytes.

Similarly, any code that scans a string assuming that the first zero byte is the null terminator will fail with Unicode strings, because most Unicode chars (for English) have a zero high byte.

For most Win32 APIs involving string data, there are matching Ansi and Unicode definitions. But not all. Which of the Win32 APIs that you rely on today are not symmetric?

How much code do you have that is aware of multibyte character encodings for Middle Eastern or Far Eastern languages? In a Unicode world, most MBCS gymnastics are completely unnecessary and most are benign, but a few MBCS code patterns actually fail on Unicode. See the byte assumption above.

WideStrings are currently implemented by Delphi as OLEStr, aka BStrs, allocated using the SysAllocString Win32 API. These are not reference counted, and are rather promiscuous in copying themselves for every reference. Clearly, the Delphi WideString implementation needs to be changed to a reference counted WideString to save memory and performance if WideString is to become the primary string data type. But that means Delphi's WideString will have different allocation semantics from OleStr. Reference counted WideStrings will have to be converted to single-reference copies before being passed out of the application to Win32 APIs expecting PWideChar buffers.

Breaking the WideString = OleStr type alias means that all the Win32 APIs that are now declared as taking WideString will need to be changed to OleStr. We'll handle Windows.pas and the other Win32 API units we provide, but you will have to do the equivalent work on any other DLL API declarations your applications use. Until you find them all and fix them, your app will compile fine but will crash mysteriously at runtime. The compiler can't help you here because the compiler can't tell if the DLL you're calling actually expects OleStr or if it's a Delphi DLL that's actually expecting a Delphi reference counted WideString. The compiler has to rely on you to get the declarations right.

If your code and the components you use have been ported to Linux or .NET in the past, then chances are these kinds of things have already been found and modified to be char size agnostic.

Unicode VCL sounds like such a simple, little thread... until you start pulling on it.

-Danny

Final quote:

"Danny Thorpe" wrote:
> many applications (particularly those connecting to external systems) can never be 100% unicode. They will always have a mix of unicode and non-unicode sections.

True, there is always a need to be able to specify which parts are wide and which parts are not. That's why we have 3 char types (AnsiChar, Char, WideChar) and 3 string types (AnsiString, String, and WideString). All those types continue to exist in Unicode places such as Delphi.NET and Kylix, but the definition of the middle one changes.

The issue is not that there is missing capability in the types. The issue in any port or redefinition of core semantics is that people very rarely write code that is multi-platform ready unless they are actually testing and debugging across multiple platforms. If you write your code to always use the never-changing types whenever you incorporate assumptions about char size, and always use size-flexible types when you should, then you'll have fewer porting issues. The issue is, people don't code that way unless they are being forced to.

> TNT unicode is probably the most used unicode solution

TNT is a good compromise, but it does not present a complete solution that includes design time support and architectural simplicity/uniformity.

> XChar/XString (8 or 16, depending on project options).

You already have types like that, and you have had them for 10 years. They are: AnsiChar/AnsiString, WideChar/WideString, and Char/String.

[…]

There's no need for an additional type. Other programming languages that span Ansi and Unicode have the same issue, and the same points of failure - code that was not written with both camps in mind.

The only languages that do not have this issue are those that don't support both camps. Java, for example, has always been taxed with memory consumption issues associated with having only Unicode strings. The .NET platform is fully Unicode (include Delphi.NET), so the only issue is code that was written prior to Unicode availability, and more recently code that was written in a Unicode context which fails to handle the more complicated world of Ansi and multibyte encoded character sets.

Correllaries to Murphy's Law: If something is adjustable, someone will adjust it incorrectly. If something has an option, someone will write code that does not handle that option correctly.

This is why I fight strongly against "just make it an option or a switch" solutions. The ideal is to have a single solution, so that there is no room to get it wrong. That's why I believe Unicode VCL is a better fit for something like a Win64 Delphi, because Unicode VCL would be the one and only 64 bit VCL. No flippin switchiness to add complexity to get between the programmer and his/her objective.

-Danny

Thanks, Danny! Keep them posts coming!

9 comments:

Anonymous said...

In regards to a refcounted WideString, perhaps an introduction of a WideString2 type that is refcounted would be better. This type would not break the OleStr compatibility and could be used in cases where performance is more critical.

IMO, it's much easier to do a search/replace on this and have the compiler able to help than it is to review all declarations/uses of WideString.

If WideString2 sounds too Oracleish, feel free to call it whatever you want.

Anonymous said...

# It makes sense to keep Win32 VCL Ansi, while targeting UniCode VCL for a Win64 Delphi platform



This certainly makes no sense to me. Unicode has been around in 32-bit Windows for over a decade now and I use it everyday (Elpack). Give me a 32-bit Unicode-enabled Delphi and I'll order it YESTERDAY.


Arthur Hoornweg

Anonymous said...

"If WideString2 sounds too Oracleish, feel free to call it whatever you want."

UniString? :-) We could have Uni16String, Uni32String...

Anonymous said...

Using utf8 for all strings would not cause problems for most applications, aslong as you don't need to fiddle with the case of characters.

Anonymous said...

Danny annonces the dfm files store text as Unicode string but a short test done with the Delphi 2005 trial version demonstrates that is partially true, partially false for Win32 programs.

In my company, we develop scientific software. These software are used any where, including Japan, Chine, Korea...

So, we need to write English text containing characters like λ (lambda = wavelength), μm (micrometer) or Å (angstrom). The 2 geek characters can be correctly stored but the angstrom character is changed to quotation mark (?).

This test have been done
.with the French version of Delphi 2005
.installed on a Japanese version of Windows XP
.creating a Win32 application.

Anonymous said...

Hi Danny,

Nice details and info about Delphi. Just a quick question: Do you know how you can store Unicode Chars/Bytes in an AnsiString? When I try to assign a WideString with Unicode Chars to an AnsiString, the chars get garbled. Is there a safer way and cleaner way to assign a WideString or WideChar to an AnsiString?

Thanks in advance for all your help.

Regards

Hallvards New Blog said...

You can use the UTF8Encode function in the System unit to convert from a WideString to an UTF8 encoded string stored in an AnsiString without losing any information. UTF8Decode converts in the other direction.

Anonymous said...

It's not that easy with WideStrings because Unicode code point (aka character) occupies 21 bits. You simply can not store each code point in 16 bit WideChar.

Luckily Unicode character set defines surrogates (those are not characters) within U+D800–U+DFFF range. Pair of surrogates can encode Unicode code points beyond 16 bit boundary.

In UTF-16 endoding code unit could be an code point or surrogate, so all characters withing BMP could be encoded as one 16 bit code unit. All the others needs pair of 16 bit code units.

In Delphi we have OLEStrs (WideStrings) who are UCS-2 on Win9x and UTF-16 on WinNT+. UCS-2 is kind of a subset of UTF-16 so there's no compatibility break between them (yes, thanks to surrogates who are used only for UTF-16 encoding).

So in what way we can use WideString with code points beyond $FFFF?

I can't imagine using these without 32 bit char type, lets say UnicodeChar. WideChar is too small to store each code point.

WideString concatenation with UnicodeChar should add one or two code units depending on code point.

Also when iterating WideString, there:

for i := 0 to Length(WS) - do
// WS[i] is a 16 bit code point

but there:

for c in WS do
// c is Unicode Char

So we still have backward compatibility and when using for..in we're freed from manually (de)composing Unicode code point from (pair of) code unit.

And something from "Myth Busters":
- Unicode char is not 16 bit

Anonymous said...

is:

for i := 0 to Length(WS) - do
// WS[i] is a 16 bit code point

should be:

for i := 0 to Length(WS) - 1 do
// WS[i] is a 16 bit code *unit*



Copyright © 2004-2007 by Hallvard Vassbotn