Danny Thorpe on Unicode and VCL
As always, debates about Delphi and its future rages on in the borland.public.delphi.non-technical newsgroups. Often feelings race high and speculations, FUD, trolling and flamewars is the order of the day - you have been warned ;).
One way of getting at some nibbles of good, technical and useful information from the newsgroups is to read them "vertically" - I find myself often arranging the newsgroup posts by poster name, and reading most or all posts of people I know from experience are good posters.
One never disappointing source of such enlightening posts is Borland's Chief Scientist Danny Thorpe. Recently he posted some interesting points about how he views the challenge of a Unicode enabled native VCL (VCL for .NET already supports Unicode, of course). You can click the link to see Danny's full posts in context from the Google cache.
Here is my interpretation of what I understand are the main points:
- Delphi .dfm files are already Unicode ready - strings are stored as utf-8
- Delphi already has the types required to support old and new code (Char/AnsiChar/WideChar
and String/AnsiString/WideString)
- Keeping both Ansi VCL and Unicode VCL means; new 3rd party controls, numerous porting issues (char size etc.), duplicate IDE designers, etc., etc.
- For performance and memory usage reasons, WideString should be made reference counted (using OleStr for external calls).
- It makes sense to keep Win32 VCL Ansi, while targeting UniCode VCL for a Win64 Delphi platform
Here are some quotes from Danny's posts (published here with permission, of course) - emphasis is mine:
Another quote:"Danny Thorpe" wrote:
> Does anyone know if Borland if ever plans to add Unicode support in VCL.The main question is: How much compatibility are you willing to sacrifice to get a Unicode VCL? Unicode VCL for Win32 will not be fully compatible with many third party components out there. Unicode VCL for Win32 will require new component designers in the IDE that will not be compatible with Ansi VCL. Unicode VCL for Win32 will require new design interfaces in the IDE which will not be compatible with the existing design-time interfaces.
[...]
Yes, we intend to produce a Unicode VCL. We already have in VCL.NET, and the only sane choice for 64 bit VCL is all-Unicode. The cost of adding Unicode support is less when you are starting with a new platform base which already has a compatibility barrier.
-Danny
Final quote:"Danny Thorpe" wrote:
> The only thing you would have to do is update any literals stored in the DFM.String literals in DFM files are already stored as UTF-8, a compressed Unicode encoding. UTF-8 looks like ANSI/ASCII for chars < 128. No DFM update utility is required.
> The break would be so minor, it shouldn't take more then a week to convert several hundred units.
The breaks I'm referring to run far deeper than DFMs. How much code do you have that runs through a PChar array by incrementing a pointer by one? In a Unicode world, PChar = PWideChar, which means each char is 2 bytes.
Similarly, any code that scans a string assuming that the first zero byte is the null terminator will fail with Unicode strings, because most Unicode chars (for English) have a zero high byte.
For most Win32 APIs involving string data, there are matching Ansi and Unicode definitions. But not all. Which of the Win32 APIs that you rely on today are not symmetric?
How much code do you have that is aware of multibyte character encodings for Middle Eastern or Far Eastern languages? In a Unicode world, most MBCS gymnastics are completely unnecessary and most are benign, but a few MBCS code patterns actually fail on Unicode. See the byte assumption above.
WideStrings are currently implemented by Delphi as OLEStr, aka BStrs, allocated using the SysAllocString Win32 API. These are not reference counted, and are rather promiscuous in copying themselves for every reference. Clearly, the Delphi WideString implementation needs to be changed to a reference counted WideString to save memory and performance if WideString is to become the primary string data type. But that means Delphi's WideString will have different allocation semantics from OleStr. Reference counted WideStrings will have to be converted to single-reference copies before being passed out of the application to Win32 APIs expecting PWideChar buffers.
Breaking the WideString = OleStr type alias means that all the Win32 APIs that are now declared as taking WideString will need to be changed to OleStr. We'll handle Windows.pas and the other Win32 API units we provide, but you will have to do the equivalent work on any other DLL API declarations your applications use. Until you find them all and fix them, your app will compile fine but will crash mysteriously at runtime. The compiler can't help you here because the compiler can't tell if the DLL you're calling actually expects OleStr or if it's a Delphi DLL that's actually expecting a Delphi reference counted WideString. The compiler has to rely on you to get the declarations right.
If your code and the components you use have been ported to Linux or .NET in the past, then chances are these kinds of things have already been found and modified to be char size agnostic.
Unicode VCL sounds like such a simple, little thread... until you start pulling on it.
-Danny
Thanks, Danny! Keep them posts coming!"Danny Thorpe" wrote:
> many applications (particularly those connecting to external systems) can never be 100% unicode. They will always have a mix of unicode and non-unicode sections.True, there is always a need to be able to specify which parts are wide and which parts are not. That's why we have 3 char types (AnsiChar, Char, WideChar) and 3 string types (AnsiString, String, and WideString). All those types continue to exist in Unicode places such as Delphi.NET and Kylix, but the definition of the middle one changes.
The issue is not that there is missing capability in the types. The issue in any port or redefinition of core semantics is that people very rarely write code that is multi-platform ready unless they are actually testing and debugging across multiple platforms. If you write your code to always use the never-changing types whenever you incorporate assumptions about char size, and always use size-flexible types when you should, then you'll have fewer porting issues. The issue is, people don't code that way unless they are being forced to.
> TNT unicode is probably the most used unicode solution
TNT is a good compromise, but it does not present a complete solution that includes design time support and architectural simplicity/uniformity.
> XChar/XString (8 or 16, depending on project options).
You already have types like that, and you have had them for 10 years. They are: AnsiChar/AnsiString, WideChar/WideString, and Char/String.
[…]
There's no need for an additional type. Other programming languages that span Ansi and Unicode have the same issue, and the same points of failure - code that was not written with both camps in mind.
The only languages that do not have this issue are those that don't support both camps. Java, for example, has always been taxed with memory consumption issues associated with having only Unicode strings. The .NET platform is fully Unicode (include Delphi.NET), so the only issue is code that was written prior to Unicode availability, and more recently code that was written in a Unicode context which fails to handle the more complicated world of Ansi and multibyte encoded character sets.
Correllaries to Murphy's Law: If something is adjustable, someone will adjust it incorrectly. If something has an option, someone will write code that does not handle that option correctly.
This is why I fight strongly against "just make it an option or a switch" solutions. The ideal is to have a single solution, so that there is no room to get it wrong. That's why I believe Unicode VCL is a better fit for something like a Win64 Delphi, because Unicode VCL would be the one and only 64 bit VCL. No flippin switchiness to add complexity to get between the programmer and his/her objective.
-Danny