Hallvard's Blog: August 2005

Friday, August 26, 2005

Danny Thorpe on Unicode and VCL

As always, debates about Delphi and its future rages on in the borland.public.delphi.non-technical newsgroups. Often feelings race high and speculations, FUD, trolling and flamewars is the order of the day - you have been warned ;).

One way of getting at some nibbles of good, technical and useful information from the newsgroups is to read them "vertically" - I find myself often arranging the newsgroup posts by poster name, and reading most or all posts of people I know from experience are good posters.

One never disappointing source of such enlightening posts is Borland's Chief Scientist Danny Thorpe. Recently he posted some interesting points about how he views the challenge of a Unicode enabled native VCL (VCL for .NET already supports Unicode, of course). You can click the link to see Danny's full posts in context from the Google cache.

Here is my interpretation of what I understand are the main points:

Delphi .dfm files are already Unicode ready - strings are stored as utf-8
Delphi already has the types required to support old and new code (Char/AnsiChar/WideChar
and String/AnsiString/WideString)
Keeping both Ansi VCL and Unicode VCL means; new 3rd party controls, numerous porting issues (char size etc.), duplicate IDE designers, etc., etc.
For performance and memory usage reasons, WideString should be made reference counted (using OleStr for external calls).
It makes sense to keep Win32 VCL Ansi, while targeting UniCode VCL for a Win64 Delphi platform

Here are some quotes from Danny's posts (published here with permission, of course) - emphasis is mine:

"Danny Thorpe" wrote:
> Does anyone know if Borland if ever plans to add Unicode support in VCL.

The main question is: How much compatibility are you willing to sacrifice to get a Unicode VCL? Unicode VCL for Win32 will not be fully compatible with many third party components out there. Unicode VCL for Win32 will require new component designers in the IDE that will not be compatible with Ansi VCL. Unicode VCL for Win32 will require new design interfaces in the IDE which will not be compatible with the existing design-time interfaces.

[...]

Yes, we intend to produce a Unicode VCL. We already have in VCL.NET, and the only sane choice for 64 bit VCL is all-Unicode. The cost of adding Unicode support is less when you are starting with a new platform base which already has a compatibility barrier.

-Danny

Another quote:

"Danny Thorpe" wrote:
> The only thing you would have to do is update any literals stored in the DFM.

String literals in DFM files are already stored as UTF-8, a compressed Unicode encoding. UTF-8 looks like ANSI/ASCII for chars < 128. No DFM update utility is required.

> The break would be so minor, it shouldn't take more then a week to convert several hundred units.

The breaks I'm referring to run far deeper than DFMs. How much code do you have that runs through a PChar array by incrementing a pointer by one? In a Unicode world, PChar = PWideChar, which means each char is 2 bytes.

Similarly, any code that scans a string assuming that the first zero byte is the null terminator will fail with Unicode strings, because most Unicode chars (for English) have a zero high byte.

For most Win32 APIs involving string data, there are matching Ansi and Unicode definitions. But not all. Which of the Win32 APIs that you rely on today are not symmetric?

How much code do you have that is aware of multibyte character encodings for Middle Eastern or Far Eastern languages? In a Unicode world, most MBCS gymnastics are completely unnecessary and most are benign, but a few MBCS code patterns actually fail on Unicode. See the byte assumption above.

WideStrings are currently implemented by Delphi as OLEStr, aka BStrs, allocated using the SysAllocString Win32 API. These are not reference counted, and are rather promiscuous in copying themselves for every reference. Clearly, the Delphi WideString implementation needs to be changed to a reference counted WideString to save memory and performance if WideString is to become the primary string data type. But that means Delphi's WideString will have different allocation semantics from OleStr. Reference counted WideStrings will have to be converted to single-reference copies before being passed out of the application to Win32 APIs expecting PWideChar buffers.

Breaking the WideString = OleStr type alias means that all the Win32 APIs that are now declared as taking WideString will need to be changed to OleStr. We'll handle Windows.pas and the other Win32 API units we provide, but you will have to do the equivalent work on any other DLL API declarations your applications use. Until you find them all and fix them, your app will compile fine but will crash mysteriously at runtime. The compiler can't help you here because the compiler can't tell if the DLL you're calling actually expects OleStr or if it's a Delphi DLL that's actually expecting a Delphi reference counted WideString. The compiler has to rely on you to get the declarations right.

If your code and the components you use have been ported to Linux or .NET in the past, then chances are these kinds of things have already been found and modified to be char size agnostic.

Unicode VCL sounds like such a simple, little thread... until you start pulling on it.

-Danny

Final quote:

"Danny Thorpe" wrote:
> many applications (particularly those connecting to external systems) can never be 100% unicode. They will always have a mix of unicode and non-unicode sections.

True, there is always a need to be able to specify which parts are wide and which parts are not. That's why we have 3 char types (AnsiChar, Char, WideChar) and 3 string types (AnsiString, String, and WideString). All those types continue to exist in Unicode places such as Delphi.NET and Kylix, but the definition of the middle one changes.

The issue is not that there is missing capability in the types. The issue in any port or redefinition of core semantics is that people very rarely write code that is multi-platform ready unless they are actually testing and debugging across multiple platforms. If you write your code to always use the never-changing types whenever you incorporate assumptions about char size, and always use size-flexible types when you should, then you'll have fewer porting issues. The issue is, people don't code that way unless they are being forced to.

> TNT unicode is probably the most used unicode solution

TNT is a good compromise, but it does not present a complete solution that includes design time support and architectural simplicity/uniformity.

> XChar/XString (8 or 16, depending on project options).

You already have types like that, and you have had them for 10 years. They are: AnsiChar/AnsiString, WideChar/WideString, and Char/String.

[…]

There's no need for an additional type. Other programming languages that span Ansi and Unicode have the same issue, and the same points of failure - code that was not written with both camps in mind.

The only languages that do not have this issue are those that don't support both camps. Java, for example, has always been taxed with memory consumption issues associated with having only Unicode strings. The .NET platform is fully Unicode (include Delphi.NET), so the only issue is code that was written prior to Unicode availability, and more recently code that was written in a Unicode context which fails to handle the more complicated world of Ansi and multibyte encoded character sets.

Correllaries to Murphy's Law: If something is adjustable, someone will adjust it incorrectly. If something has an option, someone will write code that does not handle that option correctly.

This is why I fight strongly against "just make it an option or a switch" solutions. The ideal is to have a single solution, so that there is no room to get it wrong. That's why I believe Unicode VCL is a better fit for something like a Win64 Delphi, because Unicode VCL would be the one and only 64 bit VCL. No flippin switchiness to add complexity to get between the programmer and his/her objective.

-Danny

Thanks, Danny! Keep them posts coming!

Friday, August 12, 2005

The ultimate Delphi IDE start-up hack

He's done it again! Petr Vones has hacked together and published an at-your-own-risk patching tool that will trim the start-up time of your Delphi IDE. I've tested it with my Delphi 7, 8 and 2005 IDEs and in each case, the startup time of the IDE is noticeably faster after applying Petr's clever hack.

Before you use the patcher, read Petr’s readme file where he notes:

Warning: removing the dup unit check code may cause unexpected errors, USE AT YOUR OWN RISK !!!

Reading the rest of this article, doesn’t hurt either. ;-)

Background information
At start-up the Delphi IDE (like any other application that uses the magic of Delphi's run-time packages) will load statically and dynamically linked packages (plain Windows .DLL files using a .BPL extension and containing Borland's magic plumbing to make it all work). Each package can contain any number of units and require any number of external packages. In this context, the main thing is that Bad Things™ (such as random AVs, .DFM streaming errors etc) can happen if you were to load to packages containing the same (or same-named) unit in two or more packages.

To prevent this from happening, the Delphi RTL contains a SysUtils.LoadPackage routine that is responsible for dynamically loading a new package. This is used by the IDE to load all packages (containing the standard VCL components, third party components, design time editors, wizards, exports etc.) One of the tasks LoadPackage performs is to check that the currently loaded package does not contain a unit name that exists in one of the already loaded packages. If it detects a unit-name collision, a EPackageError exception with the message

Cannot load package 'PackageA' It contains unit 'UnitName' which is also contained in package 'PackageB'

is raised and the newly loaded package is unloaded again.

Analysing the algorithm
If you have the Delphi RTL source code available (all serious Delphi programmers should) you have a look at this logic by searching for "CheckForDuplicateUnits". Stop reading now and take a look at that code.

Ok, back already? Pretty complicated-looking code, don't you think? I've taken a quick shot at trying to analyse the complexity of checking the uniqueness of all unit names contained in all loaded packages (i.e. how it theoretically performs when number of loaded packages and units increase). There are five nested loops:

The outer loop (in the IDE) loading all the installed packages (read from the Registry)
One recursive loop (InternalUnitCheck calling itself) to handle all the requires-links between the current package and all the sub-packages
One iterative loop (in InternalUnitCheck) of all the contained units in the current package
One iterative loop (in IsUnitPresent) of all currently loaded modules
One iterative loop (nested in IsUnitPresent) of all the contained units in each loaded module

So the algorithm looks like it has the complexity of O(P * R * M * U * U) where

P is the number of installed packages that will be loaded by the IDE
R is the average number of required packages
M is the average number of loaded modules (starts low and increases as more and more modules have been loaded)
U is the average number of units contained in a package

M varies from 0 to ~P, and on average it will be about P / 2. Let's be nice and set the average number of required packages to just 2. The expression then simplifies to O( P * 2 * (P/2) * U^2) or O(P^2 * U^2).

U^2 will be some constant (but potentially large) number, so the major complexity is O(N^2).

What this all this jumble-bumble means in practical terms is that the more packages you have to load, the slower it gets - no surprise there. But it also means that the running time rises exponentially as the number of packages and units increase. So theoretically, if loading 100 packages takes 10 seconds, then loading 200 packages should take in the neighbourhood of 40 seconds (instead of the 20 seconds that would be expected with a linear O(N) algorithm).

Is there a problem?
For most people the start-up time of the IDE should not be a major issue. The safety that the unit-name checks gives you is probably worth the extra time it takes. If you are often changing the installed packages, testing freeware and commercial packages, installing your own development packages etc. the convenience of having the IDE explicitly inform you of unit name conflicts outweighs any start-up time improvements.

However, for people that have a fixed group of a large number of packages, the start-up time can get noticeably long. If you are willing to live a little dangerously and risk shooting yourself in the foot (getting random crashes or weird issues caused by duplicate unit problems), you may consider using Petr's hack.

What is the solution?
If we conclude that given a trend of increasing number of packages typically installed into the Delphi IDE, the start-up time starts getting uncomfortably long, what can we do about it?

Well, for the fool-hearted (brave?) among us, there is Petr's patching hack, of course. This is a brute-force solution that simply turns off all unit-name checks "simply" by patching the first instruction of the CheckForDuplicateUnits routine to be a RET instruction, effectively turning it into a NOP operation.

But if Borland decided to do something about this, what could they do? Considering the high complexity and O(N^2) behaviour of the existing CheckForDuplicateUnits implementation, the most obvious thing to do would be to change the algorithm into something a little more efficient, like O(NlogN) or even O(N). How can this be done?

The goal is to detect collisions between a large set of strings. A hashtable using a string key would be perfectly suited for this. One complication is that the current algorithm uses the linked structure of all currently loaded packages to iterate over them all.

If *all* loaded packages are loaded dynamically through calling LoadLibrary this should not be a problem. However, if one or more packages are loaded statically (an application or package require a set of other packages at load time), the OS will load this packages and the LoadLibrary routine (and its unit-name checking logic) will never "see" these packages, and thus will fail to detect collisions with the unit names contained in them.

In addition a hashing list solution will have to take into consideration unloading of packages, removing unit names from the hashlist. The current algorithm doesn't have to take this into account as it uses the automatically linked up/torn down module chain.

Instead of maintaining a global hash-list that is kept between invocations of CheckForDuplicateUnits, it could create an empty one from scratch on each call and just use it as a relatively quick way of finding unit-name collisions. It would do this by first iterating over the already loaded modules, adding their unit names to the hash list. Then it can try to add the currently loading package unit names to the hash list – if a collision is detected here raise an error and unload it.

I’ve not analysed the possible implementations in details but (assuming it would be possible to implement correctly) I’d guess that a global hash list implementation would give O(N) performance, while a local hash list implementation (that is re-populated on each call) would give O(NlogN).

A completely different solution (and one that could be implemented even if the basic algorithm is improved), is to have the IDE detect when a new graph of packages/units that have not been checked before is being loaded. Turn off the checks if there are no new or changed packages since the last check.

Conclusions
Hardware is getting faster, but the growing size and complexity of software may be eating up all the benefits. Specifically for the Delphi IDE with a large number of packages installed, start-up times can become sluggish.

I think that Petr's patching tool will end up being used by die-hard hackers like myself until a cleaner and safer solution is present. That is not necessarily a bad thing - it gives us the nice feeling of living on the edge and getting a faster IDE start-up time than everybody else ;).