Tuesday, March 06, 2007

Review: Delphi 2007 for Win32 (Beta) - part two

Read part one here first.

What more is new in the compiler?

New published vs $M+ behavior

The new "W1055 PUBLISHED caused RTTI ($M+) to be added to type '%s' " warning is interesting. It solves one of the issues we discussed earlier. In earlier versions of the compiler, if you compiled code like this:

TMyClass = class
FName: string;
property Name: string read FName write FName;

The published property would not have RTTI generated, but would be silently treated as a public property. This is because TMyClass does not derive from a class compiled with $M+ enabled (such as TPersistent) and $M+ is not specified for the TMyClass itself. When you compile the same code in Delphi 2007 you will now get a warning:

[DCC Warning] ThSort.pas(13): W1055 PUBLISHED caused RTTI ($M+) to be added to type 'TMyClass'

This indicates that despite the missing $M+, the compiler has promoted the published properties to published and generated RTTI fro them. To remove the warning, add a $M+ directive in front of the class - or change published to public - if that is what you really meant. Nice touch!


The compiler accepts a new $DYNAMICBASE ON/OFF compiler directive, and a corresponding command line parameter called "--dynamicbase". This is a shortcut for the existing $SetPEOptFlags directive. $DYNAMICBASE ON corresponds to $SetPEOptFlags $40 where $40 is defined by Windows as:

// The DLL can be relocated at load time.

The SetPEOptFlags directive sets a field in the PE header that is also called DllCharacteristics. It is used by the Windows loader to determine capabilities of a loaded module (DLL or BPL, for instance). The bit $40 seem to have a specific meaning in Windows Vista and enables a feature called Address Space Layout Randomization (ASLR). It means that the OS loads the module at a more or less random address (actually a random offset between 0 and 255) instead of trying to load it at exactly at the specified imagebase address. This is a security and anti-remote-hack measure making it harder to construct buffer overflow attacks that call known system DLL routines at fixed addresses. You can read more about it here.

--description command line option

The compiler also has a new command line option --description:<string> that corresponds to the $DESCRIPTION compiler directive. It lets you set the module description entry in the PE header.

What's new in the RTL?

The run time library contains the lowest level building blocks in the Delphi eco-system above the compiler. The special SysInit and System units are tightly bound to the internals of the compiler and linker. When using packages, SysInit is compiled into every module (EXE and package BPLs), while the System unit is only compiled into the rtl100.bpl. That is the reason they are split into to physically different units.

That's enough nitty-gritty background. What is new, then? Well, SysInit hasn't changed. I guess this is an indication that it has stabilized and don't need any changes for fixes. So, good news. Let's dive into the inner secrets of the System unit changes.

The System unit changes

There is a new interfaced boolean global variable with the catchy name NeverSleepOnMMThreadContention. It is used in the new FastMM based memory manager (courtesy of Pierre le Riche) that resides in the getmem.inc file that is included in the implementation section of System.pas. Here is the comment that accompanies the variable declaration:

{Set this variable to true to employ a "busy waiting" loop instead of putting
the thread to sleep if a thread contention occurs inside the memory manager.
This may improve performance on multi-CPU systems with a relatively low thread
count, but will hurt performance otherwise.}
NeverSleepOnMMThreadContention: Boolean;

When the FastMM memory manager needs to protect shared resources, it uses light-weight atomic operations (such as the lock cmpxchg assembly instruction) instead of the more heavy-duty OS-level critical section APIs. If there is contention (the resource has already been locked by another thread), it normally sleeps to release the CPU and allow the other thread to finish its work and release the lock.

As the comment above alludes to, in some situations on multiprocessor machines (that is becoming mainstream) it may be more efficient to just keep looping and checking the lock availability (also called a spin-lock) while waiting for the other thread to execute on one of the other processors to release the lock. Here is an example of how the Pascal version of this logic looks like:

    while LockCmpxchg(0, 1, @MediumBlocksLocked) <> 0 do
if not NeverSleepOnMMThreadContention then
if LockCmpxchg(0, 1, @MediumBlocksLocked) = 0 then

Another improvement to the assembly versions (which gets used by default) of these spin-locks is usage of the pause x86 instruction to be more processor and OS-friendly:

{The pause instruction improves spinlock performance}

In addition the FastMM code size has been slightly reduced (I haven't measured) by using the equivalent of a method-extract refactoring on some of the assembly code (the logic for resizing a large memory block has been refactored into an internal ReallocateLargeBlock routine). 

Finally, it looks like there is some improved logic to support segmented large blocks. Pierre's comment in the code about this is:

{Is this large block segmented? I.e. is it actually built up from more than
one chunk allocated through VirtualAlloc? (Only used by large blocks.)}

Optimized RTL routines

As Steve Threfethen has already blogged about there are now updated and new FastCode replacement functions included in the stock RTL. The compiler magic System routine _LStrCmp is used whenever you compare strings, so it is a performance critical piece of code. Pierre le Richie wrote a new, faster version of this routine (based on the FastCode winner of the CompareStr challenge) and submitted it to Quality Central #31328. This routine has now been incorporated into the RTL.

Note that _LStrCmp is a general routine that is called for all string compares, including the string compare operators >, >=, =, <>, < and <=. In Pierre's QC entry he suggests changing the compiler so that it calls a different, more specific function (_LStrEqual) - in most cases, it would be able to determine non-equalness simply by comparing the strings' lengths. As this change would probably break binary .dcu compatibility, it is not included in Spacely, but it is a possibility for Highlander (that will probably become BDS 2007).

The other FastCode replacement functions reside in SysUtils. The UpperCase and LowerCase were already FastCode functions (by Aleksandr Sharahov) in BDS 2006 - there are now even faster versions by John O'Harrow in Delphi 2007. In addition CompareStr and StrLen are new FastCode functions from Pierre le Riche.

In addition the FileAge and FileSearch functions have been optimized. FileAge now calls the more efficient GetFileAttributesExA API (when available) instead of the FindFirst/FindClose pair. FileSearch now leaves early if the Name parameter is an empty string. These two changes also helps improve the speed of the compiler and linker.

Finally the hash-table logic used in the CheckForDuplicateUnits routine during loading of packages has been further optimized. This code was greatly improved in BDS2006 and it is now even faster. This helps reduce the startup time of applications that loads lots of packages dynamically (such as the Delphi IDE itself).

Desktop Window Manager API

There is a new DwmApi unit (in the source\Win32\rtl\win directory) that dynamically binds to the Windows Vista specific APIs exported by DWMAPI.DLL. On other platforms the routines are stubbed out to return the E_NOTIMPL error code. The constants and routines in this unit is the basis of all the new Windows Vista and Glass specific functionality added to the VCL and Delphi IDE. It has exiting sounding routine names like DwmExtendFrameIntoClientArea and DwmUpdateThumbnailProperties. Most of the time the VCL shields you from using these nitty-gritty APIs directly, but its handy to know where to find them.

Math unit and complete boolean evaluation

The math unit has had a couple of interesting changes to the trio (Integer, Int64 and Double versions) of the InRange functions. Basically, the code has been rearranged from (BDS 2006 version):

function InRange(const AValue, AMin, AMax: Integer): Boolean;
Result := (AValue >= AMin) and (AValue <= AMax);

to (Delphi 2007 version):

function InRange(const AValue, AMin, AMax: Integer): Boolean;
A,B: Boolean;
A := (AValue >= AMin);
B := (AValue <= AMax);
Result := B and A;

The code changes forces the compiler to evaluate both boolean expressions before setting the Result variable. This reduces the number of conditional jumps and thus typically improves performance (particularly in the cases where InRange returns true - then it has to evaluate both expressions anyway). The generated code differences are as follows - here is the assembly code generated for the old BDS 2006 version of InRange:

Result := (AValue >= AMin) and (AValue <= AMax);
//0040841C 3BD0 cmp edx,eax
//0040841E 7F04 jnle $00408424
//00408420 3BC8 cmp ecx,eax
//00408422 7D03 jnl $00408426
//00408424 33C0 xor eax,eax
//00408426 C3 ret

and here is the assembly code for the new version:

  A := (AValue >= AMin);
//0040842C 3BD0 cmp edx,eax
//0040842E 0F9EC2 setle dl
B := (AValue <= AMax);
//00408431 3BC8 cmp ecx,eax
//00408433 0F9DC0 setnl al
Result := B and A;
//00408436 22C2 and al,dl
//00408438 C3 ret

As you can see the second version does not have any branches, while the first one has two conditional branch instructions. Modern processors work better with branchless code, so the net effect is better performance. We might revisit this topic in the future - and the alternative way of accomplishing the same thing - using the $B+ compiler directive to enable complete boolean evaluation.

That concludes part two of this review - part three will follow shortly. Stay tuned! ;)


Anthony Mills said...

Very interesting, Hallvard.

By the way, you might want to update your shortlist of blogs. Chris Brumme hasn't blogged for almost three years now, Danny Thorpe is off in Microsoft Live-land, and the Delphi team isn't really Borland any more. :)

Anonymous said...

Very interesting. Thanks for posting!

Anonymous said...

Very interesting. Thanks.

unused said...

Great details! I love it.

BTW, you say "Highlander (that will probably become BDS 2006)" but BDS 2006 is already available. I suspect Highlander will either be BDS 2007 or BDS 2008.

Unknown said...

Highlander (that will probably become BDS 2006)

I suspect that's not what you meant there.

Anonymous said...

GetFileAttributesEx fails if the file is locked. I hope they auto fall-back to FindFirst which doesn't have that limitation.

Anonymous said...

Nice, Hallvard.

A small typo though: "that will probably become BDS 2006" should have been "2007", right?

Unknown said...

Jim: Thanks - I've fixed it now.

Unknown said...

Anthony: You're right - I need to fix that list RSN.

Anonymous said...

Wasn't using 8 bits register (al, dl) lead to a partial register stall? Or is this problem solved in latest P4 processors?

Anonymous said...

Great to see CodeGear is leveraging contributions from externasl developers to make Delphi a better product. The Delphi community has always been an important factor in Borland's success, yet it didn't seem to get the kind of treatment it deserved. IMHO, Codegear needs to leverage the community contributions and flesh out the bundled VCL to keep Win32/64 development as attractive as possible to prevent eventual defections to .NET due to it's massive FCL.

Anonymous said...

very nice review, thanks ! :)

Copyright © 2004-2007 by Hallvard Vassbotn