Skip to main content.

Saturday, July 25, 2009

I just discovered that Sun Studio has a strange idea of "optimization" when it comes to copying structures by value. A modern compiler, on a modern platform (Sun Studio included), transforms things like

struct _tag {
int x;
int y;
} v1, v2;

....

v1 = v2;


In a set of the best possible instructions to copy the content of the origin structure in the target. For example, on 64 bit architectures, or if the target architecture is anyhow able to store 64 bit words, the above is transformed in a single move operation. Otherwise, the compiler just performs the copy of both the 32-bit variables. Good compilers are also smart enough to understand when it's worth to start a "rep" cycle, that is, they can instruct the CPU to perform a repeated and very fast move in memory operation, for example, in very big structures. Very smart compilers are even able to select the proper repeated move size, performing i.e. some loop to transfer 32 bit values and then finishing off the rest of a un-padded structure via one to three byte copy.

But on sparc32, Sun Studio 11 seems not to be that smart. Our problem was that, on a relatively old Sparc32 machine, the recursive a Falcon implementation of the Fibonacci recursive test for 30 numbers took 280 seconds to complete. The same test took 1.5 seconds on different hardware; on an old sparc64 machine it took approximately 4-5 seconds. There was definitely something wrong. Fun thing, the debug version with no optimization took consistently 1/3 of the time to complete for any value of the Fibonacci series. Ever heard of a compiler slowing down, three times, the optimized version?

I have optimized several things since that first test; mainly, the endianity changes it the values stored in the Virtual Machine opcode sequences is now de-endianized at load time on Sparc machines, so the test took now 180 seconds to complete (100 in debug). Still something wrong... So, starting wondering what the hell could have been, I checked through Sun Studio profiler, and noticed that...


Functions sorted by metric: Exclusive User CPU Time

Excl. Incl. Name
User CPU User CPU
sec. sec.
17.502 17.502
3.422 3.422 Falcon::ItemArray::append(const Falcon::Item&)
2.272 2.302 Falcon::VMachine::callReturn()
2.242 2.272 Falcon::opcodeHandler_RETV(Falcon::VMachine*)
2.061 2.262 Falcon::VMContext::createFrame(unsigned,bool(*)(Falcon::VMachine*))
1.381 1.381 Falcon::VMachine::getOpcodeParam(unsigned)
1.141 1.141 Falcon::co_int_sub(const Falcon::Item&,const Falcon::Item&,Falcon::Item&)
1.021 2.712 Falcon::opcodeHandler_SUB(Falcon::VMachine*)
0.991 1.001 Falcon::opcodeHandler_POP(Falcon::VMachine*)
0.931 0.931 Falcon::co_int_compare(const Falcon::Item&,const Falcon::Item&)
0.660 1.191 Falcon::opcodeHandler_ADD(Falcon::VMachine*)
0.520 0.520 Falcon::co_int_add(const Falcon::Item&,const Falcon::Item&,Falcon::Item&)
0.130 0.130 memset


See? -- the vast majority of time is spent in array append and return frame. The array append is actually a push in the stack of the virtual machine, that is, mostly a copying an item. But that what put me on alert is the handler of RETV, which just copies an item on the A register of the virtual machine.

Falcon has items. They are just a flat structure of data, or pointers, and informations about their content (type, flags and so on). One of the most common operation in the whole virtual machine is item copy, which is done through an inline that just copies the structure flatly, c style, like

all = other.all;

where "all" is a set of structures and unions.

On 32 bit machines, the size of the falcon items is exactly 16 bytes; so I did something like:

int *pthis = (int*)&this->all, *pother = (int*) &other.all;
pthis[0] = pother[0];
pthis[1] = pother[1];
pthis[2] = pother[2];
pthis[3] = pother[3];


and that brought down the time of fib(30) to 68 seconds. 180->68. A good gain, considering that the above code is what the compiler is supposed to do when it encounters a structure assignment.

However, it was still a lot; the same test on the target machine in PERL is run in 20 seconds. The profile record was now like...

Functions sorted by metric: Exclusive User CPU Time

Excl. Incl. Name
User CPU User CPU
sec. sec.
5.584 5.584
1.531 1.531 Falcon::VMachine::getOpcodeParam(unsigned)
1.241 1.241 Falcon::co_int_sub(const Falcon::Item&,const Falcon::Item&,Falcon::Item&)
1.041 1.041 Falcon::co_int_compare(const Falcon::Item&,const Falcon::Item&)
0.580 0.600 Falcon::co_int_add(const Falcon::Item&,const Falcon::Item&,Falcon::Item&)


getOpcodeParam is the most stressed function, and it's also the one that really does the horse-work in this version of Falcon, so I wasn't surprised. But sub, compare and add being so expensive was something new. They do simple calculations on 64 bit integers, and they are supposed to play a minor role in a test as recursive Fibonacci numbers.

I was suddenly enlightened. The only superfluous operation that co_int_sub does is forcefully setting the integer type in the target item (the one that must receive the result of the subtraction). For historical reasons, the type of items was stored in the third byte on the last word of the item structure; remembering that Sparc32 architecture had some grudge against unaligned operations, I realized that the byte access on a unaligned address may have caused quite a mess. So I moved the type byte on top of the last word of the item structure.

The test took now about 8 seconds to complete. 68 -> 8 just for having reordered the fields in a structure. This time the fault is not on the compiler, but however it's weird that the position of byte access is so relevant. If it's so relevant on a certain architecture, the compiler MUST perform aligned operations and access the data through fast bitwise register masks, at least when the maximum optimization is required, as I did (-xO5). Conversly, it's easily visible the fact that the compiler created code that caused the target processor (-xarch=native, so no architecture mismatch) to go mad.

Also, the fact is totally irrelevant on Sparc64, which is sensible to unaligned operations as Sparc32, but is able to treat alignment at byte granularity for byte reads, so it was hard to spot the problem there.


Comments

No comments yet

Add Comment