Skip to main content.

Monday, January 21, 2013

I am facing a fairly big dilemma on whether to always enable the parallel programming support on the scripts side of the new engine of Falcon or not. Doing that costs some 50% performance on the mean case, but not doing that has heavy consequences. Also there might be solutions combining the bests of both worlds that I simply overlooked, that is, a way to perform the heavy operations needed to ensure item write visibility consistency across agents that are not so heavy as the simple, but effective, way I am using now.

I wrote the following post in the Falcon Programming Language Goolge group; I am repeating it here on my blog to show it also to those ones that have no access to that resource.

Hey ppl,

I need an informed opinion from the community on how to clear a problem, and I don't want to have just one mind set on this. I need more brains on this problem, so, if you can think on this and possibly spread the challenge, you'll be doing some great favor to all the future users of our language.

Here I am reporting the latest results of the developments of the new engine, now closing in for alpha 1.0 release: I have finally coded in the item locking system that allows to concurrently access items from different parallel agents.

In the following sample:

you can see two agents concurrently writing on the t.a property; one is writing a number, the other one is writing a string. A third agent is reading the result, and it prints them out as fast as it can. As the output stream is, well, a stream, we all understand that the actual output rate is limited by the O/S giving us write access to it.

The experiment shows that the check agent reads randomly a string or an integer in the t.a property; the point of the experiment is that the read is clean, and we never have a partial data in the item (that is, we never see the item with a type ID of an integer, but the data of a String).

The agents run in the test parallel engine, that is currently configured to run on 4 processors. On a 4CPU machine, you'll see the two generator agents running at full speed; on a 2CPU machine, one of the agents will be preempted by the check agent, that will then block in order to write the data on I/O. In future, we can introduce some schedule optimizations so that agents are swapped out of VM processors during I/O times, and calculating agents can progress in the meanwhile (currently we DO use this strategy in our 0.9.x Garbage Collection system), but for now, consider this test really showing the full potential of this technique on a 4CPU machine only.

This script runs "start to stop" in approx 3.5 seconds on my test machine (an old HP Intel 2.4Gx4 32 bit). This includes a lot of things we don't really want to measure now, as the compiler, module serializer etc.

These are the time results on an average run:
real 0m3.416s
user 0m5.020s
sys 0m0.160s

As we don't have any GC in place now (that is anyhow parallel and should not impact negatively on the performance), the script generates some 300M strings in memory; this heavily stresses the system memory manager, and it's likely that using a gc as good as that one we have in 0.9 version, this time could go down of some 20%.

A Lua script doing roughly something equivalent is the following:

real 0m2.112s
user 0m2.108s
sys 0m0.000s

Notice that the first loop, the one dealing just with numbers, just takes 0.3s thanks to an excellent optimization of the Lua engine, something we can have as well, so it should be probably taken out of this equation for the purpose of the opinion I am going to ask now.

The performance different is not so impressing when you use two string generating loops in both the scripts.

(notice also that increasing the print rate by reducing the %1000 to %100, or %50 on some systems, just kills the lua version).

Ok, now you have all the parameters to help me out in this decision.

As it is now, the engine looks like some 50% slower than Lua on serial problems; OTOH, on parallel problems it can easily beat LUA, at times even by a great deal thanks to real parallelism in I/O and calculus. On real world situations, exp in high order math done with math libraries, we should have a LONG leading edge. OTOH, on plain script-calculus intensive single-streamlined loops, the engine is roughly 25% to 75% slower than Lua.

The point is that removing the item lock feature, that is, the feature enabling concurrently visible read-writes of items, the engine is consistently 50% faster. Yep, making it at par with Lua also on serial loops. And, we have no optimization in place and there are MANY extra things in the engine that can get faster in future releases. In other words, without the parallel support, we could really be playing at a par with the fastest scripting language EVEN in its own field, and EVEN if our primary concern is NOT the raw speed of the VM loop (but OTHER kind of speed performance, as i.e. the performance of the application code sharing data with our engine).

We have mainly two choices, and possibly a third:
1) Creating two versions of the engine, one with Parallel() stuff turned off, would enable programmers that need a parallel environment (i.e. in robotics) a solid platform on which to work on, while giving general scripters (i.e. simple game AIs) more raw power.
2) OTOH, forcing parallel on would require third party module writers to ALWAYS take the parallelization of the engine into account, and write threadsafe modules. IF we do not, we might end up having a bunch of modules that wouldn't simply be working in the parallel version.
3) The third option would be that of forfeiting the visibility rule that allows to see "a=1" consistently across all agents on shared variables. As we do now with the Threading module, each agent would have a complete copy of its own memory, and there would be streamlined communication channels to send stuff across agents.

Notice that even in case of 1, the engine stays intrinsically MT, and as such it can be proficiently used in MT applications, and used asynchronously, even if the script themselves wouldn't be able to use this asynchronism.

As a side note, I already took several steps to experiment with item locking done only when strictly needed. For instance, I tested a verson where item locking is done in case of property access ONLY (the case used in the sample script), and times are roughly the same, proving that at least 95% of the 50% performance drop when using item locking is due to the very read-write concurrency on the accessed properties.

I also cherished the possibility to start using item locking only after a Parallel.launch() command has been issued, but I haven't found any reasonable way to really do that efficiently enough atm.

So, if any brilliant mind is listening and willing to take the challenge, I am eager to have your POV and suggestions. And if you can, also coding hands.



No comments yet

Add Comment

This item is closed, it's not possible to add new comments to it or to vote on it