Page 1 of 1

Threading and Garbage Collection - 360 Revisited

Posted: Fri Oct 19, 2012 6:06 pm
by Ninja
Hey Ross, moved this question/exploration into the general forum area taking note that you frequent this board more often.
Put up your response to the question and the below I addressed any additional information that might or might not be pertinent.
The engine is already fully multithreaded; check out the Internal Multithreading documentation here on codeplex for a minimal demonstration of setting it up (it amounts to calling Space.ThreadManager.AddThread).

At a glance, the ParallelTasks library appears to have a fairly similar forloop implementation (which is all that really matters in BEPUphysics). I'd guess that they would perform pretty similarly once ParallelTasks was tuned for the simulation, though the built-in specialized one may have the advantage of not trying to do as much feature-wise. The built-in loop threading is used over the .NET 4.0 Task Parallel Library on the PC for a similar reason.

(I should mention that measuring the performance difference between these optimized libraries in real situations can be quite tricky- the speed differences caused by different decent thread managers are very nearly nonexistent compared to the cost of whatever they're computing.)

As for garbage, the upscaling of some lists when the simulation expands is a known source. However, this should be a very small quantity (on the order of kilobytes), and it's a one time cost. Further, it can be almost entirely prevented by initializing the various pools to match the demands of the simulation beforehand. This is rarely worth the development effort, though.

You will, however, see more than just a few kilobytes being allocated when the simulation ramps up. This is normal- allocations are not always garbage; the vast majority is data absolutely required for the simulation to function. It won't be garbage collected in the future (unless you get rid of the simulation entirely).

If you see the simulation itself producing garbage persistently or in large quantities, please report it with a debuggable project that reproduces the issue if possible.

By the way, I generally check the BEPUphysics forums more often than the issue tracker; it's a better place for discussions :)
First off, I hate poking holes or attempting to with something that is already *very* awesome and obviously you have a much greater insight to the full inner workings of the physics engine design...so some of my question and/or suggestions could be naive and/or just flat out wrong. My time is very limited as I am coding furiously to get a project finished due to the game company I worked for went out of business and other reasons which at this point places me in a position to try and generate revenue from building my own indie games.

So...if you run across anything below that is a non-issue and have the time it would be nice to know more about why it would not be an issue regarding the 360 and garbage collection as well as the threading implementaiton.

========================================================================
Regarding Multi-threading vs. Parallel processing:
========================================================================
Not sure if it was a miscommunication or if I am just missing something in the multithreading portion of Bepu, but it appears that each thread works on its own handling jobs/tasks assigned to them. From my brief initial look, it appears to be spreading specific tasks over the various processors, however the tasks in full are being pushed to a single processor as opposed to the task being split into say 4 individual tasks (data being processed split into 4 groups) and done at all the same time.

My discovery with the parallel tasks was based on my initial game engine design having some issues with large chunks of entities/objects having to be processed over time as well as, the most important part, the synchronization of the results of the data relative to the game engine loop. The bottom line, after some basic testing, I ended up redesigning a large portion of the entity processing areas such that the data being processed was broken into 4 equal (as possible) chunks at the start of a level, and then I just processed each chunk of data for specific results in a linear fashion based on what needed to be known/available to any "grouping" of data being processed.

The end result was that I saved about 3ms of processing time and had close to *zero* hits on any possible SpinLock calls. I also no longer had to worry about any form of synchronization of data as it was all being handles in a linear fashion (which also greatly reduced the complexity of the code base...but it could have just been my own error in design...although I have designed and redesigned a few engines with threading and/or added threading...so without spending a day or two combing over the threading code it could just be "the way it has to be".)

So, why I say all of this is that without Bepu physics enabled my garbage collection looks pretty much flat...as very low to no object creation/memory allocation happening during game play. With Bepu enabled and utilizing all 4 threads, I am finding that there are about 600+ objects being generated on average per frame, with additional allocations happening with the Threading.SpinLock calls. SpinLock pretty much creates a handle each call and for each thread hitting the region in question.

One good article on some of the issues with SpinLock and usage of:
http://www.c-sharpcorner.com/UploadFile ... g-C-Sharp/
With some good points on usage:
In general, while holding a spin lock, one should avoid any of the following actions.
• Blocking.
• Calling anything that itself may block.
• Holding more than one spin lock at once.
• Making dynamically dispatched calls (interface and virtuals).
• Making statically dispatched calls into any code one doesn't own.
• Allocating memory.
• Do not store SpinLock instances in readonly fields.

As an example, with Bepu Trunk (1.2) setup to use 4 CPUs as per documentation, 256 Sphere Entity shapes, 2 forcefields, what I see is around 2715 instances of BepuPhysics.Threading.SpinLock. This leads me to believe that a good chunk of the threading is breaking things up into "micro-tasks" to be performed over one of the 4 threads available. Although, I could be wrong about this... but without Bepu that number goes down to less than 50 per frame at any given time. The amount of memory consumed from this one object alone ends up being around 44k, which if tasks are being dynamically generated based on demand (if I am reading the code properly) would mean that it could be generating allocations and deallocations on a regular basis.

With 2715 instances of SpinLock, I think there is a lot of blocking happening that can consume upwards of close to 1ms+ of time depending upon the number of objects being processed at any time.

So, my original interest in the possibility of utilizing a more parallel approach...however I did spend a little bit of time looking at how one might do this and it looks like it would be a bit of a task to break things up in that fashion as the original design called for expanding and collapsing lists of objects (rightfully so most people would design it this way)... so it might just not be something that can be done.

========================================================================
Garbage Collection Suggestion:
========================================================================
I also noticed things like BEPUphysics.Constraints.Collision.ContactPenetrationConstraint seem to increase and decrease over time, which also led me to believe there might be several expanding and collapsing lists such as this implementation:
CollidableCollection : IList<Collidable>

Shawn H. has a good post on usage of lists and how they can impact performance:
http://blogs.msdn.com/b/shawnhar/archiv ... rvana.aspx

As a side note, my first pass to try and optimize my game engine was to just follow the above suggestions to the "T" as Shawn points out... and it worked... as in worked very well. Using the CLR profiler you can track down what is continuously allocating things on a regular basis...typically it is adding to Lists...but there can be a *double* hit for Type cast lists as well as using IList as opposed to just List due to the way IList instantiates as well as how C# handles accessing an IList vs List. Really, unless it was something that wasn't going to change hundreds of times potentially in less than a few frames, I would just resort to using *arrays* as opposed to lists and pre-allocating the arrays with the maximum number of entries expected...later I modified many of these arrays to arrays of 4 and would then use those 4 arrays (with index counters for each) to process all of the data at the same time over the 4 processors using the Parallel Tasks library (which yielded the best results from both a threading and garbage collection perspective).

So, it is just a semi-heads up (which you probably already know about) regarding some of the lists and usage of lists and how it impacts performance on the 360...pretty much any time you see a "new", unless it is part ofof the primary BEPUPhysics initialization process, it is a potential candidate to add to the 1MB trigger that forces a GC on the 360...and pretty much any time the garbage collection happens on the 360 you will 100% of the time get a small "hitch"/"stutter"...which is why Shawn pretty much says:
You can achieve good performance by avoiding allocations, so the garbage collector will never run. This works regardless of how complex your heap may be.

You can also achieve good performance by keeping your heap nice and simple, so the garbage collector will finish quickly. This works regardless of whether you allocate memory during gameplay.

But you cannot achieve good performance by mixing these two approaches! There is little to be gained by allocating only a medium amount of memory and having only a medium complex heap. That will produce a game which has a medium size glitch every medium number of seconds.

If your goal is to avoid garbage collection glitches altogether, you must pick just one path and follow it to the bitter end.
Again... I am poking not to point out flaws...because I couldn't even begin to contemplate the time spent thus far...and for the use of everyone...which I can only say *is greatly appreciated*...
But...I am at a point where I had to take my engine from 1024 "world entities" down to about 256 in order to avoid performance hits which seems to appear when there is allot of entities colliding and/or forces being applied, which as I looked at some of the culprits involved in the performance led me down to the inner bowels of the BEPU engine where many of the above issues came into mind as I looked at why there were a significant number of garbage collections happening while many things were colliding...well...I say many... during the time that I typed all of this I only got 109 garbage collections of which the total garbage collection time was about 5001ms...which means, on average, the garbage collection time consumes around ~45ms (which is actually skewed a bit due to my start up and garbage collections during that time...which I only call once after everything has been loaded and then let the GC handle the rest during runtime). With BEPU disabled I see like about 2-5 GCs in the same amount of time (i.e. about 10-15minutes).

Either case, if you haven't read Shawn's blog on the Garbage Collector and optimizations, it might be worth the time (pretty short)...and as you continue to provide the *awesome* BEPU improvements and updates you might keep some of the tips in mind.

Here are some additional reasons behind moving more towards such optimizations:
1.) Making those changes only further improves performance under Windows, Windows has a very fast GC and is not as big of a hit as it is on the 360, but you do see a noticable performance improvement after making said changes.
2.) Arranging data away from the traditional C# List and Type morphable classes and more to "almost C like" data arrays, the C# version of BEPU will already be more "Windows 8" friendly... :)
3.) By using the GC optimization techniques offered by Shawn H, it will only help to even further improve the performance of BEPU on the 360...bc GC on 360 = bad nasty hitches... :)


Anyway, these are the only things I noticed when using BEPU that have currently forced me to reduce my entity count down to about 1/4 the number that the engine can handle (at around 1024 entities without BEPU it ran around 5-6ms total update time but with BEPU physics it hit around 10-14ms at any given point and could hit 16ms+ if all or most of the entities were in relatively close proximity).

The system has no "world geometry" and keeps everything at a similar Y axis level, so collision is limited to only body to body and not world to body and body to body... so it shouldn't be too much of a load.

Perhaps there are additional optimizations I am missing, but I have tested this with the SemiSpeedy and SuperSpeedy settings, which did drop the over-all time down by about 1-2ms, but the majority of the time and visual performance hitches appears to do with the SpinLock and Garbage collection due to "a thousand paper cuts". :)

Either case, I will keep digging around as I would like to get it up to around 512 entities at a minimum and am going to look into ways to try to reduce the number of object creation per frame (600+ with Bepu running).

Hopefully some of this may shed new light on possible avenues and/or new paths to further improve BEPU...or perhaps you are already aware and just adjusting systems as you have time.

Most definitely...love what I see so far!

Cheers,

-N

Re: Threading and Garbage Collection - 360 Revisited

Posted: Sat Oct 20, 2012 12:31 am
by Norbo
Before the rest of this long post, I should point out the most important part:
I also noticed things like BEPUphysics.Constraints.Collision.ContactPenetrationConstraint seem to increase and decrease over time
This garbage is probably related to a known bug in v1.2.0 which is fixed in the development version. You can grab the development version here: http://bepuphysics.codeplex.com/SourceC ... evelopment

The engine is designed, intended, and expected to run without garbage collections. Any persistent deviation from this is likely just a bug that managed to sneak through testing.

[There are minor exceptions to this which I'll get into further below.]
From my brief initial look, it appears to be spreading specific tasks over the various processors, however the tasks in full are being pushed to a single processor as opposed to the task being split into say 4 individual tasks (data being processed split into 4 groups) and done at all the same time.
While there exists a task-based thread manager in BEPUphysics which does not attempt to split tasks, it is only there for legacy support. No system in BEPUphysics uses it. (Way back in the day when it was still used, tasks were created in a manner which distributed the load according to the thread count, but that's no longer required.)

Fortunately, BEPUphysics is essentially a sequence of for-loops. It's possible to specialize a threading system to handle for-loops more efficiently than general tasks. So, in all modern versions of BEPUphysics, the ParallelLoopManager is the primary threading workhorse.

It doesn't function at the level of tasks. Instead, it just gets told that there's an index-based loop, and those indices are shared amongst the available threads. Load balancing is basically automatic and nearly free- each worker grabs a chunk of the loop as needed using a quick little interlocked atomic operation.
I am finding that there are about 600+ objects being generated on average per frame, with additional allocations happening with the Threading.SpinLock calls. SpinLock pretty much creates a handle each call and for each thread hitting the region in question.
Using an already-existing SpinLock does not cause managed allocations. SpinLocks are kept around in their parent objects until the parent object (most frequently, directly or indirectly an entity) is collected. SpinLocks can be found all over the place protecting extremely brief structure accesses, but their count does not independently vary. If you see SpinLocks getting allocated, it is because some other object was created that happens to use a SpinLock.
This leads me to believe that a good chunk of the threading is breaking things up into "micro-tasks" to be performed over one of the 4 threads available.
Task splitting (or, as I should say, forloop chunk consumption) is performed without ever using SpinLocks and, as mentioned, is pretty much free.

The main usage of SpinLocks in the engine is in the Solver and is required by the algorithm. Trying to avoid those synchronization points introduces other issues- either nontrivial work to batch the simulation into parallel chunks or fundamentally different algorithms that avoid the same synchronization requirement (unfortunately, the ones I'm aware of have worse convergence). These approaches are generally found in environments where avoiding the mere possibility of lock contention is extremely valuable during the solving phase- like a GPU implementation, for example.

This synchronization does indeed introduce some overhead, but it is not as bad as you might expect. Contention is rare and brief. Note in those graphs that the solver-heavy simulations typically scale very well.

There are other areas where SpinLock is used in the engine, but they are insignificant to the total run time and are not a part of core calculations.

So, my original interest in the possibility of utilizing a more parallel approach...however I did spend a little bit of time looking at how one might do this and it looks like it would be a bit of a task to break things up in that fashion as the original design called for expanding and collapsing lists of objects (rightfully so most people would design it this way)... so it might just not be something that can be done.
So, the engine is, conceptually, already nearly* as parallel as it can possibly be given the used solver algorithm.

(*There are two Amdahlian problems in the limit: sequential stale object removal in the narrow phase, and to a lesser extent, deactivation-related fixed costs. These combined only absorb a fraction of a millisecond, but Amdahl's law is unforgiving. I'll probably try to tackle those later on when I'm working on a big serverside simulation, but neither of these is related to spinlock synchronization or task work distribution nor do they contribute significantly to performance deficits on the Xbox360 due to the low thread count.)

There are a few other areas where the threading implementation could be improved to extract a percent or two speedup, but those are micro-optimizations, not fundamental changes.
which also led me to believe there might be several expanding and collapsing lists such as this implementation:
CollidableCollection : IList<Collidable>
That particular collection is actually just a convenience wrapper around a privately held data collection. The CollidableCollection itself cannot resize and will not be responsible for any garbage. In fact, even newing up more instances of it is 'free' (heap-wise) as it is a value type.

(A new'd struct (or any value type) will not involve the heap- except for, of course, a struct explicitly stored on the heap by putting it in some heap-stored object :) This can be a bit weird at first for people coming from a C/C++ background.)

There are pools which will grow when the simulation increases in complexity, and the internal array may need to be resized to accommodate. This is a one-time cost and the resulting garbage is very tiny. The pools never try to shrink unless you tell them to, so they will not continually resize and create more garbage.

If you want, you can avoid the pool resizing cost by forcing the pools to adjust their capacities ahead of time. This isn't generally done, though, because the amount of garbage produced is essentially irrelevant- even on the Xbox.

Other non-pool lists will sometimes resize too, but again, this is relatively rare and tiny and is a necessary response to expanding simulation conditions.

(There is a case, however, where it is actually a good idea to pre-allocate certain types. If I create a simulation that I know will rapidly grow to 100,000 box-box collision pairs, I should call the NarrowPhaseHelper.Factories.BoxBox.EnsureCount(100000). This isn't to avoid garbage; it's to create the 100,000 pairs up front. Creating that many objects at run time can be noticeably expensive.)

To give you an idea of how all this expansion and resizing looks in practice, here's a snapshot of the CLR profiler result for the PlanetDemo in the latest development version. It's got 1000 boxes floating in space, occasionally interacting with each other. It sounds passingly similar to your configuration.
flat.JPG
flat.JPG (178.9 KiB) Viewed 9305 times
It jumps up at the beginning to handle the increased simulation complexity (there are several hundred collisions happening), and then it levels off once the simulation reaches its peak. From then on, there are only occasional additional minor allocations.

In fact, if you select the last 200 seconds of the simulation (all the boxes are still orbiting around and occasionally interacting), the simulation itself does not allocate a single byte.

While there are certainly a few resizings contained within that heap from the expansion phase, they are a matter of kilobytes. Even if this had been running on the Xbox360, it wouldn't be nearly enough to trigger a GC by itself.

[Digression: The conceptual construct of a 'List' is not really a performance issue. In performance-critical environments, the user must be aware of the internal array representation and its capacity, but this is relatively easy to deal with and- as shown- does not have to produce any meaningful quantity of garbage.

Further, ILists and other interfaces get a bad wrap for their foreach allocation behavior that gets undeservedly applied to other aspects of their usage and other List implementations. It's true that the enumerator returned by an IList will get boxed if the concrete enumerator was a struct (thus invalidating the concrete List enumerator's heaplessness and generating garbage), but merely indexing into an IList has no such problem. Sure, it will go through a layer of indirection and add a few cycles, but that only matters in performance critical areas.

That's why ILists are generally used in external APIs- it gives users convenience while allowing the internals of the library to grab and process the data into whatever speedy internal representation is required. If you look beneath the surface in BEPUphysics, you'll most frequently find a RawList hiding out behind any exposed list (ReadOnlyLists or ILists).

Those RawLists are BEPUphysics's micro-optimization for avoiding the overhead of recopying stored structures and allowing direct usage of array-stored values by ref-parametered math functions. Basically, the engine spends most of its time working directly with arrays already.]
Either case, if you haven't read Shawn's blog on the Garbage Collector and optimizations, it might be worth the time (pretty short)...and as you continue to provide the *awesome* BEPU improvements and updates you might keep some of the tips in mind.
Don't worry, I have delved into some of the deepest and most dangerous caves and studied the darkest arts in the pursuit of performance. The nittiest of gritties in memory management, threading, and optimization are no stranger to me ;)
Anyway, these are the only things I noticed when using BEPU that have currently forced me to reduce my entity count down to about 1/4 the number that the engine can handle (at around 1024 entities without BEPU it ran around 5-6ms total update time but with BEPU physics it hit around 10-14ms at any given point and could hit 16ms+ if all or most of the entities were in relatively close proximity).
1024 is a lot of physics entities for the Xbox360 to handle under any circumstances. The Xbox360 running under the compact framework is an order of magnitude slower (or worse) than even a fairly low end PC. The power of Xenon's VMX instructions are locked away from managed developers and the CF CLR/JIT is pretty poor compared to the desktop version. It's also hardware from the year 2005 :)

Updating versions to get rid of any garbage-related bugs will stop the spikes, but that won't help the regular update times.

[For reference, 1000 boxes in the PlanetDemo with around 800-900 collision pairs takes around 16 milliseconds per update on the Xbox360. So, it sounds like the values you see are fairly reasonable. Unfortunately, it's just not possible for the Xbox360 to come near the 27,000 boxes that my 3770K can do in even less time :)]
Perhaps there are additional optimizations I am missing, but I have tested this with the SemiSpeedy and SuperSpeedy settings, which did drop the over-all time down by about 1-2ms
The superspeedy settings aren't all that helpful unless the simulation is solver bottlenecked. If it's a bunch of sparse objects that rarely collide (and when they do, the solution is trivial), there isn't really one single place which is easily targeted for tuning-based optimizations.

There are probably some other approaches you could use to speed it up (e.g. 30hz physics update rate plus interpolation or some simulation organization tricks), but I can't offer much in the way of specifics without knowing more about the simulation.