Put up your response to the question and the below I addressed any additional information that might or might not be pertinent.
First off, I hate poking holes or attempting to with something that is already *very* awesome and obviously you have a much greater insight to the full inner workings of the physics engine design...so some of my question and/or suggestions could be naive and/or just flat out wrong. My time is very limited as I am coding furiously to get a project finished due to the game company I worked for went out of business and other reasons which at this point places me in a position to try and generate revenue from building my own indie games.The engine is already fully multithreaded; check out the Internal Multithreading documentation here on codeplex for a minimal demonstration of setting it up (it amounts to calling Space.ThreadManager.AddThread).
At a glance, the ParallelTasks library appears to have a fairly similar forloop implementation (which is all that really matters in BEPUphysics). I'd guess that they would perform pretty similarly once ParallelTasks was tuned for the simulation, though the built-in specialized one may have the advantage of not trying to do as much feature-wise. The built-in loop threading is used over the .NET 4.0 Task Parallel Library on the PC for a similar reason.
(I should mention that measuring the performance difference between these optimized libraries in real situations can be quite tricky- the speed differences caused by different decent thread managers are very nearly nonexistent compared to the cost of whatever they're computing.)
As for garbage, the upscaling of some lists when the simulation expands is a known source. However, this should be a very small quantity (on the order of kilobytes), and it's a one time cost. Further, it can be almost entirely prevented by initializing the various pools to match the demands of the simulation beforehand. This is rarely worth the development effort, though.
You will, however, see more than just a few kilobytes being allocated when the simulation ramps up. This is normal- allocations are not always garbage; the vast majority is data absolutely required for the simulation to function. It won't be garbage collected in the future (unless you get rid of the simulation entirely).
If you see the simulation itself producing garbage persistently or in large quantities, please report it with a debuggable project that reproduces the issue if possible.
By the way, I generally check the BEPUphysics forums more often than the issue tracker; it's a better place for discussions
So...if you run across anything below that is a non-issue and have the time it would be nice to know more about why it would not be an issue regarding the 360 and garbage collection as well as the threading implementaiton.
========================================================================
Regarding Multi-threading vs. Parallel processing:
========================================================================
Not sure if it was a miscommunication or if I am just missing something in the multithreading portion of Bepu, but it appears that each thread works on its own handling jobs/tasks assigned to them. From my brief initial look, it appears to be spreading specific tasks over the various processors, however the tasks in full are being pushed to a single processor as opposed to the task being split into say 4 individual tasks (data being processed split into 4 groups) and done at all the same time.
My discovery with the parallel tasks was based on my initial game engine design having some issues with large chunks of entities/objects having to be processed over time as well as, the most important part, the synchronization of the results of the data relative to the game engine loop. The bottom line, after some basic testing, I ended up redesigning a large portion of the entity processing areas such that the data being processed was broken into 4 equal (as possible) chunks at the start of a level, and then I just processed each chunk of data for specific results in a linear fashion based on what needed to be known/available to any "grouping" of data being processed.
The end result was that I saved about 3ms of processing time and had close to *zero* hits on any possible SpinLock calls. I also no longer had to worry about any form of synchronization of data as it was all being handles in a linear fashion (which also greatly reduced the complexity of the code base...but it could have just been my own error in design...although I have designed and redesigned a few engines with threading and/or added threading...so without spending a day or two combing over the threading code it could just be "the way it has to be".)
So, why I say all of this is that without Bepu physics enabled my garbage collection looks pretty much flat...as very low to no object creation/memory allocation happening during game play. With Bepu enabled and utilizing all 4 threads, I am finding that there are about 600+ objects being generated on average per frame, with additional allocations happening with the Threading.SpinLock calls. SpinLock pretty much creates a handle each call and for each thread hitting the region in question.
One good article on some of the issues with SpinLock and usage of:
http://www.c-sharpcorner.com/UploadFile ... g-C-Sharp/
With some good points on usage:
In general, while holding a spin lock, one should avoid any of the following actions.
• Blocking.
• Calling anything that itself may block.
• Holding more than one spin lock at once.
• Making dynamically dispatched calls (interface and virtuals).
• Making statically dispatched calls into any code one doesn't own.
• Allocating memory.
• Do not store SpinLock instances in readonly fields.
As an example, with Bepu Trunk (1.2) setup to use 4 CPUs as per documentation, 256 Sphere Entity shapes, 2 forcefields, what I see is around 2715 instances of BepuPhysics.Threading.SpinLock. This leads me to believe that a good chunk of the threading is breaking things up into "micro-tasks" to be performed over one of the 4 threads available. Although, I could be wrong about this... but without Bepu that number goes down to less than 50 per frame at any given time. The amount of memory consumed from this one object alone ends up being around 44k, which if tasks are being dynamically generated based on demand (if I am reading the code properly) would mean that it could be generating allocations and deallocations on a regular basis.
With 2715 instances of SpinLock, I think there is a lot of blocking happening that can consume upwards of close to 1ms+ of time depending upon the number of objects being processed at any time.
So, my original interest in the possibility of utilizing a more parallel approach...however I did spend a little bit of time looking at how one might do this and it looks like it would be a bit of a task to break things up in that fashion as the original design called for expanding and collapsing lists of objects (rightfully so most people would design it this way)... so it might just not be something that can be done.
========================================================================
Garbage Collection Suggestion:
========================================================================
I also noticed things like BEPUphysics.Constraints.Collision.ContactPenetrationConstraint seem to increase and decrease over time, which also led me to believe there might be several expanding and collapsing lists such as this implementation:
CollidableCollection : IList<Collidable>
Shawn H. has a good post on usage of lists and how they can impact performance:
http://blogs.msdn.com/b/shawnhar/archiv ... rvana.aspx
As a side note, my first pass to try and optimize my game engine was to just follow the above suggestions to the "T" as Shawn points out... and it worked... as in worked very well. Using the CLR profiler you can track down what is continuously allocating things on a regular basis...typically it is adding to Lists...but there can be a *double* hit for Type cast lists as well as using IList as opposed to just List due to the way IList instantiates as well as how C# handles accessing an IList vs List. Really, unless it was something that wasn't going to change hundreds of times potentially in less than a few frames, I would just resort to using *arrays* as opposed to lists and pre-allocating the arrays with the maximum number of entries expected...later I modified many of these arrays to arrays of 4 and would then use those 4 arrays (with index counters for each) to process all of the data at the same time over the 4 processors using the Parallel Tasks library (which yielded the best results from both a threading and garbage collection perspective).
So, it is just a semi-heads up (which you probably already know about) regarding some of the lists and usage of lists and how it impacts performance on the 360...pretty much any time you see a "new", unless it is part ofof the primary BEPUPhysics initialization process, it is a potential candidate to add to the 1MB trigger that forces a GC on the 360...and pretty much any time the garbage collection happens on the 360 you will 100% of the time get a small "hitch"/"stutter"...which is why Shawn pretty much says:
Again... I am poking not to point out flaws...because I couldn't even begin to contemplate the time spent thus far...and for the use of everyone...which I can only say *is greatly appreciated*...You can achieve good performance by avoiding allocations, so the garbage collector will never run. This works regardless of how complex your heap may be.
You can also achieve good performance by keeping your heap nice and simple, so the garbage collector will finish quickly. This works regardless of whether you allocate memory during gameplay.
But you cannot achieve good performance by mixing these two approaches! There is little to be gained by allocating only a medium amount of memory and having only a medium complex heap. That will produce a game which has a medium size glitch every medium number of seconds.
If your goal is to avoid garbage collection glitches altogether, you must pick just one path and follow it to the bitter end.
But...I am at a point where I had to take my engine from 1024 "world entities" down to about 256 in order to avoid performance hits which seems to appear when there is allot of entities colliding and/or forces being applied, which as I looked at some of the culprits involved in the performance led me down to the inner bowels of the BEPU engine where many of the above issues came into mind as I looked at why there were a significant number of garbage collections happening while many things were colliding...well...I say many... during the time that I typed all of this I only got 109 garbage collections of which the total garbage collection time was about 5001ms...which means, on average, the garbage collection time consumes around ~45ms (which is actually skewed a bit due to my start up and garbage collections during that time...which I only call once after everything has been loaded and then let the GC handle the rest during runtime). With BEPU disabled I see like about 2-5 GCs in the same amount of time (i.e. about 10-15minutes).
Either case, if you haven't read Shawn's blog on the Garbage Collector and optimizations, it might be worth the time (pretty short)...and as you continue to provide the *awesome* BEPU improvements and updates you might keep some of the tips in mind.
Here are some additional reasons behind moving more towards such optimizations:
1.) Making those changes only further improves performance under Windows, Windows has a very fast GC and is not as big of a hit as it is on the 360, but you do see a noticable performance improvement after making said changes.
2.) Arranging data away from the traditional C# List and Type morphable classes and more to "almost C like" data arrays, the C# version of BEPU will already be more "Windows 8" friendly...
3.) By using the GC optimization techniques offered by Shawn H, it will only help to even further improve the performance of BEPU on the 360...bc GC on 360 = bad nasty hitches...
Anyway, these are the only things I noticed when using BEPU that have currently forced me to reduce my entity count down to about 1/4 the number that the engine can handle (at around 1024 entities without BEPU it ran around 5-6ms total update time but with BEPU physics it hit around 10-14ms at any given point and could hit 16ms+ if all or most of the entities were in relatively close proximity).
The system has no "world geometry" and keeps everything at a similar Y axis level, so collision is limited to only body to body and not world to body and body to body... so it shouldn't be too much of a load.
Perhaps there are additional optimizations I am missing, but I have tested this with the SemiSpeedy and SuperSpeedy settings, which did drop the over-all time down by about 1-2ms, but the majority of the time and visual performance hitches appears to do with the SpinLock and Garbage collection due to "a thousand paper cuts".
Either case, I will keep digging around as I would like to get it up to around 512 entities at a minimum and am going to look into ways to try to reduce the number of object creation per frame (600+ with Bepu running).
Hopefully some of this may shed new light on possible avenues and/or new paths to further improve BEPU...or perhaps you are already aware and just adjusting systems as you have time.
Most definitely...love what I see so far!
Cheers,
-N