I promise I'm not an alien preparing the invasion of Earth, and that removing part of the first letter of my name was not just a ploy to make you feel comfortable
I don't care very much about rotation, would adding a special case for spheres that don't rotate help performance significantly?
Very little, most likely. You can set an entity's LocalInertiaTensorInverse to a zero matrix and that will disallow rotation. While the math involved in integration is performed, having infinite angular inertia can simplify the solving process a little.
You could also try just setting the solver's iteration limit to something lower (Space.Solver.IterationLimit). It will help a little; if its a bunch of isolated spheres on the ground, its already early-outing really fast, but you may be able to save an iteration or two here and there.
Is the sphere-terrain collision expensive -- maybe I could start optimizing there?
It's not particularly expensive. The primary consideration is the number of triangles that are possibly colliding with a single object. If a single sphere overlaps 40 triangles, it's obviously going to be slower than if it's only overlapping 2. Secondarily, the configuration of collisions matter. If the plane of the triangle is found to be the colliding feature, the collision test goes extremely fast. If it has to resort to boundary tests, it may slow down a bit. It still should go pretty fast due to the simplicity of the involved shapes. The worst case for this sort of issue is a terrain composed of a bunch of spikes.
The convex-triangle special case is fast enough in general that there is not currently a sphere-triangle special case, though there may be in the future. The difference wouldn't be gigantic.
With spheres, the issue of triangle boundary bumps is mostly eliminated. Depending on your game, you may be able to turn off the terrain.ImproveBoundaryBehavior. Having that enabled allows boxes and other shapes to slide smoothly over edges/vertex based on their connectivity with adjacent triangles. It won't be a huge performance win, but it is worth trying out.
I'm willing to give up some detection precision. Mostly these spheres will not be moving too quickly. Maybe I should cap the number of sphere-cast steps. I care very little about time of impact precision. Maybe TOI sorting is costing a lot?
No TOI sorting is used. CCD should have extremely low overhead in most cases since it only activates when the relative velocities put objects in danger of tunnelling.
I'm not familiar with your broadphase technique. It seems that about 80 active spheres takes about 20ms to update on the XBOX (obviously this depends greatly on their positioning). I might be able to get away with that if I can simultaneously have ~200-300 inactive spheres at the same time. I'm off to go test that next.
With only 80 objects, it's extremely unlikely that the broad phase is the bottleneck. In addition, the broad phase will still do almost all of its work even if objects are inactive because the broad phase is responsible for triggering the sequence which can result in the activation of entities.
If you were interested, there are a few new broad phases currently in testing.
-One is an extremely simple traditional 1D SAP (SortAndSweep1D) which can beat the current DBVH by quite a bit in certain configurations (primarily when object density is sparse). It does not support any accelerated queries, though.
-Another is a hybridized 2D grid + SAP (Grid2DSortAndSweep) which has a slightly higher baseline overhead, but eliminates most of the 1D SAP's worst-case configuration issues. It beats DBVH quite handily most of the time and supports easy parallelization. It conceptually supports queries, though they are not implemented yet (and when they are, they will be slower than the DBVH's).
-The DBVH is scheduled for a rewrite. The current version is a bit messy and a memory hog.
All of the new stuff can be found in the development fork (though beware bugs):
http://bepuphysics.codeplex.com/SourceC ... evelopment
By the way, I don't know how your whole architecture is set up, but you can get some pretty significant gains from using internal multithreading. A fork-and-join type approach where the physics engine has full access to the Xbox's 3 cores will take the per-update time down a lot, though other compute-heavy systems couldn't be running on other threads simultaneously. It also simplifies the access patterns a lot since you don't have to worry about corrupt data mid-update.
EDIT:
I tested BEPU with a much flatter terrain, and updates took 30ms with 392 spheres. Holy crap, that's awesome! So BEPU clearly wins in the fight against my simulator. I'll start transitioning everything over to BEPU in the next day or two.
Could I ask what the original terrain test looked like? I may be able to use it to do additional optimizing in the future.