Saturday
Sep192015

Blazing Fast Trees

The prototype for the first big piece of BEPUphysics v2.0.0 is pretty much done: a tree.

This tree will (eventually) replace all the existing trees in BEPUphysics and act as the foundation of the new broad phase.

So how does the current prototype compare with v1.4.0's broad phase?

It's a lot faster.

The measured 'realistic' test scene includes 65536 randomly positioned cubic leaves ranging from 1 to 100 units across, with leaf size given by 1 + 99 * X^10, where X is a uniform random value from 0 to 1. In other words, there are lots of smaller objects and a few big objects, and the average size is 10 units. All leaves are moving in random directions with speeds given by 10 * X^10, where X is a uniform random value from 0 to 1, and they bounce off the predefined world bounds (a large cube) so that they stay in the same volume. The number of overlaps ranges between 65600 and 66300.

Both simulations are multithreaded with 8 threads on a 3770K@4.5ghz. Notably, the benchmarking environment was not totally clean. The small spikes visible in the new implementation do not persist between runs and are just the other programs occasionally interfering.

So, the first obvious thing you might notice is that the old version spikes like crazy. Those spikes were a driving force behind this whole rewrite. What's causing them, and how bad can they get?

 

The answers are refinement and really bad. Each one of those spikes represents a reconstruction of part of the tree which has expanded beyond its optimal size. Those reconstructions aren't cheap, and more importantly, they are unbounded. If a reconstruction starts near the root, it may force a reconstruction of large fractions of the tree. If you're really unlucky, it will be so close to the root that the main thread has to do it. In the worst case, the root itself might get reconstructed- see that spike on frame 0? The graph is actually cut off; it took 108ms. While a full root reconstruction usually only happens on the first frame, the other reconstructions are clearly bad enough. These are multi-frame spikes that a user can definitely notice if they're paying attention. Imagine how that would feel in VR.

To be fair to the old broad phase, this test is a lot more painful than most simulations. The continuous divergent motion nearly maximizes the amount of reconstruction required. 

But there's something else going on, and it might be even worse. Notice that slow upward slope in the first graph? The new version doesn't have it at all, so it's not a property of the scene itself. What does the tree quality look like?

 

This graph represents the computed cost of the tree. If you've heard of surface area heuristic tree builders in raytracing, this is basically the same thing except the minimized metric is volume instead of surface area. (Volume queries and self collision tests have probability of overlap proportional to volume, ray-AABB intersection probability is proportional to surface area. They usually produce pretty similar trees, though.)

The new tree starts with poor quality since the tree was built using incremental insertion, but the new refinement process quickly reduces cost. It gets to around 37.2, compared to a full sweep rebuild of around 31.9.

The old tree starts out better since the first frame's root reconstruction does a full median split build. But what happens afterward? That doesn't look good. What happens if tree churns faster? How about a bunch of objects moving 10-100 instead of 0-10 units per second, with the same distribution?

 

 

Uh oh. The cost increases pretty quickly, and the self test cost rises in step. By the end, the new version is well over 10 times as fast. As you might expect, faster leaf speeds are even worse. I neglected to fully benchmark that since a cost metric 10000 times higher than it should be slows things down a little.

What's happening?

The old tree reconstructs nodes when their volume goes above a certain threshold. After the reconstruction, a new threshold is computed based on the result of the reconstruction. Unfortunately, that new threshold lets the tree degrade further next time around. Eventually, the threshold ratchets high enough that very few meaningful refinements occur. Note in the graph that the big refinement time spikes are mostly gone after frame 1000. If enough objects are moving chaotically for long periods of time, this problem could show up in a real game.

This poses a particularly large problem for long-running simulations like those on a persistent game server. The good news is that the new version has no such problem, the bad news is that there is no good workaround for the old version. For now, if you run into this problem, try periodically calling DynamicHierarchy.ForceRebuild (or look for the internal ForceRevalidation in older versions). As the name implies, it will reset the tree quality but at a hefty price. Expect to drop multiple frames.

(This failure is blindingly obvious in hindsight, and I don't know how I missed it when designing it, benchmarking it, or using it. I'm also surprised no one's reported it to my knowledge. Oops!)

So, how about if nothing is moving?

 

The old version manages to maintain a constant slope, though it still has some nasty spikes. Interestingly, those aren't primarily from refinement, as we'll see in a moment.

This is also a less favorable comparison for the new tree, "only" being 3 times as fast.

Splitting the time contributions helps explain both observations:
  

The old version's spikes can't be reconstructions given that everything is totally stationary, and the self test shows them too. I didn't bother fully investigating this, but one possible source is poor load balancing. It uses a fairly blind work collector, making it very easy to end up with one thread overworked. The new version, in contrast, is smarter about selecting subtasks of similar size and also collects more of them.

So why is the new refinement only a little bit faster if the self test is 3.5 times faster? Two reasons. First, the new refinement is never satisfied with doing no work, so in this kind of situation it does a bit too much. Second, I just haven't spent much time optimizing the refinement blocks for low work situations like this. These blocks are fairly large compared to the needs of a totally stationary tree, so very few of them need to be dispatched. In this case, there were only 2. The other threads sit idle during that particular subphase. In other words, the new tree is currently tuned for harder workloads.

Now, keeping leaves stationary, what happens when the density of leaves is varied? First, a sparse distribution with 8 times the volume (and so about one eighth the overlaps):

 

A bit over twice as fast. A little disappointing, but this is another one of those 'easy' cases where the new refinement implementation doesn't really adapt to very small workloads, providing marginal speedups.

How about the reverse? 64 times more dense than the above, with almost 500000 overlaps. With about 8 overlaps per leaf, this is roughly the density of a loose pile.

 

Despite the fact that the refinement suffers from the same 'easy simulation' issue, the massive improvement in test times brings the total speedup to over 5 times faster. The new tree's refinement takes less than a millisecond on both the sparse and dense cases, but the dense case stresses the self test vastly more. And the old tree is nowhere near as fast at collision tests.

Next up: while maintaining the same medium density of leaves (about one overlap per leaf), vary the number. Leaves are moving at the usual 0-10 speed again for these tests.  First, a mere 16384 leaves instead of 65536:

Only about 2.5 times faster near the end. The split timings are interesting, though: 

The self test marches along at around 3.5 times as fast near the end, but the refinement is actually slower... if you ignore the enormous spikes of the old version. Once again, there's just not enough work to do and the work chunks are too big at the moment. 400 microseconds pretty okay, though.

How about a very high leaf count, say, 262144 leaves? 

Around 4 times as fast. Refinement has enough to chomp on.

Refinement alone hangs around 2.5-2.75 times as fast, which is pretty fancy considering how much more work it's doing. As usual, the self test is super speedy, only occasionally dropping below 4.20 times as fast.

How about multithreaded scaling? I haven't investigated higher core counts yet, but here are the new tree's results for single threaded versus full threads on the 3770K under the original 65536 'realistic' case:

 

Very close to exactly 4 times as fast total. Self tests float around 4.5 times faster. As described earlier, this kind of 'easy' simulation results in a fairly low scaling in refinement- only about 2.3 times faster. If everything was flying around at higher speeds, refinement would be stressed more and more work would be available.

For completeness, here's the new tree versus the old tree, singlethreaded, in the same simulation:

 

3 times faster refines (ignoring spikes), and about 4.5 faster in general.

How does it work?

The biggest conceptual change is the new refinement phase. It has three subphases:

1) Refit

As objects move, the node bounds must adapt. Rather than doing a full tree reconstruction every frame, the node bounds are recursively updated to contain all their children.

During the refit traversal, two additional pieces of information are collected. First, nodes with a child leaf count below a given threshold are added to 'refinement candidates' set. These candidates are the roots of a bunch of parallel subtrees. Second, the change in volume of every node is computed. The sum of every node's change in volume divided by the root's volume provides the change in the cost metric of the tree for this frame.

2) Binned Refine

A subset of the refinement candidates collected by the refit traversal are selected. The number of selected candidates is based on the refit's computed change in cost; a bigger increase means more refinements. The frame index is used to select different refinement candidates as time progresses, guaranteeing that the whole tree eventually gets touched.

The root always gets added as a refinement target. However, the refinement is bounded. All of these refinements tend to be pretty small. Currently, any individual refinement in a tree with 65536 leaves will collect no more than 768 subtrees, a little over 1%. That's why there are no spikes in performance.

Here's an example of candidates and targets in a tree with 24 leaves:

The number within each node is the number of leaves in the children of that node. Green circles are leaf nodes, purple circles are refinement candidates that weren't picked, and red circles are the selected refinement targets. In this case, the maximum number of subtrees for any refinement was chosen as 8.

Since the root has so many potential nodes available, it has options about which nodes to refine. Rather than just diving down the tree a fixed depth, it seeks out the largest nodes by volume. Typically, large nodes tend to be a high leverage place to spend refine time. Consider a leaf node that's moved far enough from its original position that it should be in a far part of the tree. Its parents will tend to have very large bounds, and refinement will see that.

For multithreading, refinement targets are marked (only the refinement treelet root, though- no need to mark every involved node). Refinement node collection will avoid collecting nodes beyond any marked node, allowing refinements to proceed in parallel.

The actual process applied to each refinement target is just a straightforward binned builder that operates on the collected nodes. (For more about binned builders, look up "On fast Construction of SAH-based Bounding Volume Hierarchies" by Ingo Wald.)

3) Cache Optimize 

The old tree allocated nodes as reference types and left them scattered through memory. Traversing the tree was essentially a series of guaranteed cache misses. This is not ideal.

The new tree is just a single contiguous array. While adding/removing elements and binned refinements can scramble the memory order relative to tree traversal order, it's possible to cheaply walk through parts of the tree and shuffle nodes around so that they're in the correct relative positions. A good result only requires optimizing a fraction of the tree; 3% to 5% works quite well when things aren't moving crazy fast. The fraction of cache optimized nodes scales with refit-computed cost change as well, so it compensates for the extra scrambling effects of refinement. In most cases, the tree will sit at 80-95% of cache optimal. (Trees with only a few nodes, say less than 4096, will tend to have a harder time keeping up right now, but they take microseconds anyway.)

Cache optimization can double performance all by itself, so it's one of the most important improvements.

As for the self test phase that comes after refinement, it's pretty much identical to the old version in concept. It's just made vastly faster by a superior node memory layout, cache friendliness, greater attention to tiny bits, and no virtual calls. 

Interestingly, SIMD isn't a huge part of the speedup. It's used here and there (mainly refit), but not to its full potential. The self test in particular, despite being the dominant cost, doesn't use SIMD at all. 

Future work

1) Solving the refinement scaling issue for 'easy' simulations would be nice.

2) SIMD is a big potential area for improvement. As mentioned, this tree is mostly scalar in nature. At best, refit gets decent use of 3-wide operations. My attempts at creating fully vectorized variants tended to do significantly better than the old one, but they incurred too much overhead in many phases and couldn't beat the mostly scalar new version. I'll probably fiddle with it some more when a few more SIMD instructions are exposed, like shuffles; it should be possible to get at least another 1.5 to 2 times speedup.

3) Refinement currently does some unnecessary work on all the non-root treelets. They actually use the same sort of priority queue selection, even though they are guaranteed to eat the whole subtree by the refinement candidate collection threshold. Further, it should be possible to improve the node collection within refinement by taking into account the change in subtree volume on a per-node level. The root refinement would seek out high entropy parts of the tree. Some early testing implied this would help, but I removed it due to memory layout conflicts. 

4) I suspect there are some other good options for the choice of refinement algorithm. I already briefly tried agglomerative and sweep refiners (which were too slow relative to their quality advantage), but I didn't get around to trying things like brute forcing small treelet optimization (something like "Fast Parallel Construction of High-Quality Bounding Volume Hierarchies"). I might revisit this when setting up the systems of the next point.

5) It should be possible to improve the cache optimization distribution. Right now, the multithreaded version is forced into a suboptimal optimization order and suffers from overhead introduced by lots of atomic operations. Some experiments with linking cache optimization to the subtrees being refined showed promise. It converged with little effort, but it couldn't handle the scrambling effect of root refinement. I think this is solvable, maybe in combination with #4.

6) Most importantly, all of the above assumes a bunch of dynamic leaves. Most simulations have tons of static or inactive objects. The benchmarks show that the new tree doesn't do a bad job on these by any means, but imagine all the leaves were static meshes. There's no point in being aggressive with refinements or cache optimizations because nothing is moving or changing, and there's no need for any collision self testing if static-static collisions don't matter.

This is important because the number of static objects can be vastly larger than the number of dynamic objects. A scene big enough to have 5000 active dynamic objects might have hundreds of thousands of static/inactive objects. The old broad phase would just choke and die completely, requiring extra work to use a StaticGroup or something (which still wouldn't provide optimal performance for statics, and does nothing for inactive dynamics). In contrast, a new broad phase that has a dedicated static/inactive tree could very likely handle it with very little overhead.

When I have mentioned big planned broad phase speedups in the past ("over 10 times on some scenes"), this is primarily what I was referring to. The 4 times speedup of the core rewrite was just gravy.

Now what?

If you're feeling adventurous, you can grab the tree inside of the new scratchpad repository on github. Beware, it's extremely messy and not really packaged in any way. There are thousands of lines of dead code and diagnostics, a few dependencies are directly referenced .dlls rather than nice nuget packages, and there's no documentation. The project also contains some of the vectorized trees (with far fewer features) and some early vectorized solver prototyping. Everything but the Trees/SingleArray tree variant is fairly useless, but it might be interesting to someone.

In the future, the scratchpad repo will be where I dump incomplete code scribblings, mostly related to BEPUphysics.

I'm switching developmental gears to some graphics stuff that will use the new tree. It will likely get cleaned up over time and turned into a more usable form over the next few months. A proper BEPUphysics v2.0.0 repository will probably get created sometime in H1 2016, though it will remain incomplete for a while after that.

Monday
Jun152015

BEPUphysics v1.4.0 released!

Grab it on codeplex or nuget! Check the change log!

Now for the fun stuff.

Sunday
Apr192015

BEPUphysics in a CoreCLR World

A lot of exciting stuff has happened in the .NET world over the last year, and BEPUphysics is approaching some massive breaking changes. It seems like a good time to condense the plans in one spot.

First, expect v1.4.0 to get packaged up as a stable release in the next couple of months. At this time, I expect that v1.4.0 will likely be the last version designed with XNA platform compatibility in mind.

Following what seems to be every other open source project in existence, BEPUphysics will probably be moving to github after v1.4.0 is released.

Now for the fun stuff:


BEPUphysics v2.0.0

High Level Overview:

Performance drives almost everything in v2.0.0. Expect major revisions; many areas will undergo total rewrites. Applications may require significant changes to adapt. The revisions follow the spirit of the DX11/OpenGL to DX12/Vulkan shift. The engine will focus on providing the highest possible performance with a minimal API.

Expect the lowest level engine primitives like Entity to become much 'dumber', behaving more like simple opaque data blobs instead of a web of references, interfaces, and callbacks. The lowest layer will likely assume the user knows what they're doing. For example, expect a fundamental field like LinearVelocity to be exposed directly and without any automatic activation logic. "Safe" layers that limit access and provide validation may be built above this to give new users fewer ways to break everything.

Features designed for convenience will be implemented at a higher level explicitly separated from the core simulation or the responsibility will be punted to the user.

Some likely victims of this redesign include:
-Internal timestepping. There is really nothing special about internal timestepping- it's just one possible (and very simple) implementation of fixed timesteps that could, and probably should, be implemented externally.
-Space-resident state buffers and state interpolation. Users who need these things (for asynchronous updates or internal timestepping) have to opt in anyway, and there's no reason to have them baked into the engine core.
-All deferred collision events, and many immediate collision events. The important degrees of access will be retained to enable such things to be implemented externally, but the engine will do far less.
-'Prefab' entity types like Box, Sphere, and so on are redundant and only exist for legacy reasons. Related complicated inheritance hierarchies and generics to expose typed fields in collidables will also likely go away.
-'Fat' collision filtering. Some games can get by with no filtering, or just bitfields. The engine and API shouldn't be hauling around a bunch of pointless dictionaries for such use cases.
And more. 

Platform Support:

Expect older platforms like Xbox360 and WP7 to be abandoned. The primary target will be .NET Core. RyuJIT and the new SIMD-accelerated numeric types will be assumed. Given the new thriving open source initiative, I think this is a safe bet.

Going forward, expect the engine to adopt the latest language versions and platform updates more rapidly. The latest version of VS Community edition will be assumed. Backwards compatibility will be limited to snapshots, similar to how v1.4.0 will be a snapshot for the XNA-era platforms.

Areas of Focus:

1) Optimizing large simulations with many inactive or static objects

In v1.4.0 and before, a common recommendation is to avoid broadphase pollution. Every static object added to the Space is one more object to be dynamically handled  by the broad phase. To mitigate this issue, bundling many objects into parent objects like StaticGroups is recommended. However, StaticGroups require explicit effort, lack dynamic flexibility, and are not as efficient as they could be.

Inactive objects are also a form of broadphase pollution, but unlike static objects, they cannot be bundled into StaticGroups. Further, these inactive objects pollute most of the other stages. In some cases, the Solver may end up spending vastly more time testing activity states than actually solving anything.

Often, games with these sorts of simulations end up implementing some form of entity tracking to remove objects outside of player attention for performance reasons. While it works in many cases, it would be better to not have to do it at all.

Two large changes are required to address these problems:
-The BroadPhase will be aware of the properties of static and inactive objects. In the normal case, additional static or inactive objects will incur almost no overhead. (In other words, expect slightly less overhead than the StaticGroup incurs, while supporting inactive dynamic objects.)
-Deactivation will be redesigned. Persistent tracking of constraint graphs will be dropped in favor of incremental analysis of the active set, substantially reducing deactivation maintenance overhead. Stages will only consider the active set, rather than enumerating over all objects and checking activity after the fact.

On the type of simulations hamstrung by the current implementation, these changes could improve performance hugely. In extreme cases, a 10x speedup without considering the other implementation improvements or SIMD should be possible.

2) Wide parallel scaling for large server-style workloads

While the engine scales reasonably well up to around 4 to 6 physical cores, there remain sequential bottlenecks and lock-prone bits of code. The NarrowPhase's tracking of obsolete collision pairs is the worst sequential offender. More speculatively, the Solver's locking may be removed in favor of a batching model if some other changes pan out.

The end goal is decent scaling on 16-64 physical cores for large simulations, though fully achieving this will likely require some time.

3) SIMD

With RyuJIT's support for SIMD types comes an opportunity for some transformative performance improvements. However, the current implementation would not benefit significantly from simply swapping out the BEPUutilities types for the new accelerated types. Similarly, future offline optimizing/autovectorizing compilers don't have much to work with under the current design. As it is, these no-effort approaches would probably end up providing an incremental improvement of 10-50% depending on the simulation.

To achieve big throughput improvements, the engine needs cleaner data flow, and that means a big redesign. The solver is the most obvious example. Expect constraints to undergo unification and a shift in data layout. The Entity object's data layout will likely be affected by these changes. The BroadPhase will also benefit, though how much is still unclear since the broad phase is headed for a ground up rewrite.

The NarrowPhase is going to be the most difficult area to adapt; there are a lot of different collision detection routines with very complicated state. There aren't as many opportunities for unification, so it's going to be a long case-by-case struggle to extract as much performance as possible. The most common few collision types will most likely receive in-depth treatment, and the remainder will be addressed as required.

Miscellaneous Changes:

-The demos application will move off of XNA, eliminating the need for a XNA Game Studio install. The drawer will be rewritten, and will get a bit more efficient. Expect the new drawer to use DX11 (feature level 11_0) through SharpDX. Alternate rendering backends for OpenGL (or hopefully Vulkan, should platform and driver support be promising at the time) may be added later for use in cross platform debugging. 

-As alluded to previously, expect a new broad phase with a much smoother (and generally lower) runtime profile. Focuses on incremental refinement; final quality of tree may actually end up higher than the current 'offline' hierarchies offered by BEPUphysics.

-StaticGroup will likely disappear in favor of the BroadPhase just handling it automatically, but the non-BroadPhase hierarchies used by other types like the StaticMesh should still get upgraded to at least match the BroadPhase's quality.

-Collision pair handlers are a case study in inheritance hell. Expect something to happen here, but I'm not yet sure what.

-Wider use of more GC-friendly data structures like the QuickList/QuickSet to avoid garbage and heap complexity.

-Convex casts should use a proper swept test against the broad phase acceleration structure. Should make long unaligned casts much faster.

-More continuous collision detection options. Motion clamping CCD is not great for all situations- particularly systems of lots of dynamic objects, like passengers on a plane or spaceship. The existing speculative contacts implementation helps a little to stabilize things, but its powers are limited. Granting extra power to speculative contacts while limiting ghost collisions would be beneficial.

-The CompoundShape could use some better flexibility. The CompoundHelper is testament to how difficult it can be to do some things efficiently with it.

Schedule Goals:

Variable. Timetable depends heavily on what else is going on in development. Be very suspicious of all of these targets.

Expect the earliest changes to start showing up right after v1.4.0 is released. The first changes will likely be related the debug drawer rewrite.

The next chunk may be CCD/collision pair improvements and the deactivation/broadphase revamp for large simulations. The order of these things is uncertain at this time because there may turn out to be some architectural dependencies. This work will probably cover late spring to mid summer 2015.

Early attempts at parallelization improvements will probably show up next. Probably later in summer 2015.

SIMD work will likely begin at some time in late summer 2015. It may take a few months to adapt the Solver and BroadPhase.

The remaining miscellaneous changes, like gradual improvements to collision detection routines, will occur over the following months and into 2016. I believe all the big changes should be done by some time in spring 2016.

This work won't be contiguous; I'll be hopping around to other projects throughout.

Future Wishlist:

-The ancient FluidVolume, though slightly less gross than it once was, is still very gross. It would be nice to fix it once and for all. This would likely involve some generalizations to nonplanar water- most likely procedural surfaces that would be helpful in efficiently modeling waves, but maybe to simple dynamic heightfields if the jump is short enough.

-Fracture simulation. This has been on the list for a very long time, but there is still a chance it will come up. It probably won't do anything fancy like runtime carving or voronoi shattering. More likely, it will act on some future improved version of CompoundShapes, providing different kinds of simple stress simulation that respond to collisions and environmental effects to choose which parts get fractured. (This isn't a very complicated feature, and as mentioned elsewhere on the forum, I actually implemented something like it once before in a spaceship game prototype- it just wasn't quite as efficient or as clean as a proper release would require.)

On GPU Physics:

In the past, I've included various kinds of GPU acceleration on the development wishlist. However, now, I do not expect to release any GPU-accelerated rigid body physics systems in the foreseeable future. BEPUphysics itself will stay exclusively on the CPU for the foreseeable future.

I've revisited the question of GPU accelerated physics a few times over the last few years, including a few prototypes. However, GPU physics in games is still primarily in the realm of decoration. It's not impossible to use for game logic, but having all of the information directly accessible in main memory with no latency is just a lot easier. 

And implementing individually complicated objects like the CharacterController would be even more painful in the coherence-demanding world of GPUs. (I would not be surprised if a GPU version of a bunch of full-featured CharacterControllers actually ran slower due to the architectural mismatch.) There might be a hybrid approach somewhere in here, but the extra complexity is not attractive.

And CPUs can give pretty-darn-decent performance. BEPUphysics is already remarkably quick for how poorly it uses the capabilities of a modern CPU.

And our own game is not a great fit for GPU simulation, so we have no strong internal reason to pursue it. Everything interacts heavily with game logic, there are no deformable objects, there are no fluids, any cloth is well within the abilities of CPU physics, and the clients' GPUs are going to be busy making pretty pictures.

This all makes implementing runtime GPU simulation a bit of a hard sell.

That said, there's a small chance that I'll end up working on other types of GPU accelerated simulation. For example, one of the GPU prototypes was a content-time tool to simulate flesh and bone in a character to automatically generate vertex-bone weights and pose-specific morph targets. We ended up going another direction in the end, but it's conceivable that other forms of tooling (like BEPUik) could end up coming out of continued development.

 

Have some input? Concerned about future platform support? Want to discuss the upcoming changes? Post on the forum thread this was mirrored from, or just throw tweets at me.

Wednesday
May212014

BEPUik v0.3.0 now available!

64 bit Blender builds with the latest BEPUik full body inverse kinematics addon now available!
Don't forget to follow Norbo and Squashwell on twitter to read tweets about things.
Friday
Dec202013

BEPUphysics v1.3.0 released!

Grab the new version! Check out the new stuff!

With this new version comes some changes to the forks. I've dropped the SlimDX and SharpDX forks in favor of focusing on the dependency free main fork.

The XNA fork will stick around for now, but the XNA fork's library will use BEPUutilities math instead of XNA math. Going forward, the fork will be used to maintain project configuration and conditional compilation requirements for XNA platforms.