First of all, thanks for creating BEPU Physics, I love it. Having a well designed, performant physics engine in pure C# is a godsend (saves the pain of updating wrappers for a native one and recompiling for each target platform for starters, plus it's fun to play with!

I'll be throwing some money your way soon ).
Glad you like it
Performing ConvexCast on a very hi-poly instanced mesh is very expensive when the cast encompases large chunk of triangles. I only need to detect whether there's an intersection or not, but I don't need any details or the number of triangles. What is the best way to do this efficiently?
Two things will likely give you significant improvements:
1) Make your own boolean-only version of the convex cast. It would be almost identical to the existing one, but instead of using MPRToolbox.Sweep, it would use MPRToolbox.AreSweptShapesIntersecting. The regular sweep actually uses this as a prepass, so this only saves time when the convex cast is intersecting.
2) Make the candidate selection less dumb. I implemented convex casts very lazily under the assumption that they would always be very, very short. Their primary purpose is in CCD, so they mainly need to span the distance between an object's previous frame and current frame positions. So, rather than having a dedicated tree traversal, they expand the bounding box to contain the entire sweep... which means for a long diagonal sweep, the vast majority of the volume is wasted. Adding a dedicated swept-AABB traversal to the trees would help a lot.
On the hacky side of things, there's the option of spraying a bunch of rays approximating the convex cast.
(Future note: v2 will absolutely have a proper traversal. It wouldn't even take long to add to v1.4.X's tree but I'm trying to stop myself from getting distracted

)
When I use InstancedMesh collider with a high density mesh, raycast against this collider ignores triangles smaller than certain threshold, while convex body sweep works fine. I have found that this is caused by too large (relatively speaking) Epsilon in the "FindRayTriangleIntersection" function, where it checks for a degenerate triangle. Decreasing epsilon or skipping this check seems to work, as I have decided for the former (so that actual degenerate triangles are still filtered). Could smaller epsilon cause some inefficiencies (or other problems) in other scenarios (most other objects have "regular" sizes).
Decreasing epsilon across the board could have some sneaky side effects considered all the places it is used- it's related to world scale. (I'd like to make these scale tuning factors less pernicious in v2...)
If you know for a fact that all meshes being raytraced won't have degenerate triangles, then eliminating the check in FindRayTriangleIntersection will be fine. You could also just use a much more aggressive epsilon for that function specifically- realistically it could go down a few orders of magnitude without causing problems.
Could having larger bounding box cause some other problems though? Would it be okay/worth-it if I ran an efficient bounding box calculation on background thread once its stops moving and then just assigned it once it's done (I'm not using asynchronous updating at this point)?
Unnecessarily large bounding boxes will just make its presence in the broad phase larger. If the larger bounding box puts it in testing range of more objects, then it would increase the collision detection cost. If there's nothing in the vicinity, it's no problem at all.
If the collision detection overhead caused by conservative bounds is an issue, asynchronously computing tight bounds could be useful. In that case, however, it may just be easier to use something like the MobileMeshShape's bounding hull vertices. By default it computes the convex hull of the input mesh and uses that hull to compute the bounding box. In the worst case, the convex hull operation doesn't help at all (a highly tessellated sphere), but most of the time it saves quite a few. The MobileMeshShape can also be constructed with arbitrary bounding hull vertices, so a simplified/approximate set could be provided.
Additionally, is there a nicer way to do this, that wouldn't require me to use a modified version of the library (it's not a big problem, but I'd like to avoid making modifications to the library itself if possible)?
There may be ways of eliminating library modifications if the problem could be changed a bit. As it is, the library just wasn't built for this use case in mind, so it needs some hand holding as you've found. My usual advice for a game-like application would be to massively simplify the physics representation of the mesh, preferably into simple convex bits. Ideally, a single box or sphere would suffice, but sometimes complex concavity is needed.
For complex concavity, I would recommend running offline mesh simplification and convex decomposition, like
V-HACD. Note that convex hull collision detection costs are roughly linear with the number of vertices in the hull, so the simplification step can be important. BEPUphysics doesn't have any of these offline processes built in, but something like Blender could do it.
This sort of decomposition will be almost mandatory if you expect a nontrivial number of collisions spanning many triangles in the mesh each. In that situation, the speedup could be between 1-3 orders of magnitude, and most likely no library changes would be required.