How to get parallel GPU pixel rendering? For voxel ray tracing

Question

I made a voxel raycaster in Unity using a compute shader and a texture. But at 1080p, it is limited to a view distance of only 100 at 30 fps. With no light bounces yet or anything, I am quite disappointed with this performance.

I tried learning Vulkan and the best tutorials are based on rasterization, and I guess all I really want to do is compute pixels in parallel on the GPU. I am familiar with CUDA and I've read that is sometimes used for rendering? Or is there a simple way of just computing pixels in parallel in Vulcan? I've already got a template Vulkan project that opens a blank window. I don't need to get any data back from the GPU just render straight to the screen after giving it data.

And with the code below would it be significantly faster in Vulkan as opposed to a Unity compute shader? It has A LOT of if/else statements in it which I have read is bad for GPUs but I can't think of any other way of writing it.

EDIT: I optimized it as much as I could but it's still pretty slow, like 30 fps at 1080p.

Here is the compute shader:

#pragma kernel CSMain

RWTexture2D<float4> Result; // the actual array of pixels the player sees
const float width; // in pixels
const float height;

const StructuredBuffer<int> voxelMaterials; // for now just getting a flat voxel array
const int voxelBufferRowSize;
const int voxelBufferPlaneSize;
const int voxelBufferSize;
const StructuredBuffer<float3> rayDirections; // I'm now actually using it as points instead of directions
const float maxRayDistance;

const float3 playerCameraPosition; // relative to the voxelData, ie the first voxel's bottom, back, left corner position, no negative coordinates
const float3 playerWorldForward;
const float3 playerWorldRight;
const float3 playerWorldUp;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = float4(0, 0, 0, 0); // setting the pixel to black by default
    float3 pointHolder = playerCameraPosition; // initializing the first point to the player's position
    const float3 p = rayDirections[id.x + (id.y * width)]; // vector transformation getting the world space directions of the rays relative to the player
    const float3 u1 = p.x * playerWorldRight;
    const float3 u2 = p.y * playerWorldUp;
    const float3 u3 = p.z * playerWorldForward;
    const float3 direction = u1 + u2 + u3; // the direction to that point

    float distanceTraveled = 0;
    int3 directionAxes; // 1 for positive, 0 for zero, -1 for negative
    int3 directionIfReplacements = { 0, 0, 0 }; // 1 for positive, 0 for zero, -1 for negative
    float3 axesUnit = { 1 / abs(direction.x), 1 / abs(direction.y), 1 / abs(direction.z) };
    float3 distancesXYZ = { 1000, 1000, 1000 };
    int face = 0; // 1 = x, 2 = y, 3 = z // the current face the while loop point is on

    // comparing the floats once in the beginning so the rest of the ray traversal can compare ints
    if (direction.x > 0) {
        directionAxes.x = 1;
        directionIfReplacements.x = 1;
    }
    else if (direction.x < 0) {
        directionAxes.x = -1;
    }
    else {
        distanceTraveled = maxRayDistance; // just ending the ray for now if one of it's direction axes is exactly 0. You'll see a line of black pixels if the player's rotation is zero but this never happens naturally
        directionAxes.x = 0;
    }
    if (direction.y > 0) {
        directionAxes.y = 1;
        directionIfReplacements.y = 1;
    }
    else if (direction.y < 0) {
        directionAxes.y = -1;
    }
    else {
        distanceTraveled = maxRayDistance;
        directionAxes.y = 0;
    }
    if (direction.z > 0) {
        directionAxes.z = 1;
        directionIfReplacements.z = 1;
    }
    else if (direction.z < 0) {
        directionAxes.z = -1;
    }
    else {
        distanceTraveled = maxRayDistance;
        directionAxes.z = 0;
    }

    // calculating the first point
    if (playerCameraPosition.x < voxelBufferRowSize &&
        playerCameraPosition.x >= 0 &&
        playerCameraPosition.y < voxelBufferRowSize &&
        playerCameraPosition.y >= 0 &&
        playerCameraPosition.z < voxelBufferRowSize &&
        playerCameraPosition.z >= 0)
    {
        int voxelIndex = floor(playerCameraPosition.x) + (floor(playerCameraPosition.z) * voxelBufferRowSize) + (floor(playerCameraPosition.y) * voxelBufferPlaneSize); // the voxel index in the flat array

        switch (voxelMaterials[voxelIndex]) {
        case 1:
            Result[id.xy] = float4(1, 0, 0, 0);
            distanceTraveled = maxRayDistance; // to end the while loop
            break;
        case 2:
            Result[id.xy] = float4(0, 1, 0, 0);
            distanceTraveled = maxRayDistance;
            break;
        case 3:
            Result[id.xy] = float4(0, 0, 1, 0);
            distanceTraveled = maxRayDistance;
            break;
        default:
            break;
        }
    }

    // traversing the ray beyond the first point
    while (distanceTraveled < maxRayDistance) 
    {
        switch (face) {
        case 1:
            distancesXYZ.x = axesUnit.x;
            distancesXYZ.y = (floor(pointHolder.y + directionIfReplacements.y) - pointHolder.y) / direction.y;
            distancesXYZ.z = (floor(pointHolder.z + directionIfReplacements.z) - pointHolder.z) / direction.z;
            break;
        case 2:
            distancesXYZ.y = axesUnit.y;
            distancesXYZ.x = (floor(pointHolder.x + directionIfReplacements.x) - pointHolder.x) / direction.x;
            distancesXYZ.z = (floor(pointHolder.z + directionIfReplacements.z) - pointHolder.z) / direction.z;
            break;
        case 3:
            distancesXYZ.z = axesUnit.z;
            distancesXYZ.x = (floor(pointHolder.x + directionIfReplacements.x) - pointHolder.x) / direction.x;
            distancesXYZ.y = (floor(pointHolder.y + directionIfReplacements.y) - pointHolder.y) / direction.y;
            break;
        default:
            distancesXYZ.x = (floor(pointHolder.x + directionIfReplacements.x) - pointHolder.x) / direction.x;
            distancesXYZ.y = (floor(pointHolder.y + directionIfReplacements.y) - pointHolder.y) / direction.y;
            distancesXYZ.z = (floor(pointHolder.z + directionIfReplacements.z) - pointHolder.z) / direction.z;
            break;
        }

        face = 0; // 1 = x, 2 = y, 3 = z
        float smallestDistance = 1000;
        if (distancesXYZ.x < smallestDistance) {
            smallestDistance = distancesXYZ.x;
            face = 1;
        }
        if (distancesXYZ.y < smallestDistance) {
            smallestDistance = distancesXYZ.y;
            face = 2;
        }
        if (distancesXYZ.z < smallestDistance) {
            smallestDistance = distancesXYZ.z;
            face = 3;
        }
        if (smallestDistance == 0) {
            break;
        }

        int3 facesIfReplacement = { 1, 1, 1 };
        switch (face) { // directionIfReplacements is positive if positive but I want to subtract so invert it to subtract 1 when negative subtract nothing when positive
        case 1:
            facesIfReplacement.x = 1 - directionIfReplacements.x;
            break;
        case 2:
            facesIfReplacement.y = 1 - directionIfReplacements.y;
            break;
        case 3:
            facesIfReplacement.z = 1 - directionIfReplacements.z;
            break;
        }

        pointHolder += direction * smallestDistance; // the acual ray marching
        distanceTraveled += smallestDistance;

        int3 voxelIndexXYZ = { -1,-1,-1 }; // the integer coordinates within the buffer
        voxelIndexXYZ.x = ceil(pointHolder.x - facesIfReplacement.x);
        voxelIndexXYZ.y = ceil(pointHolder.y - facesIfReplacement.y);
        voxelIndexXYZ.z = ceil(pointHolder.z - facesIfReplacement.z);

        //check if voxelIndexXYZ is within bounds of the voxel buffer before indexing the array
        if (voxelIndexXYZ.x < voxelBufferRowSize &&
            voxelIndexXYZ.x >= 0 &&
            voxelIndexXYZ.y < voxelBufferRowSize &&
            voxelIndexXYZ.y >= 0 &&
            voxelIndexXYZ.z < voxelBufferRowSize &&
            voxelIndexXYZ.z >= 0)
        {
            int voxelIndex = voxelIndexXYZ.x + (voxelIndexXYZ.z * voxelBufferRowSize) + (voxelIndexXYZ.y * voxelBufferPlaneSize); // the voxel index in the flat array
            switch (voxelMaterials[voxelIndex]) {
            case 1:
                Result[id.xy] = float4(1, 0, 0, 0) * (1 - (distanceTraveled / maxRayDistance));
                distanceTraveled = maxRayDistance; // to end the while loop
                break;
            case 2:
                Result[id.xy] = float4(0, 1, 0, 0) * (1 - (distanceTraveled / maxRayDistance));
                distanceTraveled = maxRayDistance;
                break;
            case 3:
                Result[id.xy] = float4(0, 0, 1, 0) * (1 - (distanceTraveled / maxRayDistance));
                distanceTraveled = maxRayDistance;
                break;
            }
        }
        else {
            break; // should be uncommented in actual game implementation where the player will always be inside the voxel buffer
        }
    }
}

Depending on the voxel data you give it it produces this:

And here is the shader after "optimizing" it and taking out all branching or diverging conditional statements (I think):

#pragma kernel CSMain

RWTexture2D<float4> Result; // the actual array of pixels the player sees
float4 resultHolder;
const float width; // in pixels
const float height;

const Buffer<int> voxelMaterials; // for now just getting a flat voxel array
const Buffer<float4> voxelColors;
const int voxelBufferRowSize;
const int voxelBufferPlaneSize;
const int voxelBufferSize;
const Buffer<float3> rayDirections; // I'm now actually using it as points instead of directions
const float maxRayDistance;

const float3 playerCameraPosition; // relative to the voxelData, ie the first voxel's bottom, back, left corner position, no negative coordinates
const float3 playerWorldForward;
const float3 playerWorldRight;
const float3 playerWorldUp;

[numthreads(16, 16, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    resultHolder = float4(0, 0, 0, 0); // setting the pixel to black by default
    float3 pointHolder = playerCameraPosition; // initializing the first point to the player's position
    const float3 p = rayDirections[id.x + (id.y * width)]; // vector transformation getting the world space directions of the rays relative to the player
    const float3 u1 = p.x * playerWorldRight;
    const float3 u2 = p.y * playerWorldUp;
    const float3 u3 = p.z * playerWorldForward;
    const float3 direction = u1 + u2 + u3; // the transformed ray direction in world space
    const bool anyDir0 = direction.x == 0 || direction.y == 0 || direction.z == 0; // preventing a division by zero
    float distanceTraveled = maxRayDistance * anyDir0;

    const float3 nonZeroDirection = { // to prevent a division by zero
        direction.x + (1 * anyDir0),
        direction.y + (1 * anyDir0),
        direction.z + (1 * anyDir0)
    };
    const float3 axesUnits = { // the distances if the axis is an integer
        1.0f / abs(nonZeroDirection.x),
        1.0f / abs(nonZeroDirection.y),
        1.0f / abs(nonZeroDirection.z)
    };
    const bool3 isDirectionPositiveOr0 = {
        direction.x >= 0,
        direction.y >= 0,
        direction.z >= 0
    };

    while (distanceTraveled < maxRayDistance)
    {
        const bool3 pointIsAnInteger = {
            (int)pointHolder.x == pointHolder.x,
            (int)pointHolder.y == pointHolder.y,
            (int)pointHolder.z == pointHolder.z
        };

        const float3 distancesXYZ = {
            ((floor(pointHolder.x + isDirectionPositiveOr0.x) - pointHolder.x) / direction.x * !pointIsAnInteger.x)  +  (axesUnits.x * pointIsAnInteger.x),
            ((floor(pointHolder.y + isDirectionPositiveOr0.y) - pointHolder.y) / direction.y * !pointIsAnInteger.y)  +  (axesUnits.y * pointIsAnInteger.y),
            ((floor(pointHolder.z + isDirectionPositiveOr0.z) - pointHolder.z) / direction.z * !pointIsAnInteger.z)  +  (axesUnits.z * pointIsAnInteger.z)
        };

        float smallestDistance = min(distancesXYZ.x, distancesXYZ.y);
        smallestDistance = min(smallestDistance, distancesXYZ.z);

        pointHolder += direction * smallestDistance;
        distanceTraveled += smallestDistance;

        const int3 voxelIndexXYZ = {
            floor(pointHolder.x) - (!isDirectionPositiveOr0.x && (int)pointHolder.x == pointHolder.x), 
            floor(pointHolder.y) - (!isDirectionPositiveOr0.y && (int)pointHolder.y == pointHolder.y),
            floor(pointHolder.z) - (!isDirectionPositiveOr0.z && (int)pointHolder.z == pointHolder.z)
        };

        const bool inBounds = (voxelIndexXYZ.x < voxelBufferRowSize && voxelIndexXYZ.x >= 0) && (voxelIndexXYZ.y < voxelBufferRowSize && voxelIndexXYZ.y >= 0) && (voxelIndexXYZ.z < voxelBufferRowSize && voxelIndexXYZ.z >= 0);

        const int voxelIndexFlat = (voxelIndexXYZ.x + (voxelIndexXYZ.z * voxelBufferRowSize) + (voxelIndexXYZ.y * voxelBufferPlaneSize)) * inBounds; // meaning the voxel on 0,0,0 will always be empty and act as a our index out of range prevention

        if (voxelMaterials[voxelIndexFlat] > 0) {
            resultHolder = voxelColors[voxelMaterials[voxelIndexFlat]] * (1 - (distanceTraveled / maxRayDistance));
            break;
        }   
        if (!inBounds) break;
    }
    Result[id.xy] = resultHolder;
}

"*I tried learning Vulkan and the best tutorials are based on rasterization, and I guess all I really want to do is compute pixels in parallel on the GPU.*" Rasterization *is* computing pixels, in parallel, on the GPU. So how is what you want different from that? — Nicol Bolas, Apr 03 '21 at 03:05
No, to rasterize you need to calculate vertices and triangles and then get the pixles from that using a completely different system from raytracing. I just want to say `pixels[n] = color` without all the extra stuff. — Tristan367, Apr 03 '21 at 10:03

user369070 · Accepted Answer · 2021-04-04T10:11:25.337

0

Compute shader is what it is: a program that runs on a GPU, be it on vulkan, or in Unity, so you are doing it in parallel either way. The point of vulkan, however, is that it gives you more control about the commands being executed on GPU - synchronization, memory, etc. So its not neccesseraly going to be faster in vulkan than in unity. So, what you should do is actually optimise your shaders.

Also, the main problem with if/else is divergence within groups of invocations which operate in lock-step. So, if you can avoid it, the performance impact will be far lessened. These may help you with that.

If you still want to do all that in vulkan...

Since you are not going to do any of the triangle rasterisation, you probably won't need renderpasses or graphics pipelines that the tutorials generally show. Instead you are going to need a compute shader pipeline. Those are far simplier than graphics pipelines, only requiring one shader and the pipeline layout(the inputs and outputs are bound via descriptor sets).

You just need to pass the swapchain image to the compute shader as a storage image in a descriptor (and of course any other data your shader may need, all are passed via descriptors). For that you need to specify VK_IMAGE_USAGE_STORAGE_BIT in your swapchain creation structure.

Then, in your command buffer you bind the descriptor sets with image and other data, bind the compute pipeline, and dispatch it as you probably do in Unity. The swapchain presentation and submitting the command buffers shouldn't be different than how the graphics works in the tutorials.

edited Apr 04 '21 at 10:11

answered Apr 02 '21 at 15:34

user369070

334
1
3
8

I optimized it and updated my posted code and now there is only 5 if statements and 3 switch statements in the actual ray traversal and it is barely even noticeably faster. I see videos of people raytracing polygons in 4k with light bounces in Vulkan it just seems weird that I can hardly even raycast voxels at 30 fps with less view distance if I should be getting similar performance. My algorithm can't be that bad, right? Thank you. – Tristan367 Apr 03 '21 at 02:56
@Tristan367 First, what is your GPU, and theirs? Second, you could try to translate others' algorithms into your code to see a more direct comparison... Also, consider using textures instead of `StructuredBuffer`. And floats on GPU are the cheapest thing ever... – user369070 Apr 03 '21 at 11:50
I have the GTX 1050. It is a decent little GPU and it can handle blender raytracing with lots of samples just fine. I think there must be some Unity overhead or something. I have to "dispatch" the compute shader every frame from a Monobehvior (Unity class) in the Unity update loop. – Tristan367 Apr 03 '21 at 20:29
@Tristan367 The dispatch is how you tell the GPU to start doing work, you do that in vulkan too, although there you would bundle it with other commands... While it is not impossible that Unity adds a lot of overhead, i personally doubt it. Other possible overheads are having to copy the data from cpu ram to gpu, gpu-cpu sync, ineffcient data structures... – user369070 Apr 04 '21 at 09:53
I just took out ALL conditional statements. There is one tiny if statement left that I can't figure out but there is no noticeable improvement in speed. I edited the code into my post check it out. – Tristan367 Apr 04 '21 at 23:35
@Tristan367 First, break; and return; exist. Then, you could replace some of your integer operations with bitwise to make them more readable(&, |, ^ - and, or, xor). And more importantly, you are likely bound by memory, rather than execution: try changing data structures. And try changing resolution, to see if fps changes proportionally. Optimising shaders is not just about removing `if`s: its about searching and removing bottlenecks. – user369070 Apr 05 '21 at 03:05
Optional breaks and returns require branching, right? And I don't think it's memory, the thing that makes the biggest difference is the view distance. Even at 4k, performance is great if the view distance is only a few units. Lowering the resolution helps dramatically too. It just seems odd to me that I can't raycast a few simple cubes at 30 fps yet Blender can rayTRACE thousands of polygons at about the same speed on the same computer. And what "data structures?" Indexing a flat array is about as fast of a data structure as you can get. – Tristan367 Apr 05 '21 at 04:06
And there wasn't room in that comment but changing the voxel buffer size I send to the GPU makes no noticeable change in performance, whether it is one voxel or 5 million. Is that what you meant by memory? – Tristan367 Apr 05 '21 at 04:15
First, read second paragraph of my reply again: branches only bad if causing divergence. And you will return when out of loop anyway. Returning from part of a warp will turn those lanes inactive and only waste performance of those. By memory, I meant time it takes for gpu to load data onto very silicon chip. Flat arrays arent that great at holding 3D grids (texture locality problems), try textures instead, as those may also use hardware specialised memory... Try making the view distance an integer that holds the amount of voxels traversed, so that threads which never hit finish at same time... – user369070 Apr 05 '21 at 05:00
Welp, since low level optimisations dont help anymore, its time to rethink your approach alltogether. I heard more complicated engines are using trees to accelerate marching: blocks of empty space clumped together and marched in one single step. For (maybe) simplier approach you could try using shared memory or line shuffle, checking multiple voxels at a step for intersections in separate lines, and add progress to correct threads. Avoiding divergence will be harder there, and requires overall more time thinking... Also, branches are not fire, and your compiler will try to optimise out some... – user369070 Apr 29 '21 at 15:58
Thank you for the advice, yes, I think I'm going to try using octrees for this. I can already get max vanilla Minecraft view distance if I turn the resolution down to 1280x720, so its not far off. Traversing an octree would probably allow it to exceed Minecraft view distances or at least match them at 1080p. However, this is without any light bouncing implemented yet. And yes, I got a little carried away with taking out the branching lol it is actually slightly faster though. – Tristan367 Apr 29 '21 at 20:23
How could I structure the data in an HLSL buffer? HLSL won't let me make a struct with a member of its own type so I can't think of another way to store nodes of data like that that wouldn't be super messy and complicated. – Tristan367 Apr 30 '21 at 09:34
Thats a something for a separate question (and googling). But i'd try do an array of ints(nodes?), where each bit specifies if a node exists, with some constant maximum depth. – user369070 Apr 30 '21 at 15:55

How to get parallel GPU pixel rendering? For voxel ray tracing

1 Answers1