Permafrost

Project 7

In an arctic lab, shoot and bash your way through your mutated coworkers and use your wits to find your best friend.

Contribution

In Permafrost I did mostly engine and rendering related work, namely global illumination, occlusion depth culling and shadow map optimizations. I already had indirect rendering and compute-based frustum culling from the previous project, so implementing occlusion culling with a Hi-Z depth map was fairly straight forward.

Video showing occlusion culling

However quickly after getting the occlusion culling implemented we hit an instance count limit at 2048 instances per dispatch, this caused some interesting rendering bugs. This instance limit was a result of the GPU-driven culling implementation I had introduced in the previous project.

Video showcasing what my nightmares consisted of during this time

Occlusion Culling

The way the occlusion culling works is by first combining all instances into one gigantic instance buffer, then running a compute shader for culling. This culling shader outputs a buffer of booleans, these booleans are predicates of whether an instance from the input instance buffer should be drawn or not.

Instance 1

True

Instance 2

False

Instance 3

False

Instance 4

True

Instance 5

False

Instance 6

True

Instance 7

True

Instance Buffer

Predicate Buffer

Input

Output

This leaves me with a 1:1 boolean buffer representation of my input instance buffer, after that I need to pipe this through a stream compaction shader that will output a dense instance buffer containing only the instances from the first input instance buffer that were also true in the boolean buffer. I can then use this culled compact instance buffer in my vertex shaders and combined with writing the instance count for each mesh to an indirect command buffer, I achieve GPU sided culling and perfect instancing, or this was the goal at least. Most of this was already done in the previous project, but my implementation, heavily based on this blog, had one major flaw, namely that my stream compaction shader could only work with 2048 instances per dispatch because of the way I had used thread group shared memory to do efficient compaction.

The method for stream compaction I used is called parallel prefix sum scan, originally proposed by Belloch. An implementation can also be found in GPU Gems. It works by treating the booleans in the predicate buffer as integers and produces an array of integers as such:

y0 = 0
y1 = x0
y2 = x0 + x1
…
yn-1= x0 + x1 + … + xn-2

This leaves me with a buffer of integers that represent the different indices of the predicates in the original boolean buffer, using this I can then write a dense instance buffer.

The thread shared memory block I used to write these indices into was 2048 in size, with a dispatch group size of (1024, 1, 1) with only one group launched in each dimension, the thread group worked with two instances at once which allowed me to compact 2048 instances in total. This worked fine if we kept us under that instance count, and we managed to do so for the entirety of project six, but seeing as we had already reached that limit and the issues it brought I panicked and considered just turning off culling and indirect rendering altogether for the time being.

However after giving it some thought I came up with my own solution that would allow up to 65536 instances per cull. My idea was to introduce another dimension to the compute and use the thread groups per Y to issue parallel culls since they would all have their own group shared memory. The only problem was how I would sync and combine these properly and I was stuck on this problem for a few days, I tried a bunch of ideas, getting desperate enough to create an entirely separate RWBuffer that would only hold the different index offsets per thread group. As is often the case, the solution I came up with eventually was stupidly simple and I had been over-complicating the problem in my head.

The solution I came up with was to just introduce another group-shared variable that would represent the offset into the final output instance buffer, which would be the index where the previous thread group would leave off. So if my groupID.y is 0, my toffset would be 0, if my groupID.y is 1, that makes my toffset wherever the groupID.y == 0 dispatch left off.

This had roughly been my idea for the last couple of days but I could not figure out a way to share this offset between thread groups, as they would work in parallel, so I could never know how far the first thread group had written whilst also writing in the second thread group. The stupidly simple solution I eventually came up with, that you can see in the image above, was to just recalculate the offset from the previous thread group at the start of my second thread group.

This recalculation works by just adding each predicate flag together, which gives me how many instances were written to the buffer in the previous thread group. And with this I finally had working occlusion culling that could now work with up to 65536 instances per dispatch.

Global Illumination

Very early in pre-production the graphical artists requested some form of global illumination as it would help greatly with the atmosphere we were trying to build. My first naive attempt was to implement reflective shadow maps which worked for local indirect light, which meant the light could bounce off a red wall and hit the ground next to it, but it couldn't fill a room with light.

So after realizing that reflective shadow maps weren't going to be enough I considered trying to implement light propagation volumes, or LPV for short, but this didn't seem too far off from implementing VXGI, which I deemed would be a better fit for our game. So that's what I eventually decided on. Implementing a minimum viable product of VXGI went much faster than I originally anticipated, and after a few days we had global illumination, albeit a bit flickery and performance heavy.

This was deemed good enough but I put a mental note to come back and fix some of the issues in the future, namely I wanted to shrink the memory footprint of holding all those voxels in memory, and I wanted to make it run faster, this eventually became my specialization project and you can read more about it here.

Team

Safety First

Programming

Niklas Fredriksson

Simon Igelström

Zoe Thysell

Erik Ljungman

Semi Asani

Joakim Larsson

Neo Nemeth

Animation

Jack Thell Malmberg

Elias Runelid

Tanya Bengtsson

Level Design

Mina Mirhosseini

Tilde Persson

Christoffer Carlsvärd

Graphics

Victor Ek

Mirjam Hildahl

Philip Tingberg

Daniel Gryningstjerna

William Holster

Technical Artist

Frode Ödehn

Malte Hedenström

Sara Ekstam