Improving Memory Usage in the Daemon Renderer

After the move to compressed texture formats, we looked for ways to reduce the memory used for storing the geometry of models and the world map.

The architecture of modern GPUs is highly parallel; there are literally hundreds or even thousands of threads running at the same time, but there is only one memory bus. This means that access to global memory has to be serialized, and that means that shaders are more likely to be stalled the more memory they have to access.

Unlike textures, which either store RGB(A) colours or normal vectors, the data that makes up a model’s geometry is more complex. There are several per-vertex attributes that the engine uses to draw a model:

the vertex position
the vertex normal, tangent and binormal vectors
the texture coordinates
the lightmap coordinates
the vertex colour
the indexes of the bones the vertex is attached to (up to 4)
the weights of effect of the bones on the vertex position (up to 4)

The engine used to pass all these attributes as 4-float vectors, even attributes that used fewer than 4 components (e.g. texture coordinates are only 2d, but the engine stored them in an (x, y, pad, pad) vector. So overall the engine used nine 4-float vectors for a vertex, which means 36 floats or 144 bytes.

position.x	position.y	position.z
normal.x	normal.y	normal.z
tangent.x	tangent.y	tangent.z
binormal.x	binormal.y	binormal.z
texcoord.s	texcoord.t
lmcoord.s	lmcoord.t
colour.r	colour.g	colour.b	colour.a
blendindex[0]	blendindex[1]	blendindex[2]	blendindex[3]
blendweight[0]	blendweight[1]	blendweight[2]	blendweight[3]

To reduce this number, we use several methods like storing data at lower precision, combining multiple attributes into one or transforming the data into a different mathematical representation altogether.

Vertex Position

The next smaller format for vertex position would be either 16-bit floats or normalized integers. Those are good enough for models, but they don’t provide the necessary precision for map geometry.

Vertex Colours

The vertex colour is usually stored as an 8-bit normalized value in the model and map files, so storing them as floats in the engine doesn’t provide any benefit. The switch from 4-float vector to 4-byte vector reduces the memory by 12 bytes per vertex.

Texture and Lightmap coordinates

The Texture and Lightmap coordinates each used only 2 components of a 4-float vector, so the first improvement is to store both in a single vector, the texture coordinates in the x and y channels and the lightmap coordinates in the z and w channels. That alone reduces the vertex size by 16 bytes.

But the texture and lightmap coordinates don’t need the full float precision, so we can store them as 16-bit values. If texture coordinates would always be in the 0 .. 1 range, a normalized 16-bit value would be optimal, but the engine should support texture wrapping, so a larger range of values is required. That means we have to either use a scaled integer coordinate or a 16-bit float value.

The scaled integer format provides constant precision for a given range, but this range has to be defined globally in the engine. A large texture coordinate range allows repeating textures, which are often used on walls, floors, etc., but it limits the available precision. A range that works for large walls and small structures simultaneously is difficult to find.

So we chose the 16-bit float format because it provides a greater range while still having enough precision for the most often used coordinates between 0 and 1.

That reduces the vertex size by another 8 bytes.

Bone Indexes and Weights

As the engine supports a maximum of 128 bones, we need only a single byte per bone index. The bone weights are always between 0 and 1, and a precision of 8 bits is enough for practical purposes, so they could be stored in another 4-byte vector.

However, we pack them together into a 4-ushort vector, where the most significant byte is the bone index and the least significant byte is the bone weight ×256. As a bone weight can be 1, this means that a weight of 1 would not fit into the byte, but fortunately weights are guaranteed to have a sum of 1 and to be sorted from largest to lowest weight. That means the first weight can never be 0 and the second, third and fourth weight can never be 1. Thus we can store (1-weight) for the first weight and keep everything in the 0 to 255 range.

This reduces the vertex size by 24 bytes.

Normals, Tangents and Binormals

The normal, tangent and binormal vectors define the basis on which the normalmap vectors are defined. Although they don’t have to be orthogonal unit vectors, in practice you would either waste normal map precision or make some normals impossible, if the vectors are not. So assuming that they are close to orthogonal unit vectors, the engine now enforces this mathematically.

Although this has an influence on the normal maps in theory, the differences are not noticeable in practice.

Knowing that the vectors are orthogonal unit vectors allows the use of some mathematical identities to reduce the number of values needed to compute them, e.g. the third vector has to be either the cross product of the other two or it is the negative of the cross product. The method used in the engine is based on the observation that orthogonal unit bases that have the same orientation as the standard basis can always be obtained by a rotation of the standard basis vectors.

A rotation can be efficiently stored as a unit quaternion, which is a 4-element vector of numbers between -1 and 1, so a 4-short vector is sufficient. However, there are also orthogonal unit bases that have the opposite orientation. To encode those we can just negate one of the vectors, giving us a basis having the right orientation, and encode that, but we need a way to store the extra bit of information that the vector has been inverted. Adding an extra bool attribute would be possible, but there is a way to store that flag inside the quaternion itself: a unit quaternion and its inverse always encode the same rotation, so the engine can use the sign of the w coordinate to encode the orientation of the basis. A positive w means no negation is needed and a negative w means the vector has to be flipped.

This works great, but if the quaternion has a w component of 0, the method can’t store a negative sign and the renderer may use the wrong orientation. To avoid this anomaly, the w component is set to the minimal nonzero value if it is zero, the error added by this (for 16-bit precision) is approx. 0.2 degrees, so not noticeable in practice.

This technique has been first used by Crytek in their CryEngine 3 (see their paper on QTangents).

All this means that the normal, tangent and binormal vectors can be squeezed into a 4-short vector, saving a further 40 bytes per vertex.

Packing it all together

The vertex data is interleaved in memory to get the best cache performance. To achieve good alignment of the data, there are two separate formats for skeletal and non-skeletal vertices.

Non-skeletal data uses floats for the position, but it obviously doesn’t need the bone indexes and weights, so they are packed as follows into 32 bytes.

position.x	position.y	position.z	colour.rgba
qtangent.xy	qtangent.zw	texcoord.st	lmcoord.st

Skeletal data uses 16-bit normalized integers to store the position and drops the lightmap coordinates, to make some room for the bone indexes and weights.

position.xy	position.zw	texcoord.st	colour.rgba
qtangent.xy	qtangent.zw	blend[0..1]	blend[2..3]

Summary

Taking all changes together, the size of a vertex has been reduced from 144 bytes to 32 bytes, i.e. the engine needs more than 1 MB less for a 10000 vertex model than before. While this seems to be a little gain compared to the amounts of memory saved by texture compression, the geometry data has to be fully processed every frame, unlike textures where often only small parts of the data are needed, so the vertex compression should have a similar effect on memory bandwidth as the texture compression.