Intro
Thinking performance.
Identify the target framerate, aim your approach on hitting that target framerate.
- Everything needs to be as efficient as possible
- Adjust pipelines to engine and hardware restrictions
- Try to offload parts to pre-calculations
- Use the engine’s pool of techniques to achieve quality at suitable cost
What is rendering?

- CPU and GPU handle different parts of the rendering calculations
- They are interdependent and can bottleneck each other
- Know how to the load is distributed between the 2
Shading Techniques
- Deferred shading
- Compositing based using the G-Buffer
- Shading happens in deferred passes
- Good at rendering dynamic lighting
- More flexible when it comes to disabling features, less flexible when it comes to surface attributes
Before Rendering
Rendering threads

Rendering is heavily parallel process. It happens on multiple threads, main threads are CPU(Game), CPU(Draw) and
GPU, reality there is many threads that branch and converge.
UE4 Cmd (stat unit, stat unitgraph)
CPU – Game thread
Before we can render anything we first need to know where everything will be,
Calculate all logic and transforms
- Animations
- Positions of models and objects
- Physics
- AI
- Spawn and destroy, hide and unhide
Results: UE4 now knows all transforms of all models.
CPU – Draw thread
Before we can use the transforms to render the image we need to know what to include in the rendering, ignoring this
question might take rendering expensive on GPU.
Occlusion process – Builds up a list of all visible models/objects, happens per object – not per triangle
4 Stage process
- Distance Culling (manually, LOD Component, Cull Distance Volume)
- Frustum Culling (what is in front of the camera, wide FOV more objects to render)
- Precomputed Visibility
- Occlusion Culling
Precomputed visibility answers more complex occlusion questions,
Objects occluded by other objects, divides the scene into a grid, each grid cell remembers what is visible at that location.
Dynamic Occlusion Culling checks the visibility state on every model, that is mostly run on the CPU but some parts are
GPU handled.
Occlusion Performance Implication
- Set up manual culling (i.e. distance culling, per-computed vis)
- Even things like particles occlude
- Many small objects cause more stress on CPU for culling
- Large models will rarely occlude and thus increase GPU
- Know your world and balance objects size vs count

Results: UE4 now has a list of models to render.
Geometry Rendering
The GPU now has a list of models and transforms but if we just render this info out we could possibly cause a lot of
redundant pixel rendering. Similar to excluding objects, we need to exclude pixels, we need to figure out which pixels
are occluded.
To do this, we generate a depth pass and use it to determine if the given pixel is in front and visible.
GPU – Prepass / Early-Z Pass

Render teapot first, then render the box.

Previous strategy doesn’t work, that why we need to depend on the depth of those pixels to know if this object or this
pixel is behind in front of another object and then decide if we need to render it.
Question 1. How does the renderer associate the early-z pass with an actual object in the scene?
It doesn’t really associated object per object what it happens, it knows the position of pixel on the screen, so what it
need to render an object or a pixel it knows. Ignore it or keep it.
Draw calls
- Now we are ready to actually rendering some geometries, so in order to be efficient, the GPU render a drawcall by
drawcall, not a triangle by a triangle.
- A drawcall is it’s a group of triangles that share the same properties.
- Drawcalls are prepared by CPU(Draw) thread
- Distilling rendering info for objects into a GPU state ready for submission
2,000 – 3,000 is reasonable, more than 5,000 is getting high, more than 10,000 is probably a problem, on mobile this number
is far lower (few hundred max), draw calls is determined by visible objects.
- Drawcalls have a huge impact on the CPU (draw) thread
- Has high overhead for preparing GPU states
- Usually we hit the issues with drawcalls way before issues with tri count.
Imagination of the overhead of a draw call vs that triangles,
Copying 1 single 1GB file vs Copying 1 million 1KB files.
Drawcalls performance implications:
- Render your triangles with as few drawcalls as possible
- 50,000 triangles can run worse than 50 million dependents on scene setup (Drawcalls)
- When optimizing scene, know your bottleneck (Drawcall vs Tri count)
Optimizing Drawcalls
Merging objects
To lower the drawcalls it is better to use fewer larger models than many small ones. You can do that too much, it impacts
other things negatively
- Occlusion
- Lightmapping
- Collision calculation
- Memory
Good balance between size and count is a good strategy.
Drawcall is related directly to how many objects you have and how many unique material IDs you have.
Merging guidelines
- Target low poly objects
- Merge only meshes within the same area
- Merge only meshes sharing the same material
- Meshes with no or simple collision are better for merging
- Distant geometry is usually great to merge (fine with culling)
HLODs
Hierarchical Level of Detail
- Regular LODs means a model becomes lower poly in the distance
- Essentially swaps one object for another simpler object (less materials)
- Hierarchical LOD (HLOD) is a bigger version, it merges objects together in the distance to lower the drawcalls
- Groups objects together into single drawcalls
- Grouping need to be done manually
Instanced Rendering
- Groups objects together into single drawcalls
- Grouping need to be done manually
Vertex Processing
First thing processing the Drawcall
Vertex shader takes care of this process
Vertex shader is a small program specialized in vertex processing
Runs completely on the GPU and so they are fast
Input is vertex data in 3D space output vertex data in screen-space
Vertex shaders – Common tasks
- It converts local VTX positions to world position
- It handles vertex shading/coloring
- It can apply additional offsets to vertex positions
Practical examples of world position offset vertex shaders are
- Cloth
- Water displacement
- Foliage wind animation
Why animate things this way?
Scalability with very high number of vertices, imagine a forest, it could involve millions of vertices to animate.
Vertex shaders do not modify the actual object or affect the scene state, it is purely a visual effect.
The CPU is not aware of what the vertex shaders do, thus things like physics or collisions will not take it into account.
Vertex Shaders Performance Implications
- The more complex the animations performed the slower
- The more vertices affected the slower
- Disable complex vertex shader effects on distant geometry
Rasterizing and G-Buffer
Rasterizing
GPU ready to render pixels, determine which pixels should be shaded called rasterizing, done drawcall by drawcall,
then tri by tri.

See here on this magnified pixel grid, we have a blue triangle and the rasterization for this triangle gives up those
orange pixels. Now the thing that need to know it happens drawcall by drawcall again to be more efficient, then it
goes triangle by triangle by same order it is submitted to the GPU.
Pixel shaders are responsible for calculating the pixel color, input is generally interpolated vertex data, texture
samplers, … etc.
Rasterizing inefficiency
When rasterizing dense meshes at distance, they converge to only few pixels. A waste of vertex processing.
i.e. A 100k tris object seen from so far away that it would be 1 pixel big, will only show 1 pixel of its closest triangle!
Overshading
Due to hardware design, it always uses a 2x2 pixel quad for processing. If a triangle is very small or very thin then
it means it might process 4 pixels while only 1 pixel is actually filled.

The gray gird is basically pixels, and the orange grid on the top which is the pixel quads that the GPU can process.
So, if we have a tiny triangle like first pic, ideally, we only having three pixels so we just need to process through those
three pixels to output the final color. However, in reality this is not what happens on the GPU, the GPU need to process
12 pixels just to render us those three pixels at the end. So, here we see like our first waste of pixel processing for small triangles.
Even worse case,

How to visualize overshading, Lit -> Optimization View modes -> Quad Overdraw.
Rasterization and Overshading Performance Implications
- Triangles are more expensive to render in great density
- When seen at a distance the density increases
- Thus, reducing triangle count at a distance (lodding / culling) is critical
- Very thin triangles are inefficient because they pass through many 2x2 pixel quads yet only fill a fraction of them
- The more complex the pixel shader is the more expensive overshading
Result are written out to:
- Multiple G-Buffers in case of deferred shading
- Shaded buffer in case of forward shading
G-Buffer
It is a rendered image encoding special data, these buffers are then used for different uses – mainly lighting, the frame rendered out in multiple G-Buffers.