Overview with pseudocode

For the course on game engines I teach in the IT University of Copenhagen's MSc in Games, one project is to implement a 3d wireframe software renderer, without using OpenGL or other existing APIs, in order to fully understand what's going on behind the scenes in 3d rendering.

This document gives a concise overview of the series of mathematical transforms needed to do that: to take triangles, given by their coordinates in 3d space, and turn them into a wireframe view on a 2d screen, as seen from a movable camera. I aim to provide all the details needed to implement a renderer from scratch, without using existing libraries, while being much more concise than the lengthy mathematical treatments given in a typical graphics textbook (some of those are linked at the end of this document, for those wanting further information).

(I do apologize for the lack of illustrative diagrams, as I haven't had a chance to make any yet.)

A geometric transform is simply an operation that moves a point or set of points: translates them, rotates them about an axis, etc. Conceptually, geometric transforms are just applications of trigonometry. We could use trigonometric formulas directly. For example, if we want to take a point $(x,y,z)$, and rotate it by $\theta$ degrees around the $z$ axis (i.e., rotate it in its $x$–$y$ plane), the resulting point is located at: \begin{align*} x' &= x \cos \theta - y \sin \theta\\ y' &= x \sin \theta + y \cos \theta\\ z' &= z \end{align*}

It's convenient to represent these in matrix form, however, because then they
can be easily stored and chained. In addition, chains of transforms can be
accumulated into a single precomputed matrix for future use: multiplying two
transform matrices gives a new matrix that applies *both* of the original
transforms.

There are two conventions for matrix-based transforms: we can represent our
point as a *row vector*, or as a *column vector*. Either way results
in a transform identical to the direct trigonometric formulas above.

With row vectors: \begin{equation*} \begin{bmatrix}x' & y' & z'\end{bmatrix} = \begin{bmatrix}x & y & z\end{bmatrix} \begin{bmatrix} \cos \theta & \sin \theta & 0 \\ -\sin \theta & \cos \theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \end{equation*}

With column vectors: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z'\end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta & 0 \\ \sin \theta & \cos \theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z\end{bmatrix} \end{equation*}

If you multiply them out, you can verify that both forms are equivalent to the three equations at the top. The transformation matrix for a column vector is the transpose of the transformation matrix for a row vector (and vice versa). Which convention you choose is arbitrary, but make sure you use the right transformation matrices for the convention you're using! In this document, we'll use the column-vector convention.

The choice of convention also impacts the direction of chaining. As can be seen above, the transformation matrix for row vectors is multiplied on the right, and in general chains towards the right. For column vectors, the transformation matrix is multiplied on the left, and transforms chain towards the left. This can be important to remember when order of application makes a difference.

A number of 3d transforms can be represented as $3 \times 3$ matrices like the
rotation example given in the previous section. These are called the
*linear* transforms (in the same sense as *linear algebra*). In
computer graphics, these almost cover the common space of transforms, with one
important one missing: translation, when we simply move a point by an offset.
The set of linear transforms, plus translation, is called the *affine
transforms*.

In equation form, translation of a point $(x,y,z)$ by an offset $(t_x,t_y,t_z)$ produces a new point: \begin{align*} x' &= x + t_x\\ y' &= y + t_y\\ z' &= z + t_z \end{align*}

You can verify that there's no way to write a $3 \times 3$ transformation
matrix that is equivalent to those equations. We would *really* like to be
able to encode all affine transforms in transformation matrices, however, so
that we have a uniform representation for transforms.

The solution is to work in an augmented space called *homogeneous
coordinates*. We augment our points to 4 dimensions, by adding a "dummy"
coordinate, $w$; for our purposes, this is always normalized to 1. These new
4-dimensional points therefore have $4 \times 4$ transformation matrices
instead of $3 \times 3$ ones. In this augmeted space, it turns out that it's
possible to encode some kinds of transforms of 3-dimensional points which
aren't possible to encode directly in 3d vectors with $3 \times 3$
transformation matrices.

Translation can now be defined: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

In addition, we can easily "upgrade" all $3 \times 3$ linear transformation matrices to $4 \times 4$ homogeneous transformation matrices, by adding $\begin{bmatrix}0 & 0 & 0 & 1\end{bmatrix}$ as the last row and column (that just sets $w'=w$, and keeps the definitions of $x',y',z'$ the same as before, ignoring $w$). The next section shows the homogeneous-coordinate forms of all the major affine transforms.

Note that, while we initially set $w$ to 1, some kinds of transforms might
change it (affine transforms don't, but the perspective transform does).
Therefore, after a series of transforms, and before the resulting point is
used, we need to divide through by $w'$ to *normalize* it back to
1—we'll mention this again in the section on the perspective transform
and clip space.

Transformation matrices are usually built up out of several base transforms, which can then be chained together to produce more complex transforms. This section lists the most common ones.

Rotate by $\theta$ around the $x$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \theta & -\sin \theta & 0 \\ 0 & \sin \theta & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

Rotate by $\theta$ around the $y$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \cos \theta & 0 & \sin \theta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \theta & 0 & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

Rotate by $\theta$ around the $z$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta & 0 & 0 \\ \sin \theta & \cos \theta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

Translate by an offset $(t_x,t_y,t_z)$: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

Scale by a factor $(s_x,s_y,s_z)$: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

For scaling by a uniform factor $s$, just set $s_x=s_y=s_z=s$. It's common for game engines to support only uniform scaling, because it simplifies some things. For example, if an engine only supports uniform scaling, spheres will always stay spherical, so we can safely assume spherical colliders will continue to work. Note that scaling happens around the origin, so objects not centered at the origin will have their centers move. They'll move towards the origin when scaled by factors less than 1, or away from the origin when scaled by factors greater than 1. To avoid this, either scale the object when it's located at the origin, or perform a translation afterwards to move the object back to where it should be.

Models typically start in *model space*, also called *object space*.
They are usually centered at the origin, as exported from a 3d modeling
program. Each model starts in its own model space, where it sits at the origin.

A scene is made up of a single coordinate space, the *world space*, which
may contain many objects. Objects are inserted into the scene, but if just
inserted directly, they would all end up at the origin, and possibly at strange
sizes (depending on what units were used in the modeling program). An object is
transformed from model space to world space by scaling, translating, and
rotating it, so it reaches the size, location, and orientation it should have
in the world.

To render a scene, we need to view the world from somewhere, which we think of
as the "camera". The camera is located at some position in the world (in
world coordinates), and has an orientation. There are different ways an engine
can use to specify the orientation. Here, we'll use the *look-at*
approach, which is a common one. In that approach, the camera is oriented by
specifying two things. First, a *look point* (also in world coordinates),
specifies where the camera is pointing. Second, an *up direction*
specifies how to rotate the scene so the correct side is at the top of the
screen. In many games, the up direction is just $(0,1,0)$, with the $y$ axis in
world space being the up direction (exceptions include things like flight
simulators, where which way is up might rotate when the plane banks).

The world as it's seen from the camera's vantage point is the *view
space*. In view space, the camera is located at the origin, and by convention,
faces perpendicularly to the $x$–$y$ axis. That way, $x$ and $y$ coordinates
will eventually map onto screen $x$ and $y$ coordinates, while $z$ will
represent depth into the scene (distance from the camera).

To transform objects in our scene from world space into view space, we
apply a series of transformations to *all* elements of the scene. The
elements all start out in world space, and the series of transformations
converts their coordinates into view-space coordinates.

The first transformation is easy. The camera is located at position $\vec{c} = (c_x,c_y,c_z)$ in world space. In view space, we want it to be at the origin, $(0,0,0)$. This simply requires translating by an offset of $-\vec{c} = (-c_x,-c_y,-c_z)$, using the translation matrix given earlier.

Next, we need to point the camera at the look point, and orient it so the correct side is up. To calculate the transformation matrix for this step requires a little vector mathematics.

In world-space coordinates, the camera is at $\vec{c}$, and should be looking
at the look point, $\vec{\ell}$. That means it should be looking in direction
$\vec{d} = \vec{\ell} - \vec{c}$, in world space. If we normalize $\vec{n} =
\frac{\vec{d}}{|\vec{d}|}$ so it has length 1, this gives us a direction
vector, called the *look direction*, which is what we want to align with
the $z$ axis. It's also called the *view plane normal*, because the
direction we're looking is perpendicular to ("normal to") the desired view
plane.

At this point, we could begin applying rotations: find the $x$-axis and $y$-axis angles between $\vec{n}$ and the $z$ axis, and rotate so that the look direction coincides with the $z$ axis. However, instead we'll fix the rest of the camera's orientation and do the transformation all at once. The idea is to think of the view transformation as an overall change of coordinates: we start out with $x$–$y$–$z$ coordinates, and we want to transform into a new axis system. We already have one of the three axes, with $\vec{n}$ representing the new $z$, so we just need the other two.

How do we find these axes? The other piece of information we have is the
*up* vector. It is not necessarily orthogonal to the look direction, so is
not one of the axes itself. However, a version of the *up* vector should
represent our new $y$ axis—but we need a version that's within the view
plane. We can get that in two steps, using cross products, and in the process
get the third axis as well. First, we compute
(normalized again),
$$\vec{u} = \frac{\vec{up} \times \vec{n}}{|\vec{up} \times
\vec{n}|}$$

By the definition of cross products, $\vec{u}$ is orthogonal to
both $\vec{up}$ and $\vec{n}$. Since it's orthogonal to the view plane's
normal, it must be in the view plane. And since it's orthogonal to the up
direction, it must point to the *side* of the view plane. Therefore, it's
the new $x$ axis. Finally, from this we can get the new $y$ axis by taking
another cross product, $\vec{v} = \vec{n} \times \vec{u}$. Since this is
orthogonal to both the view plane's normal, and to a vector pointing to the
side of the view plane, it must point up in the view plane.

Now we have a new three-axis coordinate system, defined by the $\vec{u}$–$\vec{v}$–$\vec{n}$ direction vectors. To transform the scene into view space, we need to transform the objects from world-space coordinates to this new coordinate system. This could be done by a series of rotations, but it can also be done by constructing a single change-of-coordinate matrix from these three direction vectors which represent the new axes: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} u_x & u_y & u_z & 0 \\ v_x & v_y & v_z & 0 \\ n_x & n_y & n_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

After multiplying an object by this transformation matrix, its coordinates are now in view space.

There is a final transform, a projection into *normalized device
coordinates* (NDC). In NDC, all coordinates within the viewing space are mapped to
a cube, spanning from $-1$ to $1$ along each axis. This requires transforming
the *viewing
frustum*, which is a chopped-pyramid shape defining the 3d region of
space that the camera sees, into a cube.

We'll restrict ourselves to the common case of a symmetric viewing frustum.
When symmetric, it is defined by four values: the *near* and *far*
distances specify the near and far planes, between which objects are viewable.
Then, the *width* and *height* specify the dimensions of the near
viewing plane.

How do we choose those? The *near* and *far* are freely chosen based
on what kind of depths we want to be viewable. They are given as negative
numbers, because in a right-handed coordinate system (the default used by most
modeling software and engines), the $z$ axis going into the screen is
negative. (Note that some APIs hide this step. For example, you give
OpenGL positive values for the near and far planes, and it internally negates
them.) The *width* and *height* control the aspect ratio and the
field of view. We can also go the other way around, and calculate *width*
and *height* from a desired aspect ratio, and a desired field of view. For
example, it's common to use a horizontal field of view of 75 or 90 degrees, and
an aspect ratio equal to the aspect ratio of your screen. If you have a desired
horizontal field of view of $fov$ and width-to-height aspect ratio of $r$, then by basic geometry,
\begin{align*}
width &= -2 \cdot near \cdot \tan(\frac{fov}{2})\\
height &= \frac{width}{r}
\end{align*}

Now, we have *near*, *far*, *width*, and *height*. Given
those four values, we need to do four things, and an optional but common fifth.
They are: 1) scale the $x$ axis to $[-1,1]$ according to the width and distance
of the near plane; 2) scale the $y$ axis to $[-1,1]$ according to the height
and distance of the near plane; 3) center the $z$ axis between the near and far
planes and scale it to $[-1,1]$; and finally 4) apply perspective scaling so
objects further from the viewer look smaller. In addition, we usually reflect
the scene so that positive $z$ goes into the screen: this converts it to a
*left-handed* coordinate system where $z$ depths are positive, which is
more convenient for rendering.

We won't go over the mathematical derivation of the projection matrix that performs these transformations (the references at the end of this document do go over it), but the end result is: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \frac{2 \cdot near}{width} & 0 & 0 & 0\\ 0 & \frac{2 \cdot near}{height} & 0 & 0 \\ 0 & 0 & \frac{-(far+near)}{far-near} & \frac{-2 \cdot far \cdot near}{far-near} \\ 0 & 0 & -1 & 0 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}

Note that the perspective transform, unlike the previous transforms we discussed, changes $w$, so the new $w'$ will not be 1 anymore. This requires normalizing, by dividing through by $w'$.

An optional optimization: we can actually remove points
that are outside of the viewable region (called frustum culling, or clipping)
before normalizing. The space of un-normalized homogeneous coordinates that we
get after multiplying by the projection matrix, but before normalizing, is
called the *clip space*. A point should be culled if its clip space
coordinate, on any axis, is outside the range $[-w',w']$. This saves us from
doing potentially many wasted division operations, from normalizing points that
are just going to be culled anyway. The reason why we can do this is simple. In
normalized device coordinates, the viewable space has been mapped to the cube
that spans $[-1,1]$ on each axis. Therefore, before the normalization by $w'$,
this is equivalent to a cube spanning $[-w',w']$ on each axis. So frustum
culling becomes a simple comparison with $w'$. Then we normalize as usual, but
only the points that weren't culled.

Finally, we have normalized device coordinates, which are almost the final result. All that's left is to draw them to the screen. The $x$ and $y$ axes are mapped to screen pixels, according to the width and height of the screen, with $-1$ representing the left/top edge and $+1$ the right/bottom edge. To avoid distortion, these should be at the same aspect ratio that was used to construct the projection matrix. And, the $z$ axis provides the depth values indicating how far from the camera each point is. That would be needed if we were going beyond a wireframe renderer, and drawing opaque surfaces where it would be necessary to know which surfaces are behind others.

Tying it all together, to render a scene, we need four things: a list of triangles to render (in 3d world coordinates), the location of the camera in the world, a point that the camera is looking at, and the up direction of the camera.

We need to transform the triangle's 3d points to 2d screen coordinates, in order to draw them on screen (or to an image). The transformation from 3d coordinates to 2d points is carried out via the view and perspective transforms discussed above.

We can build the overall rendering transform out of three intermediate transformation matrices:

`cameraLocationTransform`

: Sets the camera at the origin. This is just a translation matrix.`cameraLookTransform`

: Points the camera towards the*look point*, with the correct*up*direction. This one requires some vector math to build the transformation matrix, as explained in the "View space" section above.`perspectiveTransform`

: Adds perspective to the view and converts to normalized device coordinates. This transformation matrix is the one shown in the "Perspective projection" section above.

Once those matrices are computed, we can do wireframe rendering with this pseudocode:

foreach triangle in triangles { foreach vertex in triangle.vertices { // apply the view and perspective transforms vertexViewSpace = perspectiveTransform * cameraLookTransform * cameraLocationTransform * vertex // normalize by dividing through by the homogeneous coordinate w vertexViewSpace.x = vertexViewSpace.x / vertexViewSpace.w vertexViewSpace.y = vertexViewSpace.y / vertexViewSpace.w vertexViewSpace.z = vertexViewSpace.z / vertexViewSpace.w if any of .x/.y/.z are outside the range [-1,1] skip to next vertex // now map [-1,1] into the screen coordinates (0,width) and (0,height) // where (0,0) is the top-left corner of the screen screenCoordinate.x = vertexViewSpace.x * (width/2.0) + (width/2.0) screenCoordinate.y = -vertexViewSpace.y * (height/2.0) + (height/2.0) drawPoint(screenCoordinate) } draw three lines connecting the three vertices' 2d coordinates }

And that, which turns out not to be all that much code once you understand it, is a fully self-contained, wireframe 3d renderer.

This article is, by design, a bit of a whirlwind tour of the entire process. If you'd like a more detailed exposition, including mathematical derivations of each step, there are a number of good sources:

*Game Engine
Architecture* by Jason Gregory (AK Peters, 2009), Chapter 3, "3d Math
for Games", provides an overview of matrix mathematics and an introduction to
transformation matrices, homogeneous coordinates, and the affine transforms.

*Interactive
Computer Graphics: A Top-Down Approach with Shader-Based OpenGL* by
Edward Angel and Dave Shreiner (Addison-Wesley, 2011), Chapter 3, "Geometric
Objects and Transformations", also provides an overview of matrix mathematics
and geometric transforms, with a somewhat more mathematics-heavy approach
grounded in linear algebra.

In the same Angel & Shreiner book, Chapter 4, "Viewing", provides an extensive introduction to cameras, viewing, perspective transforms, and related topics.

Song Ho Ahn's OpenGL tutorials provide a description of what goes on inside OpenGL's implementation of 3d rendering. The section on "Transformation", in particular, traces the series of transformations from object coordinates through to screen coordinates.