In recent decades the performance of GPUs (in graphics cards) has increased immensely compared to CPUs with a result that now significant performance increases can be obtained by performing some processing on the GPU that would normally be achieved on the CPU. I’ve been working on implementing general purpose GPU programming with DBPro to greatly improve the performance of calculations using DBPro. I have succeeded and it is wasn't that hard. This tutorial will introduce the concept using a simplified perlin noise texture generation with a speed boost of >20x!
About the GPU
Before we get to the nitty gritty I guess we should say a word or two about the GPU and it’s capabilities since not all processing will run better on the GPU than the CPU. The key to understanding the GPU is that it, not surprisingly, is designed for processing of graphics, so vertexes, faces and pixels, however, since this involves a lot of vector and matrix mathematics it is also adept at really hard sums. This maths geekiness is hardwired into the GPU since it has registers designed with four dimensional vectors in mind (r,g,b,w). Processes involving really hard vector or matrix operations will work considerably faster on the GPU than on the CPU.
GPU performance is also boosted by parallel processing, which means it runs the same program on several GPUs at one time. Graphics cards have more parallel processors than CPUs, for example, a Geforce 6 has six processors in parallel to process vertex data and sixteen processors to processes pixels. We can, however, get these processors to do other stuff if we pretend the stuff is images. We can then get the GPU to do number crunching for us super fast.
What GPUs are not good at is communicating. One of the requirements of parallel processing is that there are significant restrictions on where data can be read from or to and when in GPU programs. This makes GPU programming somewhat fiddly since everything has its place in GPU code.
GPU Programs
We’re all familiar with GPU programs, they’re called shaders. This tutorial will not cover shader programming in too much detail, there are many tutorials out there on the subject. To start using the GPU to speed up DBPro you’ll need to know how to code in High Level Shader Language (HLSL).
We do need to know a few things about shader programming before we start. Perhaps the most important is that there are two main parts to a shader: (a) the vertex shader, and (b) the pixel shader. All these are is two small pieces of code, which look like functions, but are really more like separate programs.
The vertex shader performs operations on vertexes, most important of them being the position and normals of the vertex both in the world and relative to the camera (i.e. their projection). There is much cool stuff you can do in the vertex shader to manipulate model meshes, however, for most GP-GPU the vertex shader will simply act to pass our data to the pixel shader which will do most of the work. The vertex shader is run before the pixel shader for every vertex in a mesh. Each instance of the vertex shader is run in parallel on one of the vertex processors. Importantly for us, vertex shaders cannot read texture data (or at least only on the most modern graphics cards and even then they are slow at this).
The pixel shader performs operations on screen pixels. One instance of the pixel shader will run for every pixel in parallel on the pixel processors. Instances of the pixel shader are actually generated by the vertex shader since this creates faces from vertexes which are then used to generate pixels by the rasterizer. You can think of this processes like spawning. The vertex shader makes projections of faces that spawn an instance of the pixel shader for every pixel that should be in the triangular face. The pixel shader then processes the colour of that one pixel and passes it out to graphics memory for display. The key to understanding the pixel shader is that it has no control over where it’s output will be written, this has already been determined by the vertex shader and rasterizer.
Finally we should also mention the final part of every shader, the technique. This piece of code in a shader specifies the conditions under which the vertex and pixel shaders operate (e.g. back face culling, whether to enable z write), and which vertex or pixel shaders are run (we can have several in a single shader). The technique also allows us a rudimentary form of program control using passes to run data through several vertex and pixel shaders sequentially. Texture data can also be transferred from one pass to another using a RENDERCOLORTARGET.
Programming the GPU: In a nutshell
Although the GPU is designed and setup for processing graphics we can trick it into doing all sorts of useful stuff, all we need to do is present it with data in the form of a texture, write pixel shader code that does the really hard sums we want it to do, then let it pass back the results to us as an output texture. Since a texture is pretty much like an array of data, what we are doing is passing it a structured list of numbers, and getting back a resulting structured list of numbers which we can then read and use how ever we like.
When tricking the GPU into painting our fence for us (i.e. the Tom Sawyer method of coding

) we have to remember the limitations of the gullible boy doing the painting. The data we pass in will be in a texture, the larger the texture, the slower it is passed to the graphics card and back. The texture also has limited depth, so each pixel stores a number 0-255. We can, however, split our data up into 4 channels (red, green, blue and alpha) which means we can pass a much larger range of data in a texture if we stop thinking of it as a picture and start thinking of it as a limited array. The same, of course, is true of the data when it comes out of the shader.
As well as read/write issues GPUs have other limitations resulting from their parallel nature, of which program flow is perhaps the most different from CPU programming. Each pixel shader can be considered to be the inner code of a loop that runs over all pixels in the projected image (since the code is run for every pixel). Adding program flow commands to pixel shaders, however, results in significant decreases in performance since each parallel processor must process commands in unison and thus logic statements such as IF THEN are problematic. Loops are also a problem in pixel shaders unless they are very specific. The use of globals to define the total number of loops will, for example, prevent a shader from compiling.
A problematic limitation of GPUs is also the limited number of instructions (lines of code) that can be included in older shaders, which are more widely compatible, although use of functions allows additional instructions to be used.
To make the GPU do work for us in DBPro we only need to find a way to pass texture data too and get texture data from a shader.
You can read much more about general purpose GPU programming and concepts in this:
http://http.developer.nvidia.com/GPUGems2/gpugems2_frontmatter.html
GPU Example: Generating Perlin Noise
Perlin noise is commonly used in generating terrains, clouds and random textures but is computationally expensive to generate in DBPro. We can, for example, use memblocks to construct, scale and smooth the various layers of perlin noise we need to combined to generate a perlin noise texture. This method is, however, very slow for textures larger than 256x256 pixels and not appropriate for the generation of perlin noise in realtime. In this example we’ll generate some simplified perlin noise using a shader and return the texture generated and save it as a file. Since this is really only a demonstration of the concept of GP-GPU the algorithm I use in the shader to generate the noise is rather simple. It suffers from linear smoothing artifacts and is not seamless, it also has a fixed number of octaves. You’ll find more complex shaders for generating perlin noise that will do a better job. The great thing about this technique is you can just swap the guts of the shader over if you want it to do something else.
Here is the DBPro function that I’ve used to generate perlin noise. What this function does is generate a random noise image using memblock commands and load this to an image. This noise image will be used as an input texture to the perlin noise shader by creating a plane object and adding the texture to it with the texture object command. We then load the shader using load effect and apply the shader to the object we’ve just created. To grab the output of the shader is a little more tricky. The simplest solution is to use the camera to view the final texture on the object and copy this to a bitmap. To do this we must set the aspect, distance and range of the camera so that it captures the image without distortion (i.e.using a small angle isometric camera) and captures every pixel of the image. We can then grab an image from the bitmap using get image. You’d probably want to reset the camera aspect and range back to its original values after you’ve done this…but I have been bothered.
Function makePerlinWithGPU(seed, startingRes)
Randomize Seed
Local color as Dword
dt=timer()
//Generate a single noise image at the highest resolution.
memblockID = memperlin
if bitmap exist(1) then delete bitmap 1
create bitmap 1, startingres, startingres
set current bitmap 0
if memblock exist(memblockID) then delete memblock memblockID
make memblock from bitmap memblockID, 1
for x = 0 to startingres-1
for y = 0 to startingres-1
r=rnd(255)
color = rgb(r,r,r)
Write Memblock Dword memblockID, 12+(x + y*startingres)*4, color
next
next
make image from memblock memperlin, memblockID
delete memblock memblockID
//Make an object and apply our shader
if object exist(10) then delete object 10
make object plane 10, 1, 1
texture object 10, memperlin
convert object fvf 10, 530 ` XYZ+NORMAL+TEX2
if effect exist(2) then delete effect 2
load effect "shaders\perlinnoise.fx", 2, 0
set object effect 10, 2
//Set up isometric square camera and calculate view distance and range to return full image
dist# = 1.0/(2.0*tan(1))
position camera 0, 0, dist#
set camera range camdist#*0.75, camdist#*1.25
set camera aspect 1.0
set camera fov 2
point camera 0, 0, 0
create bitmap 2, startingres, startingres
sync
get image memperlin+1, 0, 0, startingres, startingres, 3
set current bitmap 0
endfunction
The shader perlinnoise.fx is shown below. There are a couple of points to notice. The input texture is grabbed from the model. All the vertex shader does is project the model to the current view to generate one pixel shader per pixel of our input texture ensuring the pixel shader code will process every pixel of the input and produce one pixel of output in the same position. The pixel shader is very simple it takes the input textures and scales them one by one and combines them with a weight. The smoothing necessary to generate perlin noise is achieved by the interpolation of the tex2D function an is bilinear. This is about the simplest implementation of perlin noise and generates a texture that has some of the artifacts caused by bilinear smoothing and isn’t seamless since we aren’t using repeated boundary conditions in smoothing the image. However, the beauty of the technique is that the shader is independent of the DBPro code, we can easily use a different shader.
//Generates Simplified Perlin Noise
float4x4 WorldViewProj : WorldViewProjection;
int numOctaves=6;
texture inputTexture
<
string ResourceName = "";
>;
sampler2D inputTextureSample = sampler_state {
Texture = <inputTexture>;
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
AddressU = Wrap;
AddressV = Wrap;
};
struct inputData
{
float4 Position : POSITION;
float2 TextureCoords : TEXCOORD0;
};
struct outputData
{
float4 Position : POSITION;
float2 TextureCoords : TEXCOORD0;
};
//Vertex Shader passes on data to pixel shader including the texture coordinates of the texture
outputData PerlinVS(inputData IN)
{
outputData OUT;
float4 pos = mul( IN.Position, WorldViewProj );
OUT.Position = pos;
OUT.TextureCoords = IN.TextureCoords;
return OUT;
}
//Pixel Shader combines weighted and scaled interpolated (smoothed) versions of texture
float4 PerlinPS(outputData IN): COLOR
{
float4 perlin = tex2D(inputTextureSample , IN.TextureCoords/pow(2,numOctaves-1))/pow(2,numOctaves-5);
perlin += tex2D(inputTextureSample , IN.TextureCoords/pow(2,numOctaves-2))/pow(2,numOctaves-4);
perlin += tex2D(inputTextureSample , IN.TextureCoords/pow(2,numOctaves-3))/pow(2,numOctaves-3);
perlin += tex2D(inputTextureSample , IN.TextureCoords/pow(2,numOctaves-4))/pow(2,numOctaves-2);
perlin += tex2D(inputTextureSample , IN.TextureCoords/pow(2,numOctaves-5))/pow(2,numOctaves-1);
perlin += tex2D(inputTextureSample , IN.TextureCoords)/pow(2,numOctaves);
return perlin;
}
technique PerlinNoise
{
pass Pass0
{
VertexShader = compile vs_2_0 PerlinVS();
PixelShader = compile ps_2_0 PerlinPS();
}
}
Performance
Okay...proof of pudding time...the graph below shows the relative performance of the memblock method (CPU) compared to the shader method (GPU) in generating perlin noise. You can see at small sizes (64x64) the GPU method is twice the speed, however, at large sizes this factor raises to around 20x faster. Ultimately there is a plateau at around 17x faster, however, much of this plateau is due to the speed of the dark basic commands in setting up the noise function and creating bitmaps. The GPU speed enhancement in actually doing the calculations involved is much larger than 20x as large sizes.
In terms of realtime generation of perlin noise memblocks in this test took ~100 millisecs to generate a 128x128 perlin noise texture, which is probably the limit for realtime generation. The GPU method can generate a 512x512 texture in the same time. Actually most of the overhead here is still the memblock commands to generate a random noise image. If these were precomputed then even larger images could be generated realtime.
For precomputation of textures at startup or during level loading the GPU can process larger textures. In this test a 1024x1024 texture took 450 millisec to process, whilst a 2048x2048 texture took 2011 millisecs. Of course, the exact boost will depend on the system. These were tested on a system with an i7 quad core CPU and a nvidea GTX 675M graphics card.
I hope you agree, this is quite a mouthwatering pudding.
Other Applications
A huge range of image processing tasks would benefit from utilisation of the GPU such as blur, sharpening, gaussian smooth etc and would be significantly faster than even built-in CPU techniques. GPU processing is not, however, restricted to graphics, any data can be passed into an image and processed by the graphics card such as modelling of physics. The limitations of graphics cards, however, mean there are certain tasks that GPUs don’t do well, such as those involving many logical statements and requiring random read/write (e.g. AI pathfinding).
Future Developments?
You’ll have noticed that the method used to output data from a shader into an image so it can be used by the CPU (i.e. your dbpro code) was fiddly and causes a major slow down. Camera Effect shaders allow a texture to be passed back to an image, however, input would have to be from the camera. It should, however, be possible to make a new shader interface that simply inputs an image and outputs an image…say “image effect”…if the Games Creators can rewrite one of their DLLs to do that it could speed up GPU programs considerably.
Products
Forester Pro (tree and plant creator) -
http://www.hptware.co.uk/forester.php
Medusa Pro (rock, cliff, cave creator) -
http://www.hptware.co.uk/medusa.php
Mr Normal (Normal map generator) -
http://www.hptware.co.uk/mrnormal.php
GrumpyOne - the natural state of the programmer - Forester Pro (Tree & Plant Creator), Medusa Pro (Rock Creator), Mr Normal (Normal Map Generator) http://www.hptware.co.uk