Screen culling - what does it mean?

Author

Message

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 26th Mar 2010 11:16

Link

Well, that's my whole question.

Thanks

Frans

Back to top

Profile PM

Masqutti

15

Years of Service

User Offline

Joined: 8th Jan 2010

Location: insanity

Posted: 26th Mar 2010 15:25 Edited at: 26th Mar 2010 15:26

Link

If it's frustum culling, that means the graphics engine doesn't draw the polygons outside the cameraview that you see on your screen. The shape of camera viewport is kinda like frustum that's why the name. I bet this is the same thing ?

This increases the performance greatly..

DBpro does this automatically.

hmmmh.. that didn't compile

Back to top

Profile PM

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 26th Mar 2010 15:31

Link

Masqutti thanks.

Back to top

Profile PM

Masqutti

15

Years of Service

User Offline

Joined: 8th Jan 2010

Location: insanity

Posted: 26th Mar 2010 15:34

Link

np

hmmmh.. that didn't compile

Back to top

Profile PM

James H

18

Years of Service

User Offline

Joined: 21st Apr 2007

Location: St Helens

Posted: 26th Mar 2010 17:28

Link

Quote: "DBpro does this automatically."

Not exactly - the static world is I think. For standard dbp objects though only objects beyond the draw distance set by camera range are but anything within that radius is not. This is what the object inscreen() command is for.

Back to top

Profile PM

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 26th Mar 2010 17:32

Link

I'll keep it in mind, thanks.

Back to top

Profile PM

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 26th Mar 2010 18:43

Link

It might help to know in what context the words "screen culling" is mentioned.

Final culling is done by DX or the graphics card. It still checks all the object's vertices to see if it needs to render it. The fastest render occurs when you exclude anything that you know won't be rendered and only pass those that "may" be rendered to the card. I'm not 100% sure on all the steps taken, but there are multiple ways and multiple checks to cull unrendered objects/vertices. I don't know if DBP has any built-in automatic culling, but I know there is a command for taking an object out of the render proceedure (I think it's "exclude object on"). This method is, of course, the best and fastest run-time of all the methods since it's the first in the line of culling methods (this means that none of the others even have to care about testing it).

All that rant was for explaining to the best of my knowlege what culling means for 3D objects. Screen culling could also mean that nothing outside the "view box" matters.

The fastest code is the code never written.

Back to top

Profile PM Email Website

IanM

Retired Moderator

22

Years of Service

User Offline

Joined: 11th Sep 2002

Location: In my moon base

Posted: 26th Mar 2010 21:29

Link

Quote: "For standard dbp objects though only objects beyond the draw distance set by camera range are but anything within that radius is not"

Not true.

DBPro uses the calculated centre of the object, and the radius of the object from this point, and checks that this sphere is on the inside of the box defined by the planes that represent the camera frustum.

In some circumstances though, DBPro does not cull (and I'll probably forget one or two here):
- Locked objects
- Glued objects
- All objects when reflection shading is active
- When an object casts a shadow
- For physics objects
- For objects with negative radius (SET OBJECT RADIUS) or a very large radius

Utility plug-ins collection (updated 02 Mar 2010) and
http://www.matrix1.demon.co.uk

Back to top

Profile PM Email Website

James H

18

Years of Service

User Offline

Joined: 21st Apr 2007

Location: St Helens

Posted: 27th Mar 2010 04:04

Link

Hmmm, thanks Ian, I must apologise for any misguidance I have envoked. However I have utilised the object inscreen command and excluded/hidden anything outside of the the frustrum and get much higher frame rates - yet set object radius -1 does produce a significant difference in framerate even with the object inscreen exclusion/hidden checks in place. I don`t suppose you could give a fuller explanation of what is going on here? Perhaps in code if possible. I really don`t understand why auto detection and action does not out perform manual detection and action. Obviously I have missed something here. Many thanks.

Back to top

Profile PM

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 04:17 Edited at: 27th Mar 2010 04:18

Link

Quote: "I really don`t understand why auto detection and action does not out perform manual detection and action."

My mantra: The fastest code is the code never written
This works just as well in your case: The fastest scene has no objects to render.

That means if the program doesn't process the object, it will run faster. When you manually "exclude" an object from the render loop, all the other "culling" computations done by other parts of DBP, DirectX, and your video card don't even happen. Therefore faster! Moreover, your camera doesn't move location or orientation very fast in most games, so when you tell DBP not to include an object in the loop, all those frames-per-second's add up.

The fastest code is the code never written.

Back to top

Profile PM Email Website

James H

18

Years of Service

User Offline

Joined: 21st Apr 2007

Location: St Helens

Posted: 27th Mar 2010 05:27

Link

Thankyou so much - that makes sense. Here comes the obvious question; Why does dbp not use the code behind object inscreen() with its auto checks if you get my meaning - why does it make all of those computations instead? - I assume there is a very good reason. (Probably obvious too!)
Also when I set x number of objects radius -1 there is a signficant drop in frame rate. If I include manual checks, the frame rate is lower than than not changing those objects radius and still carrying out manual checks - why is that?

Back to top

Profile PM

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 06:02

Link

Quote: "If I include manual checks, the frame rate is lower than than not changing those objects radius and still carrying out manual checks - why is that? "

I have no idea what you are asking....

When YOU look at a scene, it's obvious to you which objects are within your view. That's not true for a program. It has to test each object you have given it to determine if it's within the camera's view (even the one's clearly behind the camera). DBP does this somehow. Most likely it takes the radius of an object and tests that first to see if that "radius" is within the view. Objects can have multiple limbs, so it also tests the limbs and then each vertex of the individual meshes. I'm not sure how detailed DBP goes within that chain before it sends it to the DirectX pipeline, but each of those checks takes processor cycles which means frame rate for your program; because each of those checks is code (remember my mantra).
Whatever "object inscreen()" is, it's code and code takes processor cycles--loss of frame rate. The DBP designers used the fastest code they could make with the best results they could achieve.

In short, use any means you need to make your code more streamline (fast not just "neat"). Sometimes the faster code looks ugly. Making call to functions, goto's, gosubs, and any choice commands clear the processor's Instruction Pointer's "look ahead" buffer (can't remember what that's called). That "look ahead" buffer is used by the processor to go through the code faster.... Kinda getting technical now..... We don't need to get into all that. Just remember that when you write any proceedure, keep it as precise as possible:

+ Code Snippet

A as float
A=1.5
A=A*MyMultiplyer

is slower than

+ Code Snippet

A as float =1.5*MyMultiplyer

Also, instead of using several "if" statements with conditions of the same variable, use "switch-case" statements.

Good luck and good night.

The fastest code is the code never written.

Back to top

Profile PM Email Website

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 27th Mar 2010 11:48

Link

Hawkblood,

If I am not mistaking, you said that GOTO's and GOSUB's are faster than FUNCTION calls.

If so, did you actually test that?

Frans

Back to top

Profile PM

Kira Vakaan

16

Years of Service

User Offline

Joined: 1st Dec 2008

Location: MI, United States

Posted: 27th Mar 2010 12:25 Edited at: 27th Mar 2010 12:27

Link

Whoa, that's a dangerous thing to say on this forum...

lol

@Hawkblood: Not really sure what you're talking about with the "Instruction Pointer's 'look ahead' buffer"... The Instruction Pointer is simply a register in the processor that stores the address of the current instruction. So when you call a function or subroutine, the value in the Instruction Pointer just changes.

Edit: typo

Back to top

Profile PM Email

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 15:38

Link

Quote: "If I am not mistaking, you said that GOTO's and GOSUB's are faster than FUNCTION calls."

You are mistaken. I said that they all have a cost; a cost that is much greater than having whatever code they have in them in it's place (linear).

Quote: " the value in the Instruction Pointer just changes."

It does just change, but the buffer has to be dumped because what WAS ahead is no longer valid.

The fastest code is the code never written.

Back to top

Profile PM Email Website

Kevin Picone

22

Years of Service

User Offline

Joined: 27th Aug 2002

Location: Australia

Posted: 27th Mar 2010 16:21 Edited at: 23rd Aug 2010 00:16

Link

Quote: "You are mistaken. I said that they all have a cost; a cost that is much greater than having whatever code they have in them in it's place (linear)."

yes & no. On old cpu's that don't have an instruction cache (68000 for example), then executing long linear streams of instructions is quicker than a tight loop. As each instruction is being fetched from memory regardless.

However, instruction level caches negate this. Now ideal code design means writing solutions that fit inside the available instruction cache. Unrolling beyond the cache size causes fetch penalties, as the cache is flushed, read, flushed read etc.

New PlayBasic Learning Edition Released 24th April 2010

Back to top

Profile PM Website

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 27th Mar 2010 16:49

Link

Hawkblood, I understand that every call to whatever block off non-linear or non-inline code costs.

Kevin Picone, I think to understand the first part of what you're saying, but the 2nd part goes beyond my knowledge.

Thanks

Back to top

Profile PM

empty

22

Years of Service

User Offline

Joined: 26th Aug 2002

Location: 3 boats down from the candy

Posted: 27th Mar 2010 17:39 Edited at: 27th Mar 2010 17:39

Link

Let's assume the instruction cache can hold 10 instructions. And lets assume the following pseudo code (each line = one instruction)

+ Code Snippet

loop_start:
do_stuff
do_more_stuff
inc counter
if counter < 5 goto loop_start

When executed, this code will go through 16 instructions, albeit it'll fit in our cache as there are only 4 instructions to cache. Now we unroll it

+ Code Snippet

do_stuff
do_more_stuff
do_stuff
do_more_stuff
do_stuff
do_more_stuff
do_stuff
do_more_stuff

That's only 8 instructions and also well within our cache, so it'll execute faster.
However, if we want to loop 20 times, we'll have 40 instructions. That means the cpu needs to clear and re-new the cache 4 (40/10) times. So there are chances that a tight, not unrolled loop will execute faster.

Back to top

Profile PM

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 18:15

Link

The information I am going on is from my assembly language days (about 20 years ago). Things may have changed as far as the way the instruction cache works, but whenever you "jump", the cache WILL dump and refill from the new instruction pointer location which neggates the advantage of having that cache. So:

+ Code Snippet

do_stuff
do_more_stuff
do_stuff
do_more_stuff
do_stuff
do_more_stuff
do_stuff
do_more_stuff
.... to 1000 stuffs

will be faster than

+ Code Snippet

loop_start:
do_stuff
inc counter
if counter<1000 then goto loop_start

The second is neater, but not as fast. It's impractical to make the code so linear.

The fastest code is the code never written.

Back to top

Profile PM Email Website

IanM

Retired Moderator

22

Years of Service

User Offline

Joined: 11th Sep 2002

Location: In my moon base

Posted: 27th Mar 2010 18:44 Edited at: 27th Mar 2010 19:04

Link

That's no longer true for any high-speed processor (incl x86). The cache is generally large enough that only the least recently used cache lines will be dumped.

The x86 series includes branch prediction, so it will 'guess' at whether the jump will be made or not, and fill the pipeline with the most appropriate path of instructions - if it finds that it guessed wrong, then the pipeline will be 'stall' and be reset and restart processing instructions from the other path.

In addition, fetching new instructions from memory into cache is relatively expensive compared to re-executing the same instructions again from the cache. Unrolling generally forces the instructions to be fetched to cache.

x86 is very good at the branch prediction, so generally, looped code will be faster than unrolled code, assuming that the loop is relatively small.

Basically, don't worry too much about loops - they aren't slow, and your second pseudo-code will most likely be faster (depending on the actual content of 'do_stuff').

[EDIT]
@James H,
Forgot to answer your post ...

Firstly, DBPro builds lists of objects every frame that must be considered for rendering. If you have excluded your object or hidden it then it won't be added to these lists, reducing the cost of rebuilding these lists - this is minimal though, unless you have increased the number of objects, which could cause these lists be be expanded in size.

Secondly, DBPro sorts these lists in various ways - sorting is expensive, no matter how good your sort algorithms. In certain rendering modes, DBPro will sort these lists every time a camera is rendered, and in others, may sort these lists when you change textures etc.

Thirdly, I happen to know that the OBJECT IN SCREEN code was updated in the last release for improved accuracy and uses different code than the rendering code - whether the new code is actually faster I don't know, but I believe the rendering code is a little 'looser' in what it considers to be inside the frustum.

Utility plug-ins collection (updated 02 Mar 2010) and
http://www.matrix1.demon.co.uk

Back to top

Profile PM Email Website

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 19:00

Link

You said my annalasys is correct and incorrect in the same post. I realize that processors have a more inteligent way of guessing jumps than before, but each wrong guess will cause that section to dump and be refetched.

My whole annalasys is that there are ways to make code faster just by the way it is written. Take the IF statemtent for instance:

+ Code Snippet

if a<5 and a>-5 and a<>0 then do somthing

is faster than

+ Code Snippet

if a<5
 if a>-5
  if a<>0
   do something
  endif
 endif
endif

Most applications won't notice any difference, but when that's expanded out to a huge nest, on slower computers this could make all the difference in the world.
For most computer applications, the processor is waiting for vsync to occur and the odd times that the processor "guesses" wrong only takes a few msec's more. On apps that are right at the edge of the computer's capabilities, these wrong "guesses" can be a killer.

The fastest code is the code never written.

Back to top

Profile PM Email Website

IanM

Retired Moderator

22

Years of Service

User Offline

Joined: 11th Sep 2002

Location: In my moon base

Posted: 27th Mar 2010 19:18

Link

Actually, I'm not sure that I did agree.

In summary, I thought my analysis said that it's faster to reuse code from the cache than it is to fetch from memory, and that in the majority of cases in loops, that the processor will correctly guess the correct branch to take. It will not simply dump the cache because of the existence of a 'jump' instruction, whether it's a forward jump or a backward jump, or whether it's a conditional jump or a fixed jump.

When the processor hits the branch instruction, it may guess incorrectly and have to stall the pipeline and jump backwards for a loop - it still has the instructions cached, so you only pay the price of a mis-prediction. When the processor hits that branch again, it will generally use its previous guess, and this time will not stall. It will repeat until the loop conditions cause it to exit the loop, causing again a mis-prediction - this means that if you are looping 1000 times, it will run the first iteration of the loop at a delayed speed (caused by fetching from memory to cache), mis-predict on the first branch causing the pipeline to stall, but on the next 999 iterations of the loop run at full processor speed, before finally mis-predicting the last branch and then continuing on.

This is as opposed to the continual stalls that will be introduced by coding 1000 occurrences of the same code, as they need to be fetched from memory into cache at one quarter of the speed of the processor (assume 533mhz FSB on a 2.6ghz processor) or worse.

Those pieces of code, if coded in assembly or a ... let's say 'more optimised language', would be identical.

I do agree though that in DBPro, they will produce difference code. You can see the code for yourself by looking in your DBPro TEMP directory at the *.dbm file.

Utility plug-ins collection (updated 02 Mar 2010) and
http://www.matrix1.demon.co.uk

Back to top

Profile PM Email Website

Hawkblood

15

Years of Service

User Offline

Joined: 5th Dec 2009

Location:

Posted: 27th Mar 2010 19:30

Link

I think this horse is dead now.
To the original question of culling: If YOU cull it, it will be faster than to let other steps cull it.

The fastest code is the code never written.

Back to top

Profile PM Email Website

Frans

16

Years of Service

User Offline

Joined: 12th Sep 2008

Location: Netherlands

Posted: 27th Mar 2010 20:09

Link

You people made me dizzy ...

Back to top

Profile PM

Sorry your browser is not supported!

DarkBASIC Professional Discussion / Screen culling - what does it mean?