|
Author Topic:   Code Optimizations
JohnMancine
Member
posted October 06, 1999 07:23 AM            
Hey everybody...

I am getting a really slow framerate on my current landscape rendering... and I haven't even started with texturing etc. etc.

Some of the things I think of to watch for (or will fix to speed things up later) are converting from floats->integers ... making sure i pass pointers/references for big data objects etc.

Can anyone else submit what they watch for when they want to make sure their code is running as fast as possible? What other bottle knecks do i need to watch out for?

Thanks!
John

IP:

LDA Seumas
unregistered
posted October 06, 1999 07:07 PM           
Along with float to int conversions, other killers are divides, modulos, and square roots, both integer and floating point.

Whenever possible, convert integer divides (and multiplies) into left and right shift operators ("<<" and ">>"), and convert modulus ("%") into bit-and. With floating point, try converting an oft-needed divide into a reciprocal and multiplies. So if you need to divide by X twice or more, compute RX = 1.0f / X, then multiply by RX instead of dividing by X. For places you might need a distance coputed with a sqrt(), see if you can get away with a square-distance. If you're just comparing two distances for less than or greater than, comparing the square distances will give the same result as comparing the true square root distances.

Anyone have any other favorites?

------------------
-- Seumas McNally, Lead Programmer, Longbow Digital Arts

IP:

Kaeto
Member
posted October 09, 1999 03:32 PM            
You can do a modulus by right-shifting and the left-shifting (I don't know if this is anyfaster than a bit-and, though. Seems like it should be)

IP:

Kaeto
Member
posted October 09, 1999 03:33 PM            
Oops wrote that backwards (i think). You << and then >>

IP:

LDA Seumas
unregistered
posted October 11, 1999 03:26 AM           
You could do a modulo with a left shift << followed by a right shift >>, but you'd have to do it on an unsigned long to avoid erroneous sign extension on the right shift, and those two operations would probably be slower than the single bit-and that I think it would require to do the exact same operation, that is, setting X high order bits to 0 and leaving the remaining low order bits as they are. Assuming the left and right shifts would be by the same number of bits, and you're talking about the same operation I think you're talking about (a fast version of (x % y) where y is a power of 2).

------------------
-- Seumas McNally, Lead Programmer, Longbow Digital Arts

IP:

assen
Member
posted October 11, 1999 04:53 AM            
The most important rule is "Don't Ever Trust The ****** Compiler To Generate Even Remotely Sensible Code".
If you have something like

p->a->b->c = Something;
p->a->b->d = SomethingElse;
p->a->b->e = SomethingThird;

don't trust the bugger, do it like that:

TypeB *pb = p->a->b;
pb->c = Something;
pb->d = SomethingElse;
pb->e = SomethingThird;

or even

p->a->b->Somethings();
...

inline TypeB::Somethings()
{
c = Something;
d = SomethingElse;
e = SomethingThird;
}

The sad reality is, Visual C++ can't be trusted to generate equally good code in all three cases.

Replace

int a[100][100];
for(int i=0; i<100; i++)
for(int j=0; j<100; j++)
a[i][j] = Something(i,j);

with

int a[10000], *pa = a;
for(int i=0; i<100; i++)
for(int j=0; j<100; j++)
*pa++ = Something(i,j);

or even with

int a[10000], *pa = a;
for(int i=0; i<10000; i++)
*pa++ = Something(i/100, i%100);

of course, this is only worth it if Something() is not too expensive - if you have a couple of sqrt()'s or atan2()'s in it, you don't need it.
(here it becomes obvious why programmers like powers of 2 - if a was 128x128, you'd save the divisions).
(BTW, even the most braindamaged of compilers would recognise the division/modulus pair and would reduce it to one instruction - x86 CPUs anyway calculate the modulus and the remainder together).

My Split2() function (see Seumas' ramblings on Binary Trees on the Programming page; it is basically what he showhs, with the priority computation, visibility culling and backface simplification rolled in) used to take something like three screens of convoluted pointer dereferences and bitfield accesses, which took about 70% of my program's time (far more than the OpenGL render itself, which is absurd). I managed to bring that percentage to 30% just by reorganizing the operations.

Second rule: don't use bitfields, operator overloading, because you might be tempted to think that something which says:

a = b + c;

is simple and fast. If a, b, and c are bitfields, this compiles into a bunch of shifts and masks you could avoid if you were working with the representation of the bitfields themselves. If a, b and c are vectors, the above line would sum all four components, even though in some cases you actually might need just two of them.
If you had to write:
a = Vec4fSum(b, c);
or
a = (b + c) & 0x70;
you wouldn't use this so light-heartedly.
Code clarity is a Good Thing (TM), but in the "inner loop" you cannot afford to sacrifice performance for clarity. Keep the clear code in a comment block nearby, so later you'd know what you're doing, then optimize it away into a jumble of <<, &, and look-up tables.

Be careful with look-up tables. On modern CPUs you'd better off doing a few calculations yourself than thrashing the cache with a look-up table.

Spend some time examining the compiler output. Turn on Listing Files: Source with Assembler in the C++ tab in Project Settings in VC++, then take a good look at what monstrous hodge-podge your "simple and elegant" code turns into. If you can't read assembler... well... try to get some idea about it. You can achieve "algorithm-level efficiency" without knowing sh*t about the underlying archutecture, but really efficient "inner loop" code cannot be written without a clear awareness of the code generated. You rarely need to code directly in assembler (except, for example, if you have a good reason to use SSE or MMX or 3DNow! which aren't adequately exposed by compilers), but you need to write your C++ in a way which doesn't leave the compiler the choice of generating bad code.

IP:

assen
Member
posted October 11, 1999 03:12 PM            
Another technique which can be useful in select cases,
is loop unrolling. Basically, it means that instead of

for(i=0; i<4; i++)
DoSomething();

you write
DoSomething();
DoSomething();
DoSomething();
DoSomething();

If you don't know the exact number of iterations you'll need,
you can use the following trick which leads to good code:

switch(i) {
case 4: DoSomething(); // no break; fallthru following cases
case 3: DoSomething(); // no break; fallthru following cases
case 2: DoSomething(); // no break; fallthru following cases
case 1: DoSomething(); // no break; fallthru following cases
}

So it'll look something like that:

while(i>4) {
DoSomething();
DoSomething();
DoSomething();
DoSomething();
i -= 4;
}
switch(i) {
case 4: DoSomething();
case 3: DoSomething();
case 2: DoSomething();
case 1: DoSomething();
}

This way you can reduce the loop overhead 4 times; of course, this
will result in a significant gain only if DoSomething() is a relatively
cheap operation.
You can devise a macro to perform this unrolling for you, which
you'll use like that:

UNROLL(DoSomething(), nTimes);

IP:

Sam K
Member
posted February 18, 2000 08:16 PM            
I'm new to this board, good work on the sharing ideas and techniques.. a very worthwhile read.

On the subject of optimizations, my brother suggested something the other day... which I haven't tried yet, but may just work if anyone is interested in having a go:

Floating point numbers have a mantissa and an exponent. The exponent hold the power the number (mantissa) is to. The exponent is also centered around 127 (Being that the exponent is 8 bits of a "float") i.e 127 = 0.00. <127 = minus power (e.g 0.0003). >127 = positive power (e.g. 30000.0).

Now we all know integer numbers can be divided and multiplied by a power of two with a shift right or left.
And we also know you can't do that with floats. Which is also why fixed point is so attractive, (even though floating point is supposedly faster these days, floating point divides are a killer..., and horrible when the divisor is 2, or 4 or whatever)

So, my brother said "Subtracting 1 from the exponent in the float divides the number by two, and Adding 1 to the exponent in the float multiplies the number by two"

**thus you can divide and multiply floating point numbers by ANY power of 2 with an add or a subrtact.**

Also remember the Float is stored like this:
SEEEEEEEEMMMMMM......
The byte of the exponent is offset by one bit(The float's overall sign bit).
So writing to the exponent is a tiny bit trickier (i.e. a rol and ror as well as an add or subtract).

I can't reach a computer for a couple of weeks to try it out, anyone fancies giving it a go, I'd be most interested.

Or if anyone has heard of this already, and can confirm it.

cheers

sam

[This message has been edited by Sam K (edited February 19, 2000).]

IP: