Branchless code

Discussion

SP4CEBAR 2022-06-21 19:02 (Edited)

Since I watched this video years ago, I've been writing branches code when possible,

Branchless code is writing

X=A*-(x<0)-B*(X>=0)
(TRUE has the value -1, that's why everything is negative)

Instead of

IF X<0 THEN X=A ELSE X=B END IF

however branchless code isn't always faster, it all depends on what the compiler does with your code, the only ways to figure out if it is faster is to look at the assembly code or to measure the CPU cycles

Which compiler(s) does LowResNX use? Also, are there more people who also write branchless code

McPepic 2022-06-21 19:29

I switch back and forth. I think it runs faster if you write out the condition. It’s also easier to read. Sometimes I want to write it in as few lines as possible, in which case I use branchless. I think it’s up to preference, though. If you really want to check, there is a program on LowRes that counts clock cycles, so you could use that. By the way, if your condition runs all on one line, you don’t need to write “end if”.

SP4CEBAR 2022-06-21 20:56 (Edited)

I didn't realize it was one line, it should have been five lines, but I forgot to add a double space after each line, so there were no markdown line breaks so it appears as one line

nathanielbabiak 2022-06-22 00:42 (Edited)

The user rilden has uploaded Cycle counter. Also check out Timo's comments here and here.

From the manual...

LowRes NX has a simplified simulation of CPU cycles. There is a fixed limit of cycles per frame. This assures the same program execution speed on all devices, so if you optimize your program on your device to run smoothly, it will run the same on all other devices.

Each execution of a command, function or operator, as well as access to a variable or a constant count 1 cycle. Some operations have additional costs:

String creation and modification count 1 cycle per letter.
Array initialization counts 1 cycle per element.
Memory area modification counts 1 cycle per byte (not single byte modifications like POKE).
BG area modification and text output count 2 cycles per cell (not single cell modifications like CELL).

Total cycles per frame: 17556

Cycles per VBL interrupt: 1140

Cycles per raster interrupt: 51

The main program may spend any number of cycles, but when the limit is reached before a WAIT VBL or WAIT command, the execution continues in the next frame. If interrupts exceed their limit, you will see black scanlines on the screen.

nathanielbabiak 2022-06-22 00:53 (Edited)

Branchless code tends to be really hard to read, debug, modify, etc.

There aren't too many uploads on this site currently that require speed. Regardless, the best way to get faster code is to change your algorithm "big picture" rather than syntactically (saving only a few clock cycles).

A good exercise when looking at an algorithm "big picture" is:

First, separate the algorithm into steps
Then, identify a single step where you would benefit from a "magic" subprogram or variable, meaning a placeholder that magically does exactly what you need
Lastly, ask yourself, "how could I use way more memory or arrays to track the magic?"

You'll likely find you can trade execution time for memory usage. Here's some cool examples:

the engine of Duke Nukem 3D was coded in Qbasic so the developer could see how algorithmic changes effected execution time! After he'd make changes, he'd port those changes over to C for compiling in the commercial version. He didn't focus on line-by-line optimization
the file compressor uses over a half megabyte of working memory, but in exchange allows on-console compression
the raycasting engine of my three Wolf3D uploads is essentially the same between the versions, but the display driver uses entirely new algorithms each time. The (comparable) framerate of those uploads is 8, 15, then 30 FPS - the difference is huge!

That said, I've explored the console's clock cycles for syntactic gains, my results are published here, with recommendations published here.

SP4CEBAR 2022-06-22 15:20 (Edited)

At one point I'll need to optimize my game engine which always has an NX CPU usage of 100% (it also contains a lot of branchless code)

Oh, wait you won't receive a notification for this, so you probably won't see this