# Double Precision in LowRes NX

3

nathanielbabiak 2022-10-06 04:31 (Edited)

LowRes NX uses 32-bit floating point, but if you want more, you have to code it yourself.

For background, the 32 bits aren't all digits. You get 23 bits for digits (pick any value from from 0 to 2^23-1), I'll call it variable s, and that 23 bit value is part of a fraction: (s + 2^24)/2^24. The fraction evaluates to a number between 1.0 and 1.9999999. That's seven 9s, or seven significant figures (in base 10). There's another part of the 32-bit encoding that allows the fraction to cover a wider range, it's a straight-forward multiplier on the fraction, and it ranges from 10^-38 to 10^38.

Here's a quick-reference, one compact statement summarizing the background info, and another extending it to 64 bits:

• FP32 has 7 significant figures with a range of 10^(+/-38)
• FP64 has 16 significant figures with a range up to 10^(+/-308)

LowRes NX, using FP32, literally can't represent numbers 10^39 or larger without a software-implementation.

If you ignore the difference in range (since 10^38 seems big enough) and focus only on increasing the number of significant figures, ultimately you'd be coding an algorithm that works on a collection of FP32 variables, each with 23 bits for digits, with multiple variables comprising a more precise "float-float" number.

But why did I call it a "float-float" number? Because this is an area of current research. Computer CPUs use FP64, but GPUs use FP32. The area of current research is to get FP64 "precision" on GPUs that are limited to FP32. For example, compliers encounter the C keyword "double" (FP64) while compiling binaries for GPUs (FP32).

The research is called double-double arithmetic, it's pretty well understood at this point for common architectures. The structure of the algorithm is normalize (sort) and then operate (+-*/...). There's some really theoretical articles on this stuff that discuss minimizing rounding error, but you can just ignore it and implement the algorithms instead. (...Unless you plan to use LowRes NX in nuclear or aerospace haha!)

The easiest/fastest source for this is Yozo's. Cite is Hida, Y., "Library for Double-Double and Quad-Double Arithmetic", 2007.

SP4CEBAR 2022-10-06 13:33

Here's something I made a while ago, it's decimal (10^7 instead of 2^23)

2022-10-06 13:33

SP4CEBAR 2022-10-06 13:58

Here's an unfinished concept

2022-10-06 13:58

nathanielbabiak 2022-10-12 05:05 (Edited)

The essential concept missing from those nx examples is that LowRes NX performs all calculations as FP32. If you want it to work, a good starting point would be pdf-page 4 of the cite provided above. Try to implement the "two-sum" function. You can check your implementation with almost any modern calculator software except LowRes NX (since the modern stuff will hopefully be more accurate than FP32).

(As a hint, note the dot operator and circle operators are defined at the bottom of pdf-page 3.)

The great thing about this approach is that you'll be able to rely on LowRes NX to calculate any carry or overflow for you... since LowRes NX performs all calculations as FP32.

SUB TWO_SUM( A, B, S, E )
S = A + B
V = S - A
E = ( A - ( S - V ) ) + ( B - V )
END SUB

SP4CEBAR 2022-10-12 14:08

Okay
I probably won't continue my DOUBLE "concept" program
If I understand the example correctly:
S is sum and E is carry?
E = A - A_clipped + B - B_clipped