jump to navigation

Shaving bits off a float, the shiny way August 30, 2008

Posted by winden in coding, demoscene.

One of the most used tricks for space optimization on 4k intros is to mangle floating point values on the executables before compressing.

Recap on standard floating point number storage

To recap, a float is stored in memory using 4 bytes, with bit-fields packed like this:

  • 1 bit for sign, s
  • 8 bits for exponent, e
  • 23 bits for mantissa, m

The exponent is stored with 127 added to it, so the final value is basically

  • value = power(2, e – 127) * (1.0 + m)

What this means is that the exponent is used to select which bracket to use:

  • e = 125 –> value is between 0.25 and 0.5
  • e = 126 –> value is between 0.5 and1
  • e = 127 –> value is between 1 and 2
  • e = 128 –> value is between 2 and 4
  • e = 129 –> value is betweem 4 and 8

And then the mantissa is used to select a number inside than bracket.

Shaving off bits

Compressor engines work by detecting repeating patterns and encoding them in non-obvious ways. The most obvious pattern to detect are strings of zero bits or zero bytes.

That can be used in our advantage: the executable or datafile will compress better if there are many zero bytes, and that is easy to do if we somehow force some bits on the float values to zero. Let’s say we blank out the 2 lower bytes on a float. That means that we end up with this floating point representation:

  • 1 bit for sign
  • 8 bits for exponent
  • 7 bits for mantissa

You could say that this forces us to use numbers other than what we wanted, but that was already the case since floats are only an approximation. How do you store π on a normal float? You can’t, since a float is just a 23-binary digits approximation to the value we want!

I really need a value that is not available by using fewer bits… Help!

That’s so wrong it hurts, you don’t really need them for making a demo. A demo is created by faking, fudging and tweaking until it looks good. You can probably achieve what you want by mangling either that number, another number or the whole formula in a random way.

How to shave bits, the shiny way (aka “in the future the compilers will obsolete the coders”)

Ideally, we would want to automate the whole process of shaving off bytes. One of the nicest parts of C++ is that compilers can detect and eliminate operations at compile time if they can prove that the result will be the same as doing it at run time. Given C++ templates and reasonably modern C++ compilers such as gcc 4, we can try this way:

class f16 {
union {
mutable unsigned int _iv;
mutable float _v;
inline f16(float const v) {
_v = v;
_iv = _iv & mask;
inline operator float() const {
return _v;
inline operator unsigned int() const {
return _iv;

What is happening here is that we create an object from a float, and the constructor stores that value with the lower 16 bits forced to zero. As the real storage place is marked as private and thus hidden from any other code, the compiler is able to optimise it and the resulting executable only has the masked value.

How to shave bits, the reliable way (aka “compilers are evil and does things behind my back”)

A big problem with the previous method is that it depends on the compiler realizing that he can optimize the code that way. If you need to be sure that the compiler only stores truncated numbers in the executable, the easiest way is to simply never feed him numbers other than the ones that are already in the format we want.

There are two ways:

  1. Place the proper numbers my hand or by preprocessor macros
  2. Make a filter that reads the source code, detects the floating point values and outputs the code to another file with truncated numbers.

How to shave bits, the backhanded way (aka “compilers are dumb, I better fix their output afterwards”)

Nowadays the leading executable compressor Crinkler is capable of automatically truncating bits. This is done by using the name of the variable as a hint to the compressor, which dutifully takes the variable and masks off lower bits before starting to compress anything. While nice and really automatic, I’d consider this a catch-all rule and just use any of the previous techniques in addition to achieve better control of the numbers.

Relevant links:

Wikipedia article on floating point numbers

IQ/RGBA article on truncating floating point numbers by hand

in4k article on truncating floating point numbers both by hand and by script

Crinkler executable linker and compressor


1. slack - August 31, 2008

Well, as you know I tend to use the “compilers are evil and do things behind my back” way using a filter on my sources, but I wouldn’t call it that way. It’s more like “I don’t want to wrap every floating point constant inside f16( )”. The nice thing of your method, IMHO, is that you have control over which values get truncated and which ones don’t, without using hundreds of preprocessor macros iq-style ;)

2. j'endors Messon - January 30, 2010

i don’t understand where “mask” come from?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: