language agnostic - Why Are Floating Point Numbers Inaccurate? -


why numbers lose accuracy when stored floating point numbers?

for example, decimal number 9.2 can expressed ratio of 2 decimal integers (92/10), both of can expressed in binary (0b1011100/0b1010). however, same ratio stored floating point number never equal 9.2:

32-bit "single precision" float: 9.19999980926513671875 64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875 

how can such apparently simple number "too big" express in 64 bits of memory?

in programming languages, floating point numbers represented lot scientific notation: exponent , mantissa (also called significand). simple number, 9.2, fraction:

5179139571476070 * 2 -49

where exponent -49 , mantissa 5179139571476070. reason impossible represent some decimal numbers way both exponent , mantissa must integers. in other words, floats must integer multiplied integer power of 2.

9.2 may 92/10, 10 cannot expressed 2n if n limited integer values.


seeing data

first, few functions see components make 32- , 64-bit float. gloss on these if care output (example in python):

def float_to_bin_parts(number, bits=64):     if bits == 32:          # single precision         int_pack      = 'i'         float_pack    = 'f'         exponent_bits = 8         mantissa_bits = 23         exponent_bias = 127     elif bits == 64:        # double precision. python floats         int_pack      = 'q'         float_pack    = 'd'         exponent_bits = 11         mantissa_bits = 52         exponent_bias = 1023     else:         raise valueerror, 'bits argument must 32 or 64'     bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))     return [''.join(islice(bin_iter, x)) x in (1, exponent_bits, mantissa_bits)] 

there's lot of complexity behind function, , it'd quite tangent explain, if you're interested, important resource our purposes struct module.

python's float 64-bit, double-precision number. in other languages such c, c++, java , c#, double-precision has separate type double, implemented 64 bits.

when call function our example, 9.2, here's get:

>>> float_to_bin_parts(9.2) ['0', '10000000010', '0010011001100110011001100110011001100110011001100110'] 

interpreting data

you'll see i've split return value 3 components. these components are:

  • sign
  • exponent
  • mantissa (also called significand, or fraction)

sign

the sign stored in first component single bit. it's easy explain: 0 means float positive number; 1 means it's negative. because 9.2 positive, our sign value 0.

exponent

the exponent stored in middle component 11 bits. in our case, 0b10000000010. in decimal, represents value 1026. quirk of component must subtract number equal 2(# of bits) - 1 - 1 true exponent; in our case, means subtracting 0b1111111111 (decimal number 1023) true exponent, 0b00000000011 (decimal number 3).

mantissa

the mantissa stored in third component 52 bits. however, there's quirk component well. understand quirk, consider number in scientific notation, this:

6.0221413x1023

the mantissa 6.0221413. recall mantissa in scientific notation begins single non-zero digit. same holds true binary, except binary has 2 digits: 0 , 1. binary mantissa always starts 1! when float stored, 1 @ front of binary mantissa omitted save space; have place @ front of our third element true mantissa:

1.0010011001100110011001100110011001100110011001100110

this involves more simple addition, because bits stored in our third component represent fractional part of mantissa, right of radix point.

when dealing decimal numbers, "move decimal point" multiplying or dividing powers of 10. in binary, can same thing multiplying or dividing powers of 2. since our third element has 52 bits, divide 252 move 52 places right:

0.0010011001100110011001100110011001100110011001100110

in decimal notation, that's same dividing 675539944105574 4503599627370496 0.1499999999999999. (this 1 example of ratio can expressed in binary, approximately in decimal; more detail, see: 675539944105574 / 4503599627370496.)

now we've transformed third component fractional number, adding 1 gives true mantissa.

recapping components

  • sign (first component): 0 positive, 1 negative
  • exponent (middle component): subtract 2(# of bits) - 1 - 1 true exponent
  • mantissa (last component): divide 2(# of bits) , add 1 true mantissa

calculating number

putting 3 parts together, we're given binary number:

1.0010011001100110011001100110011001100110011001100110 x 1011

which can convert binary decimal:

1.1499999999999999 x 23 (inexact!)

and multiply reveal final representation of number started (9.2) after being stored floating point value:

9.1999999999999993


representing fraction

9.2

now we've built number, it's possible reconstruct simple fraction:

1.0010011001100110011001100110011001100110011001100110 x 1011

shift mantissa whole number:

10010011001100110011001100110011001100110011001100110 x 1011-110100

convert decimal:

5179139571476070 x 23-52

subtract exponent:

5179139571476070 x 2-49

turn negative exponent division:

5179139571476070 / 249

multiply exponent:

5179139571476070 / 562949953421312

which equals:

9.1999999999999993

9.5

>>> float_to_bin_parts(9.5) ['0', '10000000010', '0011000000000000000000000000000000000000000000000000'] 

already can see mantissa 4 digits followed whole lot of zeroes. let's go through paces.

assemble binary scientific notation:

1.0011 x 1011

shift decimal point:

10011 x 1011-100

subtract exponent:

10011 x 10-1

binary decimal:

19 x 2-1

negative exponent division:

19 / 21

multiply exponent:

19 / 2

equals:

9.5



further reading


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -