Arithmetic

Floating Point Respresentation ( IEEE754 )

e.g. F = ± 1.xxx * 2^E

"precision"	S	E'	f	total	magnitude	precision
4 byte single	1	8	23	32	[2 x 10 ^-38, 2 x 10 ³⁸]	7 decimal places
8 byte double	1	11	52	64	[2 x 10 ^-308 , 2 x 10 ³⁰⁸ ]	15 decimal places

S : 0 = positive, 1 = negative
E' : biased exponent, E' = E + bias, bias = 127 for s.p, 1023 for d.p
s.p.
- E' _min = 0000 0001 ; E _min = 1 - 127 = -126
- E' _max = 1111 1110 ; E _max = 254 - 127 = +127
- E' always positive, facilitates comparing f.p numbers with integer ALU
f : doesn't encode leading one
F = (-1) ^s * 1.f * 2 ^E'-bias

e.g. represent 0.75 ₁₀ in s.p. format

0.75 x 2 = 1.5
0.75 ₁₀ = 0.11 ₂ = 1.1 x 2 ^-1 (normalized)

S = 0, f = 1, E' = -1 + 127 = 126 = 0111 1110

E' reserved values: 0000 0000, 1111 1111

E'	f	value
0000 0000	zero	0
0000 0000	non-zero	denormalized, very small results
1111 1111	zero	infinity
1111 1111	non-zero	NaN

denormalized representation (s.p): F = (-1) _s x 0.f x 2 ^-126

e.g. 1 x 2 ^-1 - 1.11 x 2 ^-2 using 4 sig dis

if multiplying, add exponents and subtract bias
E' ₃ = E' ₁ + E' ₂ - bias
= (E ^true ₁ + bias) + (E ^true ₂ + bias) - bias
= E ^true ₁ + E ^true ₂ + bias
= E ^true ₃ + bias

if dividing, subtract exponents and add bias

IEEE543: intermediate results keep 3 extra bits
x = 1.b_-1b_-2...b_-23b_-24b_-25b_-26

rounding schemes: truncation, Von Neumann, round-to-nearest-even error ≡ round(x) - x

b_-24b_-25b_-26	x
000 - 111	1.b_-1b_-2...b_-23
error accumulates with successive operations

b_-24b_-25b_-26	x
000	1.b_-1b_-2...b_-23 ->
001 - 111	1.b_-1b_-2...b_-221
error tends to cancel out with successive operations

b_-24b_-25b_-26	x
000 - 011	1.b_-1b_-2...b_-23
100	1.b_-1b_-2...b_-23(b_-23 == 0)
100	1.b_-1b_-2...b_-23 + 2^-23(b_-23 == 1)
101 - 111	1.b_-1b_-2...b_-23 + 2^-23
error w.r.t. b_-23 ∈ [-.100, +.100]
error tends to cancel out and has smaller range