Extended precision
Floating point precisions 

IEEE 754 
Other 
Extended precision refers to floating point number formats that provide greater precision and more exponent range than the basic floating point formats.^{[1]} Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitraryprecision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).
Contents
Extended precision implementations
IBM extended precision formats
The IBM 1130 offered two floating point formats: a 32bit "standard precision" format and a 40bit "extended precision" format. Standard precision format contained a 24bit two's complement significand while extended precision utilized a 32bit two's complement significand. The latter format could make full use of the cpu's 32bit integer operations. The characteristic in both formats was an 8bit field containing the power of two biased by 128. Floatingpoint arithmetic operations were performed by software, and double precision was not supported at all. The extended format occupied three 16bit words, with the extra space simply ignored.^{[2]}
The IBM System/360 supports a 32bit "short" floating point format and a 64bit "long" floating point format.^{[3]} The 360/85 and followon System/370 added support for a 128bit "extended" format.^{[4]} These formats are still supported in the current design, where they are now called the "hexadecimal floating point" (HFP) formats.
IEEE 754 extended precision formats
The IEEE 754 floating point standard recommends that implementations provide extended precision formats. The standard specifies the minimum requirements for an extended format but does not specify an encoding.^{[5]} The encoding is the implementor's choice.^{[6]}
The IA32 and x8664 and Itanium processors support an 80bit "double extended" extended precision format with a 64bit significand. The Intel 8087 math coprocessor was the first x86 device which supported floating point arithmetic in hardware. It was designed to support a 32bit "single precision" format and a 64bit "double precision" format for encoding and interchanging floating point numbers. The temporary real (extended) format was designed not to store data at higher precision as such, but rather primarily to allow for the computation of double results more reliably and accurately by minimising overflow and roundofferrors in intermediate calculations:^{[7]}^{[8]}^{[9]} for example, many floating point algorithms (e.g. exponentiation) suffer from significant precision loss when computed using the most direct implementations. To mitigate such issues the internal registers in the 8087 were designed to hold intermediate results in an 80bit "extended precision" format. The 8087 automatically converts numbers to this format when loading floating point registers from memory and also converts results back to the more conventional formats when storing the registers back into memory. To enable intermediate subexpressions results to be saved in extended precision scratch variables and continued across programming language statements, and otherwise interrupted calculations to resume where they were interrupted, it provides instructions which transfer values between these internal registers and memory without performing any conversion, which therefore enables access to the extended format for calculations^{[10]} also reviving the issue of the accuracy of functions of such numbers, but at a higher precision.
The Floating Point Unit on all subsequent x86 processors have supported this format. As a result software can be developed which takes advantage of the higher precision provided by this format. William Kahan, a primary designer of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An Extended format as wide as we dared (80 bits) was included to serve the same support role as the 13decimal internal format serves in HewlettPackard’s 10decimal calculators."^{[11]} Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087,^{[12]} and that the x87 extended precision was designed to be extensible to higher precision in future processors: "For now the 10byte Extended format is a tolerable compromise between the value of extraprecise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16byte format... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for FloatingPoint Arithmetic was framed."^{[13]}
The Motorola 6888x math coprocessors and the Motorola 68040 and 68060 processors support this same 64bit significand extended precision type (similar to the Intel format although padded to a 96bit format with 16 unused bits inserted between the exponent and significand fields^{[14]}). The followon Coldfire processors do not support this 96bit extended precision format.^{[15]}
The x87 and Motorola 68881 80bit formats meet the requirements of the IEEE 754 double extended format,^{[16]} as does the IEEE 754 128bit format.
x86 Extended Precision Format
The x86 Extended Precision Format is an 80bit format first implemented in the Intel 8087 math coprocessor and is supported by all processors that are based on the x86 design which incorporate a floating point unit. This 80bit format uses one bit for the sign of the significand, 15 bits for the exponent field (i.e. the same range as the 128bit quadruple precision IEEE 754 format) and 64 bits for the significand. The exponent field is biased by 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actual power of 2.^{[17]} An exponent field value of 32767 (all fifteen bits 1) is reserved so as to enable the representation of special states such as infinity and Not a Number. If the exponent field is zero, the value is a denormal number and the exponent of 2 is −16382.^{[18]}
File:X86 Extended Floating Point Format.svg
In the following table, "s" is the value of the sign bit (0 means positive, 1 means negative), "e" is the value of the exponent field interpreted as a positive integer, and "m" is the significand interpreted as a positive binary number where the binary point is located between bits 63 and 62. The "m" field is the combination of the integer and fraction parts in the above diagram.
Exponent  Significand  Meaning  

All Zeros  Bit 63  Bits 620  
Zero  Zero  Zero. The sign bit gives the sign of the zero.  
Nonzero  Denormal. The value is (−1)^{s} × m × 2^{−16382}  
One  Anything  Pseudo Denormal. The 80387 and later properly interpret this value but will not generate it.
The value is (−1)^{s} × m × 2^{−16382} 

All Ones  Bits 63,62  Bits 610  
00  Zero  PseudoInfinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as Infinity. The 80387 and later treat this as an invalid operand.  
Nonzero  Pseudo Not a Number. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.  
01  Anything  Pseudo Not a Number. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.  
10  Zero  Infinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as a Signaling Not a Number. The 8087 and 80287 coprocessors used the pseudoinfinity representation for infinities.  
Nonzero  Signalling Not a Number, the sign bit is meaningless.  
11  Zero  Floatingpoint Indefinite, the result of invalid calculations such as square root of a negative number, logarithm of a negative number, 0/0, infinity / infinity, infinity times 0, and others when the processor has been configured to not generate exceptions for invalid operands. The sign bit is meaningless. This is a special case of a Quiet Not a Number.  
Nonzero  Quiet Not a Number, the sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number.  
All other values  Bit 63  Bits 620  
Zero  Anything  Unnormal. Only generated on the 8087 and 80287. The 80387 and later treat this as an invalid operand.
The value is (−1)^{s} × m × 2^{e−16383} 

One  Anything  Normalized value.
The value is (−1)^{s} × m × 2^{e−16383} 
In contrast to the single and doubleprecision formats, this format does not utilize an implicit/hidden bit. Rather, bit 63 contains the integer part of the significand and bits 620 hold the fractional part. Bit 63 will be 1 on all normalized numbers. There were several advantages to this design when the 8087 was being developed:
 Calculations can be completed a little faster if all bits of the significand are present in the register.
 A 64bit significand provides sufficient precision to avoid loss of precision when the results are converted back to double precision format in the vast number of cases.
 This format provides a mechanism for indicating precision loss due to underflow which can be carried through further operations. For example, the calculation 2×10^{−4930} × 3×10^{−10} × 4×10^{20} generates the intermediate result 6×10^{−4940} which is a denormal and also involves precision loss. The product of all of the terms is 24×10^{−4920} which can be represented as a normalized number. The 80287 could complete this calculation and indicate the loss of precision by returning an "unnormal" result (exponent not 0, bit 63 = 0).^{[19]}^{[20]} Processors since the 80387 no longer generate unnormals and do not support unnormal inputs to operations. They will generate a denormal if an underflow occurs but will generate a normalized result if subsequent operations on the denormal can be normalized.^{[21]}
Introduction to use
The 80bit floating point format was widely available by 1984,^{[22]} after the development of C, Fortran and similar computer languages, which initially offered only the common 32 and 64bit floating point sizes. On the x86 design most C compilers now support 80bit extended precision via the long double type, and this was specified in the C99 / C11 standards (IEC 60559 floatingpoint arithmetic (Annex F)). Compilers on x86 for other languages often support extended precision as well, sometimes via nonstandard extensions: for example, Turbo Pascal offers an extended
type, and several Fortran compilers have a REAL*10
type (analogous to REAL*4
and REAL*8
). Such compilers also typically include extendedprecision mathematical subroutines, such as square root and trigonometric functions, in their standard libraries.
Working range
The 80bit floating point format has a range (including subnormals) from approximately 3.65×10^{−4951} to 1.18×10^{4932}. Although log_{10}(2^{64}) ≅ 19.266, this format is usually described as giving approximately eighteen significant digits of precision. The use of decimal when talking about binary is unfortunate because most decimal fractions are recurring sequences in binary just as 2/3 is in decimal. Thus, a value such as 10.15 is represented in binary as equivalent to 10·1499996185 etc. in decimal for REAL*4 but 10·15000000000000035527etc. in REAL*8: interconversion will involve approximation except for those few decimal fractions that represent an exact binary value, such as 0.625. For REAL*10, the decimal string is 10.1499999999999999996530553etc. The last 9 digit is the eighteenth fractional digit and thus the twentieth significant digit of the string. Bounds on conversion between decimal and binary for the 80bit format can be given as follows: if a decimal string with at most 18 significant digits is correctly rounded to an 80bit IEEE 754 binary floating point value (as on input) then converted back to the same number of significant decimal digits (as for output), then the final string will exactly match the original; while, conversely, if an 80bit IEEE 754 binary floating point value is correctly converted and (nearest) rounded to a decimal string with at least 21 significant decimal digits then converted back to binary format it will exactly match the original.^{[16]} These approximations are particularly troublesome when specifying the best value for constants in formulae to high precision, as might be calculated via arbitrary precision arithmetic.
Need for the 80bit format
A notable example of the need for a minimum of 64 bits of precision in the significand of the extended precision format is the need to avoid precision loss when performing exponentiation on double precision values.^{[23]}^{[24]}^{[25]}^{[26]} The x86 floating point units do not provide an instruction that directly performs exponentiation. Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation:
In order to avoid precision loss, the intermediate results "log_{2} x" and "y log_{2} x" must be computed with much higher precision because effectively both the exponent and the significand fields of x must fit into the significand field of the intermediate result. Subsequently the significand field of the intermediate result is split between the exponent and significand fields of the final result when 2^{intermediate result} is calculated. The following discussion describes this requirement in more detail.
An IEEE 754 double precision value can be represented as:
where s is the sign of the exponent (either 0 or 1), E is the unbiased exponent which is an integer that ranges from 0 to 1023, and M is the significand which is a 53bit value that falls in the range 1 ≤ M < 2. Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussion M does not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations where M is less than 1, the value is actually a denormal and therefore may have already suffered precision loss. This situation is beyond the scope of this article).
Taking the log of this representation of a double precision number and simplifying results in the following:
This result demonstrates that when taking base2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm.
Because E is an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. Because M falls in the range 1 ≤ M < 2, the value of log_{2} M will fall in the range 0 ≤ log_{2} M < 1 so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values of M less than require 53 bits to the right of the radix point and values of M less than require 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point.
The final part of the exponentiation calculation is computing 2^{intermediate result}. The "intermediate result" consists of an integer part "I" added to a fractional part "F". If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both "I" and "F" are negative numbers.
For positive intermediate results:
For negative intermediate results:
Thus the integer part of the intermediate result ("I" or "I1") plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result: 2^{F} or 2^{1+F} becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits.
In conclusion, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority of exponentiation computations involving double precision numbers.
The number of bits needed for the exponent of the extended precision format follows from the requirement that the product of two double precision numbers should not overflow when computed using the extended format. The largest possible exponent of a double precision value is 1023 so the exponent of the largest possible product of two double precision numbers is 2047 (an 11bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide.
Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80bit format (in particular the IEEE 754 standard requires the exponent range of an extended precision format to match that of the next largest, quad, precision format which is 15 bits).^{[24]}
Another example benefitting from extended precision arithmetic is iterative refinement schemes in numerical linear algebra.^{[27]}
Language support
 Some C/C++ implementations (e.g. GCC, Clang, Intel C++) implement long double using 80bit floating point numbers on x86 systems. However, this is implementationdefined behavior and is not required, but allowed by the standard, as specified for IEEE 754 hardware in the C99 standard "Annex F IEC 60559 floatingpoint arithmetic".
 D programming language implements real using largest floating point size implemented in hardware (80 bits for x86 CPUs or double precision, whichever is larger).
 The Swift standard library provides the Float80 datatype
 The Racket runtime system provies the 80bit extflonum datatype on x86 systems.
 Delphi (Pascal) has in addition to SINGLE and DOUBLE the EXTENDED Type
See also
 MPFR – the GNU "Multiple Precision FloatingPoint Reliably" library for C
 IBM Floating Point Architecture
 IEEE 754
 long double
References
 ↑ IEEE 754 (2008, ¶ 2.1.21) defines extended precision format as "A format that extends a supported basic format by providing wider precision and range."
 ↑ IBM 1130 Subroutine Library 9th ed (PDF). IBM Corporation. 1974. p. 93.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ IBM System/360 Principles of Operation (9th ed.). IBM Corporation. 1970.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>, p 41
 ↑ IBM System/370 Principles of Operation (7th ed.). IBM Corporation. 1980.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>, pp 92 thru 93
 ↑ IEEE Computer Society (August 29, 2008), IEEE Standard for FloatingPoint Arithmetic, IEEE, doi:10.1109/IEEESTD.2008.4610935, ISBN 9780738157528, IEEE Std 7542008<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>, § 3.7
 ↑ Kevin Brewer. "Kevin's Report". IEEE754 Reference Material. Retrieved 20120219.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ x87 designer Kahan notes: "This format is intended mainly to help programmers enhance the integrity of their Single and Double software, and to attenuate degradation by roundoff in Double matrix computations of larger dimensions, and can easily be used in such a way that substituting Quadruple for Extended need never invalidate its use."William Kahan (1 October 1997). "Lecture Notes on the Status of IEEE Standard 754 for Binary FloatingPoint Arithmetic" (PDF). p. 5.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Bo Einarsson (2005). Accuracy and reliability in scientific computing. SIAM. pp. 9–. ISBN 9780898718157. Retrieved 3 May 2013.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Intel corp. (March 2012). "Intel® 64 and IA32 Architectures Software Developer's Manual. Vol. 1 sec. 8.2".<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ "Highlevel languages will use extended (invisibly) to evaluate intermediate subexpressions, and later may provide extended as a declarable data type." (page 70) Jerome T. Coonen (January 1980). "An Implementation Guide to a Proposed Standard for FloatingPoint Arithmetic". IEEE Computer: 68–79.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ William Kahan (22 November 1983). "Mathematics Written in Sand  the hp15C, Intel 8087, etc" (PDF).<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ David Goldberg (March 1991). "What every computer scientist should know about floatingpoint arithmetic. ACM Computing Surveys, volume 23, issue 1" (PDF). p. 192.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Higham, Nicholas (2002). "Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 43.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Motorola MC68000 Family Programmer's Reference Manual (PDF). Freescale Semiconductor. 1992. pp. 1–16.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ColdFire Family Programmer’s Reference Manual (PDF). Freescale semiconductor. 2005. p. 77.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{16.0} ^{16.1} William Kahan (1 October 1997). "Lecture Notes on the Status of IEEE Standard 754 for Binary FloatingPoint Arithmetic" (PDF).<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Intel 80C187 datasheet
 ↑ Intel® 64 and IA32 Architectures Developer's Manual: Vol. 1. Intel Corporation. pp. 46 thru 49 and 418 thru 421.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Palmer, John F.; Morse, Stephen P. (1984). The 8087 Primer. Wiley Press. p. 14. ISBN 0471875694.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Morse, Stephen P.; Albert, Douglas J. (1986). The 80286 Architecture. Wiley Press. pp. 91–111. ISBN 0471831859.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Intel® 64 and IA32 Architectures Developer's Manual: Vol. 1. Intel Corporation. pp. 821 thru 822.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Charles Severance (20 February 1998). "An Interview with the Old Man of FloatingPoint".<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Palmer, John F.; Morse, Stephen P. (1984). The 8087 Primer. Wiley Press. p. 16. ISBN 0471875694.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{24.0} ^{24.1} Morse, Stephen P.; Albert, Douglas J. (1986). The 80286 Architecture. Wiley Press. pp. 96–98. ISBN 0471831859.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Hough, David (March 1981). "Applications of the proposed IEEE 754 standard for floating point arithmetic". IEEE Computer. 14 (3): 70–74. doi:10.1109/CM.1981.220381.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ "The presence of at least as many extra bits of precision in extended as in the exponent field of the basic format it supports greatly simplifies the accurate computation of the transcendental functions, inner products, and the power function y^{x}." (page 70) Jerome T. Coonen (January 1980). "An Implementation Guide to a Proposed Standard for FloatingPoint Arithmetic". IEEE Computer: 68–79.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Demmel, James; Hida, Yozo; Kahan, William; Li, Xiaoye S.; Mukherjee, Sonil; Riedy, E. Jason (June 2006). "Error Bounds from ExtraPrecise Iterative Refinement" (PDF). ACM Transactions on Mathematical Software. 32 (2): 325–351. doi:10.1145/1141885.1141894. Retrieved 20140418.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>