@node Floating-Point Limits @chapter Floating-Point Limits @pindex @cindex floating-point number representation @cindex representation of floating-point numbers Because floating-point numbers are represented internally as approximate quantities, algorithms for manipulating floating-point data often need to be parameterized in terms of the accuracy of the representation. Some of the functions in the C library itself need this information; for example, the algorithms for printing and reading floating-point numbers (@pxref{I/O on Streams}) and for calculating trigonometric and irrational functions (@pxref{Mathematics}) use information about the underlying floating-point representation to avoid round-off error and loss of accuracy. User programs that implement numerical analysis techniques also often need to be parameterized in this way in order to minimize or compute error bounds. The specific representation of floating-point numbers varies from machine to machine. The GNU C Library defines a set of parameters which characterize each of the supported floating-point representations on a particular system. @menu * Floating-Point Representation:: Definitions of terminology. * Floating-Point Parameters:: Descriptions of the library facilities. * IEEE Floating-Point:: An example of a common representation. @end menu @node Floating-Point Representation @section Floating-Point Representation This section introduces the terminology used to characterize the representation of floating-point numbers. You are probably already familiar with most of these concepts in terms of scientific or exponential notation for floating-point numbers. For example, the number @code{123456.0} could be expressed in exponential notation as @code{1.23456e+05}, a shorthand notation indicating that the mantissa @code{1.23456} is multiplied by the base @code{10} raised to power @code{5}. More formally, the internal representation of a floating-point number can be characterized in terms of the following parameters: @itemize @bullet @item The @dfn{sign} is either @code{-1} or @code{1}. @cindex sign (of floating-point number) @item The @dfn{base} or @dfn{radix} for exponentiation; an integer greater than @code{1}. This is a constant for the particular representation. @cindex base (of floating-point number) @cindex radix (of floating-point number) @item The @dfn{exponent} to which the base is raised. The upper and lower bounds of the exponent value are constants for the particular representation. @cindex exponent (of floating-point number) Sometimes, in the actual bits representing the floating-point number, the exponent is @dfn{biased} by adding a constant to it, to make it always be represented as an unsigned quantity. This is only important if you have some reason to pick apart the bit fields making up the floating-point number by hand, which is something for which the GNU library provides no support. So this is ignored in the discussion that follows. @cindex bias, in exponent (of floating-point number) @item The value of the @dfn{mantissa} or @dfn{significand}, which is an unsigned quantity. @cindex mantissa (of floating-point number) @cindex significand (of floating-point number) @item The @dfn{precision} of the mantissa. If the base of the representation is @var{b}, then the precision is the number of base-@var{b} digits in the mantissa. This is a constant for the particular representation. Many floating-point representations have an implicit @dfn{hidden bit} in the mantissa. Any such hidden bits are counted in the precision. Again, the GNU library provides no facilities for dealing with such low-level aspects of the representation. @cindex precision (of floating-point number) @cindex hidden bit, in mantissa (of floating-point number) @end itemize The mantissa of a floating-point number actually represents an implicit fraction whose denominator is the base raised to the power of the precision. Since the largest representable mantissa is one less than this denominator, the value of the fraction is always strictly less than @code{1}. The mathematical value of a floating-point number is then the product of this fraction; the sign; and the base raised to the exponent. If the floating-point number is @dfn{normalized}, the mantissa is also greater than or equal to the base raised to the power of one less than the precision (unless the number represents a floating-point zero, in which case the mantissa is zero). The fractional quantity is therefore greater than or equal to @code{1/@var{b}}, where @var{b} is the base. @cindex normalized floating-point number @node Floating-Point Parameters @section Floating-Point Parameters @strong{Incomplete:} This section needs some more concrete examples of what these parameters mean and how to use them in a program. These macro definitions can be accessed by including the header file @file{} in your program. Macro names starting with @samp{FLT_} refer to the @code{float} type, while names beginning with @samp{DBL_} refer to the @code{double} type and names beginning with @samp{LDBL_} refer to the @code{long double} type. (In implementations that do not support @code{long double} as a distinct data type, the values for those constants are the same as the corresponding constants for the @code{double} type.)@refill Note that only @code{FLT_RADIX} is guaranteed to be a constant expression, so the other macros listed here cannot be reliably used in places that require constant expressions, such as @samp{#if} preprocessing directives and array size specifications. Although the ANSI C standard specifies minimum and maximum values for most of these parameters, the GNU C implementation uses whatever floating-point representations are supported by the underlying hardware. So whether GNU C actually satisfies the ANSI C requirements depends on what machine it is running on. @comment float.h @comment ANSI @defvr Macro FLT_ROUNDS This value characterizes the rounding mode for floating-point addition. The following values indicate standard rounding modes: @table @code @item -1 The mode is indeterminable. @item 0 Rounding is towards zero. @item 1 Rounding is to the nearest number. @item 2 Rounding is towards positive infinity. @item 3 Rounding is towards negative infinity. @end table @noindent Any other value represents a machine-dependent nonstandard rounding mode. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_RADIX This is the value of the base, or radix, of exponent representation. This is guaranteed to be a constant expression, unlike the other macros described in this section. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MANT_DIG This is the number of base-@code{FLT_RADIX} digits in the floating-point mantissa for the @code{float} data type. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MANT_DIG This is the number of base-@code{FLT_RADIX} digits in the floating-point mantissa for the @code{double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MANT_DIG This is the number of base-@code{FLT_RADIX} digits in the floating-point mantissa for the @code{long double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_DIG This is the number of decimal digits of precision for the @code{float} data type. Technically, if @var{p} and @var{b} are the precision and base (respectively) for the representation, then the decimal precision @var{q} is the maximum number of decimal digits such that any floating point number with @var{q} base 10 digits can be rounded to a floating point number with @var{p} base @var{b} digits and back again, without change to the @var{q} decimal digits. The value of this macro is guaranteed to be at least @code{6}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_DIG This is similar to @code{FLT_DIG}, but is for the @code{double} data type. The value of this macro is guaranteed to be at least @code{10}. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_DIG This is similar to @code{FLT_DIG}, but is for the @code{long double} data type. The value of this macro is guaranteed to be at least @code{10}. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MIN_EXP This is the minimum negative integer such that the mathematical value @code{FLT_RADIX} raised to this power minus 1 can be represented as a normalized floating-point number of type @code{float}. In terms of the actual implementation, this is just the smallest value that can be represented in the exponent field of the number. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MIN_EXP This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MIN_EXP This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MIN_10_EXP This is the minimum negative integer such that the mathematical value @code{10} raised to this power minus 1 can be represented as a normalized floating-point number of type @code{float}. This is guaranteed to be no greater than @code{-37}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MIN_10_EXP This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MIN_10_EXP This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MAX_EXP This is the maximum negative integer such that the mathematical value @code{FLT_RADIX} raised to this power minus 1 can be represented as a floating-point number of type @code{float}. In terms of the actual implementation, this is just the largest value that can be represented in the exponent field of the number. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MAX_EXP This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MAX_EXP This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MAX_10_EXP This is the maximum negative integer such that the mathematical value @code{10} raised to this power minus 1 can be represented as a normalized floating-point number of type @code{float}. This is guaranteed to be at least @code{37}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MAX_10_EXP This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MAX_10_EXP This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long double} data type. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MAX The value of this macro is the maximum representable floating-point number of type @code{float}, and is guaranteed to be at least @code{1E+37}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MAX The value of this macro is the maximum representable floating-point number of type @code{double}, and is guaranteed to be at least @code{1E+37}. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MAX The value of this macro is the maximum representable floating-point number of type @code{long double}, and is guaranteed to be at least @code{1E+37}. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_MIN The value of this macro is the minimum normalized positive floating-point number that is representable by type @code{float}, and is guaranteed to be no more than @code{1E-37}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_MIN The value of this macro is the minimum normalized positive floating-point number that is representable by type @code{double}, and is guaranteed to be no more than @code{1E-37}. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_MIN The value of this macro is the minimum normalized positive floating-point number that is representable by type @code{long double}, and is guaranteed to be no more than @code{1E-37}. @end defvr @comment float.h @comment ANSI @defvr Macro FLT_EPSILON This is the minimum positive floating-point number of type @code{float} such that @code{1.0 + FLT_EPSILON != 1.0} is true. It's guaranteed to be no greater than @code{1E-5}. @end defvr @comment float.h @comment ANSI @defvr Macro DBL_EPSILON This is similar to @code{FLT_EPSILON}, but is for the @code{double} type. The maximum value is @code{1E-9}. @end defvr @comment float.h @comment ANSI @defvr Macro LDBL_EPSILON This is similar to @code{FLT_EPSILON}, but is for the @code{long double} type. The maximum value is @code{1E-9}. @end defvr @node IEEE Floating Point @section IEEE Floating Point Here is an example showing how these parameters work for a common floating point representation, specified by the @cite{IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)}. The IEEE single-precision float representation uses a base of 2. There is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total precision is 24 base-2 digits), and an 8-bit exponent that can represent values in the range -125 to 128, inclusive. So, for an implementation that uses this representation for the @code{float} data type, appropriate values for the corresponding parameters are: @example FLT_RADIX 2 FLT_MANT_DIG 24 FLT_DIG 6 FLT_MIN_EXP -125 FLT_MIN_10_EXP -37 FLT_MAX_EXP 128 FLT_MAX_10_EXP +38 FLT_MIN 1.17549435E-38F FLT_MAX 3.40282347E+38F FLT_EPSILON 1.19209290E-07F @end example