summaryrefslogtreecommitdiff
path: root/manual/=float.texinfo
diff options
context:
space:
mode:
Diffstat (limited to 'manual/=float.texinfo')
-rw-r--r--manual/=float.texinfo414
1 files changed, 0 insertions, 414 deletions
diff --git a/manual/=float.texinfo b/manual/=float.texinfo
deleted file mode 100644
index d4e3920f8c..0000000000
--- a/manual/=float.texinfo
+++ /dev/null
@@ -1,414 +0,0 @@
-@node Floating-Point Limits
-@chapter Floating-Point Limits
-@pindex <float.h>
-@cindex floating-point number representation
-@cindex representation of floating-point numbers
-
-Because floating-point numbers are represented internally as approximate
-quantities, algorithms for manipulating floating-point data often need
-to be parameterized in terms of the accuracy of the representation.
-Some of the functions in the C library itself need this information; for
-example, the algorithms for printing and reading floating-point numbers
-(@pxref{I/O on Streams}) and for calculating trigonometric and
-irrational functions (@pxref{Mathematics}) use information about the
-underlying floating-point representation to avoid round-off error and
-loss of accuracy. User programs that implement numerical analysis
-techniques also often need to be parameterized in this way in order to
-minimize or compute error bounds.
-
-The specific representation of floating-point numbers varies from
-machine to machine. The GNU C Library defines a set of parameters which
-characterize each of the supported floating-point representations on a
-particular system.
-
-@menu
-* Floating-Point Representation:: Definitions of terminology.
-* Floating-Point Parameters:: Descriptions of the library facilities.
-* IEEE Floating-Point:: An example of a common representation.
-@end menu
-
-@node Floating-Point Representation
-@section Floating-Point Representation
-
-This section introduces the terminology used to characterize the
-representation of floating-point numbers.
-
-You are probably already familiar with most of these concepts in terms
-of scientific or exponential notation for floating-point numbers. For
-example, the number @code{123456.0} could be expressed in exponential
-notation as @code{1.23456e+05}, a shorthand notation indicating that the
-mantissa @code{1.23456} is multiplied by the base @code{10} raised to
-power @code{5}.
-
-More formally, the internal representation of a floating-point number
-can be characterized in terms of the following parameters:
-
-@itemize @bullet
-@item
-The @dfn{sign} is either @code{-1} or @code{1}.
-@cindex sign (of floating-point number)
-
-@item
-The @dfn{base} or @dfn{radix} for exponentiation; an integer greater
-than @code{1}. This is a constant for the particular representation.
-@cindex base (of floating-point number)
-@cindex radix (of floating-point number)
-
-@item
-The @dfn{exponent} to which the base is raised. The upper and lower
-bounds of the exponent value are constants for the particular
-representation.
-@cindex exponent (of floating-point number)
-
-Sometimes, in the actual bits representing the floating-point number,
-the exponent is @dfn{biased} by adding a constant to it, to make it
-always be represented as an unsigned quantity. This is only important
-if you have some reason to pick apart the bit fields making up the
-floating-point number by hand, which is something for which the GNU
-library provides no support. So this is ignored in the discussion that
-follows.
-@cindex bias, in exponent (of floating-point number)
-
-@item
-The value of the @dfn{mantissa} or @dfn{significand}, which is an
-unsigned quantity.
-@cindex mantissa (of floating-point number)
-@cindex significand (of floating-point number)
-
-@item
-The @dfn{precision} of the mantissa. If the base of the representation
-is @var{b}, then the precision is the number of base-@var{b} digits in
-the mantissa. This is a constant for the particular representation.
-
-Many floating-point representations have an implicit @dfn{hidden bit} in
-the mantissa. Any such hidden bits are counted in the precision.
-Again, the GNU library provides no facilities for dealing with such low-level
-aspects of the representation.
-@cindex precision (of floating-point number)
-@cindex hidden bit, in mantissa (of floating-point number)
-@end itemize
-
-The mantissa of a floating-point number actually represents an implicit
-fraction whose denominator is the base raised to the power of the
-precision. Since the largest representable mantissa is one less than
-this denominator, the value of the fraction is always strictly less than
-@code{1}. The mathematical value of a floating-point number is then the
-product of this fraction; the sign; and the base raised to the exponent.
-
-If the floating-point number is @dfn{normalized}, the mantissa is also
-greater than or equal to the base raised to the power of one less
-than the precision (unless the number represents a floating-point zero,
-in which case the mantissa is zero). The fractional quantity is
-therefore greater than or equal to @code{1/@var{b}}, where @var{b} is
-the base.
-@cindex normalized floating-point number
-
-@node Floating-Point Parameters
-@section Floating-Point Parameters
-
-@strong{Incomplete:} This section needs some more concrete examples
-of what these parameters mean and how to use them in a program.
-
-These macro definitions can be accessed by including the header file
-@file{<float.h>} in your program.
-
-Macro names starting with @samp{FLT_} refer to the @code{float} type,
-while names beginning with @samp{DBL_} refer to the @code{double} type
-and names beginning with @samp{LDBL_} refer to the @code{long double}
-type. (In implementations that do not support @code{long double} as
-a distinct data type, the values for those constants are the same
-as the corresponding constants for the @code{double} type.)@refill
-
-Note that only @code{FLT_RADIX} is guaranteed to be a constant
-expression, so the other macros listed here cannot be reliably used in
-places that require constant expressions, such as @samp{#if}
-preprocessing directives and array size specifications.
-
-Although the @w{ISO C} standard specifies minimum and maximum values for
-most of these parameters, the GNU C implementation uses whatever
-floating-point representations are supported by the underlying hardware.
-So whether GNU C actually satisfies the @w{ISO C} requirements depends on
-what machine it is running on.
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_ROUNDS
-This value characterizes the rounding mode for floating-point addition.
-The following values indicate standard rounding modes:
-
-@table @code
-@item -1
-The mode is indeterminable.
-@item 0
-Rounding is towards zero.
-@item 1
-Rounding is to the nearest number.
-@item 2
-Rounding is towards positive infinity.
-@item 3
-Rounding is towards negative infinity.
-@end table
-
-@noindent
-Any other value represents a machine-dependent nonstandard rounding
-mode.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_RADIX
-This is the value of the base, or radix, of exponent representation.
-This is guaranteed to be a constant expression, unlike the other macros
-described in this section.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MANT_DIG
-This is the number of base-@code{FLT_RADIX} digits in the floating-point
-mantissa for the @code{float} data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MANT_DIG
-This is the number of base-@code{FLT_RADIX} digits in the floating-point
-mantissa for the @code{double} data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MANT_DIG
-This is the number of base-@code{FLT_RADIX} digits in the floating-point
-mantissa for the @code{long double} data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_DIG
-This is the number of decimal digits of precision for the @code{float}
-data type. Technically, if @var{p} and @var{b} are the precision and
-base (respectively) for the representation, then the decimal precision
-@var{q} is the maximum number of decimal digits such that any floating
-point number with @var{q} base 10 digits can be rounded to a floating
-point number with @var{p} base @var{b} digits and back again, without
-change to the @var{q} decimal digits.
-
-The value of this macro is guaranteed to be at least @code{6}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_DIG
-This is similar to @code{FLT_DIG}, but is for the @code{double} data
-type. The value of this macro is guaranteed to be at least @code{10}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_DIG
-This is similar to @code{FLT_DIG}, but is for the @code{long double}
-data type. The value of this macro is guaranteed to be at least
-@code{10}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MIN_EXP
-This is the minimum negative integer such that the mathematical value
-@code{FLT_RADIX} raised to this power minus 1 can be represented as a
-normalized floating-point number of type @code{float}. In terms of the
-actual implementation, this is just the smallest value that can be
-represented in the exponent field of the number.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MIN_EXP
-This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data
-type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MIN_EXP
-This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double}
-data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MIN_10_EXP
-This is the minimum negative integer such that the mathematical value
-@code{10} raised to this power minus 1 can be represented as a
-normalized floating-point number of type @code{float}. This is
-guaranteed to be no greater than @code{-37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MIN_10_EXP
-This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double}
-data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MIN_10_EXP
-This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long
-double} data type.
-@end defvr
-
-
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MAX_EXP
-This is the maximum negative integer such that the mathematical value
-@code{FLT_RADIX} raised to this power minus 1 can be represented as a
-floating-point number of type @code{float}. In terms of the actual
-implementation, this is just the largest value that can be represented
-in the exponent field of the number.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MAX_EXP
-This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data
-type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MAX_EXP
-This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double}
-data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MAX_10_EXP
-This is the maximum negative integer such that the mathematical value
-@code{10} raised to this power minus 1 can be represented as a
-normalized floating-point number of type @code{float}. This is
-guaranteed to be at least @code{37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MAX_10_EXP
-This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double}
-data type.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MAX_10_EXP
-This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long
-double} data type.
-@end defvr
-
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MAX
-The value of this macro is the maximum representable floating-point
-number of type @code{float}, and is guaranteed to be at least
-@code{1E+37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MAX
-The value of this macro is the maximum representable floating-point
-number of type @code{double}, and is guaranteed to be at least
-@code{1E+37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MAX
-The value of this macro is the maximum representable floating-point
-number of type @code{long double}, and is guaranteed to be at least
-@code{1E+37}.
-@end defvr
-
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_MIN
-The value of this macro is the minimum normalized positive
-floating-point number that is representable by type @code{float}, and is
-guaranteed to be no more than @code{1E-37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_MIN
-The value of this macro is the minimum normalized positive
-floating-point number that is representable by type @code{double}, and
-is guaranteed to be no more than @code{1E-37}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_MIN
-The value of this macro is the minimum normalized positive
-floating-point number that is representable by type @code{long double},
-and is guaranteed to be no more than @code{1E-37}.
-@end defvr
-
-
-@comment float.h
-@comment ISO
-@defvr Macro FLT_EPSILON
-This is the minimum positive floating-point number of type @code{float}
-such that @code{1.0 + FLT_EPSILON != 1.0} is true. It's guaranteed to
-be no greater than @code{1E-5}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro DBL_EPSILON
-This is similar to @code{FLT_EPSILON}, but is for the @code{double}
-type. The maximum value is @code{1E-9}.
-@end defvr
-
-@comment float.h
-@comment ISO
-@defvr Macro LDBL_EPSILON
-This is similar to @code{FLT_EPSILON}, but is for the @code{long double}
-type. The maximum value is @code{1E-9}.
-@end defvr
-
-
-
-@node IEEE Floating Point
-@section IEEE Floating Point
-
-Here is an example showing how these parameters work for a common
-floating point representation, specified by the @cite{IEEE Standard for
-Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 or ANSI/IEEE
-Std 854-1987)}.
-
-The IEEE single-precision float representation uses a base of 2. There
-is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total
-precision is 24 base-2 digits), and an 8-bit exponent that can represent
-values in the range -125 to 128, inclusive.
-
-So, for an implementation that uses this representation for the
-@code{float} data type, appropriate values for the corresponding
-parameters are:
-
-@example
-FLT_RADIX 2
-FLT_MANT_DIG 24
-FLT_DIG 6
-FLT_MIN_EXP -125
-FLT_MIN_10_EXP -37
-FLT_MAX_EXP 128
-FLT_MAX_10_EXP +38
-FLT_MIN 1.17549435E-38F
-FLT_MAX 3.40282347E+38F
-FLT_EPSILON 1.19209290E-07F
-@end example