VirtualBox

source: vbox/trunk/src/libs/softfloat-3e/doc/SoftFloat.html

Last change on this file was 94480, checked in by vboxsync, 3 years ago

libs/softfloat-3e: Copied from vendor branch (SoftFloat-3e.zip, md5: 7dac954ea4aed0697cbfee800ba4f492). bugref:9898

  • Property svn:eol-style set to native
  • Property svn:mime-type set to text/html
File size: 52.2 KB
Line 
1
2<HTML>
3
4<HEAD>
5<TITLE>Berkeley SoftFloat Library Interface</TITLE>
6</HEAD>
7
8<BODY>
9
10<H1>Berkeley SoftFloat Release 3e: Library Interface</H1>
11
12<P>
13John R. Hauser<BR>
142018 January 20<BR>
15</P>
16
17
18<H2>Contents</H2>
19
20<BLOCKQUOTE>
21<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
22<COL WIDTH=25>
23<COL WIDTH=*>
24<TR><TD COLSPAN=2>1. Introduction</TD></TR>
25<TR><TD COLSPAN=2>2. Limitations</TD></TR>
26<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
27<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
28<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
29<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
30<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
31<TR>
32 <TD></TD>
33 <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
34</TR>
35<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
36<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
37<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
38<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
39<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
40<TR>
41 <TD></TD>
42 <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
43</TR>
44<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
45<TR><TD COLSPAN=2>8. Function Details</TD></TR>
46<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
47<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
48<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
49<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
50<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
51<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
52<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
53<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
54<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
55<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
56<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
57<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
58<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
59<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
60<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
61<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
62<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
63<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
64<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
65</TABLE>
66</BLOCKQUOTE>
67
68
69<H2>1. Introduction</H2>
70
71<P>
72Berkeley SoftFloat is a software implementation of binary floating-point that
73conforms to the IEEE Standard for Floating-Point Arithmetic.
74The current release supports five binary formats: <NOBR>16-bit</NOBR>
75half-precision, <NOBR>32-bit</NOBR> single-precision, <NOBR>64-bit</NOBR>
76double-precision, <NOBR>80-bit</NOBR> double-extended-precision, and
77<NOBR>128-bit</NOBR> quadruple-precision.
78The following functions are supported for each format:
79<UL>
80<LI>
81addition, subtraction, multiplication, division, and square root;
82<LI>
83fused multiply-add as defined by the IEEE Standard, except for
84<NOBR>80-bit</NOBR> double-extended-precision;
85<LI>
86remainder as defined by the IEEE Standard;
87<LI>
88round to integral value;
89<LI>
90comparisons;
91<LI>
92conversions to/from other supported formats; and
93<LI>
94conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
95signed and unsigned.
96</UL>
97All operations required by the original 1985 version of the IEEE Floating-Point
98Standard are implemented, except for conversions to and from decimal.
99</P>
100
101<P>
102This document gives information about the types defined and the routines
103implemented by SoftFloat.
104It does not attempt to define or explain the IEEE Floating-Point Standard.
105Information about the standard is available elsewhere.
106</P>
107
108<P>
109The current version of SoftFloat is <NOBR>Release 3e</NOBR>.
110This release modifies the behavior of the rarely used <I>odd</I> rounding mode
111(<I>round to odd</I>, also known as <I>jamming</I>), and also adds some new
112specialization and optimization examples for those compiling SoftFloat.
113</P>
114
115<P>
116The previous <NOBR>Release 3d</NOBR> fixed bugs that were found in the square
117root functions for the <NOBR>64-bit</NOBR>, <NOBR>80-bit</NOBR>, and
118<NOBR>128-bit</NOBR> floating-point formats.
119(Thanks to Alexei Sibidanov at the University of Victoria for reporting an
120incorrect result.)
121The bugs affected all prior <NOBR>Release-3</NOBR> versions of SoftFloat
122<NOBR>through 3c</NOBR>.
123The flaw in the <NOBR>64-bit</NOBR> floating-point square root function was of
124very minor impact, causing a <NOBR>1-ulp</NOBR> error (<NOBR>1 unit</NOBR> in
125the last place) a few times out of a billion.
126The bugs in the <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> square root
127functions were more serious.
128Although incorrect results again occurred only a few times out of a billion,
129when they did occur a large portion of the less-significant bits could be
130wrong.
131</P>
132
133<P>
134Among earlier releases, 3b was notable for adding support for the
135<NOBR>16-bit</NOBR> half-precision format.
136For more about the evolution of SoftFloat releases, see
137<A HREF="SoftFloat-history.html"><NOBR><CODE>SoftFloat-history.html</CODE></NOBR></A>.
138</P>
139
140<P>
141The functional interface of SoftFloat <NOBR>Release 3</NOBR> and later differs
142in many details from the releases that came before.
143For specifics of these differences, see <NOBR>section 9</NOBR> below,
144<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
145</P>
146
147
148<H2>2. Limitations</H2>
149
150<P>
151SoftFloat assumes the computer has an addressable byte size of 8 or
152<NOBR>16 bits</NOBR>.
153(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
154</P>
155
156<P>
157SoftFloat is written in C and is designed to work with other C code.
158The C compiler used must conform at a minimum to the 1989 ANSI standard for the
159C language (same as the 1990 ISO standard) and must in addition support basic
160arithmetic on <NOBR>64-bit</NOBR> integers.
161Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
162single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
163did not require <NOBR>64-bit</NOBR> integers, but this option is not supported
164starting with <NOBR>Release 3</NOBR>.
165Since 1999, ISO standards for C have mandated compiler support for
166<NOBR>64-bit</NOBR> integers.
167A compiler conforming to the 1999 C Standard or later is recommended but not
168strictly required.
169</P>
170
171<P>
172Most operations not required by the original 1985 version of the IEEE
173Floating-Point Standard but added in the 2008 version are not yet supported in
174SoftFloat <NOBR>Release 3e</NOBR>.
175</P>
176
177
178<H2>3. Acknowledgments and License</H2>
179
180<P>
181The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
182<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation
183supplanting earlier releases.
184The project to create <NOBR>Release 3</NOBR> (now <NOBR>through 3e</NOBR>) was
185done in the employ of the University of California, Berkeley, within the
186Department of Electrical Engineering and Computer Sciences, first for the
187Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
188The work was officially overseen by Prof. Krste Asanovic, with funding provided
189by these sources:
190<BLOCKQUOTE>
191<TABLE>
192<COL>
193<COL WIDTH=10>
194<COL>
195<TR>
196<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
197<TD></TD>
198<TD>
199Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
200(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
201NVIDIA, Oracle, and Samsung.
202</TD>
203</TR>
204<TR>
205<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
206<TD></TD>
207<TD>
208DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
209ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
210Oracle, and Samsung.
211</TD>
212</TR>
213</TABLE>
214</BLOCKQUOTE>
215</P>
216
217<P>
218The following applies to the whole of SoftFloat <NOBR>Release 3e</NOBR> as well
219as to each source file individually.
220</P>
221
222<P>
223Copyright 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 The Regents of the
224University of California.
225All rights reserved.
226</P>
227
228<P>
229Redistribution and use in source and binary forms, with or without
230modification, are permitted provided that the following conditions are met:
231<OL>
232
233<LI>
234<P>
235Redistributions of source code must retain the above copyright notice, this
236list of conditions, and the following disclaimer.
237</P>
238
239<LI>
240<P>
241Redistributions in binary form must reproduce the above copyright notice, this
242list of conditions, and the following disclaimer in the documentation and/or
243other materials provided with the distribution.
244</P>
245
246<LI>
247<P>
248Neither the name of the University nor the names of its contributors may be
249used to endorse or promote products derived from this software without specific
250prior written permission.
251</P>
252
253</OL>
254</P>
255
256<P>
257THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS &ldquo;AS IS&rdquo;,
258AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
259IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
260DISCLAIMED.
261IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
262INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
263BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
264DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
265LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
266OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
267ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
268</P>
269
270
271<H2>4. Types and Functions</H2>
272
273<P>
274The types and functions of SoftFloat are declared in header file
275<CODE>softfloat.h</CODE>.
276</P>
277
278<H3>4.1. Boolean and Integer Types</H3>
279
280<P>
281Header file <CODE>softfloat.h</CODE> depends on standard headers
282<CODE>&lt;stdbool.h&gt;</CODE> and <CODE>&lt;stdint.h&gt;</CODE> to define type
283<CODE>bool</CODE> and several integer types.
284These standard headers have been part of the ISO C Standard Library since 1999.
285With any recent compiler, they are likely to be supported, even if the compiler
286does not claim complete conformance to the latest ISO C Standard.
287For older or nonstandard compilers, a port of SoftFloat may have substitutes
288for these headers.
289Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
290<CODE>&lt;stdbool.h&gt;</CODE> and on these type names from
291<CODE>&lt;stdint.h&gt;</CODE>:
292<BLOCKQUOTE>
293<PRE>
294uint16_t
295uint32_t
296uint64_t
297int32_t
298int64_t
299uint_fast8_t
300uint_fast32_t
301uint_fast64_t
302int_fast32_t
303int_fast64_t
304</PRE>
305</BLOCKQUOTE>
306</P>
307
308
309<H3>4.2. Floating-Point Types</H3>
310
311<P>
312The <CODE>softfloat.h</CODE> header defines five floating-point types:
313<BLOCKQUOTE>
314<TABLE CELLSPACING=0 CELLPADDING=0>
315<TR>
316<TD><CODE>float16_t</CODE></TD>
317<TD><NOBR>16-bit</NOBR> half-precision binary format</TD>
318</TR>
319<TR>
320<TD><CODE>float32_t</CODE></TD>
321<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
322</TR>
323<TR>
324<TD><CODE>float64_t</CODE></TD>
325<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
326</TR>
327<TR>
328<TD><CODE>extFloat80_t&nbsp;&nbsp;&nbsp;</CODE></TD>
329<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
330Motorola format)</TD>
331</TR>
332<TR>
333<TD><CODE>float128_t</CODE></TD>
334<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
335</TR>
336</TABLE>
337</BLOCKQUOTE>
338The non-extended types are each exactly the size specified:
339<NOBR>16 bits</NOBR> for <CODE>float16_t</CODE>, <NOBR>32 bits</NOBR> for
340<CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for <CODE>float64_t</CODE>, and
341<NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
342Aside from these size requirements, the definitions of all these types may
343differ for different ports of SoftFloat to specific systems.
344A given port of SoftFloat may or may not define some of the floating-point
345types as aliases for the C standard types <CODE>float</CODE>,
346<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
347</P>
348
349<P>
350Header file <CODE>softfloat.h</CODE> also defines a structure,
351<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
352<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
353This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
354at least these two fields (not necessarily in this order):
355<BLOCKQUOTE>
356<PRE>
357uint16_t signExp;
358uint64_t signif;
359</PRE>
360</BLOCKQUOTE>
361Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
362value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
363encoded exponent in the other <NOBR>15 bits</NOBR>.
364Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
365the floating-point value.
366(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
367leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
368in the most significant bit of the significand.)
369</P>
370
371<H3>4.3. Supported Floating-Point Functions</H3>
372
373<P>
374SoftFloat implements these arithmetic operations for its floating-point types:
375<UL>
376<LI>
377conversions between any two floating-point formats;
378<LI>
379for each floating-point format, conversions to and from signed and unsigned
380<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
381<LI>
382for each format, the usual addition, subtraction, multiplication, division, and
383square root operations;
384<LI>
385for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
386operation defined by the IEEE Standard;
387<LI>
388for each format, the floating-point remainder operation defined by the IEEE
389Standard;
390<LI>
391for each format, a &ldquo;round to integer&rdquo; operation that rounds to the
392nearest integer value in the same format; and
393<LI>
394comparisons between two values in the same floating-point format.
395</UL>
396</P>
397
398<P>
399The following operations required by the 2008 IEEE Floating-Point Standard are
400not supported in SoftFloat <NOBR>Release 3e</NOBR>:
401<UL>
402<LI>
403<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
404<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
405<LI>
406conversions between floating-point formats and decimal or hexadecimal character
407sequences;
408<LI>
409all &ldquo;quiet-computation&rdquo; operations (<B>copy</B>, <B>negate</B>,
410<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
411manipulation of the floating-point sign bit); and
412<LI>
413all &ldquo;non-computational&rdquo; operations other than <B>isSignaling</B>
414(which is supported).
415</UL>
416</P>
417
418<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
419
420<P>
421Because the <NOBR>80-bit</NOBR> double-extended-precision format,
422<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
423finite floating-point numbers are encodable in this type in multiple equivalent
424forms.
425Of these multiple encodings, there is always a unique one with the least
426encoded exponent value, and this encoding is considered the <I>canonical</I>
427representation of the floating-point number.
428Any other equivalent representations (having a higher encoded exponent value)
429are <I>non-canonical</I>.
430For a value in the subnormal range (including zero), the canonical
431representation always has an encoded exponent of zero and a leading significand
432bit <NOBR>of 0</NOBR>.
433For finite values outside the subnormal range, the canonical representation
434always has an encoded exponent that is nonzero and a leading significand bit
435<NOBR>of 1</NOBR>.
436</P>
437
438<P>
439For an infinity or NaN, the leading significand bit is similarly expected to
440<NOBR>be 1</NOBR>.
441An infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
442considered non-canonical.
443Hence, altogether, to be canonical, a value of type <CODE>extFloat80_t</CODE>
444must have a leading significand bit <NOBR>of 1</NOBR>, unless the value is
445subnormal or zero, in which case the leading significand bit and the encoded
446exponent must both be zero.
447</P>
448
449<P>
450SoftFloat&rsquo;s functions are not guaranteed to operate as expected when
451inputs of type <CODE>extFloat80_t</CODE> are non-canonical.
452Assuming all of a function&rsquo;s <CODE>extFloat80_t</CODE> inputs (if any)
453are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
454be canonical.
455</P>
456
457<H3>4.5. Conventions for Passing Arguments and Results</H3>
458
459<P>
460Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
461<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
462cases passed as function arguments by value.
463Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
464is always returned directly as the function result.
465Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
466floating-point values has this simple signature:
467<BLOCKQUOTE>
468<CODE>float64_t f64_add( float64_t, float64_t );</CODE>
469</BLOCKQUOTE>
470</P>
471
472<P>
473The story is more complex when function inputs and outputs are
474<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
475For these types, SoftFloat always provides a function that passes these larger
476values into or out of the function indirectly, via pointers.
477For example, for adding two <NOBR>128-bit</NOBR> floating-point values,
478SoftFloat supplies this function:
479<BLOCKQUOTE>
480<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE>
481</BLOCKQUOTE>
482The first two arguments point to the values to be added, and the last argument
483points to the location where the sum will be stored.
484The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
485that the <NOBR>128-bit</NOBR> inputs and outputs are &ldquo;in memory&rdquo;,
486pointed to by pointer arguments.
487</P>
488
489<P>
490All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
491types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
492At the same time, SoftFloat ports may also implement alternate versions of
493these same functions that pass <CODE>extFloat80_t</CODE> and
494<CODE>float128_t</CODE> by value, like the smaller formats.
495Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a
496SoftFloat port may also supply an equivalent function with this signature:
497<BLOCKQUOTE>
498<CODE>float128_t f128_add( float128_t, float128_t );</CODE>
499</BLOCKQUOTE>
500</P>
501
502<P>
503As a general rule, on computers where the machine word size is
504<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
505(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
506and <CODE>float128_t</CODE>, because passing such large types directly can have
507significant extra cost.
508On computers where the word size is <NOBR>64 bits</NOBR> or larger, both
509function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
510provided, because the cost of passing by value is then more reasonable.
511Applications that must be portable accross both classes of computers must use
512the pointer-based functions, as these are always implemented.
513However, if it is known that SoftFloat includes the by-value functions for all
514platforms of interest, programmers can use whichever version they prefer.
515</P>
516
517
518<H2>5. Reserved Names</H2>
519
520<P>
521In addition to the variables and functions documented here, SoftFloat defines
522some symbol names for its own private use.
523These private names always begin with the prefix
524&lsquo;<CODE>softfloat_</CODE>&rsquo;.
525When a program includes header <CODE>softfloat.h</CODE> or links with the
526SoftFloat library, all names with prefix &lsquo;<CODE>softfloat_</CODE>&rsquo;
527are reserved for possible use by SoftFloat.
528Applications that use SoftFloat should not define their own names with this
529prefix, and should reference only such names as are documented.
530</P>
531
532
533<H2>6. Mode Variables</H2>
534
535<P>
536The following global variables control rounding mode, underflow detection, and
537the <NOBR>80-bit</NOBR> extended format&rsquo;s rounding precision:
538<BLOCKQUOTE>
539<CODE>softfloat_roundingMode</CODE><BR>
540<CODE>softfloat_detectTininess</CODE><BR>
541<CODE>extF80_roundingPrecision</CODE>
542</BLOCKQUOTE>
543These mode variables are covered in the next several subsections.
544For some SoftFloat ports, these variables may be <I>per-thread</I> (declared
545<CODE>thread_local</CODE>), meaning that different execution threads have their
546own separate copies of the variables.
547</P>
548
549<H3>6.1. Rounding Mode</H3>
550
551<P>
552All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
553implemented for all operations that require rounding.
554Some ports of SoftFloat may also implement the <I>round-to-odd</I> mode.
555</P>
556
557<P>
558The rounding mode is selected by the global variable
559<BLOCKQUOTE>
560<CODE>uint_fast8_t softfloat_roundingMode;</CODE>
561</BLOCKQUOTE>
562This variable may be set to one of the values
563<BLOCKQUOTE>
564<TABLE CELLSPACING=0 CELLPADDING=0>
565<TR>
566<TD><CODE>softfloat_round_near_even</CODE></TD>
567<TD>round to nearest, with ties to even</TD>
568</TR>
569<TR>
570<TD><CODE>softfloat_round_near_maxMag&nbsp;&nbsp;</CODE></TD>
571<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
572</TR>
573<TR>
574<TD><CODE>softfloat_round_minMag</CODE></TD>
575<TD>round to minimum magnitude (toward zero)</TD>
576</TR>
577<TR>
578<TD><CODE>softfloat_round_min</CODE></TD>
579<TD>round to minimum (down)</TD>
580</TR>
581<TR>
582<TD><CODE>softfloat_round_max</CODE></TD>
583<TD>round to maximum (up)</TD>
584</TR>
585<TR>
586<TD><CODE>softfloat_round_odd</CODE></TD>
587<TD>round to odd (jamming), if supported by the SoftFloat port</TD>
588</TR>
589</TABLE>
590</BLOCKQUOTE>
591Variable <CODE>softfloat_roundingMode</CODE> is initialized to
592<CODE>softfloat_round_near_even</CODE>.
593</P>
594
595<P>
596When <CODE>softfloat_round_odd</CODE> is the rounding mode for a function that
597rounds to an integer value (either conversion to an integer format or a
598&lsquo;<CODE>roundToInt</CODE>&rsquo; function), if the input is not already an
599integer, the rounded result is the closest <EM>odd</EM> integer.
600For other operations, this rounding mode acts as though the floating-point
601result is first rounded to minimum magnitude, the same as
602<CODE>softfloat_round_minMag</CODE>, and then, if the result is inexact, the
603least-significant bit of the result is set <NOBR>to 1</NOBR>.
604Rounding to odd is also known as <EM>jamming</EM>.
605</P>
606
607<H3>6.2. Underflow Detection</H3>
608
609<P>
610In the terminology of the IEEE Standard, SoftFloat can detect tininess for
611underflow either before or after rounding.
612The choice is made by the global variable
613<BLOCKQUOTE>
614<CODE>uint_fast8_t softfloat_detectTininess;</CODE>
615</BLOCKQUOTE>
616which can be set to either
617<BLOCKQUOTE>
618<CODE>softfloat_tininess_beforeRounding</CODE><BR>
619<CODE>softfloat_tininess_afterRounding</CODE>
620</BLOCKQUOTE>
621Detecting tininess after rounding is usually better because it results in fewer
622spurious underflow signals.
623The other option is provided for compatibility with some systems.
624Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
625always detects loss of accuracy for underflow as an inexact result.
626</P>
627
628<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
629
630<P>
631For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
632arithmetic operations is controlled by the global variable
633<BLOCKQUOTE>
634<CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
635</BLOCKQUOTE>
636The operations affected are:
637<BLOCKQUOTE>
638<CODE>extF80_add</CODE><BR>
639<CODE>extF80_sub</CODE><BR>
640<CODE>extF80_mul</CODE><BR>
641<CODE>extF80_div</CODE><BR>
642<CODE>extF80_sqrt</CODE>
643</BLOCKQUOTE>
644When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
645these operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
646double-extended-precision format, like occurs for other formats.
647Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
648operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
649<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
650<CODE>float64_t</CODE>), respectively.
651When rounding to reduced precision, additional bits in the result significand
652beyond the rounding point are set to zero.
653The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
654other than 32, 64, or 80 is not specified.
655Operations other than the ones listed above are not affected by
656<CODE>extF80_roundingPrecision</CODE>.
657</P>
658
659
660<H2>7. Exceptions and Exception Flags</H2>
661
662<P>
663All five exception flags required by the IEEE Floating-Point Standard are
664implemented.
665Each flag is stored as a separate bit in the global variable
666<BLOCKQUOTE>
667<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
668</BLOCKQUOTE>
669The positions of the exception flag bits within this variable are determined by
670the bit masks
671<BLOCKQUOTE>
672<CODE>softfloat_flag_inexact</CODE><BR>
673<CODE>softfloat_flag_underflow</CODE><BR>
674<CODE>softfloat_flag_overflow</CODE><BR>
675<CODE>softfloat_flag_infinite</CODE><BR>
676<CODE>softfloat_flag_invalid</CODE>
677</BLOCKQUOTE>
678Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
679meaning no exceptions.
680</P>
681
682<P>
683For some SoftFloat ports, <CODE>softfloat_exceptionFlags</CODE> may be
684<I>per-thread</I> (declared <CODE>thread_local</CODE>), meaning that different
685execution threads have their own separate instances of it.
686</P>
687
688<P>
689An individual exception flag can be cleared with the statement
690<BLOCKQUOTE>
691<CODE>softfloat_exceptionFlags &= ~softfloat_flag_&lt;<I>exception</I>&gt;;</CODE>
692</BLOCKQUOTE>
693where <CODE>&lt;<I>exception</I>&gt;</CODE> is the appropriate name.
694To raise a floating-point exception, function <CODE>softfloat_raiseFlags</CODE>
695should normally be used.
696</P>
697
698<P>
699When SoftFloat detects an exception other than <I>inexact</I>, it calls
700<CODE>softfloat_raiseFlags</CODE>.
701The default version of this function simply raises the corresponding exception
702flags.
703Particular ports of SoftFloat may support alternate behavior, such as exception
704traps, by modifying the default <CODE>softfloat_raiseFlags</CODE>.
705A program may also supply its own <CODE>softfloat_raiseFlags</CODE> function to
706override the one from the SoftFloat library.
707</P>
708
709<P>
710Because inexact results occur frequently under most circumstances (and thus are
711hardly exceptional), SoftFloat does not ordinarily call
712<CODE>softfloat_raiseFlags</CODE> for <I>inexact</I> exceptions.
713It does always raise the <I>inexact</I> exception flag as required.
714</P>
715
716
717<H2>8. Function Details</H2>
718
719<P>
720In this section, <CODE>&lt;<I>float</I>&gt;</CODE> appears in function names as
721a substitute for one of these abbreviations:
722<BLOCKQUOTE>
723<TABLE CELLSPACING=0 CELLPADDING=0>
724<TR>
725<TD><CODE>f16</CODE></TD>
726<TD>indicates <CODE>float16_t</CODE>, passed by value</TD>
727</TR>
728<TR>
729<TD><CODE>f32</CODE></TD>
730<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
731</TR>
732<TR>
733<TD><CODE>f64</CODE></TD>
734<TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
735</TR>
736<TR>
737<TD><CODE>extF80M&nbsp;&nbsp;&nbsp;</CODE></TD>
738<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
739</TR>
740<TR>
741<TD><CODE>extF80</CODE></TD>
742<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
743</TR>
744<TR>
745<TD><CODE>f128M</CODE></TD>
746<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
747</TR>
748<TR>
749<TD><CODE>f128</CODE></TD>
750<TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
751</TR>
752</TABLE>
753</BLOCKQUOTE>
754The circumstances under which values of floating-point types
755<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
756value or indirectly via pointers was discussed earlier in
757<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
758</P>
759
760<H3>8.1. Conversions from Integer to Floating-Point</H3>
761
762<P>
763All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
764signed or unsigned, to a floating-point format are supported.
765Functions performing these conversions have these names:
766<BLOCKQUOTE>
767<CODE>ui32_to_&lt;<I>float</I>&gt;</CODE><BR>
768<CODE>ui64_to_&lt;<I>float</I>&gt;</CODE><BR>
769<CODE>i32_to_&lt;<I>float</I>&gt;</CODE><BR>
770<CODE>i64_to_&lt;<I>float</I>&gt;</CODE>
771</BLOCKQUOTE>
772Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
773double-precision and larger formats are always exact, and likewise conversions
774from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
775double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
776always exact.
777</P>
778
779<P>
780Each conversion function takes one input of the appropriate type and generates
781one output.
782The following illustrates the signatures of these functions in cases when the
783floating-point result is passed either by value or via pointers:
784<BLOCKQUOTE>
785<PRE>
786float64_t i32_to_f64( int32_t <I>a</I> );
787</PRE>
788<PRE>
789void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
790</PRE>
791</BLOCKQUOTE>
792</P>
793
794<H3>8.2. Conversions from Floating-Point to Integer</H3>
795
796<P>
797Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or
798<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
799functions:
800<BLOCKQUOTE>
801<CODE>&lt;<I>float</I>&gt;_to_ui32</CODE><BR>
802<CODE>&lt;<I>float</I>&gt;_to_ui64</CODE><BR>
803<CODE>&lt;<I>float</I>&gt;_to_i32</CODE><BR>
804<CODE>&lt;<I>float</I>&gt;_to_i64</CODE>
805</BLOCKQUOTE>
806The functions have signatures as follows, depending on whether the
807floating-point input is passed by value or via pointers:
808<BLOCKQUOTE>
809<PRE>
810int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
811</PRE>
812<PRE>
813int_fast32_t
814 f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
815</PRE>
816</BLOCKQUOTE>
817</P>
818
819<P>
820The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
821the conversion.
822The variable that usually indicates rounding mode,
823<CODE>softfloat_roundingMode</CODE>, is ignored.
824Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
825exception flag is raised if the conversion is not exact.
826If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
827be raised;
828otherwise, it will not be, even if the conversion is inexact.
829</P>
830
831<P>
832A conversion from floating-point to integer format raises the <I>invalid</I>
833exception if the source value cannot be rounded to a representable integer of
834the desired size (32 or 64 bits).
835In such circumstances, the integer result returned is determined by the
836particular port of SoftFloat, although typically this value will be either the
837maximum or minimum value of the integer format.
838The functions that convert to integer types never raise the floating-point
839<I>overflow</I> exception.
840</P>
841
842<P>
843Because languages such <NOBR>as C</NOBR> require that conversions to integers
844be rounded toward zero, the following functions are provided for improved speed
845and convenience:
846<BLOCKQUOTE>
847<CODE>&lt;<I>float</I>&gt;_to_ui32_r_minMag</CODE><BR>
848<CODE>&lt;<I>float</I>&gt;_to_ui64_r_minMag</CODE><BR>
849<CODE>&lt;<I>float</I>&gt;_to_i32_r_minMag</CODE><BR>
850<CODE>&lt;<I>float</I>&gt;_to_i64_r_minMag</CODE>
851</BLOCKQUOTE>
852These functions round only toward zero (to minimum magnitude).
853The signatures for these functions are the same as above without the redundant
854<CODE><I>roundingMode</I></CODE> argument:
855<BLOCKQUOTE>
856<PRE>
857int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
858</PRE>
859<PRE>
860int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
861</PRE>
862</BLOCKQUOTE>
863</P>
864
865<H3>8.3. Conversions Among Floating-Point Types</H3>
866
867<P>
868Conversions between floating-point formats are done by functions with these
869names:
870<BLOCKQUOTE>
871<CODE>&lt;<I>float</I>&gt;_to_&lt;<I>float</I>&gt;</CODE>
872</BLOCKQUOTE>
873All combinations of source and result type are supported where the source and
874result are different formats.
875There are four different styles of signature for these functions, depending on
876whether the input and the output floating-point values are passed by value or
877via pointers:
878<BLOCKQUOTE>
879<PRE>
880float32_t f64_to_f32( float64_t <I>a</I> );
881</PRE>
882<PRE>
883float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
884</PRE>
885<PRE>
886void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
887</PRE>
888<PRE>
889void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
890</PRE>
891</BLOCKQUOTE>
892</P>
893
894<P>
895Conversions from a smaller to a larger floating-point format are always exact
896and so require no rounding.
897</P>
898
899<H3>8.4. Basic Arithmetic Functions</H3>
900
901<P>
902The following basic arithmetic functions are provided:
903<BLOCKQUOTE>
904<CODE>&lt;<I>float</I>&gt;_add</CODE><BR>
905<CODE>&lt;<I>float</I>&gt;_sub</CODE><BR>
906<CODE>&lt;<I>float</I>&gt;_mul</CODE><BR>
907<CODE>&lt;<I>float</I>&gt;_div</CODE><BR>
908<CODE>&lt;<I>float</I>&gt;_sqrt</CODE>
909</BLOCKQUOTE>
910Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
911(square root) which takes only one.
912The operands and result are all of the same floating-point format.
913Signatures for these functions take the following forms:
914<BLOCKQUOTE>
915<PRE>
916float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
917</PRE>
918<PRE>
919void
920 f128M_add(
921 const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
922</PRE>
923<PRE>
924float64_t f64_sqrt( float64_t <I>a</I> );
925</PRE>
926<PRE>
927void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
928</PRE>
929</BLOCKQUOTE>
930When floating-point values are passed indirectly through pointers, arguments
931<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
932operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
933location where the result is stored.
934</P>
935
936<P>
937Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
938(<CODE>extFloat80_t</CODE>) functions is affected by variable
939<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
940<NOBR>section 6.3</NOBR>,
941<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
942</P>
943
944<H3>8.5. Fused Multiply-Add Functions</H3>
945
946<P>
947The 2008 version of the IEEE Floating-Point Standard defines a <I>fused
948multiply-add</I> operation that does a combined multiplication and addition
949with only a single rounding.
950SoftFloat implements fused multiply-add with functions
951<BLOCKQUOTE>
952<CODE>&lt;<I>float</I>&gt;_mulAdd</CODE>
953</BLOCKQUOTE>
954Unlike other operations, fused multiple-add is not supported for the
955<NOBR>80-bit</NOBR> double-extended-precision format,
956<CODE>extFloat80_t</CODE>.
957</P>
958
959<P>
960Depending on whether floating-point values are passed by value or via pointers,
961the fused multiply-add functions have signatures of these forms:
962<BLOCKQUOTE>
963<PRE>
964float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
965</PRE>
966<PRE>
967void
968 f128M_mulAdd(
969 const float128_t *<I>aPtr</I>,
970 const float128_t *<I>bPtr</I>,
971 const float128_t *<I>cPtr</I>,
972 float128_t *<I>destPtr</I>
973 );
974</PRE>
975</BLOCKQUOTE>
976The functions compute
977<NOBR>(<CODE><I>a</I></CODE> &times; <CODE><I>b</I></CODE>)
978 + <CODE><I>c</I></CODE></NOBR>
979with a single rounding.
980When floating-point values are passed indirectly through pointers, arguments
981<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
982<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
983<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
984<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
985</P>
986
987<P>
988If one of the multiplication operands <CODE><I>a</I></CODE> and
989<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
990the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
991</P>
992
993<H3>8.6. Remainder Functions</H3>
994
995<P>
996For each format, SoftFloat implements the remainder operation defined by the
997IEEE Floating-Point Standard.
998The remainder functions have names
999<BLOCKQUOTE>
1000<CODE>&lt;<I>float</I>&gt;_rem</CODE>
1001</BLOCKQUOTE>
1002Each remainder operation takes two floating-point operands of the same format
1003and returns a result in the same format.
1004Depending on whether floating-point values are passed by value or via pointers,
1005the remainder functions have signatures of these forms:
1006<BLOCKQUOTE>
1007<PRE>
1008float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
1009</PRE>
1010<PRE>
1011void
1012 f128M_rem(
1013 const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
1014</PRE>
1015</BLOCKQUOTE>
1016When floating-point values are passed indirectly through pointers, arguments
1017<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
1018<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
1019<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1020</P>
1021
1022<P>
1023The IEEE Standard remainder operation computes the value
1024<NOBR><CODE><I>a</I></CODE>
1025 &minus; <I>n</I> &times; <CODE><I>b</I></CODE></NOBR>,
1026where <I>n</I> is the integer closest to
1027<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
1028If <NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR> is exactly
1029halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
1030<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
1031The IEEE Standard&rsquo;s remainder operation is always exact and so requires
1032no rounding.
1033</P>
1034
1035<P>
1036Depending on the relative magnitudes of the operands, the remainder
1037functions can take considerably longer to execute than the other SoftFloat
1038functions.
1039This is an inherent characteristic of the remainder operation itself and is not
1040a flaw in the SoftFloat implementation.
1041</P>
1042
1043<H3>8.7. Round-to-Integer Functions</H3>
1044
1045<P>
1046For each format, SoftFloat implements the round-to-integer operation specified
1047by the IEEE Floating-Point Standard.
1048These functions are named
1049<BLOCKQUOTE>
1050<CODE>&lt;<I>float</I>&gt;_roundToInt</CODE>
1051</BLOCKQUOTE>
1052Each round-to-integer operation takes a single floating-point operand.
1053This operand is rounded to an integer according to a specified rounding mode,
1054and the resulting integer value is returned in the same floating-point format.
1055(Note that the result is not an integer type.)
1056</P>
1057
1058<P>
1059The signatures of the round-to-integer functions are similar to those for
1060conversions to an integer type:
1061<BLOCKQUOTE>
1062<PRE>
1063float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
1064</PRE>
1065<PRE>
1066void
1067 f128M_roundToInt(
1068 const float128_t *<I>aPtr</I>,
1069 uint_fast8_t <I>roundingMode</I>,
1070 bool <I>exact</I>,
1071 float128_t *<I>destPtr</I>
1072 );
1073</PRE>
1074</BLOCKQUOTE>
1075When floating-point values are passed indirectly through pointers,
1076<CODE><I>aPtr</I></CODE> points to the input operand and
1077<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1078</P>
1079
1080<P>
1081The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
1082apply.
1083The variable that usually indicates rounding mode,
1084<CODE>softfloat_roundingMode</CODE>, is ignored.
1085Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
1086exception flag is raised if the conversion is not exact.
1087If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
1088be raised;
1089otherwise, it will not be, even if the conversion is inexact.
1090</P>
1091
1092<H3>8.8. Comparison Functions</H3>
1093
1094<P>
1095For each format, the following floating-point comparison functions are
1096provided:
1097<BLOCKQUOTE>
1098<CODE>&lt;<I>float</I>&gt;_eq</CODE><BR>
1099<CODE>&lt;<I>float</I>&gt;_le</CODE><BR>
1100<CODE>&lt;<I>float</I>&gt;_lt</CODE>
1101</BLOCKQUOTE>
1102Each comparison takes two operands of the same type and returns a Boolean.
1103The abbreviation <CODE>eq</CODE> stands for &ldquo;equal&rdquo; (=);
1104<CODE>le</CODE> stands for &ldquo;less than or equal&rdquo; (&le;);
1105and <CODE>lt</CODE> stands for &ldquo;less than&rdquo; (&lt;).
1106Depending on whether the floating-point operands are passed by value or via
1107pointers, the comparison functions have signatures of these forms:
1108<BLOCKQUOTE>
1109<PRE>
1110bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
1111</PRE>
1112<PRE>
1113bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
1114</PRE>
1115</BLOCKQUOTE>
1116</P>
1117
1118<P>
1119The usual greater-than (&gt;), greater-than-or-equal (&ge;), and not-equal
1120(&ne;) comparisons are easily obtained from the functions provided.
1121The not-equal function is just the logical complement of the equal function.
1122The greater-than-or-equal function is identical to the less-than-or-equal
1123function with the arguments in reverse order, and likewise the greater-than
1124function is identical to the less-than function with the arguments reversed.
1125</P>
1126
1127<P>
1128The IEEE Floating-Point Standard specifies that the less-than-or-equal and
1129less-than comparisons by default raise the <I>invalid</I> exception if either
1130operand is any kind of NaN.
1131Equality comparisons, on the other hand, are defined by default to raise the
1132<I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
1133For completeness, SoftFloat provides these complementary functions:
1134<BLOCKQUOTE>
1135<CODE>&lt;<I>float</I>&gt;_eq_signaling</CODE><BR>
1136<CODE>&lt;<I>float</I>&gt;_le_quiet</CODE><BR>
1137<CODE>&lt;<I>float</I>&gt;_lt_quiet</CODE>
1138</BLOCKQUOTE>
1139The <CODE>signaling</CODE> equality comparisons are identical to the default
1140equality comparisons except that the <I>invalid</I> exception is raised for any
1141NaN input, not just for signaling NaNs.
1142Similarly, the <CODE>quiet</CODE> comparison functions are identical to their
1143default counterparts except that the <I>invalid</I> exception is not raised for
1144quiet NaNs.
1145</P>
1146
1147<H3>8.9. Signaling NaN Test Functions</H3>
1148
1149<P>
1150Functions for testing whether a floating-point value is a signaling NaN are
1151provided with these names:
1152<BLOCKQUOTE>
1153<CODE>&lt;<I>float</I>&gt;_isSignalingNaN</CODE>
1154</BLOCKQUOTE>
1155The functions take one floating-point operand and return a Boolean indicating
1156whether the operand is a signaling NaN.
1157Accordingly, the functions have the forms
1158<BLOCKQUOTE>
1159<PRE>
1160bool f64_isSignalingNaN( float64_t <I>a</I> );
1161</PRE>
1162<PRE>
1163bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
1164</PRE>
1165</BLOCKQUOTE>
1166</P>
1167
1168<H3>8.10. Raise-Exception Function</H3>
1169
1170<P>
1171SoftFloat provides a single function for raising floating-point exceptions:
1172<BLOCKQUOTE>
1173<PRE>
1174void softfloat_raiseFlags( uint_fast8_t <I>exceptions</I> );
1175</PRE>
1176</BLOCKQUOTE>
1177The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
1178exceptions to raise.
1179(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
1180In addition to setting the specified exception flags in variable
1181<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raiseFlags</CODE>
1182function may cause a trap or abort appropriate for the current system.
1183</P>
1184
1185
1186<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
1187
1188<P>
1189Apart from a change in the legal use license, <NOBR>Release 3</NOBR> of
1190SoftFloat introduced numerous technical differences compared to earlier
1191releases.
1192</P>
1193
1194<H3>9.1. Name Changes</H3>
1195
1196<P>
1197The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR>
1198is that the names of most functions and variables have changed, even when the
1199behavior has not.
1200First, the floating-point types, the mode variables, the exception flags
1201variable, the function to raise exceptions, and various associated constants
1202have been renamed as follows:
1203<BLOCKQUOTE>
1204<TABLE>
1205<TR>
1206<TD>old name, Release 2:</TD>
1207<TD>new name, Release 3:</TD>
1208</TR>
1209<TR>
1210<TD><CODE>float32</CODE></TD>
1211<TD><CODE>float32_t</CODE></TD>
1212</TR>
1213<TR>
1214<TD><CODE>float64</CODE></TD>
1215<TD><CODE>float64_t</CODE></TD>
1216</TR>
1217<TR>
1218<TD><CODE>floatx80</CODE></TD>
1219<TD><CODE>extFloat80_t</CODE></TD>
1220</TR>
1221<TR>
1222<TD><CODE>float128</CODE></TD>
1223<TD><CODE>float128_t</CODE></TD>
1224</TR>
1225<TR>
1226<TD><CODE>float_rounding_mode</CODE></TD>
1227<TD><CODE>softfloat_roundingMode</CODE></TD>
1228</TR>
1229<TR>
1230<TD><CODE>float_round_nearest_even</CODE></TD>
1231<TD><CODE>softfloat_round_near_even</CODE></TD>
1232</TR>
1233<TR>
1234<TD><CODE>float_round_to_zero</CODE></TD>
1235<TD><CODE>softfloat_round_minMag</CODE></TD>
1236</TR>
1237<TR>
1238<TD><CODE>float_round_down</CODE></TD>
1239<TD><CODE>softfloat_round_min</CODE></TD>
1240</TR>
1241<TR>
1242<TD><CODE>float_round_up</CODE></TD>
1243<TD><CODE>softfloat_round_max</CODE></TD>
1244</TR>
1245<TR>
1246<TD><CODE>float_detect_tininess</CODE></TD>
1247<TD><CODE>softfloat_detectTininess</CODE></TD>
1248</TR>
1249<TR>
1250<TD><CODE>float_tininess_before_rounding&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1251<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
1252</TR>
1253<TR>
1254<TD><CODE>float_tininess_after_rounding</CODE></TD>
1255<TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
1256</TR>
1257<TR>
1258<TD><CODE>floatx80_rounding_precision</CODE></TD>
1259<TD><CODE>extF80_roundingPrecision</CODE></TD>
1260</TR>
1261<TR>
1262<TD><CODE>float_exception_flags</CODE></TD>
1263<TD><CODE>softfloat_exceptionFlags</CODE></TD>
1264</TR>
1265<TR>
1266<TD><CODE>float_flag_inexact</CODE></TD>
1267<TD><CODE>softfloat_flag_inexact</CODE></TD>
1268</TR>
1269<TR>
1270<TD><CODE>float_flag_underflow</CODE></TD>
1271<TD><CODE>softfloat_flag_underflow</CODE></TD>
1272</TR>
1273<TR>
1274<TD><CODE>float_flag_overflow</CODE></TD>
1275<TD><CODE>softfloat_flag_overflow</CODE></TD>
1276</TR>
1277<TR>
1278<TD><CODE>float_flag_divbyzero</CODE></TD>
1279<TD><CODE>softfloat_flag_infinite</CODE></TD>
1280</TR>
1281<TR>
1282<TD><CODE>float_flag_invalid</CODE></TD>
1283<TD><CODE>softfloat_flag_invalid</CODE></TD>
1284</TR>
1285<TR>
1286<TD><CODE>float_raise</CODE></TD>
1287<TD><CODE>softfloat_raiseFlags</CODE></TD>
1288</TR>
1289</TABLE>
1290</BLOCKQUOTE>
1291</P>
1292
1293<P>
1294Furthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for
1295function names:
1296<BLOCKQUOTE>
1297<TABLE>
1298<TR>
1299<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1300<TD>used in names in Release 3:</TD>
1301</TR>
1302<TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR>
1303<TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR>
1304<TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR>
1305<TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR>
1306<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
1307<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR>
1308</TABLE>
1309</BLOCKQUOTE>
1310Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
1311numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
1312is now <CODE>f32_add</CODE>.
1313Lastly, there have been a few other changes to function names:
1314<BLOCKQUOTE>
1315<TABLE>
1316<TR>
1317<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1318<TD>used in names in Release 3:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1319<TD>relevant functions:</TD>
1320</TR>
1321<TR>
1322<TD><CODE>_round_to_zero</CODE></TD>
1323<TD><CODE>_r_minMag</CODE></TD>
1324<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
1325</TR>
1326<TR>
1327<TD><CODE>round_to_int</CODE></TD>
1328<TD><CODE>roundToInt</CODE></TD>
1329<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
1330</TR>
1331<TR>
1332<TD><CODE>is_signaling_nan&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1333<TD><CODE>isSignalingNaN</CODE></TD>
1334<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
1335</TR>
1336</TABLE>
1337</BLOCKQUOTE>
1338</P>
1339
1340<H3>9.2. Changes to Function Arguments</H3>
1341
1342<P>
1343Besides simple name changes, some operations were given a different interface
1344in <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
1345<UL>
1346
1347<LI>
1348<P>
1349Since <NOBR>Release 3</NOBR>, integer arguments and results of functions have
1350standard types from header <CODE>&lt;stdint.h&gt;</CODE>, such as
1351<CODE>uint32_t</CODE>, whereas previously their types could be defined
1352differently for each port of SoftFloat, usually using traditional C types such
1353as <CODE>unsigned</CODE> <CODE>int</CODE>.
1354Likewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as
1355standard type <CODE>bool</CODE> from <CODE>&lt;stdbool.h&gt;</CODE>, whereas
1356previously these were again passed as a port-specific type (usually
1357<CODE>int</CODE>).
1358</P>
1359
1360<LI>
1361<P>
1362As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
1363Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and
1364later may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point
1365values through pointers, meaning that functions take pointer arguments and then
1366read or write floating-point values at the locations indicated by the pointers.
1367In <NOBR>Release 2</NOBR>, floating-point arguments and results were always
1368passed by value, regardless of their size.
1369</P>
1370
1371<LI>
1372<P>
1373Functions that round to an integer have additional
1374<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
1375they did not have in <NOBR>Release 2</NOBR>.
1376Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
1377since <NOBR>Release 3</NOBR>.
1378For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
1379same global variable that affects the basic arithmetic operations (now called
1380<CODE>softfloat_roundingMode</CODE> but previously known as
1381<CODE>float_rounding_mode</CODE>).
1382Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
1383an exact integer value, and if the <I>invalid</I> exception was not raised by
1384the function, the <I>inexact</I> exception was always raised.
1385<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
1386case.
1387Applications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same
1388effect as <NOBR>Release 2</NOBR> by passing variable
1389<CODE>softfloat_roundingMode</CODE> for argument
1390<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument
1391<CODE><I>exact</I></CODE>.
1392</P>
1393
1394</UL>
1395</P>
1396
1397<H3>9.3. Added Capabilities</H3>
1398
1399<P>
1400With <NOBR>Release 3</NOBR>, some new features have been added that were not
1401present in <NOBR>Release 2</NOBR>:
1402<UL>
1403
1404<LI>
1405<P>
1406A port of SoftFloat can now define any of the floating-point types
1407<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and
1408<CODE>float128_t</CODE> as aliases for C&rsquo;s standard floating-point types
1409<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE>
1410<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>.
1411This potential convenience was not supported under <NOBR>Release 2</NOBR>.
1412</P>
1413
1414<P>
1415(Note, however, that there may be a performance cost to defining
1416SoftFloat&rsquo;s floating-point types this way, depending on the platform and
1417the applications using SoftFloat.
1418Ports of SoftFloat may choose to forgo the convenience in favor of better
1419speed.)
1420</P>
1421
1422<P>
1423<LI>
1424As of <NOBR>Release 3b</NOBR>, <NOBR>16-bit</NOBR> half-precision,
1425<CODE>float16_t</CODE>, is supported.
1426</P>
1427
1428<P>
1429<LI>
1430Functions have been added for converting between the floating-point types and
1431unsigned integers.
1432<NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
1433</P>
1434
1435<P>
1436<LI>
1437Fused multiply-add functions have been added for all floating-point formats
1438except <NOBR>80-bit</NOBR> double-extended-precision,
1439<CODE>extFloat80_t</CODE>.
1440</P>
1441
1442<P>
1443<LI>
1444New rounding modes are supported:
1445<CODE>softfloat_round_near_maxMag</CODE> (round to nearest, with ties to
1446maximum magnitude, away from zero), and, as of <NOBR>Release 3c</NOBR>,
1447optional <CODE>softfloat_round_odd</CODE> (round to odd, also known as
1448jamming).
1449</P>
1450
1451</UL>
1452</P>
1453
1454<H3>9.4. Better Compatibility with the C Language</H3>
1455
1456<P>
1457<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C
1458Standard&rsquo;s rules for portability.
1459For example, older releases of SoftFloat employed type conversions in ways
1460that, while commonly practiced, are not fully defined by the C Standard.
1461Such problematic type conversions have generally been replaced by the use of
1462unions, the behavior around which is more strictly regulated these days.
1463</P>
1464
1465<H3>9.5. New Organization as a Library</H3>
1466
1467<P>
1468Starting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
1469Previously, SoftFloat compiled into a single, monolithic object file containing
1470all the SoftFloat functions, with the consequence that a program linking with
1471SoftFloat would get every SoftFloat function in its binary file even if only a
1472few functions were actually used.
1473With SoftFloat in the form of a library, a program that is linked by a standard
1474linker will include only those functions of SoftFloat that it needs and no
1475others.
1476</P>
1477
1478<H3>9.6. Optimization Gains (and Losses)</H3>
1479
1480<P>
1481Individual SoftFloat functions have been variously improved in
1482<NOBR>Release 3</NOBR> compared to earlier releases.
1483In particular, better, faster algorithms have been deployed for the operations
1484of division, square root, and remainder.
1485For functions operating on the larger <NOBR>80-bit</NOBR> and
1486<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
1487<CODE>float128_t</CODE>, code size has also generally been reduced.
1488</P>
1489
1490<P>
1491However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
1492single object file, compilers could make optimizations across function calls
1493when one SoftFloat function calls another.
1494Now that the functions of SoftFloat are compiled separately and only afterward
1495linked together into a program, there is not usually the same opportunity to
1496optimize across function calls.
1497Some loss of speed has been observed due to this change.
1498</P>
1499
1500
1501<H2>10. Future Directions</H2>
1502
1503<P>
1504The following improvements are anticipated for future releases of SoftFloat:
1505<UL>
1506<LI>
1507more functions from the 2008 version of the IEEE Floating-Point Standard;
1508<LI>
1509consistent, defined behavior for non-canonical representations of extended
1510format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
1511<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
1512
1513</UL>
1514</P>
1515
1516
1517<H2>11. Contact Information</H2>
1518
1519<P>
1520At the time of this writing, the most up-to-date information about SoftFloat
1521and the latest release can be found at the Web page
1522<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></NOBR></A>.
1523</P>
1524
1525
1526</BODY>
1527
Note: See TracBrowser for help on using the repository browser.

© 2024 Oracle Support Privacy / Do Not Sell My Info Terms of Use Trademark Policy Automated Access Etiquette