SoftFloat.html

Last change on this file was 94480, checked in by vboxsync, 3 years ago
libs/softfloat-3e: Copied from vendor branch (SoftFloat-3e.zip, md5: 7dac954ea4aed0697cbfee800ba4f492). bugref:9898
Property svn:eol-style set to `native` Property svn:mime-type set to `text/html`
File size: 52.2 KB

Line
1
2	<HTML>
3
4	<HEAD>
5	<TITLE>Berkeley SoftFloat Library Interface</TITLE>
6	</HEAD>
7
8	<BODY>
9
10	<H1>Berkeley SoftFloat Release 3e: Library Interface</H1>
11
12	<P>
13	John R. Hauser<BR>
14	2018 January 20<BR>
15	</P>
16
17
18	<H2>Contents</H2>
19
20	<BLOCKQUOTE>
21	<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
22	<COL WIDTH=25>
23	<COL WIDTH=*>
24	<TR><TD COLSPAN=2>1. Introduction</TD></TR>
25	<TR><TD COLSPAN=2>2. Limitations</TD></TR>
26	<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
27	<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
28	<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
29	<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
30	<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
31	<TR>
32	<TD></TD>
33	<TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
34	</TR>
35	<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
36	<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
37	<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
38	<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
39	<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
40	<TR>
41	<TD></TD>
42	<TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
43	</TR>
44	<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
45	<TR><TD COLSPAN=2>8. Function Details</TD></TR>
46	<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
47	<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
48	<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
49	<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
50	<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
51	<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
52	<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
53	<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
54	<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
55	<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
56	<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
57	<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
58	<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
59	<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
60	<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
61	<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
62	<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
63	<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
64	<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
65	</TABLE>
66	</BLOCKQUOTE>
67
68
69	<H2>1. Introduction</H2>
70
71	<P>
72	Berkeley SoftFloat is a software implementation of binary floating-point that
73	conforms to the IEEE Standard for Floating-Point Arithmetic.
74	The current release supports five binary formats: <NOBR>16-bit</NOBR>
75	half-precision, <NOBR>32-bit</NOBR> single-precision, <NOBR>64-bit</NOBR>
76	double-precision, <NOBR>80-bit</NOBR> double-extended-precision, and
77	<NOBR>128-bit</NOBR> quadruple-precision.
78	The following functions are supported for each format:
79	<UL>
80	<LI>
81	addition, subtraction, multiplication, division, and square root;
82	<LI>
83	fused multiply-add as defined by the IEEE Standard, except for
84	<NOBR>80-bit</NOBR> double-extended-precision;
85	<LI>
86	remainder as defined by the IEEE Standard;
87	<LI>
88	round to integral value;
89	<LI>
90	comparisons;
91	<LI>
92	conversions to/from other supported formats; and
93	<LI>
94	conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
95	signed and unsigned.
96	</UL>
97	All operations required by the original 1985 version of the IEEE Floating-Point
98	Standard are implemented, except for conversions to and from decimal.
99	</P>
100
101	<P>
102	This document gives information about the types defined and the routines
103	implemented by SoftFloat.
104	It does not attempt to define or explain the IEEE Floating-Point Standard.
105	Information about the standard is available elsewhere.
106	</P>
107
108	<P>
109	The current version of SoftFloat is <NOBR>Release 3e</NOBR>.
110	This release modifies the behavior of the rarely used <I>odd</I> rounding mode
111	(<I>round to odd</I>, also known as <I>jamming</I>), and also adds some new
112	specialization and optimization examples for those compiling SoftFloat.
113	</P>
114
115	<P>
116	The previous <NOBR>Release 3d</NOBR> fixed bugs that were found in the square
117	root functions for the <NOBR>64-bit</NOBR>, <NOBR>80-bit</NOBR>, and
118	<NOBR>128-bit</NOBR> floating-point formats.
119	(Thanks to Alexei Sibidanov at the University of Victoria for reporting an
120	incorrect result.)
121	The bugs affected all prior <NOBR>Release-3</NOBR> versions of SoftFloat
122	<NOBR>through 3c</NOBR>.
123	The flaw in the <NOBR>64-bit</NOBR> floating-point square root function was of
124	very minor impact, causing a <NOBR>1-ulp</NOBR> error (<NOBR>1 unit</NOBR> in
125	the last place) a few times out of a billion.
126	The bugs in the <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> square root
127	functions were more serious.
128	Although incorrect results again occurred only a few times out of a billion,
129	when they did occur a large portion of the less-significant bits could be
130	wrong.
131	</P>
132
133	<P>
134	Among earlier releases, 3b was notable for adding support for the
135	<NOBR>16-bit</NOBR> half-precision format.
136	For more about the evolution of SoftFloat releases, see
137	<A HREF="SoftFloat-history.html"><NOBR><CODE>SoftFloat-history.html</CODE></NOBR></A>.
138	</P>
139
140	<P>
141	The functional interface of SoftFloat <NOBR>Release 3</NOBR> and later differs
142	in many details from the releases that came before.
143	For specifics of these differences, see <NOBR>section 9</NOBR> below,
144	<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
145	</P>
146
147
148	<H2>2. Limitations</H2>
149
150	<P>
151	SoftFloat assumes the computer has an addressable byte size of 8 or
152	<NOBR>16 bits</NOBR>.
153	(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
154	</P>
155
156	<P>
157	SoftFloat is written in C and is designed to work with other C code.
158	The C compiler used must conform at a minimum to the 1989 ANSI standard for the
159	C language (same as the 1990 ISO standard) and must in addition support basic
160	arithmetic on <NOBR>64-bit</NOBR> integers.
161	Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
162	single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
163	did not require <NOBR>64-bit</NOBR> integers, but this option is not supported
164	starting with <NOBR>Release 3</NOBR>.
165	Since 1999, ISO standards for C have mandated compiler support for
166	<NOBR>64-bit</NOBR> integers.
167	A compiler conforming to the 1999 C Standard or later is recommended but not
168	strictly required.
169	</P>
170
171	<P>
172	Most operations not required by the original 1985 version of the IEEE
173	Floating-Point Standard but added in the 2008 version are not yet supported in
174	SoftFloat <NOBR>Release 3e</NOBR>.
175	</P>
176
177
178	<H2>3. Acknowledgments and License</H2>
179
180	<P>
181	The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
182	<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation
183	supplanting earlier releases.
184	The project to create <NOBR>Release 3</NOBR> (now <NOBR>through 3e</NOBR>) was
185	done in the employ of the University of California, Berkeley, within the
186	Department of Electrical Engineering and Computer Sciences, first for the
187	Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
188	The work was officially overseen by Prof. Krste Asanovic, with funding provided
189	by these sources:
190	<BLOCKQUOTE>
191	<TABLE>
192	<COL>
193	<COL WIDTH=10>
194	<COL>
195	<TR>
196	<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
197	<TD></TD>
198	<TD>
199	Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
200	(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
201	NVIDIA, Oracle, and Samsung.
202	</TD>
203	</TR>
204	<TR>
205	<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
206	<TD></TD>
207	<TD>
208	DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
209	ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
210	Oracle, and Samsung.
211	</TD>
212	</TR>
213	</TABLE>
214	</BLOCKQUOTE>
215	</P>
216
217	<P>
218	The following applies to the whole of SoftFloat <NOBR>Release 3e</NOBR> as well
219	as to each source file individually.
220	</P>
221
222	<P>
223	Copyright 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 The Regents of the
224	University of California.
225	All rights reserved.
226	</P>
227
228	<P>
229	Redistribution and use in source and binary forms, with or without
230	modification, are permitted provided that the following conditions are met:
231	<OL>
232
233	<LI>
234	<P>
235	Redistributions of source code must retain the above copyright notice, this
236	list of conditions, and the following disclaimer.
237	</P>
238
239	<LI>
240	<P>
241	Redistributions in binary form must reproduce the above copyright notice, this
242	list of conditions, and the following disclaimer in the documentation and/or
243	other materials provided with the distribution.
244	</P>
245
246	<LI>
247	<P>
248	Neither the name of the University nor the names of its contributors may be
249	used to endorse or promote products derived from this software without specific
250	prior written permission.
251	</P>
252
253	</OL>
254	</P>
255
256	<P>
257	THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”,
258	AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
259	IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
260	DISCLAIMED.
261	IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
262	INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
263	BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
264	DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
265	LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
266	OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
267	ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
268	</P>
269
270
271	<H2>4. Types and Functions</H2>
272
273	<P>
274	The types and functions of SoftFloat are declared in header file
275	<CODE>softfloat.h</CODE>.
276	</P>
277
278	<H3>4.1. Boolean and Integer Types</H3>
279
280	<P>
281	Header file <CODE>softfloat.h</CODE> depends on standard headers
282	<CODE><stdbool.h></CODE> and <CODE><stdint.h></CODE> to define type
283	<CODE>bool</CODE> and several integer types.
284	These standard headers have been part of the ISO C Standard Library since 1999.
285	With any recent compiler, they are likely to be supported, even if the compiler
286	does not claim complete conformance to the latest ISO C Standard.
287	For older or nonstandard compilers, a port of SoftFloat may have substitutes
288	for these headers.
289	Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
290	<CODE><stdbool.h></CODE> and on these type names from
291	<CODE><stdint.h></CODE>:
292	<BLOCKQUOTE>
293	<PRE>
294	uint16_t
295	uint32_t
296	uint64_t
297	int32_t
298	int64_t
299	uint_fast8_t
300	uint_fast32_t
301	uint_fast64_t
302	int_fast32_t
303	int_fast64_t
304	</PRE>
305	</BLOCKQUOTE>
306	</P>
307
308
309	<H3>4.2. Floating-Point Types</H3>
310
311	<P>
312	The <CODE>softfloat.h</CODE> header defines five floating-point types:
313	<BLOCKQUOTE>
314	<TABLE CELLSPACING=0 CELLPADDING=0>
315	<TR>
316	<TD><CODE>float16_t</CODE></TD>
317	<TD><NOBR>16-bit</NOBR> half-precision binary format</TD>
318	</TR>
319	<TR>
320	<TD><CODE>float32_t</CODE></TD>
321	<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
322	</TR>
323	<TR>
324	<TD><CODE>float64_t</CODE></TD>
325	<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
326	</TR>
327	<TR>
328	<TD><CODE>extFloat80_t   </CODE></TD>
329	<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
330	Motorola format)</TD>
331	</TR>
332	<TR>
333	<TD><CODE>float128_t</CODE></TD>
334	<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
335	</TR>
336	</TABLE>
337	</BLOCKQUOTE>
338	The non-extended types are each exactly the size specified:
339	<NOBR>16 bits</NOBR> for <CODE>float16_t</CODE>, <NOBR>32 bits</NOBR> for
340	<CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for <CODE>float64_t</CODE>, and
341	<NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
342	Aside from these size requirements, the definitions of all these types may
343	differ for different ports of SoftFloat to specific systems.
344	A given port of SoftFloat may or may not define some of the floating-point
345	types as aliases for the C standard types <CODE>float</CODE>,
346	<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
347	</P>
348
349	<P>
350	Header file <CODE>softfloat.h</CODE> also defines a structure,
351	<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
352	<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
353	This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
354	at least these two fields (not necessarily in this order):
355	<BLOCKQUOTE>
356	<PRE>
357	uint16_t signExp;
358	uint64_t signif;
359	</PRE>
360	</BLOCKQUOTE>
361	Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
362	value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
363	encoded exponent in the other <NOBR>15 bits</NOBR>.
364	Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
365	the floating-point value.
366	(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
367	leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
368	in the most significant bit of the significand.)
369	</P>
370
371	<H3>4.3. Supported Floating-Point Functions</H3>
372
373	<P>
374	SoftFloat implements these arithmetic operations for its floating-point types:
375	<UL>
376	<LI>
377	conversions between any two floating-point formats;
378	<LI>
379	for each floating-point format, conversions to and from signed and unsigned
380	<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
381	<LI>
382	for each format, the usual addition, subtraction, multiplication, division, and
383	square root operations;
384	<LI>
385	for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
386	operation defined by the IEEE Standard;
387	<LI>
388	for each format, the floating-point remainder operation defined by the IEEE
389	Standard;
390	<LI>
391	for each format, a “round to integer” operation that rounds to the
392	nearest integer value in the same format; and
393	<LI>
394	comparisons between two values in the same floating-point format.
395	</UL>
396	</P>
397
398	<P>
399	The following operations required by the 2008 IEEE Floating-Point Standard are
400	not supported in SoftFloat <NOBR>Release 3e</NOBR>:
401	<UL>
402	<LI>
403	<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
404	<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
405	<LI>
406	conversions between floating-point formats and decimal or hexadecimal character
407	sequences;
408	<LI>
409	all “quiet-computation” operations (<B>copy</B>, <B>negate</B>,
410	<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
411	manipulation of the floating-point sign bit); and
412	<LI>
413	all “non-computational” operations other than <B>isSignaling</B>
414	(which is supported).
415	</UL>
416	</P>
417
418	<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
419
420	<P>
421	Because the <NOBR>80-bit</NOBR> double-extended-precision format,
422	<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
423	finite floating-point numbers are encodable in this type in multiple equivalent
424	forms.
425	Of these multiple encodings, there is always a unique one with the least
426	encoded exponent value, and this encoding is considered the <I>canonical</I>
427	representation of the floating-point number.
428	Any other equivalent representations (having a higher encoded exponent value)
429	are <I>non-canonical</I>.
430	For a value in the subnormal range (including zero), the canonical
431	representation always has an encoded exponent of zero and a leading significand
432	bit <NOBR>of 0</NOBR>.
433	For finite values outside the subnormal range, the canonical representation
434	always has an encoded exponent that is nonzero and a leading significand bit
435	<NOBR>of 1</NOBR>.
436	</P>
437
438	<P>
439	For an infinity or NaN, the leading significand bit is similarly expected to
440	<NOBR>be 1</NOBR>.
441	An infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
442	considered non-canonical.
443	Hence, altogether, to be canonical, a value of type <CODE>extFloat80_t</CODE>
444	must have a leading significand bit <NOBR>of 1</NOBR>, unless the value is
445	subnormal or zero, in which case the leading significand bit and the encoded
446	exponent must both be zero.
447	</P>
448
449	<P>
450	SoftFloat’s functions are not guaranteed to operate as expected when
451	inputs of type <CODE>extFloat80_t</CODE> are non-canonical.
452	Assuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any)
453	are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
454	be canonical.
455	</P>
456
457	<H3>4.5. Conventions for Passing Arguments and Results</H3>
458
459	<P>
460	Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
461	<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
462	cases passed as function arguments by value.
463	Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
464	is always returned directly as the function result.
465	Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
466	floating-point values has this simple signature:
467	<BLOCKQUOTE>
468	<CODE>float64_t f64_add( float64_t, float64_t );</CODE>
469	</BLOCKQUOTE>
470	</P>
471
472	<P>
473	The story is more complex when function inputs and outputs are
474	<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
475	For these types, SoftFloat always provides a function that passes these larger
476	values into or out of the function indirectly, via pointers.
477	For example, for adding two <NOBR>128-bit</NOBR> floating-point values,
478	SoftFloat supplies this function:
479	<BLOCKQUOTE>
480	<CODE>void f128M_add( const float128_t , const float128_t , float128_t * );</CODE>
481	</BLOCKQUOTE>
482	The first two arguments point to the values to be added, and the last argument
483	points to the location where the sum will be stored.
484	The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
485	that the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”,
486	pointed to by pointer arguments.
487	</P>
488
489	<P>
490	All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
491	types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
492	At the same time, SoftFloat ports may also implement alternate versions of
493	these same functions that pass <CODE>extFloat80_t</CODE> and
494	<CODE>float128_t</CODE> by value, like the smaller formats.
495	Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a
496	SoftFloat port may also supply an equivalent function with this signature:
497	<BLOCKQUOTE>
498	<CODE>float128_t f128_add( float128_t, float128_t );</CODE>
499	</BLOCKQUOTE>
500	</P>
501
502	<P>
503	As a general rule, on computers where the machine word size is
504	<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
505	(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
506	and <CODE>float128_t</CODE>, because passing such large types directly can have
507	significant extra cost.
508	On computers where the word size is <NOBR>64 bits</NOBR> or larger, both
509	function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
510	provided, because the cost of passing by value is then more reasonable.
511	Applications that must be portable accross both classes of computers must use
512	the pointer-based functions, as these are always implemented.
513	However, if it is known that SoftFloat includes the by-value functions for all
514	platforms of interest, programmers can use whichever version they prefer.
515	</P>
516
517
518	<H2>5. Reserved Names</H2>
519
520	<P>
521	In addition to the variables and functions documented here, SoftFloat defines
522	some symbol names for its own private use.
523	These private names always begin with the prefix
524	‘<CODE>softfloat_</CODE>’.
525	When a program includes header <CODE>softfloat.h</CODE> or links with the
526	SoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’
527	are reserved for possible use by SoftFloat.
528	Applications that use SoftFloat should not define their own names with this
529	prefix, and should reference only such names as are documented.
530	</P>
531
532
533	<H2>6. Mode Variables</H2>
534
535	<P>
536	The following global variables control rounding mode, underflow detection, and
537	the <NOBR>80-bit</NOBR> extended format’s rounding precision:
538	<BLOCKQUOTE>
539	<CODE>softfloat_roundingMode</CODE><BR>
540	<CODE>softfloat_detectTininess</CODE><BR>
541	<CODE>extF80_roundingPrecision</CODE>
542	</BLOCKQUOTE>
543	These mode variables are covered in the next several subsections.
544	For some SoftFloat ports, these variables may be <I>per-thread</I> (declared
545	<CODE>thread_local</CODE>), meaning that different execution threads have their
546	own separate copies of the variables.
547	</P>
548
549	<H3>6.1. Rounding Mode</H3>
550
551	<P>
552	All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
553	implemented for all operations that require rounding.
554	Some ports of SoftFloat may also implement the <I>round-to-odd</I> mode.
555	</P>
556
557	<P>
558	The rounding mode is selected by the global variable
559	<BLOCKQUOTE>
560	<CODE>uint_fast8_t softfloat_roundingMode;</CODE>
561	</BLOCKQUOTE>
562	This variable may be set to one of the values
563	<BLOCKQUOTE>
564	<TABLE CELLSPACING=0 CELLPADDING=0>
565	<TR>
566	<TD><CODE>softfloat_round_near_even</CODE></TD>
567	<TD>round to nearest, with ties to even</TD>
568	</TR>
569	<TR>
570	<TD><CODE>softfloat_round_near_maxMag  </CODE></TD>
571	<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
572	</TR>
573	<TR>
574	<TD><CODE>softfloat_round_minMag</CODE></TD>
575	<TD>round to minimum magnitude (toward zero)</TD>
576	</TR>
577	<TR>
578	<TD><CODE>softfloat_round_min</CODE></TD>
579	<TD>round to minimum (down)</TD>
580	</TR>
581	<TR>
582	<TD><CODE>softfloat_round_max</CODE></TD>
583	<TD>round to maximum (up)</TD>
584	</TR>
585	<TR>
586	<TD><CODE>softfloat_round_odd</CODE></TD>
587	<TD>round to odd (jamming), if supported by the SoftFloat port</TD>
588	</TR>
589	</TABLE>
590	</BLOCKQUOTE>
591	Variable <CODE>softfloat_roundingMode</CODE> is initialized to
592	<CODE>softfloat_round_near_even</CODE>.
593	</P>
594
595	<P>
596	When <CODE>softfloat_round_odd</CODE> is the rounding mode for a function that
597	rounds to an integer value (either conversion to an integer format or a
598	‘<CODE>roundToInt</CODE>’ function), if the input is not already an
599	integer, the rounded result is the closest <EM>odd</EM> integer.
600	For other operations, this rounding mode acts as though the floating-point
601	result is first rounded to minimum magnitude, the same as
602	<CODE>softfloat_round_minMag</CODE>, and then, if the result is inexact, the
603	least-significant bit of the result is set <NOBR>to 1</NOBR>.
604	Rounding to odd is also known as <EM>jamming</EM>.
605	</P>
606
607	<H3>6.2. Underflow Detection</H3>
608
609	<P>
610	In the terminology of the IEEE Standard, SoftFloat can detect tininess for
611	underflow either before or after rounding.
612	The choice is made by the global variable
613	<BLOCKQUOTE>
614	<CODE>uint_fast8_t softfloat_detectTininess;</CODE>
615	</BLOCKQUOTE>
616	which can be set to either
617	<BLOCKQUOTE>
618	<CODE>softfloat_tininess_beforeRounding</CODE><BR>
619	<CODE>softfloat_tininess_afterRounding</CODE>
620	</BLOCKQUOTE>
621	Detecting tininess after rounding is usually better because it results in fewer
622	spurious underflow signals.
623	The other option is provided for compatibility with some systems.
624	Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
625	always detects loss of accuracy for underflow as an inexact result.
626	</P>
627
628	<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
629
630	<P>
631	For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
632	arithmetic operations is controlled by the global variable
633	<BLOCKQUOTE>
634	<CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
635	</BLOCKQUOTE>
636	The operations affected are:
637	<BLOCKQUOTE>
638	<CODE>extF80_add</CODE><BR>
639	<CODE>extF80_sub</CODE><BR>
640	<CODE>extF80_mul</CODE><BR>
641	<CODE>extF80_div</CODE><BR>
642	<CODE>extF80_sqrt</CODE>
643	</BLOCKQUOTE>
644	When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
645	these operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
646	double-extended-precision format, like occurs for other formats.
647	Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
648	operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
649	<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
650	<CODE>float64_t</CODE>), respectively.
651	When rounding to reduced precision, additional bits in the result significand
652	beyond the rounding point are set to zero.
653	The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
654	other than 32, 64, or 80 is not specified.
655	Operations other than the ones listed above are not affected by
656	<CODE>extF80_roundingPrecision</CODE>.
657	</P>
658
659
660	<H2>7. Exceptions and Exception Flags</H2>
661
662	<P>
663	All five exception flags required by the IEEE Floating-Point Standard are
664	implemented.
665	Each flag is stored as a separate bit in the global variable
666	<BLOCKQUOTE>
667	<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
668	</BLOCKQUOTE>
669	The positions of the exception flag bits within this variable are determined by
670	the bit masks
671	<BLOCKQUOTE>
672	<CODE>softfloat_flag_inexact</CODE><BR>
673	<CODE>softfloat_flag_underflow</CODE><BR>
674	<CODE>softfloat_flag_overflow</CODE><BR>
675	<CODE>softfloat_flag_infinite</CODE><BR>
676	<CODE>softfloat_flag_invalid</CODE>
677	</BLOCKQUOTE>
678	Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
679	meaning no exceptions.
680	</P>
681
682	<P>
683	For some SoftFloat ports, <CODE>softfloat_exceptionFlags</CODE> may be
684	<I>per-thread</I> (declared <CODE>thread_local</CODE>), meaning that different
685	execution threads have their own separate instances of it.
686	</P>
687
688	<P>
689	An individual exception flag can be cleared with the statement
690	<BLOCKQUOTE>
691	<CODE>softfloat_exceptionFlags &= ~softfloat_flag_<<I>exception</I>>;</CODE>
692	</BLOCKQUOTE>
693	where <CODE><<I>exception</I>></CODE> is the appropriate name.
694	To raise a floating-point exception, function <CODE>softfloat_raiseFlags</CODE>
695	should normally be used.
696	</P>
697
698	<P>
699	When SoftFloat detects an exception other than <I>inexact</I>, it calls
700	<CODE>softfloat_raiseFlags</CODE>.
701	The default version of this function simply raises the corresponding exception
702	flags.
703	Particular ports of SoftFloat may support alternate behavior, such as exception
704	traps, by modifying the default <CODE>softfloat_raiseFlags</CODE>.
705	A program may also supply its own <CODE>softfloat_raiseFlags</CODE> function to
706	override the one from the SoftFloat library.
707	</P>
708
709	<P>
710	Because inexact results occur frequently under most circumstances (and thus are
711	hardly exceptional), SoftFloat does not ordinarily call
712	<CODE>softfloat_raiseFlags</CODE> for <I>inexact</I> exceptions.
713	It does always raise the <I>inexact</I> exception flag as required.
714	</P>
715
716
717	<H2>8. Function Details</H2>
718
719	<P>
720	In this section, <CODE><<I>float</I>></CODE> appears in function names as
721	a substitute for one of these abbreviations:
722	<BLOCKQUOTE>
723	<TABLE CELLSPACING=0 CELLPADDING=0>
724	<TR>
725	<TD><CODE>f16</CODE></TD>
726	<TD>indicates <CODE>float16_t</CODE>, passed by value</TD>
727	</TR>
728	<TR>
729	<TD><CODE>f32</CODE></TD>
730	<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
731	</TR>
732	<TR>
733	<TD><CODE>f64</CODE></TD>
734	<TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
735	</TR>
736	<TR>
737	<TD><CODE>extF80M   </CODE></TD>
738	<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
739	</TR>
740	<TR>
741	<TD><CODE>extF80</CODE></TD>
742	<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
743	</TR>
744	<TR>
745	<TD><CODE>f128M</CODE></TD>
746	<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
747	</TR>
748	<TR>
749	<TD><CODE>f128</CODE></TD>
750	<TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
751	</TR>
752	</TABLE>
753	</BLOCKQUOTE>
754	The circumstances under which values of floating-point types
755	<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
756	value or indirectly via pointers was discussed earlier in
757	<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
758	</P>
759
760	<H3>8.1. Conversions from Integer to Floating-Point</H3>
761
762	<P>
763	All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
764	signed or unsigned, to a floating-point format are supported.
765	Functions performing these conversions have these names:
766	<BLOCKQUOTE>
767	<CODE>ui32_to_<<I>float</I>></CODE><BR>
768	<CODE>ui64_to_<<I>float</I>></CODE><BR>
769	<CODE>i32_to_<<I>float</I>></CODE><BR>
770	<CODE>i64_to_<<I>float</I>></CODE>
771	</BLOCKQUOTE>
772	Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
773	double-precision and larger formats are always exact, and likewise conversions
774	from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
775	double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
776	always exact.
777	</P>
778
779	<P>
780	Each conversion function takes one input of the appropriate type and generates
781	one output.
782	The following illustrates the signatures of these functions in cases when the
783	floating-point result is passed either by value or via pointers:
784	<BLOCKQUOTE>
785	<PRE>
786	float64_t i32_to_f64( int32_t <I>a</I> );
787	</PRE>
788	<PRE>
789	void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
790	</PRE>
791	</BLOCKQUOTE>
792	</P>
793
794	<H3>8.2. Conversions from Floating-Point to Integer</H3>
795
796	<P>
797	Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or
798	<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
799	functions:
800	<BLOCKQUOTE>
801	<CODE><<I>float</I>>_to_ui32</CODE><BR>
802	<CODE><<I>float</I>>_to_ui64</CODE><BR>
803	<CODE><<I>float</I>>_to_i32</CODE><BR>
804	<CODE><<I>float</I>>_to_i64</CODE>
805	</BLOCKQUOTE>
806	The functions have signatures as follows, depending on whether the
807	floating-point input is passed by value or via pointers:
808	<BLOCKQUOTE>
809	<PRE>
810	int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
811	</PRE>
812	<PRE>
813	int_fast32_t
814	f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
815	</PRE>
816	</BLOCKQUOTE>
817	</P>
818
819	<P>
820	The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
821	the conversion.
822	The variable that usually indicates rounding mode,
823	<CODE>softfloat_roundingMode</CODE>, is ignored.
824	Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
825	exception flag is raised if the conversion is not exact.
826	If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
827	be raised;
828	otherwise, it will not be, even if the conversion is inexact.
829	</P>
830
831	<P>
832	A conversion from floating-point to integer format raises the <I>invalid</I>
833	exception if the source value cannot be rounded to a representable integer of
834	the desired size (32 or 64 bits).
835	In such circumstances, the integer result returned is determined by the
836	particular port of SoftFloat, although typically this value will be either the
837	maximum or minimum value of the integer format.
838	The functions that convert to integer types never raise the floating-point
839	<I>overflow</I> exception.
840	</P>
841
842	<P>
843	Because languages such <NOBR>as C</NOBR> require that conversions to integers
844	be rounded toward zero, the following functions are provided for improved speed
845	and convenience:
846	<BLOCKQUOTE>
847	<CODE><<I>float</I>>_to_ui32_r_minMag</CODE><BR>
848	<CODE><<I>float</I>>_to_ui64_r_minMag</CODE><BR>
849	<CODE><<I>float</I>>_to_i32_r_minMag</CODE><BR>
850	<CODE><<I>float</I>>_to_i64_r_minMag</CODE>
851	</BLOCKQUOTE>
852	These functions round only toward zero (to minimum magnitude).
853	The signatures for these functions are the same as above without the redundant
854	<CODE><I>roundingMode</I></CODE> argument:
855	<BLOCKQUOTE>
856	<PRE>
857	int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
858	</PRE>
859	<PRE>
860	int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
861	</PRE>
862	</BLOCKQUOTE>
863	</P>
864
865	<H3>8.3. Conversions Among Floating-Point Types</H3>
866
867	<P>
868	Conversions between floating-point formats are done by functions with these
869	names:
870	<BLOCKQUOTE>
871	<CODE><<I>float</I>>_to_<<I>float</I>></CODE>
872	</BLOCKQUOTE>
873	All combinations of source and result type are supported where the source and
874	result are different formats.
875	There are four different styles of signature for these functions, depending on
876	whether the input and the output floating-point values are passed by value or
877	via pointers:
878	<BLOCKQUOTE>
879	<PRE>
880	float32_t f64_to_f32( float64_t <I>a</I> );
881	</PRE>
882	<PRE>
883	float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
884	</PRE>
885	<PRE>
886	void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
887	</PRE>
888	<PRE>
889	void extF80M_to_f128M( const extFloat80_t <I>aPtr</I>, float128_t <I>destPtr</I> );
890	</PRE>
891	</BLOCKQUOTE>
892	</P>
893
894	<P>
895	Conversions from a smaller to a larger floating-point format are always exact
896	and so require no rounding.
897	</P>
898
899	<H3>8.4. Basic Arithmetic Functions</H3>
900
901	<P>
902	The following basic arithmetic functions are provided:
903	<BLOCKQUOTE>
904	<CODE><<I>float</I>>_add</CODE><BR>
905	<CODE><<I>float</I>>_sub</CODE><BR>
906	<CODE><<I>float</I>>_mul</CODE><BR>
907	<CODE><<I>float</I>>_div</CODE><BR>
908	<CODE><<I>float</I>>_sqrt</CODE>
909	</BLOCKQUOTE>
910	Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
911	(square root) which takes only one.
912	The operands and result are all of the same floating-point format.
913	Signatures for these functions take the following forms:
914	<BLOCKQUOTE>
915	<PRE>
916	float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
917	</PRE>
918	<PRE>
919	void
920	f128M_add(
921	const float128_t <I>aPtr</I>, const float128_t <I>bPtr</I>, float128_t *<I>destPtr</I> );
922	</PRE>
923	<PRE>
924	float64_t f64_sqrt( float64_t <I>a</I> );
925	</PRE>
926	<PRE>
927	void f128M_sqrt( const float128_t <I>aPtr</I>, float128_t <I>destPtr</I> );
928	</PRE>
929	</BLOCKQUOTE>
930	When floating-point values are passed indirectly through pointers, arguments
931	<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
932	operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
933	location where the result is stored.
934	</P>
935
936	<P>
937	Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
938	(<CODE>extFloat80_t</CODE>) functions is affected by variable
939	<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
940	<NOBR>section 6.3</NOBR>,
941	<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
942	</P>
943
944	<H3>8.5. Fused Multiply-Add Functions</H3>
945
946	<P>
947	The 2008 version of the IEEE Floating-Point Standard defines a <I>fused
948	multiply-add</I> operation that does a combined multiplication and addition
949	with only a single rounding.
950	SoftFloat implements fused multiply-add with functions
951	<BLOCKQUOTE>
952	<CODE><<I>float</I>>_mulAdd</CODE>
953	</BLOCKQUOTE>
954	Unlike other operations, fused multiple-add is not supported for the
955	<NOBR>80-bit</NOBR> double-extended-precision format,
956	<CODE>extFloat80_t</CODE>.
957	</P>
958
959	<P>
960	Depending on whether floating-point values are passed by value or via pointers,
961	the fused multiply-add functions have signatures of these forms:
962	<BLOCKQUOTE>
963	<PRE>
964	float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
965	</PRE>
966	<PRE>
967	void
968	f128M_mulAdd(
969	const float128_t *<I>aPtr</I>,
970	const float128_t *<I>bPtr</I>,
971	const float128_t *<I>cPtr</I>,
972	float128_t *<I>destPtr</I>
973	);
974	</PRE>
975	</BLOCKQUOTE>
976	The functions compute
977	<NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>)
978	+ <CODE><I>c</I></CODE></NOBR>
979	with a single rounding.
980	When floating-point values are passed indirectly through pointers, arguments
981	<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
982	<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
983	<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
984	<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
985	</P>
986
987	<P>
988	If one of the multiplication operands <CODE><I>a</I></CODE> and
989	<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
990	the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
991	</P>
992
993	<H3>8.6. Remainder Functions</H3>
994
995	<P>
996	For each format, SoftFloat implements the remainder operation defined by the
997	IEEE Floating-Point Standard.
998	The remainder functions have names
999	<BLOCKQUOTE>
1000	<CODE><<I>float</I>>_rem</CODE>
1001	</BLOCKQUOTE>
1002	Each remainder operation takes two floating-point operands of the same format
1003	and returns a result in the same format.
1004	Depending on whether floating-point values are passed by value or via pointers,
1005	the remainder functions have signatures of these forms:
1006	<BLOCKQUOTE>
1007	<PRE>
1008	float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
1009	</PRE>
1010	<PRE>
1011	void
1012	f128M_rem(
1013	const float128_t <I>aPtr</I>, const float128_t <I>bPtr</I>, float128_t *<I>destPtr</I> );
1014	</PRE>
1015	</BLOCKQUOTE>
1016	When floating-point values are passed indirectly through pointers, arguments
1017	<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
1018	<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
1019	<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1020	</P>
1021
1022	<P>
1023	The IEEE Standard remainder operation computes the value
1024	<NOBR><CODE><I>a</I></CODE>
1025	− <I>n</I> × <CODE><I>b</I></CODE></NOBR>,
1026	where <I>n</I> is the integer closest to
1027	<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
1028	If <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly
1029	halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
1030	<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
1031	The IEEE Standard’s remainder operation is always exact and so requires
1032	no rounding.
1033	</P>
1034
1035	<P>
1036	Depending on the relative magnitudes of the operands, the remainder
1037	functions can take considerably longer to execute than the other SoftFloat
1038	functions.
1039	This is an inherent characteristic of the remainder operation itself and is not
1040	a flaw in the SoftFloat implementation.
1041	</P>
1042
1043	<H3>8.7. Round-to-Integer Functions</H3>
1044
1045	<P>
1046	For each format, SoftFloat implements the round-to-integer operation specified
1047	by the IEEE Floating-Point Standard.
1048	These functions are named
1049	<BLOCKQUOTE>
1050	<CODE><<I>float</I>>_roundToInt</CODE>
1051	</BLOCKQUOTE>
1052	Each round-to-integer operation takes a single floating-point operand.
1053	This operand is rounded to an integer according to a specified rounding mode,
1054	and the resulting integer value is returned in the same floating-point format.
1055	(Note that the result is not an integer type.)
1056	</P>
1057
1058	<P>
1059	The signatures of the round-to-integer functions are similar to those for
1060	conversions to an integer type:
1061	<BLOCKQUOTE>
1062	<PRE>
1063	float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
1064	</PRE>
1065	<PRE>
1066	void
1067	f128M_roundToInt(
1068	const float128_t *<I>aPtr</I>,
1069	uint_fast8_t <I>roundingMode</I>,
1070	bool <I>exact</I>,
1071	float128_t *<I>destPtr</I>
1072	);
1073	</PRE>
1074	</BLOCKQUOTE>
1075	When floating-point values are passed indirectly through pointers,
1076	<CODE><I>aPtr</I></CODE> points to the input operand and
1077	<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1078	</P>
1079
1080	<P>
1081	The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
1082	apply.
1083	The variable that usually indicates rounding mode,
1084	<CODE>softfloat_roundingMode</CODE>, is ignored.
1085	Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
1086	exception flag is raised if the conversion is not exact.
1087	If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
1088	be raised;
1089	otherwise, it will not be, even if the conversion is inexact.
1090	</P>
1091
1092	<H3>8.8. Comparison Functions</H3>
1093
1094	<P>
1095	For each format, the following floating-point comparison functions are
1096	provided:
1097	<BLOCKQUOTE>
1098	<CODE><<I>float</I>>_eq</CODE><BR>
1099	<CODE><<I>float</I>>_le</CODE><BR>
1100	<CODE><<I>float</I>>_lt</CODE>
1101	</BLOCKQUOTE>
1102	Each comparison takes two operands of the same type and returns a Boolean.
1103	The abbreviation <CODE>eq</CODE> stands for “equal” (=);
1104	<CODE>le</CODE> stands for “less than or equal” (≤);
1105	and <CODE>lt</CODE> stands for “less than” (<).
1106	Depending on whether the floating-point operands are passed by value or via
1107	pointers, the comparison functions have signatures of these forms:
1108	<BLOCKQUOTE>
1109	<PRE>
1110	bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
1111	</PRE>
1112	<PRE>
1113	bool f128M_eq( const float128_t <I>aPtr</I>, const float128_t <I>bPtr</I> );
1114	</PRE>
1115	</BLOCKQUOTE>
1116	</P>
1117
1118	<P>
1119	The usual greater-than (>), greater-than-or-equal (≥), and not-equal
1120	(≠) comparisons are easily obtained from the functions provided.
1121	The not-equal function is just the logical complement of the equal function.
1122	The greater-than-or-equal function is identical to the less-than-or-equal
1123	function with the arguments in reverse order, and likewise the greater-than
1124	function is identical to the less-than function with the arguments reversed.
1125	</P>
1126
1127	<P>
1128	The IEEE Floating-Point Standard specifies that the less-than-or-equal and
1129	less-than comparisons by default raise the <I>invalid</I> exception if either
1130	operand is any kind of NaN.
1131	Equality comparisons, on the other hand, are defined by default to raise the
1132	<I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
1133	For completeness, SoftFloat provides these complementary functions:
1134	<BLOCKQUOTE>
1135	<CODE><<I>float</I>>_eq_signaling</CODE><BR>
1136	<CODE><<I>float</I>>_le_quiet</CODE><BR>
1137	<CODE><<I>float</I>>_lt_quiet</CODE>
1138	</BLOCKQUOTE>
1139	The <CODE>signaling</CODE> equality comparisons are identical to the default
1140	equality comparisons except that the <I>invalid</I> exception is raised for any
1141	NaN input, not just for signaling NaNs.
1142	Similarly, the <CODE>quiet</CODE> comparison functions are identical to their
1143	default counterparts except that the <I>invalid</I> exception is not raised for
1144	quiet NaNs.
1145	</P>
1146
1147	<H3>8.9. Signaling NaN Test Functions</H3>
1148
1149	<P>
1150	Functions for testing whether a floating-point value is a signaling NaN are
1151	provided with these names:
1152	<BLOCKQUOTE>
1153	<CODE><<I>float</I>>_isSignalingNaN</CODE>
1154	</BLOCKQUOTE>
1155	The functions take one floating-point operand and return a Boolean indicating
1156	whether the operand is a signaling NaN.
1157	Accordingly, the functions have the forms
1158	<BLOCKQUOTE>
1159	<PRE>
1160	bool f64_isSignalingNaN( float64_t <I>a</I> );
1161	</PRE>
1162	<PRE>
1163	bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
1164	</PRE>
1165	</BLOCKQUOTE>
1166	</P>
1167
1168	<H3>8.10. Raise-Exception Function</H3>
1169
1170	<P>
1171	SoftFloat provides a single function for raising floating-point exceptions:
1172	<BLOCKQUOTE>
1173	<PRE>
1174	void softfloat_raiseFlags( uint_fast8_t <I>exceptions</I> );
1175	</PRE>
1176	</BLOCKQUOTE>
1177	The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
1178	exceptions to raise.
1179	(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
1180	In addition to setting the specified exception flags in variable
1181	<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raiseFlags</CODE>
1182	function may cause a trap or abort appropriate for the current system.
1183	</P>
1184
1185
1186	<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
1187
1188	<P>
1189	Apart from a change in the legal use license, <NOBR>Release 3</NOBR> of
1190	SoftFloat introduced numerous technical differences compared to earlier
1191	releases.
1192	</P>
1193
1194	<H3>9.1. Name Changes</H3>
1195
1196	<P>
1197	The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR>
1198	is that the names of most functions and variables have changed, even when the
1199	behavior has not.
1200	First, the floating-point types, the mode variables, the exception flags
1201	variable, the function to raise exceptions, and various associated constants
1202	have been renamed as follows:
1203	<BLOCKQUOTE>
1204	<TABLE>
1205	<TR>
1206	<TD>old name, Release 2:</TD>
1207	<TD>new name, Release 3:</TD>
1208	</TR>
1209	<TR>
1210	<TD><CODE>float32</CODE></TD>
1211	<TD><CODE>float32_t</CODE></TD>
1212	</TR>
1213	<TR>
1214	<TD><CODE>float64</CODE></TD>
1215	<TD><CODE>float64_t</CODE></TD>
1216	</TR>
1217	<TR>
1218	<TD><CODE>floatx80</CODE></TD>
1219	<TD><CODE>extFloat80_t</CODE></TD>
1220	</TR>
1221	<TR>
1222	<TD><CODE>float128</CODE></TD>
1223	<TD><CODE>float128_t</CODE></TD>
1224	</TR>
1225	<TR>
1226	<TD><CODE>float_rounding_mode</CODE></TD>
1227	<TD><CODE>softfloat_roundingMode</CODE></TD>
1228	</TR>
1229	<TR>
1230	<TD><CODE>float_round_nearest_even</CODE></TD>
1231	<TD><CODE>softfloat_round_near_even</CODE></TD>
1232	</TR>
1233	<TR>
1234	<TD><CODE>float_round_to_zero</CODE></TD>
1235	<TD><CODE>softfloat_round_minMag</CODE></TD>
1236	</TR>
1237	<TR>
1238	<TD><CODE>float_round_down</CODE></TD>
1239	<TD><CODE>softfloat_round_min</CODE></TD>
1240	</TR>
1241	<TR>
1242	<TD><CODE>float_round_up</CODE></TD>
1243	<TD><CODE>softfloat_round_max</CODE></TD>
1244	</TR>
1245	<TR>
1246	<TD><CODE>float_detect_tininess</CODE></TD>
1247	<TD><CODE>softfloat_detectTininess</CODE></TD>
1248	</TR>
1249	<TR>
1250	<TD><CODE>float_tininess_before_rounding    </CODE></TD>
1251	<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
1252	</TR>
1253	<TR>
1254	<TD><CODE>float_tininess_after_rounding</CODE></TD>
1255	<TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
1256	</TR>
1257	<TR>
1258	<TD><CODE>floatx80_rounding_precision</CODE></TD>
1259	<TD><CODE>extF80_roundingPrecision</CODE></TD>
1260	</TR>
1261	<TR>
1262	<TD><CODE>float_exception_flags</CODE></TD>
1263	<TD><CODE>softfloat_exceptionFlags</CODE></TD>
1264	</TR>
1265	<TR>
1266	<TD><CODE>float_flag_inexact</CODE></TD>
1267	<TD><CODE>softfloat_flag_inexact</CODE></TD>
1268	</TR>
1269	<TR>
1270	<TD><CODE>float_flag_underflow</CODE></TD>
1271	<TD><CODE>softfloat_flag_underflow</CODE></TD>
1272	</TR>
1273	<TR>
1274	<TD><CODE>float_flag_overflow</CODE></TD>
1275	<TD><CODE>softfloat_flag_overflow</CODE></TD>
1276	</TR>
1277	<TR>
1278	<TD><CODE>float_flag_divbyzero</CODE></TD>
1279	<TD><CODE>softfloat_flag_infinite</CODE></TD>
1280	</TR>
1281	<TR>
1282	<TD><CODE>float_flag_invalid</CODE></TD>
1283	<TD><CODE>softfloat_flag_invalid</CODE></TD>
1284	</TR>
1285	<TR>
1286	<TD><CODE>float_raise</CODE></TD>
1287	<TD><CODE>softfloat_raiseFlags</CODE></TD>
1288	</TR>
1289	</TABLE>
1290	</BLOCKQUOTE>
1291	</P>
1292
1293	<P>
1294	Furthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for
1295	function names:
1296	<BLOCKQUOTE>
1297	<TABLE>
1298	<TR>
1299	<TD>used in names in Release 2:<CODE>    </CODE></TD>
1300	<TD>used in names in Release 3:</TD>
1301	</TR>
1302	<TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR>
1303	<TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR>
1304	<TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR>
1305	<TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR>
1306	<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
1307	<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR>
1308	</TABLE>
1309	</BLOCKQUOTE>
1310	Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
1311	numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
1312	is now <CODE>f32_add</CODE>.
1313	Lastly, there have been a few other changes to function names:
1314	<BLOCKQUOTE>
1315	<TABLE>
1316	<TR>
1317	<TD>used in names in Release 2:<CODE>   </CODE></TD>
1318	<TD>used in names in Release 3:<CODE>   </CODE></TD>
1319	<TD>relevant functions:</TD>
1320	</TR>
1321	<TR>
1322	<TD><CODE>_round_to_zero</CODE></TD>
1323	<TD><CODE>_r_minMag</CODE></TD>
1324	<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
1325	</TR>
1326	<TR>
1327	<TD><CODE>round_to_int</CODE></TD>
1328	<TD><CODE>roundToInt</CODE></TD>
1329	<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
1330	</TR>
1331	<TR>
1332	<TD><CODE>is_signaling_nan    </CODE></TD>
1333	<TD><CODE>isSignalingNaN</CODE></TD>
1334	<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
1335	</TR>
1336	</TABLE>
1337	</BLOCKQUOTE>
1338	</P>
1339
1340	<H3>9.2. Changes to Function Arguments</H3>
1341
1342	<P>
1343	Besides simple name changes, some operations were given a different interface
1344	in <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
1345	<UL>
1346
1347	<LI>
1348	<P>
1349	Since <NOBR>Release 3</NOBR>, integer arguments and results of functions have
1350	standard types from header <CODE><stdint.h></CODE>, such as
1351	<CODE>uint32_t</CODE>, whereas previously their types could be defined
1352	differently for each port of SoftFloat, usually using traditional C types such
1353	as <CODE>unsigned</CODE> <CODE>int</CODE>.
1354	Likewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as
1355	standard type <CODE>bool</CODE> from <CODE><stdbool.h></CODE>, whereas
1356	previously these were again passed as a port-specific type (usually
1357	<CODE>int</CODE>).
1358	</P>
1359
1360	<LI>
1361	<P>
1362	As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
1363	Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and
1364	later may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point
1365	values through pointers, meaning that functions take pointer arguments and then
1366	read or write floating-point values at the locations indicated by the pointers.
1367	In <NOBR>Release 2</NOBR>, floating-point arguments and results were always
1368	passed by value, regardless of their size.
1369	</P>
1370
1371	<LI>
1372	<P>
1373	Functions that round to an integer have additional
1374	<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
1375	they did not have in <NOBR>Release 2</NOBR>.
1376	Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
1377	since <NOBR>Release 3</NOBR>.
1378	For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
1379	same global variable that affects the basic arithmetic operations (now called
1380	<CODE>softfloat_roundingMode</CODE> but previously known as
1381	<CODE>float_rounding_mode</CODE>).
1382	Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
1383	an exact integer value, and if the <I>invalid</I> exception was not raised by
1384	the function, the <I>inexact</I> exception was always raised.
1385	<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
1386	case.
1387	Applications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same
1388	effect as <NOBR>Release 2</NOBR> by passing variable
1389	<CODE>softfloat_roundingMode</CODE> for argument
1390	<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument
1391	<CODE><I>exact</I></CODE>.
1392	</P>
1393
1394	</UL>
1395	</P>
1396
1397	<H3>9.3. Added Capabilities</H3>
1398
1399	<P>
1400	With <NOBR>Release 3</NOBR>, some new features have been added that were not
1401	present in <NOBR>Release 2</NOBR>:
1402	<UL>
1403
1404	<LI>
1405	<P>
1406	A port of SoftFloat can now define any of the floating-point types
1407	<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and
1408	<CODE>float128_t</CODE> as aliases for C’s standard floating-point types
1409	<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE>
1410	<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>.
1411	This potential convenience was not supported under <NOBR>Release 2</NOBR>.
1412	</P>
1413
1414	<P>
1415	(Note, however, that there may be a performance cost to defining
1416	SoftFloat’s floating-point types this way, depending on the platform and
1417	the applications using SoftFloat.
1418	Ports of SoftFloat may choose to forgo the convenience in favor of better
1419	speed.)
1420	</P>
1421
1422	<P>
1423	<LI>
1424	As of <NOBR>Release 3b</NOBR>, <NOBR>16-bit</NOBR> half-precision,
1425	<CODE>float16_t</CODE>, is supported.
1426	</P>
1427
1428	<P>
1429	<LI>
1430	Functions have been added for converting between the floating-point types and
1431	unsigned integers.
1432	<NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
1433	</P>
1434
1435	<P>
1436	<LI>
1437	Fused multiply-add functions have been added for all floating-point formats
1438	except <NOBR>80-bit</NOBR> double-extended-precision,
1439	<CODE>extFloat80_t</CODE>.
1440	</P>
1441
1442	<P>
1443	<LI>
1444	New rounding modes are supported:
1445	<CODE>softfloat_round_near_maxMag</CODE> (round to nearest, with ties to
1446	maximum magnitude, away from zero), and, as of <NOBR>Release 3c</NOBR>,
1447	optional <CODE>softfloat_round_odd</CODE> (round to odd, also known as
1448	jamming).
1449	</P>
1450
1451	</UL>
1452	</P>
1453
1454	<H3>9.4. Better Compatibility with the C Language</H3>
1455
1456	<P>
1457	<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C
1458	Standard’s rules for portability.
1459	For example, older releases of SoftFloat employed type conversions in ways
1460	that, while commonly practiced, are not fully defined by the C Standard.
1461	Such problematic type conversions have generally been replaced by the use of
1462	unions, the behavior around which is more strictly regulated these days.
1463	</P>
1464
1465	<H3>9.5. New Organization as a Library</H3>
1466
1467	<P>
1468	Starting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
1469	Previously, SoftFloat compiled into a single, monolithic object file containing
1470	all the SoftFloat functions, with the consequence that a program linking with
1471	SoftFloat would get every SoftFloat function in its binary file even if only a
1472	few functions were actually used.
1473	With SoftFloat in the form of a library, a program that is linked by a standard
1474	linker will include only those functions of SoftFloat that it needs and no
1475	others.
1476	</P>
1477
1478	<H3>9.6. Optimization Gains (and Losses)</H3>
1479
1480	<P>
1481	Individual SoftFloat functions have been variously improved in
1482	<NOBR>Release 3</NOBR> compared to earlier releases.
1483	In particular, better, faster algorithms have been deployed for the operations
1484	of division, square root, and remainder.
1485	For functions operating on the larger <NOBR>80-bit</NOBR> and
1486	<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
1487	<CODE>float128_t</CODE>, code size has also generally been reduced.
1488	</P>
1489
1490	<P>
1491	However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
1492	single object file, compilers could make optimizations across function calls
1493	when one SoftFloat function calls another.
1494	Now that the functions of SoftFloat are compiled separately and only afterward
1495	linked together into a program, there is not usually the same opportunity to
1496	optimize across function calls.
1497	Some loss of speed has been observed due to this change.
1498	</P>
1499
1500
1501	<H2>10. Future Directions</H2>
1502
1503	<P>
1504	The following improvements are anticipated for future releases of SoftFloat:
1505	<UL>
1506	<LI>
1507	more functions from the 2008 version of the IEEE Floating-Point Standard;
1508	<LI>
1509	consistent, defined behavior for non-canonical representations of extended
1510	format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
1511	<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
1512
1513	</UL>
1514	</P>
1515
1516
1517	<H2>11. Contact Information</H2>
1518
1519	<P>
1520	At the time of this writing, the most up-to-date information about SoftFloat
1521	and the latest release can be found at the Web page
1522	<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></NOBR></A>.
1523	</P>
1524
1525
1526	</BODY>
1527

Note: See TracBrowser for help on using the repository browser.

source: vbox/trunk/src/libs/softfloat-3e/doc/SoftFloat.html

Download in other formats: