1 |
|
---|
2 | <HTML>
|
---|
3 |
|
---|
4 | <HEAD>
|
---|
5 | <TITLE>Berkeley SoftFloat Library Interface</TITLE>
|
---|
6 | </HEAD>
|
---|
7 |
|
---|
8 | <BODY>
|
---|
9 |
|
---|
10 | <H1>Berkeley SoftFloat Release 3e: Library Interface</H1>
|
---|
11 |
|
---|
12 | <P>
|
---|
13 | John R. Hauser<BR>
|
---|
14 | 2018 January 20<BR>
|
---|
15 | </P>
|
---|
16 |
|
---|
17 |
|
---|
18 | <H2>Contents</H2>
|
---|
19 |
|
---|
20 | <BLOCKQUOTE>
|
---|
21 | <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
|
---|
22 | <COL WIDTH=25>
|
---|
23 | <COL WIDTH=*>
|
---|
24 | <TR><TD COLSPAN=2>1. Introduction</TD></TR>
|
---|
25 | <TR><TD COLSPAN=2>2. Limitations</TD></TR>
|
---|
26 | <TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
|
---|
27 | <TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
|
---|
28 | <TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
|
---|
29 | <TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
|
---|
30 | <TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
|
---|
31 | <TR>
|
---|
32 | <TD></TD>
|
---|
33 | <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
|
---|
34 | </TR>
|
---|
35 | <TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
|
---|
36 | <TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
|
---|
37 | <TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
|
---|
38 | <TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
|
---|
39 | <TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
|
---|
40 | <TR>
|
---|
41 | <TD></TD>
|
---|
42 | <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
|
---|
43 | </TR>
|
---|
44 | <TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
|
---|
45 | <TR><TD COLSPAN=2>8. Function Details</TD></TR>
|
---|
46 | <TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
|
---|
47 | <TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
|
---|
48 | <TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
|
---|
49 | <TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
|
---|
50 | <TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
|
---|
51 | <TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
|
---|
52 | <TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
|
---|
53 | <TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
|
---|
54 | <TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
|
---|
55 | <TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
|
---|
56 | <TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
|
---|
57 | <TR><TD></TD><TD>9.1. Name Changes</TD></TR>
|
---|
58 | <TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
|
---|
59 | <TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
|
---|
60 | <TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
|
---|
61 | <TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
|
---|
62 | <TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
|
---|
63 | <TR><TD COLSPAN=2>10. Future Directions</TD></TR>
|
---|
64 | <TR><TD COLSPAN=2>11. Contact Information</TD></TR>
|
---|
65 | </TABLE>
|
---|
66 | </BLOCKQUOTE>
|
---|
67 |
|
---|
68 |
|
---|
69 | <H2>1. Introduction</H2>
|
---|
70 |
|
---|
71 | <P>
|
---|
72 | Berkeley SoftFloat is a software implementation of binary floating-point that
|
---|
73 | conforms to the IEEE Standard for Floating-Point Arithmetic.
|
---|
74 | The current release supports five binary formats: <NOBR>16-bit</NOBR>
|
---|
75 | half-precision, <NOBR>32-bit</NOBR> single-precision, <NOBR>64-bit</NOBR>
|
---|
76 | double-precision, <NOBR>80-bit</NOBR> double-extended-precision, and
|
---|
77 | <NOBR>128-bit</NOBR> quadruple-precision.
|
---|
78 | The following functions are supported for each format:
|
---|
79 | <UL>
|
---|
80 | <LI>
|
---|
81 | addition, subtraction, multiplication, division, and square root;
|
---|
82 | <LI>
|
---|
83 | fused multiply-add as defined by the IEEE Standard, except for
|
---|
84 | <NOBR>80-bit</NOBR> double-extended-precision;
|
---|
85 | <LI>
|
---|
86 | remainder as defined by the IEEE Standard;
|
---|
87 | <LI>
|
---|
88 | round to integral value;
|
---|
89 | <LI>
|
---|
90 | comparisons;
|
---|
91 | <LI>
|
---|
92 | conversions to/from other supported formats; and
|
---|
93 | <LI>
|
---|
94 | conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
|
---|
95 | signed and unsigned.
|
---|
96 | </UL>
|
---|
97 | All operations required by the original 1985 version of the IEEE Floating-Point
|
---|
98 | Standard are implemented, except for conversions to and from decimal.
|
---|
99 | </P>
|
---|
100 |
|
---|
101 | <P>
|
---|
102 | This document gives information about the types defined and the routines
|
---|
103 | implemented by SoftFloat.
|
---|
104 | It does not attempt to define or explain the IEEE Floating-Point Standard.
|
---|
105 | Information about the standard is available elsewhere.
|
---|
106 | </P>
|
---|
107 |
|
---|
108 | <P>
|
---|
109 | The current version of SoftFloat is <NOBR>Release 3e</NOBR>.
|
---|
110 | This release modifies the behavior of the rarely used <I>odd</I> rounding mode
|
---|
111 | (<I>round to odd</I>, also known as <I>jamming</I>), and also adds some new
|
---|
112 | specialization and optimization examples for those compiling SoftFloat.
|
---|
113 | </P>
|
---|
114 |
|
---|
115 | <P>
|
---|
116 | The previous <NOBR>Release 3d</NOBR> fixed bugs that were found in the square
|
---|
117 | root functions for the <NOBR>64-bit</NOBR>, <NOBR>80-bit</NOBR>, and
|
---|
118 | <NOBR>128-bit</NOBR> floating-point formats.
|
---|
119 | (Thanks to Alexei Sibidanov at the University of Victoria for reporting an
|
---|
120 | incorrect result.)
|
---|
121 | The bugs affected all prior <NOBR>Release-3</NOBR> versions of SoftFloat
|
---|
122 | <NOBR>through 3c</NOBR>.
|
---|
123 | The flaw in the <NOBR>64-bit</NOBR> floating-point square root function was of
|
---|
124 | very minor impact, causing a <NOBR>1-ulp</NOBR> error (<NOBR>1 unit</NOBR> in
|
---|
125 | the last place) a few times out of a billion.
|
---|
126 | The bugs in the <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> square root
|
---|
127 | functions were more serious.
|
---|
128 | Although incorrect results again occurred only a few times out of a billion,
|
---|
129 | when they did occur a large portion of the less-significant bits could be
|
---|
130 | wrong.
|
---|
131 | </P>
|
---|
132 |
|
---|
133 | <P>
|
---|
134 | Among earlier releases, 3b was notable for adding support for the
|
---|
135 | <NOBR>16-bit</NOBR> half-precision format.
|
---|
136 | For more about the evolution of SoftFloat releases, see
|
---|
137 | <A HREF="SoftFloat-history.html"><NOBR><CODE>SoftFloat-history.html</CODE></NOBR></A>.
|
---|
138 | </P>
|
---|
139 |
|
---|
140 | <P>
|
---|
141 | The functional interface of SoftFloat <NOBR>Release 3</NOBR> and later differs
|
---|
142 | in many details from the releases that came before.
|
---|
143 | For specifics of these differences, see <NOBR>section 9</NOBR> below,
|
---|
144 | <I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
|
---|
145 | </P>
|
---|
146 |
|
---|
147 |
|
---|
148 | <H2>2. Limitations</H2>
|
---|
149 |
|
---|
150 | <P>
|
---|
151 | SoftFloat assumes the computer has an addressable byte size of 8 or
|
---|
152 | <NOBR>16 bits</NOBR>.
|
---|
153 | (Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
|
---|
154 | </P>
|
---|
155 |
|
---|
156 | <P>
|
---|
157 | SoftFloat is written in C and is designed to work with other C code.
|
---|
158 | The C compiler used must conform at a minimum to the 1989 ANSI standard for the
|
---|
159 | C language (same as the 1990 ISO standard) and must in addition support basic
|
---|
160 | arithmetic on <NOBR>64-bit</NOBR> integers.
|
---|
161 | Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
|
---|
162 | single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
|
---|
163 | did not require <NOBR>64-bit</NOBR> integers, but this option is not supported
|
---|
164 | starting with <NOBR>Release 3</NOBR>.
|
---|
165 | Since 1999, ISO standards for C have mandated compiler support for
|
---|
166 | <NOBR>64-bit</NOBR> integers.
|
---|
167 | A compiler conforming to the 1999 C Standard or later is recommended but not
|
---|
168 | strictly required.
|
---|
169 | </P>
|
---|
170 |
|
---|
171 | <P>
|
---|
172 | Most operations not required by the original 1985 version of the IEEE
|
---|
173 | Floating-Point Standard but added in the 2008 version are not yet supported in
|
---|
174 | SoftFloat <NOBR>Release 3e</NOBR>.
|
---|
175 | </P>
|
---|
176 |
|
---|
177 |
|
---|
178 | <H2>3. Acknowledgments and License</H2>
|
---|
179 |
|
---|
180 | <P>
|
---|
181 | The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
|
---|
182 | <NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation
|
---|
183 | supplanting earlier releases.
|
---|
184 | The project to create <NOBR>Release 3</NOBR> (now <NOBR>through 3e</NOBR>) was
|
---|
185 | done in the employ of the University of California, Berkeley, within the
|
---|
186 | Department of Electrical Engineering and Computer Sciences, first for the
|
---|
187 | Parallel Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
|
---|
188 | The work was officially overseen by Prof. Krste Asanovic, with funding provided
|
---|
189 | by these sources:
|
---|
190 | <BLOCKQUOTE>
|
---|
191 | <TABLE>
|
---|
192 | <COL>
|
---|
193 | <COL WIDTH=10>
|
---|
194 | <COL>
|
---|
195 | <TR>
|
---|
196 | <TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
|
---|
197 | <TD></TD>
|
---|
198 | <TD>
|
---|
199 | Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
|
---|
200 | (Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
|
---|
201 | NVIDIA, Oracle, and Samsung.
|
---|
202 | </TD>
|
---|
203 | </TR>
|
---|
204 | <TR>
|
---|
205 | <TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
|
---|
206 | <TD></TD>
|
---|
207 | <TD>
|
---|
208 | DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
|
---|
209 | ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
|
---|
210 | Oracle, and Samsung.
|
---|
211 | </TD>
|
---|
212 | </TR>
|
---|
213 | </TABLE>
|
---|
214 | </BLOCKQUOTE>
|
---|
215 | </P>
|
---|
216 |
|
---|
217 | <P>
|
---|
218 | The following applies to the whole of SoftFloat <NOBR>Release 3e</NOBR> as well
|
---|
219 | as to each source file individually.
|
---|
220 | </P>
|
---|
221 |
|
---|
222 | <P>
|
---|
223 | Copyright 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 The Regents of the
|
---|
224 | University of California.
|
---|
225 | All rights reserved.
|
---|
226 | </P>
|
---|
227 |
|
---|
228 | <P>
|
---|
229 | Redistribution and use in source and binary forms, with or without
|
---|
230 | modification, are permitted provided that the following conditions are met:
|
---|
231 | <OL>
|
---|
232 |
|
---|
233 | <LI>
|
---|
234 | <P>
|
---|
235 | Redistributions of source code must retain the above copyright notice, this
|
---|
236 | list of conditions, and the following disclaimer.
|
---|
237 | </P>
|
---|
238 |
|
---|
239 | <LI>
|
---|
240 | <P>
|
---|
241 | Redistributions in binary form must reproduce the above copyright notice, this
|
---|
242 | list of conditions, and the following disclaimer in the documentation and/or
|
---|
243 | other materials provided with the distribution.
|
---|
244 | </P>
|
---|
245 |
|
---|
246 | <LI>
|
---|
247 | <P>
|
---|
248 | Neither the name of the University nor the names of its contributors may be
|
---|
249 | used to endorse or promote products derived from this software without specific
|
---|
250 | prior written permission.
|
---|
251 | </P>
|
---|
252 |
|
---|
253 | </OL>
|
---|
254 | </P>
|
---|
255 |
|
---|
256 | <P>
|
---|
257 | THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”,
|
---|
258 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
---|
259 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
|
---|
260 | DISCLAIMED.
|
---|
261 | IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
|
---|
262 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
---|
263 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
---|
264 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
---|
265 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
|
---|
266 | OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
|
---|
267 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
---|
268 | </P>
|
---|
269 |
|
---|
270 |
|
---|
271 | <H2>4. Types and Functions</H2>
|
---|
272 |
|
---|
273 | <P>
|
---|
274 | The types and functions of SoftFloat are declared in header file
|
---|
275 | <CODE>softfloat.h</CODE>.
|
---|
276 | </P>
|
---|
277 |
|
---|
278 | <H3>4.1. Boolean and Integer Types</H3>
|
---|
279 |
|
---|
280 | <P>
|
---|
281 | Header file <CODE>softfloat.h</CODE> depends on standard headers
|
---|
282 | <CODE><stdbool.h></CODE> and <CODE><stdint.h></CODE> to define type
|
---|
283 | <CODE>bool</CODE> and several integer types.
|
---|
284 | These standard headers have been part of the ISO C Standard Library since 1999.
|
---|
285 | With any recent compiler, they are likely to be supported, even if the compiler
|
---|
286 | does not claim complete conformance to the latest ISO C Standard.
|
---|
287 | For older or nonstandard compilers, a port of SoftFloat may have substitutes
|
---|
288 | for these headers.
|
---|
289 | Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
|
---|
290 | <CODE><stdbool.h></CODE> and on these type names from
|
---|
291 | <CODE><stdint.h></CODE>:
|
---|
292 | <BLOCKQUOTE>
|
---|
293 | <PRE>
|
---|
294 | uint16_t
|
---|
295 | uint32_t
|
---|
296 | uint64_t
|
---|
297 | int32_t
|
---|
298 | int64_t
|
---|
299 | uint_fast8_t
|
---|
300 | uint_fast32_t
|
---|
301 | uint_fast64_t
|
---|
302 | int_fast32_t
|
---|
303 | int_fast64_t
|
---|
304 | </PRE>
|
---|
305 | </BLOCKQUOTE>
|
---|
306 | </P>
|
---|
307 |
|
---|
308 |
|
---|
309 | <H3>4.2. Floating-Point Types</H3>
|
---|
310 |
|
---|
311 | <P>
|
---|
312 | The <CODE>softfloat.h</CODE> header defines five floating-point types:
|
---|
313 | <BLOCKQUOTE>
|
---|
314 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
315 | <TR>
|
---|
316 | <TD><CODE>float16_t</CODE></TD>
|
---|
317 | <TD><NOBR>16-bit</NOBR> half-precision binary format</TD>
|
---|
318 | </TR>
|
---|
319 | <TR>
|
---|
320 | <TD><CODE>float32_t</CODE></TD>
|
---|
321 | <TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
|
---|
322 | </TR>
|
---|
323 | <TR>
|
---|
324 | <TD><CODE>float64_t</CODE></TD>
|
---|
325 | <TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
|
---|
326 | </TR>
|
---|
327 | <TR>
|
---|
328 | <TD><CODE>extFloat80_t </CODE></TD>
|
---|
329 | <TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
|
---|
330 | Motorola format)</TD>
|
---|
331 | </TR>
|
---|
332 | <TR>
|
---|
333 | <TD><CODE>float128_t</CODE></TD>
|
---|
334 | <TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
|
---|
335 | </TR>
|
---|
336 | </TABLE>
|
---|
337 | </BLOCKQUOTE>
|
---|
338 | The non-extended types are each exactly the size specified:
|
---|
339 | <NOBR>16 bits</NOBR> for <CODE>float16_t</CODE>, <NOBR>32 bits</NOBR> for
|
---|
340 | <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for <CODE>float64_t</CODE>, and
|
---|
341 | <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
|
---|
342 | Aside from these size requirements, the definitions of all these types may
|
---|
343 | differ for different ports of SoftFloat to specific systems.
|
---|
344 | A given port of SoftFloat may or may not define some of the floating-point
|
---|
345 | types as aliases for the C standard types <CODE>float</CODE>,
|
---|
346 | <CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
|
---|
347 | </P>
|
---|
348 |
|
---|
349 | <P>
|
---|
350 | Header file <CODE>softfloat.h</CODE> also defines a structure,
|
---|
351 | <CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
|
---|
352 | <NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
|
---|
353 | This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
|
---|
354 | at least these two fields (not necessarily in this order):
|
---|
355 | <BLOCKQUOTE>
|
---|
356 | <PRE>
|
---|
357 | uint16_t signExp;
|
---|
358 | uint64_t signif;
|
---|
359 | </PRE>
|
---|
360 | </BLOCKQUOTE>
|
---|
361 | Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
|
---|
362 | value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
|
---|
363 | encoded exponent in the other <NOBR>15 bits</NOBR>.
|
---|
364 | Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
|
---|
365 | the floating-point value.
|
---|
366 | (In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
|
---|
367 | leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
|
---|
368 | in the most significant bit of the significand.)
|
---|
369 | </P>
|
---|
370 |
|
---|
371 | <H3>4.3. Supported Floating-Point Functions</H3>
|
---|
372 |
|
---|
373 | <P>
|
---|
374 | SoftFloat implements these arithmetic operations for its floating-point types:
|
---|
375 | <UL>
|
---|
376 | <LI>
|
---|
377 | conversions between any two floating-point formats;
|
---|
378 | <LI>
|
---|
379 | for each floating-point format, conversions to and from signed and unsigned
|
---|
380 | <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
|
---|
381 | <LI>
|
---|
382 | for each format, the usual addition, subtraction, multiplication, division, and
|
---|
383 | square root operations;
|
---|
384 | <LI>
|
---|
385 | for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
|
---|
386 | operation defined by the IEEE Standard;
|
---|
387 | <LI>
|
---|
388 | for each format, the floating-point remainder operation defined by the IEEE
|
---|
389 | Standard;
|
---|
390 | <LI>
|
---|
391 | for each format, a “round to integer” operation that rounds to the
|
---|
392 | nearest integer value in the same format; and
|
---|
393 | <LI>
|
---|
394 | comparisons between two values in the same floating-point format.
|
---|
395 | </UL>
|
---|
396 | </P>
|
---|
397 |
|
---|
398 | <P>
|
---|
399 | The following operations required by the 2008 IEEE Floating-Point Standard are
|
---|
400 | not supported in SoftFloat <NOBR>Release 3e</NOBR>:
|
---|
401 | <UL>
|
---|
402 | <LI>
|
---|
403 | <B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
|
---|
404 | <B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
|
---|
405 | <LI>
|
---|
406 | conversions between floating-point formats and decimal or hexadecimal character
|
---|
407 | sequences;
|
---|
408 | <LI>
|
---|
409 | all “quiet-computation” operations (<B>copy</B>, <B>negate</B>,
|
---|
410 | <B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
|
---|
411 | manipulation of the floating-point sign bit); and
|
---|
412 | <LI>
|
---|
413 | all “non-computational” operations other than <B>isSignaling</B>
|
---|
414 | (which is supported).
|
---|
415 | </UL>
|
---|
416 | </P>
|
---|
417 |
|
---|
418 | <H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
|
---|
419 |
|
---|
420 | <P>
|
---|
421 | Because the <NOBR>80-bit</NOBR> double-extended-precision format,
|
---|
422 | <CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
|
---|
423 | finite floating-point numbers are encodable in this type in multiple equivalent
|
---|
424 | forms.
|
---|
425 | Of these multiple encodings, there is always a unique one with the least
|
---|
426 | encoded exponent value, and this encoding is considered the <I>canonical</I>
|
---|
427 | representation of the floating-point number.
|
---|
428 | Any other equivalent representations (having a higher encoded exponent value)
|
---|
429 | are <I>non-canonical</I>.
|
---|
430 | For a value in the subnormal range (including zero), the canonical
|
---|
431 | representation always has an encoded exponent of zero and a leading significand
|
---|
432 | bit <NOBR>of 0</NOBR>.
|
---|
433 | For finite values outside the subnormal range, the canonical representation
|
---|
434 | always has an encoded exponent that is nonzero and a leading significand bit
|
---|
435 | <NOBR>of 1</NOBR>.
|
---|
436 | </P>
|
---|
437 |
|
---|
438 | <P>
|
---|
439 | For an infinity or NaN, the leading significand bit is similarly expected to
|
---|
440 | <NOBR>be 1</NOBR>.
|
---|
441 | An infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
|
---|
442 | considered non-canonical.
|
---|
443 | Hence, altogether, to be canonical, a value of type <CODE>extFloat80_t</CODE>
|
---|
444 | must have a leading significand bit <NOBR>of 1</NOBR>, unless the value is
|
---|
445 | subnormal or zero, in which case the leading significand bit and the encoded
|
---|
446 | exponent must both be zero.
|
---|
447 | </P>
|
---|
448 |
|
---|
449 | <P>
|
---|
450 | SoftFloat’s functions are not guaranteed to operate as expected when
|
---|
451 | inputs of type <CODE>extFloat80_t</CODE> are non-canonical.
|
---|
452 | Assuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any)
|
---|
453 | are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
|
---|
454 | be canonical.
|
---|
455 | </P>
|
---|
456 |
|
---|
457 | <H3>4.5. Conventions for Passing Arguments and Results</H3>
|
---|
458 |
|
---|
459 | <P>
|
---|
460 | Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
|
---|
461 | <NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
|
---|
462 | cases passed as function arguments by value.
|
---|
463 | Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
|
---|
464 | is always returned directly as the function result.
|
---|
465 | Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
|
---|
466 | floating-point values has this simple signature:
|
---|
467 | <BLOCKQUOTE>
|
---|
468 | <CODE>float64_t f64_add( float64_t, float64_t );</CODE>
|
---|
469 | </BLOCKQUOTE>
|
---|
470 | </P>
|
---|
471 |
|
---|
472 | <P>
|
---|
473 | The story is more complex when function inputs and outputs are
|
---|
474 | <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
|
---|
475 | For these types, SoftFloat always provides a function that passes these larger
|
---|
476 | values into or out of the function indirectly, via pointers.
|
---|
477 | For example, for adding two <NOBR>128-bit</NOBR> floating-point values,
|
---|
478 | SoftFloat supplies this function:
|
---|
479 | <BLOCKQUOTE>
|
---|
480 | <CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE>
|
---|
481 | </BLOCKQUOTE>
|
---|
482 | The first two arguments point to the values to be added, and the last argument
|
---|
483 | points to the location where the sum will be stored.
|
---|
484 | The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
|
---|
485 | that the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”,
|
---|
486 | pointed to by pointer arguments.
|
---|
487 | </P>
|
---|
488 |
|
---|
489 | <P>
|
---|
490 | All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
|
---|
491 | types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
|
---|
492 | At the same time, SoftFloat ports may also implement alternate versions of
|
---|
493 | these same functions that pass <CODE>extFloat80_t</CODE> and
|
---|
494 | <CODE>float128_t</CODE> by value, like the smaller formats.
|
---|
495 | Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a
|
---|
496 | SoftFloat port may also supply an equivalent function with this signature:
|
---|
497 | <BLOCKQUOTE>
|
---|
498 | <CODE>float128_t f128_add( float128_t, float128_t );</CODE>
|
---|
499 | </BLOCKQUOTE>
|
---|
500 | </P>
|
---|
501 |
|
---|
502 | <P>
|
---|
503 | As a general rule, on computers where the machine word size is
|
---|
504 | <NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
|
---|
505 | (e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
|
---|
506 | and <CODE>float128_t</CODE>, because passing such large types directly can have
|
---|
507 | significant extra cost.
|
---|
508 | On computers where the word size is <NOBR>64 bits</NOBR> or larger, both
|
---|
509 | function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
|
---|
510 | provided, because the cost of passing by value is then more reasonable.
|
---|
511 | Applications that must be portable accross both classes of computers must use
|
---|
512 | the pointer-based functions, as these are always implemented.
|
---|
513 | However, if it is known that SoftFloat includes the by-value functions for all
|
---|
514 | platforms of interest, programmers can use whichever version they prefer.
|
---|
515 | </P>
|
---|
516 |
|
---|
517 |
|
---|
518 | <H2>5. Reserved Names</H2>
|
---|
519 |
|
---|
520 | <P>
|
---|
521 | In addition to the variables and functions documented here, SoftFloat defines
|
---|
522 | some symbol names for its own private use.
|
---|
523 | These private names always begin with the prefix
|
---|
524 | ‘<CODE>softfloat_</CODE>’.
|
---|
525 | When a program includes header <CODE>softfloat.h</CODE> or links with the
|
---|
526 | SoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’
|
---|
527 | are reserved for possible use by SoftFloat.
|
---|
528 | Applications that use SoftFloat should not define their own names with this
|
---|
529 | prefix, and should reference only such names as are documented.
|
---|
530 | </P>
|
---|
531 |
|
---|
532 |
|
---|
533 | <H2>6. Mode Variables</H2>
|
---|
534 |
|
---|
535 | <P>
|
---|
536 | The following global variables control rounding mode, underflow detection, and
|
---|
537 | the <NOBR>80-bit</NOBR> extended format’s rounding precision:
|
---|
538 | <BLOCKQUOTE>
|
---|
539 | <CODE>softfloat_roundingMode</CODE><BR>
|
---|
540 | <CODE>softfloat_detectTininess</CODE><BR>
|
---|
541 | <CODE>extF80_roundingPrecision</CODE>
|
---|
542 | </BLOCKQUOTE>
|
---|
543 | These mode variables are covered in the next several subsections.
|
---|
544 | For some SoftFloat ports, these variables may be <I>per-thread</I> (declared
|
---|
545 | <CODE>thread_local</CODE>), meaning that different execution threads have their
|
---|
546 | own separate copies of the variables.
|
---|
547 | </P>
|
---|
548 |
|
---|
549 | <H3>6.1. Rounding Mode</H3>
|
---|
550 |
|
---|
551 | <P>
|
---|
552 | All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
|
---|
553 | implemented for all operations that require rounding.
|
---|
554 | Some ports of SoftFloat may also implement the <I>round-to-odd</I> mode.
|
---|
555 | </P>
|
---|
556 |
|
---|
557 | <P>
|
---|
558 | The rounding mode is selected by the global variable
|
---|
559 | <BLOCKQUOTE>
|
---|
560 | <CODE>uint_fast8_t softfloat_roundingMode;</CODE>
|
---|
561 | </BLOCKQUOTE>
|
---|
562 | This variable may be set to one of the values
|
---|
563 | <BLOCKQUOTE>
|
---|
564 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
565 | <TR>
|
---|
566 | <TD><CODE>softfloat_round_near_even</CODE></TD>
|
---|
567 | <TD>round to nearest, with ties to even</TD>
|
---|
568 | </TR>
|
---|
569 | <TR>
|
---|
570 | <TD><CODE>softfloat_round_near_maxMag </CODE></TD>
|
---|
571 | <TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
|
---|
572 | </TR>
|
---|
573 | <TR>
|
---|
574 | <TD><CODE>softfloat_round_minMag</CODE></TD>
|
---|
575 | <TD>round to minimum magnitude (toward zero)</TD>
|
---|
576 | </TR>
|
---|
577 | <TR>
|
---|
578 | <TD><CODE>softfloat_round_min</CODE></TD>
|
---|
579 | <TD>round to minimum (down)</TD>
|
---|
580 | </TR>
|
---|
581 | <TR>
|
---|
582 | <TD><CODE>softfloat_round_max</CODE></TD>
|
---|
583 | <TD>round to maximum (up)</TD>
|
---|
584 | </TR>
|
---|
585 | <TR>
|
---|
586 | <TD><CODE>softfloat_round_odd</CODE></TD>
|
---|
587 | <TD>round to odd (jamming), if supported by the SoftFloat port</TD>
|
---|
588 | </TR>
|
---|
589 | </TABLE>
|
---|
590 | </BLOCKQUOTE>
|
---|
591 | Variable <CODE>softfloat_roundingMode</CODE> is initialized to
|
---|
592 | <CODE>softfloat_round_near_even</CODE>.
|
---|
593 | </P>
|
---|
594 |
|
---|
595 | <P>
|
---|
596 | When <CODE>softfloat_round_odd</CODE> is the rounding mode for a function that
|
---|
597 | rounds to an integer value (either conversion to an integer format or a
|
---|
598 | ‘<CODE>roundToInt</CODE>’ function), if the input is not already an
|
---|
599 | integer, the rounded result is the closest <EM>odd</EM> integer.
|
---|
600 | For other operations, this rounding mode acts as though the floating-point
|
---|
601 | result is first rounded to minimum magnitude, the same as
|
---|
602 | <CODE>softfloat_round_minMag</CODE>, and then, if the result is inexact, the
|
---|
603 | least-significant bit of the result is set <NOBR>to 1</NOBR>.
|
---|
604 | Rounding to odd is also known as <EM>jamming</EM>.
|
---|
605 | </P>
|
---|
606 |
|
---|
607 | <H3>6.2. Underflow Detection</H3>
|
---|
608 |
|
---|
609 | <P>
|
---|
610 | In the terminology of the IEEE Standard, SoftFloat can detect tininess for
|
---|
611 | underflow either before or after rounding.
|
---|
612 | The choice is made by the global variable
|
---|
613 | <BLOCKQUOTE>
|
---|
614 | <CODE>uint_fast8_t softfloat_detectTininess;</CODE>
|
---|
615 | </BLOCKQUOTE>
|
---|
616 | which can be set to either
|
---|
617 | <BLOCKQUOTE>
|
---|
618 | <CODE>softfloat_tininess_beforeRounding</CODE><BR>
|
---|
619 | <CODE>softfloat_tininess_afterRounding</CODE>
|
---|
620 | </BLOCKQUOTE>
|
---|
621 | Detecting tininess after rounding is usually better because it results in fewer
|
---|
622 | spurious underflow signals.
|
---|
623 | The other option is provided for compatibility with some systems.
|
---|
624 | Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
|
---|
625 | always detects loss of accuracy for underflow as an inexact result.
|
---|
626 | </P>
|
---|
627 |
|
---|
628 | <H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
|
---|
629 |
|
---|
630 | <P>
|
---|
631 | For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
|
---|
632 | arithmetic operations is controlled by the global variable
|
---|
633 | <BLOCKQUOTE>
|
---|
634 | <CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
|
---|
635 | </BLOCKQUOTE>
|
---|
636 | The operations affected are:
|
---|
637 | <BLOCKQUOTE>
|
---|
638 | <CODE>extF80_add</CODE><BR>
|
---|
639 | <CODE>extF80_sub</CODE><BR>
|
---|
640 | <CODE>extF80_mul</CODE><BR>
|
---|
641 | <CODE>extF80_div</CODE><BR>
|
---|
642 | <CODE>extF80_sqrt</CODE>
|
---|
643 | </BLOCKQUOTE>
|
---|
644 | When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
|
---|
645 | these operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
|
---|
646 | double-extended-precision format, like occurs for other formats.
|
---|
647 | Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
|
---|
648 | operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
|
---|
649 | <CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
|
---|
650 | <CODE>float64_t</CODE>), respectively.
|
---|
651 | When rounding to reduced precision, additional bits in the result significand
|
---|
652 | beyond the rounding point are set to zero.
|
---|
653 | The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
|
---|
654 | other than 32, 64, or 80 is not specified.
|
---|
655 | Operations other than the ones listed above are not affected by
|
---|
656 | <CODE>extF80_roundingPrecision</CODE>.
|
---|
657 | </P>
|
---|
658 |
|
---|
659 |
|
---|
660 | <H2>7. Exceptions and Exception Flags</H2>
|
---|
661 |
|
---|
662 | <P>
|
---|
663 | All five exception flags required by the IEEE Floating-Point Standard are
|
---|
664 | implemented.
|
---|
665 | Each flag is stored as a separate bit in the global variable
|
---|
666 | <BLOCKQUOTE>
|
---|
667 | <CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
|
---|
668 | </BLOCKQUOTE>
|
---|
669 | The positions of the exception flag bits within this variable are determined by
|
---|
670 | the bit masks
|
---|
671 | <BLOCKQUOTE>
|
---|
672 | <CODE>softfloat_flag_inexact</CODE><BR>
|
---|
673 | <CODE>softfloat_flag_underflow</CODE><BR>
|
---|
674 | <CODE>softfloat_flag_overflow</CODE><BR>
|
---|
675 | <CODE>softfloat_flag_infinite</CODE><BR>
|
---|
676 | <CODE>softfloat_flag_invalid</CODE>
|
---|
677 | </BLOCKQUOTE>
|
---|
678 | Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
|
---|
679 | meaning no exceptions.
|
---|
680 | </P>
|
---|
681 |
|
---|
682 | <P>
|
---|
683 | For some SoftFloat ports, <CODE>softfloat_exceptionFlags</CODE> may be
|
---|
684 | <I>per-thread</I> (declared <CODE>thread_local</CODE>), meaning that different
|
---|
685 | execution threads have their own separate instances of it.
|
---|
686 | </P>
|
---|
687 |
|
---|
688 | <P>
|
---|
689 | An individual exception flag can be cleared with the statement
|
---|
690 | <BLOCKQUOTE>
|
---|
691 | <CODE>softfloat_exceptionFlags &= ~softfloat_flag_<<I>exception</I>>;</CODE>
|
---|
692 | </BLOCKQUOTE>
|
---|
693 | where <CODE><<I>exception</I>></CODE> is the appropriate name.
|
---|
694 | To raise a floating-point exception, function <CODE>softfloat_raiseFlags</CODE>
|
---|
695 | should normally be used.
|
---|
696 | </P>
|
---|
697 |
|
---|
698 | <P>
|
---|
699 | When SoftFloat detects an exception other than <I>inexact</I>, it calls
|
---|
700 | <CODE>softfloat_raiseFlags</CODE>.
|
---|
701 | The default version of this function simply raises the corresponding exception
|
---|
702 | flags.
|
---|
703 | Particular ports of SoftFloat may support alternate behavior, such as exception
|
---|
704 | traps, by modifying the default <CODE>softfloat_raiseFlags</CODE>.
|
---|
705 | A program may also supply its own <CODE>softfloat_raiseFlags</CODE> function to
|
---|
706 | override the one from the SoftFloat library.
|
---|
707 | </P>
|
---|
708 |
|
---|
709 | <P>
|
---|
710 | Because inexact results occur frequently under most circumstances (and thus are
|
---|
711 | hardly exceptional), SoftFloat does not ordinarily call
|
---|
712 | <CODE>softfloat_raiseFlags</CODE> for <I>inexact</I> exceptions.
|
---|
713 | It does always raise the <I>inexact</I> exception flag as required.
|
---|
714 | </P>
|
---|
715 |
|
---|
716 |
|
---|
717 | <H2>8. Function Details</H2>
|
---|
718 |
|
---|
719 | <P>
|
---|
720 | In this section, <CODE><<I>float</I>></CODE> appears in function names as
|
---|
721 | a substitute for one of these abbreviations:
|
---|
722 | <BLOCKQUOTE>
|
---|
723 | <TABLE CELLSPACING=0 CELLPADDING=0>
|
---|
724 | <TR>
|
---|
725 | <TD><CODE>f16</CODE></TD>
|
---|
726 | <TD>indicates <CODE>float16_t</CODE>, passed by value</TD>
|
---|
727 | </TR>
|
---|
728 | <TR>
|
---|
729 | <TD><CODE>f32</CODE></TD>
|
---|
730 | <TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
|
---|
731 | </TR>
|
---|
732 | <TR>
|
---|
733 | <TD><CODE>f64</CODE></TD>
|
---|
734 | <TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
|
---|
735 | </TR>
|
---|
736 | <TR>
|
---|
737 | <TD><CODE>extF80M </CODE></TD>
|
---|
738 | <TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
|
---|
739 | </TR>
|
---|
740 | <TR>
|
---|
741 | <TD><CODE>extF80</CODE></TD>
|
---|
742 | <TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
|
---|
743 | </TR>
|
---|
744 | <TR>
|
---|
745 | <TD><CODE>f128M</CODE></TD>
|
---|
746 | <TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
|
---|
747 | </TR>
|
---|
748 | <TR>
|
---|
749 | <TD><CODE>f128</CODE></TD>
|
---|
750 | <TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
|
---|
751 | </TR>
|
---|
752 | </TABLE>
|
---|
753 | </BLOCKQUOTE>
|
---|
754 | The circumstances under which values of floating-point types
|
---|
755 | <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
|
---|
756 | value or indirectly via pointers was discussed earlier in
|
---|
757 | <NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
|
---|
758 | </P>
|
---|
759 |
|
---|
760 | <H3>8.1. Conversions from Integer to Floating-Point</H3>
|
---|
761 |
|
---|
762 | <P>
|
---|
763 | All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
|
---|
764 | signed or unsigned, to a floating-point format are supported.
|
---|
765 | Functions performing these conversions have these names:
|
---|
766 | <BLOCKQUOTE>
|
---|
767 | <CODE>ui32_to_<<I>float</I>></CODE><BR>
|
---|
768 | <CODE>ui64_to_<<I>float</I>></CODE><BR>
|
---|
769 | <CODE>i32_to_<<I>float</I>></CODE><BR>
|
---|
770 | <CODE>i64_to_<<I>float</I>></CODE>
|
---|
771 | </BLOCKQUOTE>
|
---|
772 | Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
|
---|
773 | double-precision and larger formats are always exact, and likewise conversions
|
---|
774 | from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
|
---|
775 | double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
|
---|
776 | always exact.
|
---|
777 | </P>
|
---|
778 |
|
---|
779 | <P>
|
---|
780 | Each conversion function takes one input of the appropriate type and generates
|
---|
781 | one output.
|
---|
782 | The following illustrates the signatures of these functions in cases when the
|
---|
783 | floating-point result is passed either by value or via pointers:
|
---|
784 | <BLOCKQUOTE>
|
---|
785 | <PRE>
|
---|
786 | float64_t i32_to_f64( int32_t <I>a</I> );
|
---|
787 | </PRE>
|
---|
788 | <PRE>
|
---|
789 | void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
|
---|
790 | </PRE>
|
---|
791 | </BLOCKQUOTE>
|
---|
792 | </P>
|
---|
793 |
|
---|
794 | <H3>8.2. Conversions from Floating-Point to Integer</H3>
|
---|
795 |
|
---|
796 | <P>
|
---|
797 | Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or
|
---|
798 | <NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
|
---|
799 | functions:
|
---|
800 | <BLOCKQUOTE>
|
---|
801 | <CODE><<I>float</I>>_to_ui32</CODE><BR>
|
---|
802 | <CODE><<I>float</I>>_to_ui64</CODE><BR>
|
---|
803 | <CODE><<I>float</I>>_to_i32</CODE><BR>
|
---|
804 | <CODE><<I>float</I>>_to_i64</CODE>
|
---|
805 | </BLOCKQUOTE>
|
---|
806 | The functions have signatures as follows, depending on whether the
|
---|
807 | floating-point input is passed by value or via pointers:
|
---|
808 | <BLOCKQUOTE>
|
---|
809 | <PRE>
|
---|
810 | int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
---|
811 | </PRE>
|
---|
812 | <PRE>
|
---|
813 | int_fast32_t
|
---|
814 | f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
---|
815 | </PRE>
|
---|
816 | </BLOCKQUOTE>
|
---|
817 | </P>
|
---|
818 |
|
---|
819 | <P>
|
---|
820 | The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
|
---|
821 | the conversion.
|
---|
822 | The variable that usually indicates rounding mode,
|
---|
823 | <CODE>softfloat_roundingMode</CODE>, is ignored.
|
---|
824 | Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
|
---|
825 | exception flag is raised if the conversion is not exact.
|
---|
826 | If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
|
---|
827 | be raised;
|
---|
828 | otherwise, it will not be, even if the conversion is inexact.
|
---|
829 | </P>
|
---|
830 |
|
---|
831 | <P>
|
---|
832 | A conversion from floating-point to integer format raises the <I>invalid</I>
|
---|
833 | exception if the source value cannot be rounded to a representable integer of
|
---|
834 | the desired size (32 or 64 bits).
|
---|
835 | In such circumstances, the integer result returned is determined by the
|
---|
836 | particular port of SoftFloat, although typically this value will be either the
|
---|
837 | maximum or minimum value of the integer format.
|
---|
838 | The functions that convert to integer types never raise the floating-point
|
---|
839 | <I>overflow</I> exception.
|
---|
840 | </P>
|
---|
841 |
|
---|
842 | <P>
|
---|
843 | Because languages such <NOBR>as C</NOBR> require that conversions to integers
|
---|
844 | be rounded toward zero, the following functions are provided for improved speed
|
---|
845 | and convenience:
|
---|
846 | <BLOCKQUOTE>
|
---|
847 | <CODE><<I>float</I>>_to_ui32_r_minMag</CODE><BR>
|
---|
848 | <CODE><<I>float</I>>_to_ui64_r_minMag</CODE><BR>
|
---|
849 | <CODE><<I>float</I>>_to_i32_r_minMag</CODE><BR>
|
---|
850 | <CODE><<I>float</I>>_to_i64_r_minMag</CODE>
|
---|
851 | </BLOCKQUOTE>
|
---|
852 | These functions round only toward zero (to minimum magnitude).
|
---|
853 | The signatures for these functions are the same as above without the redundant
|
---|
854 | <CODE><I>roundingMode</I></CODE> argument:
|
---|
855 | <BLOCKQUOTE>
|
---|
856 | <PRE>
|
---|
857 | int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
|
---|
858 | </PRE>
|
---|
859 | <PRE>
|
---|
860 | int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
|
---|
861 | </PRE>
|
---|
862 | </BLOCKQUOTE>
|
---|
863 | </P>
|
---|
864 |
|
---|
865 | <H3>8.3. Conversions Among Floating-Point Types</H3>
|
---|
866 |
|
---|
867 | <P>
|
---|
868 | Conversions between floating-point formats are done by functions with these
|
---|
869 | names:
|
---|
870 | <BLOCKQUOTE>
|
---|
871 | <CODE><<I>float</I>>_to_<<I>float</I>></CODE>
|
---|
872 | </BLOCKQUOTE>
|
---|
873 | All combinations of source and result type are supported where the source and
|
---|
874 | result are different formats.
|
---|
875 | There are four different styles of signature for these functions, depending on
|
---|
876 | whether the input and the output floating-point values are passed by value or
|
---|
877 | via pointers:
|
---|
878 | <BLOCKQUOTE>
|
---|
879 | <PRE>
|
---|
880 | float32_t f64_to_f32( float64_t <I>a</I> );
|
---|
881 | </PRE>
|
---|
882 | <PRE>
|
---|
883 | float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
|
---|
884 | </PRE>
|
---|
885 | <PRE>
|
---|
886 | void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
|
---|
887 | </PRE>
|
---|
888 | <PRE>
|
---|
889 | void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
|
---|
890 | </PRE>
|
---|
891 | </BLOCKQUOTE>
|
---|
892 | </P>
|
---|
893 |
|
---|
894 | <P>
|
---|
895 | Conversions from a smaller to a larger floating-point format are always exact
|
---|
896 | and so require no rounding.
|
---|
897 | </P>
|
---|
898 |
|
---|
899 | <H3>8.4. Basic Arithmetic Functions</H3>
|
---|
900 |
|
---|
901 | <P>
|
---|
902 | The following basic arithmetic functions are provided:
|
---|
903 | <BLOCKQUOTE>
|
---|
904 | <CODE><<I>float</I>>_add</CODE><BR>
|
---|
905 | <CODE><<I>float</I>>_sub</CODE><BR>
|
---|
906 | <CODE><<I>float</I>>_mul</CODE><BR>
|
---|
907 | <CODE><<I>float</I>>_div</CODE><BR>
|
---|
908 | <CODE><<I>float</I>>_sqrt</CODE>
|
---|
909 | </BLOCKQUOTE>
|
---|
910 | Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
|
---|
911 | (square root) which takes only one.
|
---|
912 | The operands and result are all of the same floating-point format.
|
---|
913 | Signatures for these functions take the following forms:
|
---|
914 | <BLOCKQUOTE>
|
---|
915 | <PRE>
|
---|
916 | float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
|
---|
917 | </PRE>
|
---|
918 | <PRE>
|
---|
919 | void
|
---|
920 | f128M_add(
|
---|
921 | const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
|
---|
922 | </PRE>
|
---|
923 | <PRE>
|
---|
924 | float64_t f64_sqrt( float64_t <I>a</I> );
|
---|
925 | </PRE>
|
---|
926 | <PRE>
|
---|
927 | void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
|
---|
928 | </PRE>
|
---|
929 | </BLOCKQUOTE>
|
---|
930 | When floating-point values are passed indirectly through pointers, arguments
|
---|
931 | <CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
|
---|
932 | operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
|
---|
933 | location where the result is stored.
|
---|
934 | </P>
|
---|
935 |
|
---|
936 | <P>
|
---|
937 | Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
|
---|
938 | (<CODE>extFloat80_t</CODE>) functions is affected by variable
|
---|
939 | <CODE>extF80_roundingPrecision</CODE>, as explained earlier in
|
---|
940 | <NOBR>section 6.3</NOBR>,
|
---|
941 | <I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
|
---|
942 | </P>
|
---|
943 |
|
---|
944 | <H3>8.5. Fused Multiply-Add Functions</H3>
|
---|
945 |
|
---|
946 | <P>
|
---|
947 | The 2008 version of the IEEE Floating-Point Standard defines a <I>fused
|
---|
948 | multiply-add</I> operation that does a combined multiplication and addition
|
---|
949 | with only a single rounding.
|
---|
950 | SoftFloat implements fused multiply-add with functions
|
---|
951 | <BLOCKQUOTE>
|
---|
952 | <CODE><<I>float</I>>_mulAdd</CODE>
|
---|
953 | </BLOCKQUOTE>
|
---|
954 | Unlike other operations, fused multiple-add is not supported for the
|
---|
955 | <NOBR>80-bit</NOBR> double-extended-precision format,
|
---|
956 | <CODE>extFloat80_t</CODE>.
|
---|
957 | </P>
|
---|
958 |
|
---|
959 | <P>
|
---|
960 | Depending on whether floating-point values are passed by value or via pointers,
|
---|
961 | the fused multiply-add functions have signatures of these forms:
|
---|
962 | <BLOCKQUOTE>
|
---|
963 | <PRE>
|
---|
964 | float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
|
---|
965 | </PRE>
|
---|
966 | <PRE>
|
---|
967 | void
|
---|
968 | f128M_mulAdd(
|
---|
969 | const float128_t *<I>aPtr</I>,
|
---|
970 | const float128_t *<I>bPtr</I>,
|
---|
971 | const float128_t *<I>cPtr</I>,
|
---|
972 | float128_t *<I>destPtr</I>
|
---|
973 | );
|
---|
974 | </PRE>
|
---|
975 | </BLOCKQUOTE>
|
---|
976 | The functions compute
|
---|
977 | <NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>)
|
---|
978 | + <CODE><I>c</I></CODE></NOBR>
|
---|
979 | with a single rounding.
|
---|
980 | When floating-point values are passed indirectly through pointers, arguments
|
---|
981 | <CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
|
---|
982 | <CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
|
---|
983 | <CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
|
---|
984 | <CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
---|
985 | </P>
|
---|
986 |
|
---|
987 | <P>
|
---|
988 | If one of the multiplication operands <CODE><I>a</I></CODE> and
|
---|
989 | <CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
|
---|
990 | the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
|
---|
991 | </P>
|
---|
992 |
|
---|
993 | <H3>8.6. Remainder Functions</H3>
|
---|
994 |
|
---|
995 | <P>
|
---|
996 | For each format, SoftFloat implements the remainder operation defined by the
|
---|
997 | IEEE Floating-Point Standard.
|
---|
998 | The remainder functions have names
|
---|
999 | <BLOCKQUOTE>
|
---|
1000 | <CODE><<I>float</I>>_rem</CODE>
|
---|
1001 | </BLOCKQUOTE>
|
---|
1002 | Each remainder operation takes two floating-point operands of the same format
|
---|
1003 | and returns a result in the same format.
|
---|
1004 | Depending on whether floating-point values are passed by value or via pointers,
|
---|
1005 | the remainder functions have signatures of these forms:
|
---|
1006 | <BLOCKQUOTE>
|
---|
1007 | <PRE>
|
---|
1008 | float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
|
---|
1009 | </PRE>
|
---|
1010 | <PRE>
|
---|
1011 | void
|
---|
1012 | f128M_rem(
|
---|
1013 | const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
|
---|
1014 | </PRE>
|
---|
1015 | </BLOCKQUOTE>
|
---|
1016 | When floating-point values are passed indirectly through pointers, arguments
|
---|
1017 | <CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
|
---|
1018 | <CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
|
---|
1019 | <CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
---|
1020 | </P>
|
---|
1021 |
|
---|
1022 | <P>
|
---|
1023 | The IEEE Standard remainder operation computes the value
|
---|
1024 | <NOBR><CODE><I>a</I></CODE>
|
---|
1025 | − <I>n</I> × <CODE><I>b</I></CODE></NOBR>,
|
---|
1026 | where <I>n</I> is the integer closest to
|
---|
1027 | <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
|
---|
1028 | If <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly
|
---|
1029 | halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
|
---|
1030 | <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
|
---|
1031 | The IEEE Standard’s remainder operation is always exact and so requires
|
---|
1032 | no rounding.
|
---|
1033 | </P>
|
---|
1034 |
|
---|
1035 | <P>
|
---|
1036 | Depending on the relative magnitudes of the operands, the remainder
|
---|
1037 | functions can take considerably longer to execute than the other SoftFloat
|
---|
1038 | functions.
|
---|
1039 | This is an inherent characteristic of the remainder operation itself and is not
|
---|
1040 | a flaw in the SoftFloat implementation.
|
---|
1041 | </P>
|
---|
1042 |
|
---|
1043 | <H3>8.7. Round-to-Integer Functions</H3>
|
---|
1044 |
|
---|
1045 | <P>
|
---|
1046 | For each format, SoftFloat implements the round-to-integer operation specified
|
---|
1047 | by the IEEE Floating-Point Standard.
|
---|
1048 | These functions are named
|
---|
1049 | <BLOCKQUOTE>
|
---|
1050 | <CODE><<I>float</I>>_roundToInt</CODE>
|
---|
1051 | </BLOCKQUOTE>
|
---|
1052 | Each round-to-integer operation takes a single floating-point operand.
|
---|
1053 | This operand is rounded to an integer according to a specified rounding mode,
|
---|
1054 | and the resulting integer value is returned in the same floating-point format.
|
---|
1055 | (Note that the result is not an integer type.)
|
---|
1056 | </P>
|
---|
1057 |
|
---|
1058 | <P>
|
---|
1059 | The signatures of the round-to-integer functions are similar to those for
|
---|
1060 | conversions to an integer type:
|
---|
1061 | <BLOCKQUOTE>
|
---|
1062 | <PRE>
|
---|
1063 | float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
---|
1064 | </PRE>
|
---|
1065 | <PRE>
|
---|
1066 | void
|
---|
1067 | f128M_roundToInt(
|
---|
1068 | const float128_t *<I>aPtr</I>,
|
---|
1069 | uint_fast8_t <I>roundingMode</I>,
|
---|
1070 | bool <I>exact</I>,
|
---|
1071 | float128_t *<I>destPtr</I>
|
---|
1072 | );
|
---|
1073 | </PRE>
|
---|
1074 | </BLOCKQUOTE>
|
---|
1075 | When floating-point values are passed indirectly through pointers,
|
---|
1076 | <CODE><I>aPtr</I></CODE> points to the input operand and
|
---|
1077 | <CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
---|
1078 | </P>
|
---|
1079 |
|
---|
1080 | <P>
|
---|
1081 | The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
|
---|
1082 | apply.
|
---|
1083 | The variable that usually indicates rounding mode,
|
---|
1084 | <CODE>softfloat_roundingMode</CODE>, is ignored.
|
---|
1085 | Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
|
---|
1086 | exception flag is raised if the conversion is not exact.
|
---|
1087 | If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
|
---|
1088 | be raised;
|
---|
1089 | otherwise, it will not be, even if the conversion is inexact.
|
---|
1090 | </P>
|
---|
1091 |
|
---|
1092 | <H3>8.8. Comparison Functions</H3>
|
---|
1093 |
|
---|
1094 | <P>
|
---|
1095 | For each format, the following floating-point comparison functions are
|
---|
1096 | provided:
|
---|
1097 | <BLOCKQUOTE>
|
---|
1098 | <CODE><<I>float</I>>_eq</CODE><BR>
|
---|
1099 | <CODE><<I>float</I>>_le</CODE><BR>
|
---|
1100 | <CODE><<I>float</I>>_lt</CODE>
|
---|
1101 | </BLOCKQUOTE>
|
---|
1102 | Each comparison takes two operands of the same type and returns a Boolean.
|
---|
1103 | The abbreviation <CODE>eq</CODE> stands for “equal” (=);
|
---|
1104 | <CODE>le</CODE> stands for “less than or equal” (≤);
|
---|
1105 | and <CODE>lt</CODE> stands for “less than” (<).
|
---|
1106 | Depending on whether the floating-point operands are passed by value or via
|
---|
1107 | pointers, the comparison functions have signatures of these forms:
|
---|
1108 | <BLOCKQUOTE>
|
---|
1109 | <PRE>
|
---|
1110 | bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
|
---|
1111 | </PRE>
|
---|
1112 | <PRE>
|
---|
1113 | bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
|
---|
1114 | </PRE>
|
---|
1115 | </BLOCKQUOTE>
|
---|
1116 | </P>
|
---|
1117 |
|
---|
1118 | <P>
|
---|
1119 | The usual greater-than (>), greater-than-or-equal (≥), and not-equal
|
---|
1120 | (≠) comparisons are easily obtained from the functions provided.
|
---|
1121 | The not-equal function is just the logical complement of the equal function.
|
---|
1122 | The greater-than-or-equal function is identical to the less-than-or-equal
|
---|
1123 | function with the arguments in reverse order, and likewise the greater-than
|
---|
1124 | function is identical to the less-than function with the arguments reversed.
|
---|
1125 | </P>
|
---|
1126 |
|
---|
1127 | <P>
|
---|
1128 | The IEEE Floating-Point Standard specifies that the less-than-or-equal and
|
---|
1129 | less-than comparisons by default raise the <I>invalid</I> exception if either
|
---|
1130 | operand is any kind of NaN.
|
---|
1131 | Equality comparisons, on the other hand, are defined by default to raise the
|
---|
1132 | <I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
|
---|
1133 | For completeness, SoftFloat provides these complementary functions:
|
---|
1134 | <BLOCKQUOTE>
|
---|
1135 | <CODE><<I>float</I>>_eq_signaling</CODE><BR>
|
---|
1136 | <CODE><<I>float</I>>_le_quiet</CODE><BR>
|
---|
1137 | <CODE><<I>float</I>>_lt_quiet</CODE>
|
---|
1138 | </BLOCKQUOTE>
|
---|
1139 | The <CODE>signaling</CODE> equality comparisons are identical to the default
|
---|
1140 | equality comparisons except that the <I>invalid</I> exception is raised for any
|
---|
1141 | NaN input, not just for signaling NaNs.
|
---|
1142 | Similarly, the <CODE>quiet</CODE> comparison functions are identical to their
|
---|
1143 | default counterparts except that the <I>invalid</I> exception is not raised for
|
---|
1144 | quiet NaNs.
|
---|
1145 | </P>
|
---|
1146 |
|
---|
1147 | <H3>8.9. Signaling NaN Test Functions</H3>
|
---|
1148 |
|
---|
1149 | <P>
|
---|
1150 | Functions for testing whether a floating-point value is a signaling NaN are
|
---|
1151 | provided with these names:
|
---|
1152 | <BLOCKQUOTE>
|
---|
1153 | <CODE><<I>float</I>>_isSignalingNaN</CODE>
|
---|
1154 | </BLOCKQUOTE>
|
---|
1155 | The functions take one floating-point operand and return a Boolean indicating
|
---|
1156 | whether the operand is a signaling NaN.
|
---|
1157 | Accordingly, the functions have the forms
|
---|
1158 | <BLOCKQUOTE>
|
---|
1159 | <PRE>
|
---|
1160 | bool f64_isSignalingNaN( float64_t <I>a</I> );
|
---|
1161 | </PRE>
|
---|
1162 | <PRE>
|
---|
1163 | bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
|
---|
1164 | </PRE>
|
---|
1165 | </BLOCKQUOTE>
|
---|
1166 | </P>
|
---|
1167 |
|
---|
1168 | <H3>8.10. Raise-Exception Function</H3>
|
---|
1169 |
|
---|
1170 | <P>
|
---|
1171 | SoftFloat provides a single function for raising floating-point exceptions:
|
---|
1172 | <BLOCKQUOTE>
|
---|
1173 | <PRE>
|
---|
1174 | void softfloat_raiseFlags( uint_fast8_t <I>exceptions</I> );
|
---|
1175 | </PRE>
|
---|
1176 | </BLOCKQUOTE>
|
---|
1177 | The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
|
---|
1178 | exceptions to raise.
|
---|
1179 | (See earlier section 7, <I>Exceptions and Exception Flags</I>.)
|
---|
1180 | In addition to setting the specified exception flags in variable
|
---|
1181 | <CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raiseFlags</CODE>
|
---|
1182 | function may cause a trap or abort appropriate for the current system.
|
---|
1183 | </P>
|
---|
1184 |
|
---|
1185 |
|
---|
1186 | <H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
|
---|
1187 |
|
---|
1188 | <P>
|
---|
1189 | Apart from a change in the legal use license, <NOBR>Release 3</NOBR> of
|
---|
1190 | SoftFloat introduced numerous technical differences compared to earlier
|
---|
1191 | releases.
|
---|
1192 | </P>
|
---|
1193 |
|
---|
1194 | <H3>9.1. Name Changes</H3>
|
---|
1195 |
|
---|
1196 | <P>
|
---|
1197 | The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR>
|
---|
1198 | is that the names of most functions and variables have changed, even when the
|
---|
1199 | behavior has not.
|
---|
1200 | First, the floating-point types, the mode variables, the exception flags
|
---|
1201 | variable, the function to raise exceptions, and various associated constants
|
---|
1202 | have been renamed as follows:
|
---|
1203 | <BLOCKQUOTE>
|
---|
1204 | <TABLE>
|
---|
1205 | <TR>
|
---|
1206 | <TD>old name, Release 2:</TD>
|
---|
1207 | <TD>new name, Release 3:</TD>
|
---|
1208 | </TR>
|
---|
1209 | <TR>
|
---|
1210 | <TD><CODE>float32</CODE></TD>
|
---|
1211 | <TD><CODE>float32_t</CODE></TD>
|
---|
1212 | </TR>
|
---|
1213 | <TR>
|
---|
1214 | <TD><CODE>float64</CODE></TD>
|
---|
1215 | <TD><CODE>float64_t</CODE></TD>
|
---|
1216 | </TR>
|
---|
1217 | <TR>
|
---|
1218 | <TD><CODE>floatx80</CODE></TD>
|
---|
1219 | <TD><CODE>extFloat80_t</CODE></TD>
|
---|
1220 | </TR>
|
---|
1221 | <TR>
|
---|
1222 | <TD><CODE>float128</CODE></TD>
|
---|
1223 | <TD><CODE>float128_t</CODE></TD>
|
---|
1224 | </TR>
|
---|
1225 | <TR>
|
---|
1226 | <TD><CODE>float_rounding_mode</CODE></TD>
|
---|
1227 | <TD><CODE>softfloat_roundingMode</CODE></TD>
|
---|
1228 | </TR>
|
---|
1229 | <TR>
|
---|
1230 | <TD><CODE>float_round_nearest_even</CODE></TD>
|
---|
1231 | <TD><CODE>softfloat_round_near_even</CODE></TD>
|
---|
1232 | </TR>
|
---|
1233 | <TR>
|
---|
1234 | <TD><CODE>float_round_to_zero</CODE></TD>
|
---|
1235 | <TD><CODE>softfloat_round_minMag</CODE></TD>
|
---|
1236 | </TR>
|
---|
1237 | <TR>
|
---|
1238 | <TD><CODE>float_round_down</CODE></TD>
|
---|
1239 | <TD><CODE>softfloat_round_min</CODE></TD>
|
---|
1240 | </TR>
|
---|
1241 | <TR>
|
---|
1242 | <TD><CODE>float_round_up</CODE></TD>
|
---|
1243 | <TD><CODE>softfloat_round_max</CODE></TD>
|
---|
1244 | </TR>
|
---|
1245 | <TR>
|
---|
1246 | <TD><CODE>float_detect_tininess</CODE></TD>
|
---|
1247 | <TD><CODE>softfloat_detectTininess</CODE></TD>
|
---|
1248 | </TR>
|
---|
1249 | <TR>
|
---|
1250 | <TD><CODE>float_tininess_before_rounding </CODE></TD>
|
---|
1251 | <TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
|
---|
1252 | </TR>
|
---|
1253 | <TR>
|
---|
1254 | <TD><CODE>float_tininess_after_rounding</CODE></TD>
|
---|
1255 | <TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
|
---|
1256 | </TR>
|
---|
1257 | <TR>
|
---|
1258 | <TD><CODE>floatx80_rounding_precision</CODE></TD>
|
---|
1259 | <TD><CODE>extF80_roundingPrecision</CODE></TD>
|
---|
1260 | </TR>
|
---|
1261 | <TR>
|
---|
1262 | <TD><CODE>float_exception_flags</CODE></TD>
|
---|
1263 | <TD><CODE>softfloat_exceptionFlags</CODE></TD>
|
---|
1264 | </TR>
|
---|
1265 | <TR>
|
---|
1266 | <TD><CODE>float_flag_inexact</CODE></TD>
|
---|
1267 | <TD><CODE>softfloat_flag_inexact</CODE></TD>
|
---|
1268 | </TR>
|
---|
1269 | <TR>
|
---|
1270 | <TD><CODE>float_flag_underflow</CODE></TD>
|
---|
1271 | <TD><CODE>softfloat_flag_underflow</CODE></TD>
|
---|
1272 | </TR>
|
---|
1273 | <TR>
|
---|
1274 | <TD><CODE>float_flag_overflow</CODE></TD>
|
---|
1275 | <TD><CODE>softfloat_flag_overflow</CODE></TD>
|
---|
1276 | </TR>
|
---|
1277 | <TR>
|
---|
1278 | <TD><CODE>float_flag_divbyzero</CODE></TD>
|
---|
1279 | <TD><CODE>softfloat_flag_infinite</CODE></TD>
|
---|
1280 | </TR>
|
---|
1281 | <TR>
|
---|
1282 | <TD><CODE>float_flag_invalid</CODE></TD>
|
---|
1283 | <TD><CODE>softfloat_flag_invalid</CODE></TD>
|
---|
1284 | </TR>
|
---|
1285 | <TR>
|
---|
1286 | <TD><CODE>float_raise</CODE></TD>
|
---|
1287 | <TD><CODE>softfloat_raiseFlags</CODE></TD>
|
---|
1288 | </TR>
|
---|
1289 | </TABLE>
|
---|
1290 | </BLOCKQUOTE>
|
---|
1291 | </P>
|
---|
1292 |
|
---|
1293 | <P>
|
---|
1294 | Furthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for
|
---|
1295 | function names:
|
---|
1296 | <BLOCKQUOTE>
|
---|
1297 | <TABLE>
|
---|
1298 | <TR>
|
---|
1299 | <TD>used in names in Release 2:<CODE> </CODE></TD>
|
---|
1300 | <TD>used in names in Release 3:</TD>
|
---|
1301 | </TR>
|
---|
1302 | <TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR>
|
---|
1303 | <TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR>
|
---|
1304 | <TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR>
|
---|
1305 | <TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR>
|
---|
1306 | <TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
|
---|
1307 | <TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR>
|
---|
1308 | </TABLE>
|
---|
1309 | </BLOCKQUOTE>
|
---|
1310 | Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
|
---|
1311 | numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
|
---|
1312 | is now <CODE>f32_add</CODE>.
|
---|
1313 | Lastly, there have been a few other changes to function names:
|
---|
1314 | <BLOCKQUOTE>
|
---|
1315 | <TABLE>
|
---|
1316 | <TR>
|
---|
1317 | <TD>used in names in Release 2:<CODE> </CODE></TD>
|
---|
1318 | <TD>used in names in Release 3:<CODE> </CODE></TD>
|
---|
1319 | <TD>relevant functions:</TD>
|
---|
1320 | </TR>
|
---|
1321 | <TR>
|
---|
1322 | <TD><CODE>_round_to_zero</CODE></TD>
|
---|
1323 | <TD><CODE>_r_minMag</CODE></TD>
|
---|
1324 | <TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
|
---|
1325 | </TR>
|
---|
1326 | <TR>
|
---|
1327 | <TD><CODE>round_to_int</CODE></TD>
|
---|
1328 | <TD><CODE>roundToInt</CODE></TD>
|
---|
1329 | <TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
|
---|
1330 | </TR>
|
---|
1331 | <TR>
|
---|
1332 | <TD><CODE>is_signaling_nan </CODE></TD>
|
---|
1333 | <TD><CODE>isSignalingNaN</CODE></TD>
|
---|
1334 | <TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
|
---|
1335 | </TR>
|
---|
1336 | </TABLE>
|
---|
1337 | </BLOCKQUOTE>
|
---|
1338 | </P>
|
---|
1339 |
|
---|
1340 | <H3>9.2. Changes to Function Arguments</H3>
|
---|
1341 |
|
---|
1342 | <P>
|
---|
1343 | Besides simple name changes, some operations were given a different interface
|
---|
1344 | in <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
|
---|
1345 | <UL>
|
---|
1346 |
|
---|
1347 | <LI>
|
---|
1348 | <P>
|
---|
1349 | Since <NOBR>Release 3</NOBR>, integer arguments and results of functions have
|
---|
1350 | standard types from header <CODE><stdint.h></CODE>, such as
|
---|
1351 | <CODE>uint32_t</CODE>, whereas previously their types could be defined
|
---|
1352 | differently for each port of SoftFloat, usually using traditional C types such
|
---|
1353 | as <CODE>unsigned</CODE> <CODE>int</CODE>.
|
---|
1354 | Likewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as
|
---|
1355 | standard type <CODE>bool</CODE> from <CODE><stdbool.h></CODE>, whereas
|
---|
1356 | previously these were again passed as a port-specific type (usually
|
---|
1357 | <CODE>int</CODE>).
|
---|
1358 | </P>
|
---|
1359 |
|
---|
1360 | <LI>
|
---|
1361 | <P>
|
---|
1362 | As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
|
---|
1363 | Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and
|
---|
1364 | later may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point
|
---|
1365 | values through pointers, meaning that functions take pointer arguments and then
|
---|
1366 | read or write floating-point values at the locations indicated by the pointers.
|
---|
1367 | In <NOBR>Release 2</NOBR>, floating-point arguments and results were always
|
---|
1368 | passed by value, regardless of their size.
|
---|
1369 | </P>
|
---|
1370 |
|
---|
1371 | <LI>
|
---|
1372 | <P>
|
---|
1373 | Functions that round to an integer have additional
|
---|
1374 | <CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
|
---|
1375 | they did not have in <NOBR>Release 2</NOBR>.
|
---|
1376 | Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
|
---|
1377 | since <NOBR>Release 3</NOBR>.
|
---|
1378 | For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
|
---|
1379 | same global variable that affects the basic arithmetic operations (now called
|
---|
1380 | <CODE>softfloat_roundingMode</CODE> but previously known as
|
---|
1381 | <CODE>float_rounding_mode</CODE>).
|
---|
1382 | Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
|
---|
1383 | an exact integer value, and if the <I>invalid</I> exception was not raised by
|
---|
1384 | the function, the <I>inexact</I> exception was always raised.
|
---|
1385 | <NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
|
---|
1386 | case.
|
---|
1387 | Applications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same
|
---|
1388 | effect as <NOBR>Release 2</NOBR> by passing variable
|
---|
1389 | <CODE>softfloat_roundingMode</CODE> for argument
|
---|
1390 | <CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument
|
---|
1391 | <CODE><I>exact</I></CODE>.
|
---|
1392 | </P>
|
---|
1393 |
|
---|
1394 | </UL>
|
---|
1395 | </P>
|
---|
1396 |
|
---|
1397 | <H3>9.3. Added Capabilities</H3>
|
---|
1398 |
|
---|
1399 | <P>
|
---|
1400 | With <NOBR>Release 3</NOBR>, some new features have been added that were not
|
---|
1401 | present in <NOBR>Release 2</NOBR>:
|
---|
1402 | <UL>
|
---|
1403 |
|
---|
1404 | <LI>
|
---|
1405 | <P>
|
---|
1406 | A port of SoftFloat can now define any of the floating-point types
|
---|
1407 | <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and
|
---|
1408 | <CODE>float128_t</CODE> as aliases for C’s standard floating-point types
|
---|
1409 | <CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE>
|
---|
1410 | <CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>.
|
---|
1411 | This potential convenience was not supported under <NOBR>Release 2</NOBR>.
|
---|
1412 | </P>
|
---|
1413 |
|
---|
1414 | <P>
|
---|
1415 | (Note, however, that there may be a performance cost to defining
|
---|
1416 | SoftFloat’s floating-point types this way, depending on the platform and
|
---|
1417 | the applications using SoftFloat.
|
---|
1418 | Ports of SoftFloat may choose to forgo the convenience in favor of better
|
---|
1419 | speed.)
|
---|
1420 | </P>
|
---|
1421 |
|
---|
1422 | <P>
|
---|
1423 | <LI>
|
---|
1424 | As of <NOBR>Release 3b</NOBR>, <NOBR>16-bit</NOBR> half-precision,
|
---|
1425 | <CODE>float16_t</CODE>, is supported.
|
---|
1426 | </P>
|
---|
1427 |
|
---|
1428 | <P>
|
---|
1429 | <LI>
|
---|
1430 | Functions have been added for converting between the floating-point types and
|
---|
1431 | unsigned integers.
|
---|
1432 | <NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
|
---|
1433 | </P>
|
---|
1434 |
|
---|
1435 | <P>
|
---|
1436 | <LI>
|
---|
1437 | Fused multiply-add functions have been added for all floating-point formats
|
---|
1438 | except <NOBR>80-bit</NOBR> double-extended-precision,
|
---|
1439 | <CODE>extFloat80_t</CODE>.
|
---|
1440 | </P>
|
---|
1441 |
|
---|
1442 | <P>
|
---|
1443 | <LI>
|
---|
1444 | New rounding modes are supported:
|
---|
1445 | <CODE>softfloat_round_near_maxMag</CODE> (round to nearest, with ties to
|
---|
1446 | maximum magnitude, away from zero), and, as of <NOBR>Release 3c</NOBR>,
|
---|
1447 | optional <CODE>softfloat_round_odd</CODE> (round to odd, also known as
|
---|
1448 | jamming).
|
---|
1449 | </P>
|
---|
1450 |
|
---|
1451 | </UL>
|
---|
1452 | </P>
|
---|
1453 |
|
---|
1454 | <H3>9.4. Better Compatibility with the C Language</H3>
|
---|
1455 |
|
---|
1456 | <P>
|
---|
1457 | <NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C
|
---|
1458 | Standard’s rules for portability.
|
---|
1459 | For example, older releases of SoftFloat employed type conversions in ways
|
---|
1460 | that, while commonly practiced, are not fully defined by the C Standard.
|
---|
1461 | Such problematic type conversions have generally been replaced by the use of
|
---|
1462 | unions, the behavior around which is more strictly regulated these days.
|
---|
1463 | </P>
|
---|
1464 |
|
---|
1465 | <H3>9.5. New Organization as a Library</H3>
|
---|
1466 |
|
---|
1467 | <P>
|
---|
1468 | Starting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
|
---|
1469 | Previously, SoftFloat compiled into a single, monolithic object file containing
|
---|
1470 | all the SoftFloat functions, with the consequence that a program linking with
|
---|
1471 | SoftFloat would get every SoftFloat function in its binary file even if only a
|
---|
1472 | few functions were actually used.
|
---|
1473 | With SoftFloat in the form of a library, a program that is linked by a standard
|
---|
1474 | linker will include only those functions of SoftFloat that it needs and no
|
---|
1475 | others.
|
---|
1476 | </P>
|
---|
1477 |
|
---|
1478 | <H3>9.6. Optimization Gains (and Losses)</H3>
|
---|
1479 |
|
---|
1480 | <P>
|
---|
1481 | Individual SoftFloat functions have been variously improved in
|
---|
1482 | <NOBR>Release 3</NOBR> compared to earlier releases.
|
---|
1483 | In particular, better, faster algorithms have been deployed for the operations
|
---|
1484 | of division, square root, and remainder.
|
---|
1485 | For functions operating on the larger <NOBR>80-bit</NOBR> and
|
---|
1486 | <NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
|
---|
1487 | <CODE>float128_t</CODE>, code size has also generally been reduced.
|
---|
1488 | </P>
|
---|
1489 |
|
---|
1490 | <P>
|
---|
1491 | However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
|
---|
1492 | single object file, compilers could make optimizations across function calls
|
---|
1493 | when one SoftFloat function calls another.
|
---|
1494 | Now that the functions of SoftFloat are compiled separately and only afterward
|
---|
1495 | linked together into a program, there is not usually the same opportunity to
|
---|
1496 | optimize across function calls.
|
---|
1497 | Some loss of speed has been observed due to this change.
|
---|
1498 | </P>
|
---|
1499 |
|
---|
1500 |
|
---|
1501 | <H2>10. Future Directions</H2>
|
---|
1502 |
|
---|
1503 | <P>
|
---|
1504 | The following improvements are anticipated for future releases of SoftFloat:
|
---|
1505 | <UL>
|
---|
1506 | <LI>
|
---|
1507 | more functions from the 2008 version of the IEEE Floating-Point Standard;
|
---|
1508 | <LI>
|
---|
1509 | consistent, defined behavior for non-canonical representations of extended
|
---|
1510 | format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
|
---|
1511 | <I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
|
---|
1512 |
|
---|
1513 | </UL>
|
---|
1514 | </P>
|
---|
1515 |
|
---|
1516 |
|
---|
1517 | <H2>11. Contact Information</H2>
|
---|
1518 |
|
---|
1519 | <P>
|
---|
1520 | At the time of this writing, the most up-to-date information about SoftFloat
|
---|
1521 | and the latest release can be found at the Web page
|
---|
1522 | <A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><NOBR><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></NOBR></A>.
|
---|
1523 | </P>
|
---|
1524 |
|
---|
1525 |
|
---|
1526 | </BODY>
|
---|
1527 |
|
---|