?? l_setox.s
字號(hào):
/* l_setox.s - Motorola 68040 FP exponential routines (LIB) *//* Copyright 1991-1993 Wind River Systems, Inc. */ .data .globl _copyright_wind_river .long _copyright_wind_river/*modification history--------------------01f,12nov94,dvs fixed clearcase conversion search/replace errors.01e,21jul93,kdl added .text (SPR #2372).01d,23aug92,jcf changed bxxx to jxx.01c,26may92,rrr the tree shuffle01b,09jan92,kdl added modification history; general cleanup.01a,15aug91,kdl original version, from Motorola FPSP v2.0.*//*DESCRIPTION setoxsa 3.1 12/10/90 The entry point __l_setox computes the exponential of a value. __l_setoxd does the same except the input value is a denormalized number. __l_setoxm1 computes exp(X)-1, and __l_setoxm1d computes exp(X)-1 for denormalized X. INPUT ----- Double-extended value in memory location pointed to by address register a0. OUTPUT ------ exp(X) or exp(X)-1 returned in floating-point register fp0. ACCURACY and MONOTONICITY ------------------------- The returned result is within 0.85 ulps in 64 significant bit, i.e. within 0.5001 ulp to 53 bits if the result is subsequently rounded to double precision. The result is provably monotonic in double precision. SPEED ----- Two timings are measured, both in the copy-back mode. The first one is measured when the function is invoked the first time (so the instructions and data are not in cache), and the second one is measured when the function is reinvoked at the same input argument. The program __l_setox takes approximately 210/190 cycles for input argument X whose magnitude is less than 16380 log2, which is the usual situation. For the less common arguments, depending on their values, the program may run faster or slower -- but no worse than 10 slower even in the extreme cases. The program __l_setoxm1 takes approximately ???/??? cycles for input argument X, 0.25 <= |X| < 70log2. For |X| < 0.25, it takes approximately ???/??? cycles. For the less common arguments, depending on their values, the program may run faster or slower -- but no worse than 10 slower even in the extreme cases. ALGORITHM and IMPLEMENTATION NOTES ---------------------------------- __l_setoxd ------ Step 1. Set ans := 1.0 Step 2. Return ans := ans + sign(X)*2^(-126). Exit. Notes: This will always generate one exception -- inexact. __l_setox ----- Step 1. Filter out extreme cases of input argument. 1.1 If |X| >= 2^(-65), go to Step 1.3. 1.2 Go to Step 7. 1.3 If |X| < 16380 log(2), go to Step 2. 1.4 Go to Step 8. Notes: The usual case should take the branches 1.1 -> 1.3 -> 2. To avoid the use of floating-point comparisons, a compact representation of |X| is used. This format is a 32-bit integer, the upper (more significant) 16 bits are the sign and biased exponent field of |X|| the lower 16 bits are the 16 most significant fraction (including the explicit bit) bits of |X|. Consequently, the comparisons in Steps 1.1 and 1.3 can be performed by integer comparison. Note also that the constant 16380 log(2) used in Step 1.3 is also in the compact form. Thus taking the branch to Step 2 guarantees |X| < 16380 log(2). There is no harm to have a small number of cases where |X| is less than, but close to, 16380 log(2) and the branch to Step 9 is taken. Step 2. Calculate N = round-to-nearest-int( X * 64/log2 ). 2.1 Set AdjFlag := 0 (indicates the branch 1.3 -> 2 was taken) 2.2 N := round-to-nearest-integer( X * 64/log2 ). 2.3 Calculate J = N mod 64| so J = 0,1,2,..., or 63. 2.4 Calculate M = (N - J)/64| so N = 64M + J. 2.5 Calculate the address of the stored value of 2^(J/64). 2.6 Create the value Scale = 2^M. Notes: The calculation in 2.2 is really performed by Z := X * constant N := round-to-nearest-integer(Z) where constant := single-precision( 64/log 2 ). Using a single-precision constant avoids memory access. Another effect of using a single-precision "constant" is that the calculated value Z is Z = X*(64/log2)*(1+eps), |eps| <= 2^(-24). This error has to be considered later in Steps 3 and 4. Step 3. Calculate X - N*log2/64. 3.1 R := X + N*L1, where L1 := single-precision(-log2/64). 3.2 R := R + N*L2, L2 := extended-precision(-log2/64 - L1). Notes: a) The way L1 and L2 are chosen ensures L1+L2 approximate the value -log2/64 to 88 bits of accuracy. b) N*L1 is exact because N is no longer than 22 bits and L1 is no longer than 24 bits. c) The calculation X+N*L1 is also exact due to cancellation. Thus, R is practically X+N(L1+L2) to full 64 bits. d) It is important to estimate how large can |R| be after Step 3.2. N = rnd-to-int( X*64/log2 (1+eps) ), |eps|<=2^(-24) X*64/log2 (1+eps) = N + f, |f| <= 0.5 X*64/log2 - N = f - eps*X 64/log2 X - N*log2/64 = f*log2/64 - eps*X Now |X| <= 16446 log2, thus |X - N*log2/64| <= (0.5 + 16446/2^(18))*log2/64 <= 0.57 log2/64. This bound will be used in Step 4. Step 4. Approximate exp(R)-1 by a polynomial p = R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*A5)))) Notes: a) In order to reduce memory access, the coefficients are made as "short" as possible: A1 (which is 1/2), A4 and A5 are single precision| A2 and A3 are double precision. b) Even with the restrictions above, |p - (exp(R)-1)| < 2^(-68.8) for all |R| <= 0.0062. Note that 0.0062 is slightly bigger than 0.57 log2/64. c) To fully utilize the pipeline, p is separated into two independent pieces of roughly equal complexities p = [ R + R*S*(A2 + S*A4) ] + [ S*(A1 + S*(A3 + S*A5)) ] where S = R*R. Step 5. Compute 2^(J/64)*exp(R) = 2^(J/64)*(1+p) by ans := T + ( T*p + t) where T and t are the stored values for 2^(J/64). Notes: 2^(J/64) is stored as T and t where T+t approximates 2^(J/64) to roughly 85 bits| T is in extended precision and t is in single precision. Note also that T is rounded to 62 bits so that the last two bits of T are zero. The reason for such a special form is that T-1, T-2, and T-8 will all be exact --- a property that will give much more accurate computation of the function EXPM1. Step 6. Reconstruction of exp(X) exp(X) = 2^M * 2^(J/64) * exp(R). 6.1 If AdjFlag = 0, go to 6.3 6.2 ans := ans * AdjScale 6.3 Restore the user fpcr 6.4 Return ans := ans * Scale. Exit. Notes: If AdjFlag = 0, we have X = Mlog2 + Jlog2/64 + R, |M| <= 16380, and Scale = 2^M. Moreover, exp(X) will neither overflow nor underflow. If AdjFlag = 1, that means that X = (M1+M)log2 + Jlog2/64 + R, |M1+M| >= 16380. Hence, exp(X) may overflow or underflow or neither. When that is the case, AdjScale = 2^(M1) where M1 is approximately M. Thus 6.2 will never cause over/underflow. Possible exception in 6.4 is overflow or underflow. The inexact exception is not generated in 6.4. Although one can argue that the inexact flag should always be raised, to simulate that exception cost to much than the flag is worth in practical uses. Step 7. Return 1 + X. 7.1 ans := X 7.2 Restore user fpcr. 7.3 Return ans := 1 + ans. Exit Notes: For non-zero X, the inexact exception will always be raised by 7.3. That is the only exception raised by 7.3. Note also that we use the FMOVEM instruction to move X in Step 7.1 to avoid unnecessary trapping. (Although the FMOVEM may not seem relevant since X is normalized, the precaution will be useful in the library version of this code where the separate entry for denormalized inputs will be done away with.) Step 8. Handle exp(X) where |X| >= 16380log2. 8.1 If |X| > 16480 log2, go to Step 9. (mimic 2.2 - 2.6) 8.2 N := round-to-integer( X * 64/log2 ) 8.3 Calculate J = N mod 64, J = 0,1,...,63 8.4 K := (N-J)/64, M1 := truncate(K/2), M = K-M1, AdjFlag := 1. 8.5 Calculate the address of the stored value 2^(J/64). 8.6 Create the values Scale = 2^M, AdjScale = 2^M1. 8.7 Go to Step 3. Notes: Refer to notes for 2.2 - 2.6. Step 9. Handle exp(X), |X| > 16480 log2. 9.1 If X < 0, go to 9.3 9.2 ans := Huge, go to 9.4 9.3 ans := Tiny. 9.4 Restore user fpcr. 9.5 Return ans := ans * ans. Exit. Notes: Exp(X) will surely overflow or underflow, depending on X's sign. "Huge" and "Tiny" are respectively large/tiny extended-precision numbers whose square over/underflow with an inexact result. Thus, 9.5 always raises the inexact together with either overflow or underflow. __l_setoxm1d -------- Step 1. Set ans := 0 Step 2. Return ans := X + ans. Exit. Notes: This will return X with the appropriate rounding precision prescribed by the user fpcr. __l_setoxm1 ------- Step 1. Check |X| 1.1 If |X| >= 1/4, go to Step 1.3. 1.2 Go to Step 7. 1.3 If |X| < 70 log(2), go to Step 2. 1.4 Go to Step 10. Notes: The usual case should take the branches 1.1 -> 1.3 -> 2. However, it is conceivable |X| can be small very often because EXPM1 is intended to evaluate exp(X)-1 accurately when |X| is small. For further details on the comparisons, see the notes on Step 1 of __l_setox. Step 2. Calculate N = round-to-nearest-int( X * 64/log2 ). 2.1 N := round-to-nearest-integer( X * 64/log2 ). 2.2 Calculate J = N mod 64| so J = 0,1,2,..., or 63. 2.3 Calculate M = (N - J)/64| so N = 64M + J. 2.4 Calculate the address of the stored value of 2^(J/64). 2.5 Create the values Sc = 2^M and OnebySc := -2^(-M). Notes: See the notes on Step 2 of __l_setox. Step 3. Calculate X - N*log2/64. 3.1 R := X + N*L1, where L1 := single-precision(-log2/64). 3.2 R := R + N*L2, L2 := extended-precision(-log2/64 - L1). Notes: Applying the analysis of Step 3 of __l_setox in this case shows that |R| <= 0.0055 (note that |X| <= 70 log2 in this case). Step 4. Approximate exp(R)-1 by a polynomial p = R+R*R*(A1+R*(A2+R*(A3+R*(A4+R*(A5+R*A6))))) Notes: a) In order to reduce memory access, the coefficients are made as "short" as possible: A1 (which is 1/2), A5 and A6 are single precision| A2, A3 and A4 are double precision. b) Even with the restriction above, |p - (exp(R)-1)| < |R| * 2^(-72.7) for all |R| <= 0.0055. c) To fully utilize the pipeline, p is separated into two independent pieces of roughly equal complexity p = [ R*S*(A2 + S*(A4 + S*A6)) ] + [ R + S*(A1 + S*(A3 + S*A5)) ] where S = R*R. Step 5. Compute 2^(J/64)*p by p := T*p where T and t are the stored values for 2^(J/64). Notes: 2^(J/64) is stored as T and t where T+t approximates 2^(J/64) to roughly 85 bits| T is in extended precision and t is in single precision. Note also that T is rounded to 62 bits so that the last two bits of T are zero. The reason for such a special form is that T-1, T-2, and T-8 will all be exact --- a property that will be exploited in Step 6 below. The total relative error in p is no bigger than 2^(-67.7) compared to the final result. Step 6. Reconstruction of exp(X)-1 exp(X)-1 = 2^M * ( 2^(J/64) + p - 2^(-M) ). 6.1 If M <= 63, go to Step 6.3. 6.2 ans := T + (p + (t + OnebySc)). Go to 6.6 6.3 If M >= -3, go to 6.5. 6.4 ans := (T + (p + t)) + OnebySc. Go to 6.6 6.5 ans := (T + OnebySc) + (p + t). 6.6 Restore user fpcr. 6.7 Return ans := Sc * ans. Exit. Notes: The various arrangements of the expressions give accurate evaluations. Step 7. exp(X)-1 for |X| < 1/4. 7.1 If |X| >= 2^(-65), go to Step 9. 7.2 Go to Step 8. Step 8. Calculate exp(X)-1, |X| < 2^(-65). 8.1 If |X| < 2^(-16312), goto 8.3 8.2 Restore fpcr| return ans := X - 2^(-16382). Exit. 8.3 X := X * 2^(140). 8.4 Restore fpcr| ans := ans - 2^(-16382). Return ans := ans*2^(140). Exit Notes: The idea is to return "X - tiny" under the user precision and rounding modes. To avoid unnecessary inefficiency, we stay away from denormalized numbers the best we can. For |X| >= 2^(-16312), the straightforward 8.2 generates the inexact exception as the case warrants. Step 9. Calculate exp(X)-1, |X| < 1/4, by a polynomial p = X + X*X*(B1 + X*(B2 + |... + X*B12)) Notes: a) In order to reduce memory access, the coefficients are made as "short" as possible: B1 (which is 1/2), B9 to B12 are single precision| B3 to B8 are double precision| and B2 is double extended. b) Even with the restriction above, |p - (exp(X)-1)| < |X| 2^(-70.6) for all |X| <= 0.251. Note that 0.251 is slightly bigger than 1/4. c) To fully preserve accuracy, the polynomial is computed as X + ( S*B1 + Q ) where S = X*X and Q = X*S*(B2 + X*(B3 + |... + X*B12)) d) To fully utilize the pipeline, Q is separated into two independent pieces of roughly equal complexity Q = [ X*S*(B2 + S*(B4 + |... + S*B12)) ] + [ S*S*(B3 + S*(B5 + |... + S*B11)) ] Step 10. Calculate exp(X)-1 for |X| >= 70 log 2. 10.1 If X >= 70log2 , exp(X) - 1 = exp(X) for all practical purposes. Therefore, go to Step 1 of __l_setox. 10.2 If X <= -70log2, exp(X) - 1 = -1 for all practical purposes. ans := -1 Restore user fpcr Return ans := ans + 2^(-126). Exit. Notes: 10.2 will always create an inexact and return -1 + tiny in the user rounding precision and mode. Copyright (C) Motorola, Inc. 1990 All Rights Reserved THIS IS UNPUBLISHED PROPRIETARY SOURCE CODE OF MOTOROLA The copyright notice above does not evidence any actual or intended publication of such source code.__l_setox IDNT 2,1 Motorola 040 Floating Point Software Package section 8NOMANUAL*/#include "fpsp040L.h"L2: .long 0x3FDC0000,0x82E30865,0x4361C4C6,0x00000000EXPA3: .long 0x3FA55555,0x55554431EXPA2: .long 0x3FC55555,0x55554018HUGE: .long 0x7FFE0000,0xFFFFFFFF,0xFFFFFFFF,0x00000000TINY: .long 0x00010000,0xFFFFFFFF,0xFFFFFFFF,0x00000000EM1A4: .long 0x3F811111,0x11174385EM1A3: .long 0x3FA55555,0x55554F5AEM1A2: .long 0x3FC55555,0x55555555,0x00000000,0x00000000EM1B8: .long 0x3EC71DE3,0xA5774682EM1B7: .long 0x3EFA01A0,0x19D7CB68EM1B6: .long 0x3F2A01A0,0x1A019DF3EM1B5: .long 0x3F56C16C,0x16C170E2EM1B4: .long 0x3F811111,0x11111111EM1B3: .long 0x3FA55555,0x55555555EM1B2: .long 0x3FFC0000,0xAAAAAAAA,0xAAAAAAAB .long 0x00000000TWO140: .long 0x48B00000,0x00000000TWON140: .long 0x37300000,0x00000000EXPTBL: .long 0x3FFF0000,0x80000000,0x00000000,0x00000000 .long 0x3FFF0000,0x8164D1F3,0xBC030774,0x9F841A9B .long 0x3FFF0000,0x82CD8698,0xAC2BA1D8,0x9FC1D5B9 .long 0x3FFF0000,0x843A28C3,0xACDE4048,0xA0728369 .long 0x3FFF0000,0x85AAC367,0xCC487B14,0x1FC5C95C .long 0x3FFF0000,0x871F6196,0x9E8D1010,0x1EE85C9F .long 0x3FFF0000,0x88980E80,0x92DA8528,0x9FA20729 .long 0x3FFF0000,0x8A14D575,0x496EFD9C,0xA07BF9AF .long 0x3FFF0000,0x8B95C1E3,0xEA8BD6E8,0xA0020DCF .long 0x3FFF0000,0x8D1ADF5B,0x7E5BA9E4,0x205A63DA .long 0x3FFF0000,0x8EA4398B,0x45CD53C0,0x1EB70051 .long 0x3FFF0000,0x9031DC43,0x1466B1DC,0x1F6EB029 .long 0x3FFF0000,0x91C3D373,0xAB11C338,0xA0781494 .long 0x3FFF0000,0x935A2B2F,0x13E6E92C,0x9EB319B0 .long 0x3FFF0000,0x94F4EFA8,0xFEF70960,0x2017457D .long 0x3FFF0000,0x96942D37,0x20185A00,0x1F11D537 .long 0x3FFF0000,0x9837F051,0x8DB8A970,0x9FB952DD .long 0x3FFF0000,0x99E04593,0x20B7FA64,0x1FE43087 .long 0x3FFF0000,0x9B8D39B9,0xD54E5538,0x1FA2A818 .long 0x3FFF0000,0x9D3ED9A7,0x2CFFB750,0x1FDE494D .long 0x3FFF0000,0x9EF53260,0x91A111AC,0x20504890 .long 0x3FFF0000,0xA0B0510F,0xB9714FC4,0xA073691C .long 0x3FFF0000,0xA2704303,0x0C496818,0x1F9B7A05 .long 0x3FFF0000,0xA43515AE,0x09E680A0,0xA0797126 .long 0x3FFF0000,0xA5FED6A9,0xB15138EC,0xA071A140 .long 0x3FFF0000,0xA7CD93B4,0xE9653568,0x204F62DA .long 0x3FFF0000,0xA9A15AB4,0xEA7C0EF8,0x1F283C4A .long 0x3FFF0000,0xAB7A39B5,0xA93ED338,0x9F9A7FDC .long 0x3FFF0000,0xAD583EEA,0x42A14AC8,0xA05B3FAC .long 0x3FFF0000,0xAF3B78AD,0x690A4374,0x1FDF2610 .long 0x3FFF0000,0xB123F581,0xD2AC2590,0x9F705F90 .long 0x3FFF0000,0xB311C412,0xA9112488,0x201F678A .long 0x3FFF0000,0xB504F333,0xF9DE6484,0x1F32FB13 .long 0x3FFF0000,0xB6FD91E3,0x28D17790,0x20038B30 .long 0x3FFF0000,0xB8FBAF47,0x62FB9EE8,0x200DC3CC .long 0x3FFF0000,0xBAFF5AB2,0x133E45FC,0x9F8B2AE6 .long 0x3FFF0000,0xBD08A39F,0x580C36C0,0xA02BBF70 .long 0x3FFF0000,0xBF1799B6,0x7A731084,0xA00BF518 .long 0x3FFF0000,0xC12C4CCA,0x66709458,0xA041DD41 .long 0x3FFF0000,0xC346CCDA,0x24976408,0x9FDF137B .long 0x3FFF0000,0xC5672A11,0x5506DADC,0x201F1568 .long 0x3FFF0000,0xC78D74C8,0xABB9B15C,0x1FC13A2E .long 0x3FFF0000,0xC9B9BD86,0x6E2F27A4,0xA03F8F03 .long 0x3FFF0000,0xCBEC14FE,0xF2727C5C,0x1FF4907D .long 0x3FFF0000,0xCE248C15,0x1F8480E4,0x9E6E53E4 .long 0x3FFF0000,0xD06333DA,0xEF2B2594,0x1FD6D45C .long 0x3FFF0000,0xD2A81D91,0xF12AE45C,0xA076EDB9 .long 0x3FFF0000,0xD4F35AAB,0xCFEDFA20,0x9FA6DE21 .long 0x3FFF0000,0xD744FCCA,0xD69D6AF4,0x1EE69A2F .long 0x3FFF0000,0xD99D15C2,0x78AFD7B4,0x207F439F .long 0x3FFF0000,0xDBFBB797,0xDAF23754,0x201EC207
?? 快捷鍵說(shuō)明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -