?? s_cosl.s
字號(hào):
.file "sincosl.s"// Copyright (c) 2000 - 2004, Intel Corporation// All rights reserved.//// Contributed 2000 by the Intel Numerics Group, Intel Corporation//// Redistribution and use in source and binary forms, with or without// modification, are permitted provided that the following conditions are// met://// * Redistributions of source code must retain the above copyright// notice, this list of conditions and the following disclaimer.//// * Redistributions in binary form must reproduce the above copyright// notice, this list of conditions and the following disclaimer in the// documentation and/or other materials provided with the distribution.//// * The name of Intel Corporation may not be used to endorse or promote// products derived from this software without specific prior written// permission.// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.//// Intel Corporation is the author of this code, and requests that all// problem reports or change requests be submitted to it directly at// http://www.intel.com/software/products/opensource/libraries/num.htm.////*********************************************************************//// History:// 02/02/00 (hand-optimized)// 04/04/00 Unwind support added// 07/30/01 Improved speed on all paths// 08/20/01 Fixed bundling typo// 05/13/02 Changed interface to __libm_pi_by_2_reduce// 02/10/03 Reordered header: .section, .global, .proc, .align;// used data8 for long double table values// 10/13/03 Corrected final .endp name to match .proc// 10/26/04 Avoided using r14-31 as scratch so not clobbered by dynamic loader////*********************************************************************//// Function: Combined sinl(x) and cosl(x), where//// sinl(x) = sine(x), for double-extended precision x values// cosl(x) = cosine(x), for double-extended precision x values////*********************************************************************//// Resources Used://// Floating-Point Registers: f8 (Input and Return Value)// f32-f99//// General Purpose Registers:// r32-r58//// Predicate Registers: p6-p13////*********************************************************************//// IEEE Special Conditions://// Denormal fault raised on denormal inputs// Overflow exceptions do not occur// Underflow exceptions raised when appropriate for sin// (No specialized error handling for this routine)// Inexact raised when appropriate by algorithm//// sinl(SNaN) = QNaN// sinl(QNaN) = QNaN// sinl(inf) = QNaN// sinl(+/-0) = +/-0// cosl(inf) = QNaN// cosl(SNaN) = QNaN// cosl(QNaN) = QNaN// cosl(0) = 1////*********************************************************************//// Mathematical Description// ========================//// The computation of FSIN and FCOS is best handled in one piece of// code. The main reason is that given any argument Arg, computation// of trigonometric functions first calculate N and an approximation// to alpha where//// Arg = N pi/2 + alpha, |alpha| <= pi/4.//// Since//// cosl( Arg ) = sinl( (N+1) pi/2 + alpha ),//// therefore, the code for computing sine will produce cosine as long// as 1 is added to N immediately after the argument reduction// process.//// Let M = N if sine// N+1 if cosine.//// Now, given//// Arg = M pi/2 + alpha, |alpha| <= pi/4,//// let I = M mod 4, or I be the two lsb of M when M is represented// as 2's complement. I = [i_0 i_1]. Then//// sinl( Arg ) = (-1)^i_0 sinl( alpha ) if i_1 = 0,// = (-1)^i_0 cosl( alpha ) if i_1 = 1.//// For example:// if M = -1, I = 11// sin ((-pi/2 + alpha) = (-1) cos (alpha)// if M = 0, I = 00// sin (alpha) = sin (alpha)// if M = 1, I = 01// sin (pi/2 + alpha) = cos (alpha)// if M = 2, I = 10// sin (pi + alpha) = (-1) sin (alpha)// if M = 3, I = 11// sin ((3/2)pi + alpha) = (-1) cos (alpha)//// The value of alpha is obtained by argument reduction and// represented by two working precision numbers r and c where//// alpha = r + c accurately.//// The reduction method is described in a previous write up.// The argument reduction scheme identifies 4 cases. For Cases 2// and 4, because |alpha| is small, sinl(r+c) and cosl(r+c) can be// computed very easily by 2 or 3 terms of the Taylor series// expansion as follows://// Case 2:// -------//// sinl(r + c) = r + c - r^3/6 accurately// cosl(r + c) = 1 - 2^(-67) accurately//// Case 4:// -------//// sinl(r + c) = r + c - r^3/6 + r^5/120 accurately// cosl(r + c) = 1 - r^2/2 + r^4/24 accurately//// The only cases left are Cases 1 and 3 of the argument reduction// procedure. These two cases will be merged since after the// argument is reduced in either cases, we have the reduced argument// represented as r + c and that the magnitude |r + c| is not small// enough to allow the usage of a very short approximation.//// The required calculation is either//// sinl(r + c) = sinl(r) + correction, or// cosl(r + c) = cosl(r) + correction.//// Specifically,//// sinl(r + c) = sinl(r) + c sin'(r) + O(c^2)// = sinl(r) + c cos (r) + O(c^2)// = sinl(r) + c(1 - r^2/2) accurately.// Similarly,//// cosl(r + c) = cosl(r) - c sinl(r) + O(c^2)// = cosl(r) - c(r - r^3/6) accurately.//// We therefore concentrate on accurately calculating sinl(r) and// cosl(r) for a working-precision number r, |r| <= pi/4 to within// 0.1% or so.//// The greatest challenge of this task is that the second terms of// the Taylor series//// r - r^3/3! + r^r/5! - ...//// and//// 1 - r^2/2! + r^4/4! - ...//// are not very small when |r| is close to pi/4 and the rounding// errors will be a concern if simple polynomial accumulation is// used. When |r| < 2^-3, however, the second terms will be small// enough (6 bits or so of right shift) that a normal Horner// recurrence suffices. Hence there are two cases that we consider// in the accurate computation of sinl(r) and cosl(r), |r| <= pi/4.//// Case small_r: |r| < 2^(-3)// --------------------------//// Since Arg = M pi/4 + r + c accurately, and M mod 4 is [i_0 i_1],// we have//// sinl(Arg) = (-1)^i_0 * sinl(r + c) if i_1 = 0// = (-1)^i_0 * cosl(r + c) if i_1 = 1//// can be accurately approximated by//// sinl(Arg) = (-1)^i_0 * [sinl(r) + c] if i_1 = 0// = (-1)^i_0 * [cosl(r) - c*r] if i_1 = 1//// because |r| is small and thus the second terms in the correction// are unneccessary.//// Finally, sinl(r) and cosl(r) are approximated by polynomials of// moderate lengths.//// sinl(r) = r + S_1 r^3 + S_2 r^5 + ... + S_5 r^11// cosl(r) = 1 + C_1 r^2 + C_2 r^4 + ... + C_5 r^10//// We can make use of predicates to selectively calculate// sinl(r) or cosl(r) based on i_1.//// Case normal_r: 2^(-3) <= |r| <= pi/4// ------------------------------------//// This case is more likely than the previous one if one considers// r to be uniformly distributed in [-pi/4 pi/4]. Again,//// sinl(Arg) = (-1)^i_0 * sinl(r + c) if i_1 = 0// = (-1)^i_0 * cosl(r + c) if i_1 = 1.//// Because |r| is now larger, we need one extra term in the// correction. sinl(Arg) can be accurately approximated by//// sinl(Arg) = (-1)^i_0 * [sinl(r) + c(1-r^2/2)] if i_1 = 0// = (-1)^i_0 * [cosl(r) - c*r*(1 - r^2/6)] i_1 = 1.//// Finally, sinl(r) and cosl(r) are approximated by polynomials of// moderate lengths.//// sinl(r) = r + PP_1_hi r^3 + PP_1_lo r^3 +// PP_2 r^5 + ... + PP_8 r^17//// cosl(r) = 1 + QQ_1 r^2 + QQ_2 r^4 + ... + QQ_8 r^16//// where PP_1_hi is only about 16 bits long and QQ_1 is -1/2.// The crux in accurate computation is to calculate//// r + PP_1_hi r^3 or 1 + QQ_1 r^2//// accurately as two pieces: U_hi and U_lo. The way to achieve this// is to obtain r_hi as a 10 sig. bit number that approximates r to// roughly 8 bits or so of accuracy. (One convenient way is//// r_hi := frcpa( frcpa( r ) ).)//// This way,//// r + PP_1_hi r^3 = r + PP_1_hi r_hi^3 +// PP_1_hi (r^3 - r_hi^3)// = [r + PP_1_hi r_hi^3] +// [PP_1_hi (r - r_hi)// (r^2 + r_hi r + r_hi^2) ]// = U_hi + U_lo//// Since r_hi is only 10 bit long and PP_1_hi is only 16 bit long,// PP_1_hi * r_hi^3 is only at most 46 bit long and thus computed// exactly. Furthermore, r and PP_1_hi r_hi^3 are of opposite sign// and that there is no more than 8 bit shift off between r and// PP_1_hi * r_hi^3. Hence the sum, U_hi, is representable and thus// calculated without any error. Finally, the fact that//// |U_lo| <= 2^(-8) |U_hi|//// says that U_hi + U_lo is approximating r + PP_1_hi r^3 to roughly// 8 extra bits of accuracy.//// Similarly,//// 1 + QQ_1 r^2 = [1 + QQ_1 r_hi^2] +// [QQ_1 (r - r_hi)(r + r_hi)]// = U_hi + U_lo.//// Summarizing, we calculate r_hi = frcpa( frcpa( r ) ).//// If i_1 = 0, then//// U_hi := r + PP_1_hi * r_hi^3// U_lo := PP_1_hi * (r - r_hi) * (r^2 + r*r_hi + r_hi^2)// poly := PP_1_lo r^3 + PP_2 r^5 + ... + PP_8 r^17// correction := c * ( 1 + C_1 r^2 )//// Else ...i_1 = 1//// U_hi := 1 + QQ_1 * r_hi * r_hi// U_lo := QQ_1 * (r - r_hi) * (r + r_hi)// poly := QQ_2 * r^4 + QQ_3 * r^6 + ... + QQ_8 r^16// correction := -c * r * (1 + S_1 * r^2)//// End//// Finally,//// V := poly + ( U_lo + correction )//// / U_hi + V if i_0 = 0// result := |// \ (-U_hi) - V if i_0 = 1//// It is important that in the last step, negation of U_hi is// performed prior to the subtraction which is to be performed in// the user-set rounding mode.////// Algorithmic Description// =======================//// The argument reduction algorithm is tightly integrated into FSIN// and FCOS which share the same code. The following is complete and// self-contained. The argument reduction description given// previously is repeated below.////// Step 0. Initialization.//// If FSIN is invoked, set N_inc := 0; else if FCOS is invoked,// set N_inc := 1.//// Step 1. Check for exceptional and special cases.//// * If Arg is +-0, +-inf, NaN, NaT, go to Step 10 for special// handling.// * If |Arg| < 2^24, go to Step 2 for reduction of moderate// arguments. This is the most likely case.// * If |Arg| < 2^63, go to Step 8 for pre-reduction of large// arguments.// * If |Arg| >= 2^63, go to Step 10 for special handling.//// Step 2. Reduction of moderate arguments.//// If |Arg| < pi/4 ...quick branch// N_fix := N_inc (integer)// r := Arg// c := 0.0// Branch to Step 4, Case_1_complete// Else ...cf. argument reduction// N := Arg * two_by_PI (fp)// N_fix := fcvt.fx( N ) (int)// N := fcvt.xf( N_fix )// N_fix := N_fix + N_inc// s := Arg - N * P_1 (first piece of pi/2)// w := -N * P_2 (second piece of pi/2)//// If |s| >= 2^(-33)// go to Step 3, Case_1_reduce// Else// go to Step 7, Case_2_reduce// Endif// Endif//// Step 3. Case_1_reduce.//// r := s + w// c := (s - r) + w ...observe order//// Step 4. Case_1_complete//// ...At this point, the reduced argument alpha is// ...accurately represented as r + c.// If |r| < 2^(-3), go to Step 6, small_r.//// Step 5. Normal_r.//// Let [i_0 i_1] by the 2 lsb of N_fix.// FR_rsq := r * r// r_hi := frcpa( frcpa( r ) )// r_lo := r - r_hi//// If i_1 = 0, then// poly := r*FR_rsq*(PP_1_lo + FR_rsq*(PP_2 + ... FR_rsq*PP_8))// U_hi := r + PP_1_hi*r_hi*r_hi*r_hi ...any order// U_lo := PP_1_hi*r_lo*(r*r + r*r_hi + r_hi*r_hi)// correction := c + c*C_1*FR_rsq ...any order// Else// poly := FR_rsq*FR_rsq*(QQ_2 + FR_rsq*(QQ_3 + ... + FR_rsq*QQ_8))// U_hi := 1 + QQ_1 * r_hi * r_hi ...any order// U_lo := QQ_1 * r_lo * (r + r_hi)// correction := -c*(r + S_1*FR_rsq*r) ...any order// Endif//// V := poly + (U_lo + correction) ...observe order//// result := (i_0 == 0? 1.0 : -1.0)//// Last instruction in user-set rounding mode//// result := (i_0 == 0? result*U_hi + V :// result*U_hi - V)//// Return//// Step 6. Small_r.//// ...Use flush to zero mode without causing exception// Let [i_0 i_1] be the two lsb of N_fix.//// FR_rsq := r * r//// If i_1 = 0 then// z := FR_rsq*FR_rsq; z := FR_rsq*z *r// poly_lo := S_3 + FR_rsq*(S_4 + FR_rsq*S_5)// poly_hi := r*FR_rsq*(S_1 + FR_rsq*S_2)// correction := c// result := r// Else// z := FR_rsq*FR_rsq; z := FR_rsq*z// poly_lo := C_3 + FR_rsq*(C_4 + FR_rsq*C_5)// poly_hi := FR_rsq*(C_1 + FR_rsq*C_2)// correction := -c*r// result := 1// Endif//// poly := poly_hi + (z * poly_lo + correction)//// If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )// Return//// Step 7. Case_2_reduce.//// ...Refer to the write up for argument reduction for// ...rationale. The reduction algorithm below is taken from// ...argument reduction description and integrated this.//// w := N*P_3// U_1 := N*P_2 + w ...FMA// U_2 := (N*P_2 - U_1) + w ...2 FMA// ...U_1 + U_2 is N*(P_2+P_3) accurately//// r := s - U_1// c := ( (s - r) - U_1 ) - U_2//// ...The mathematical sum r + c approximates the reduced// ...argument accurately. Note that although compared to// ...Case 1, this case requires much more work to reduce// ...the argument, the subsequent calculation needed for// ...any of the trigonometric function is very little because// ...|alpha| < 1.01*2^(-33) and thus two terms of the// ...Taylor series expansion suffices.//// If i_1 = 0 then// poly := c + S_1 * r * r * r ...any order// result := r// Else// poly := -2^(-67)// result := 1.0// Endif//// If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )//// Return////// Step 8. Pre-reduction of large arguments.//// ...Again, the following reduction procedure was described// ...in the separate write up for argument reduction, which// ...is tightly integrated here.// N_0 := Arg * Inv_P_0// N_0_fix := fcvt.fx( N_0 )// N_0 := fcvt.xf( N_0_fix)// Arg' := Arg - N_0 * P_0// w := N_0 * d_1// N := Arg' * two_by_PI// N_fix := fcvt.fx( N )// N := fcvt.xf( N_fix )// N_fix := N_fix + N_inc//// s := Arg' - N * P_1// w := w - N * P_2//// If |s| >= 2^(-14)// go to Step 3// Else// go to Step 9// Endif//// Step 9. Case_4_reduce.//// ...first obtain N_0*d_1 and -N*P_2 accurately// U_hi := N_0 * d_1 V_hi := -N*P_2// U_lo := N_0 * d_1 - U_hi V_lo := -N*P_2 - U_hi ...FMAs//// ...compute the contribution from N_0*d_1 and -N*P_3// w := -N*P_3// w := w + N_0*d_2// t := U_lo + V_lo + w ...any order//// ...at this point, the mathematical value// ...s + U_hi + V_hi + t approximates the true reduced argument// ...accurately. Just need to compute this accurately.//// ...Calculate U_hi + V_hi accurately:// A := U_hi + V_hi// if |U_hi| >= |V_hi| then// a := (U_hi - A) + V_hi// else// a := (V_hi - A) + U_hi// endif// ...order in computing "a" must be observed. This branch is// ...best implemented by predicates.// ...A + a is U_hi + V_hi accurately. Moreover, "a" is// ...much smaller than A: |a| <= (1/2)ulp(A).//// ...Just need to calculate s + A + a + t// C_hi := s + A t := t + a// C_lo := (s - C_hi) + A// C_lo := C_lo + t//// ...Final steps for reduction// r := C_hi + C_lo// c := (C_hi - r) + C_lo//// ...At this point, we have r and c// ...And all we need is a couple of terms of the corresponding// ...Taylor series.//// If i_1 = 0// poly := c + r*FR_rsq*(S_1 + FR_rsq*S_2)// result := r// Else// poly := FR_rsq*(C_1 + FR_rsq*C_2)// result := 1// Endif//// If i_0 = 1, result := -result//// Last operation. Perform in user-set rounding mode//// result := (i_0 == 0? result + poly :// result - poly )// Return//// Large Arguments: For arguments above 2**63, a Payne-Hanek// style argument reduction is used and pi_by_2 reduce is called.//RODATA.align 16LOCAL_OBJECT_START(FSINCOSL_CONSTANTS)sincosl_table_p:data8 0xA2F9836E4E44152A, 0x00003FFE // Inv_pi_by_2data8 0xC84D32B0CE81B9F1, 0x00004016 // P_0data8 0xC90FDAA22168C235, 0x00003FFF // P_1data8 0xECE675D1FC8F8CBB, 0x0000BFBD // P_2data8 0xB7ED8FBBACC19C60, 0x0000BF7C // P_3data8 0x8D848E89DBD171A1, 0x0000BFBF // d_1data8 0xD5394C3618A66F8E, 0x0000BF7C // d_2LOCAL_OBJECT_END(FSINCOSL_CONSTANTS)LOCAL_OBJECT_START(sincosl_table_d)data8 0xC90FDAA22168C234, 0x00003FFE // pi_by_4data8 0xA397E5046EC6B45A, 0x00003FE7 // Inv_P_0data4 0x3E000000, 0xBE000000 // 2^-3 and -2^-3data4 0x2F000000, 0xAF000000 // 2^-33 and -2^-33data4 0x9E000000, 0x00000000 // -2^-67data4 0x00000000, 0x00000000 // padLOCAL_OBJECT_END(sincosl_table_d)LOCAL_OBJECT_START(sincosl_table_pp)data8 0xCC8ABEBCA21C0BC9, 0x00003FCE // PP_8data8 0xD7468A05720221DA, 0x0000BFD6 // PP_7data8 0xB092382F640AD517, 0x00003FDE // PP_6data8 0xD7322B47D1EB75A4, 0x0000BFE5 // PP_5data8 0xFFFFFFFFFFFFFFFE, 0x0000BFFD // C_1
?? 快捷鍵說(shuō)明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -