?? s_erfl.s
字號:
.file "erfl.s"// Copyright (c) 2001 - 2003, Intel Corporation// All rights reserved.//// Contributed 2001 by the Intel Numerics Group, Intel Corporation//// Redistribution and use in source and binary forms, with or without// modification, are permitted provided that the following conditions are// met://// * Redistributions of source code must retain the above copyright// notice, this list of conditions and the following disclaimer.//// * Redistributions in binary form must reproduce the above copyright// notice, this list of conditions and the following disclaimer in the// documentation and/or other materials provided with the distribution.//// * The name of Intel Corporation may not be used to endorse or promote// products derived from this software without specific prior written// permission.// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR // PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY // OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS // SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Intel Corporation is the author of this code, and requests that all// problem reports or change requests be submitted to it directly at // http://www.intel.com/software/products/opensource/libraries/num.htm.//// History//==============================================================// 11/21/01 Initial version// 05/20/02 Cleaned up namespace and sf0 syntax// 08/14/02 Changed mli templates to mlx// 02/06/03 Reordered header: .section, .global, .proc, .align//// API//==============================================================// long double erfl(long double)//// Overview of operation//==============================================================//// Algorithm description// ---------------------//// There are 4 paths://// 1. Special path: x = 0, Inf, NaNs, denormal// Return erfl(x) = +/-0.0 for zeros// Return erfl(x) = QNaN for NaNs// Return erfl(x) = sign(x)*1.0 for Inf// Return erfl(x) = (A0H+A0L)*x + x^2, ((A0H+A0L) = 2.0/sqrt(Pi))// for denormals//// 2. [0;1/8] path: 0.0 < |x| < 1/8// Return erfl(x) = x*(A1H+A1L) + x^3*A3 + ... + x^15*A15//// 3. Main path: 1/8 <= |x| < 6.53// For several ranges of 1/8 <= |x| < 6.53// Return erfl(x) = sign(x)*((A0H+A0L) + y*(A1H+A1L) + y^2*(A2H+A2L) + // + y^3*A3 + y^4*A4 + ... + y^25*A25 )// where y = (|x|/a) - b//// For each range there is particular set of coefficients.// Below is the list of ranges:// 1/8 <= |x| < 1/4 a = 0.125, b = 1.5// 1/4 <= |x| < 1/2 a = 0.25, b = 1.5// 1/2 <= |x| < 1.0 a = 0.5, b = 1.5// 1.0 <= |x| < 2.0 a = 1.0, b = 1.5// 2.0 <= |x| < 3.25 a = 2.0, b = 1.5// 3.25 <= |x| < 4.0 a = 2.0, b = 2.0// 4.0 <= |x| < 6.53 a = 4.0, b = 1.5// ( [3.25;4.0] subrange separated for monotonicity issues resolve )//// 4. Saturation path: 6.53 <= |x| < +INF // Return erfl(x) = sign(x)*(1.0 - tiny_value)// (tiny_value ~ 1e-1233)//// Implementation notes// --------------------//// 1. Special path: x = 0, INF, NaNa, denormals//// This branch is cut off by one fclass operation.// Then zeros+nans, infinities and denormals processed separately.// For denormals we had to use multiprecision A0 coefficient to reach// necessary accuracy: (A0H+A0L)*x-x^2//// 2. [0;1/8] path: 0.0 < |x| < 1/8//// First coefficient of polynomial we must split to multiprecision too.// Also we can parallelise computations:// (x*(A1H+A1L)) calculated in parallel with "tail" (x^3*A3 + ... + x^15*A15)// Furthermore the second part is factorized using binary tree technique.//// 3. Main path: 1/8 <= |x| < 6.53//// Multiprecision have to be performed only for first few// polynomial iterations (up to 3-rd x degree)// Here we use the same parallelisation way as above:// Split whole polynomial to first, "multiprecision" part, and second, // so called "tail", native precision part.//// 1) Multiprecision part: // [v1=(A0H+A0L)+y*(A1H+A1L)] + [v2=y^2*((A2H+A2L)+y*A3)]// v1 and v2 terms calculated in parallel//// 2) Tail part:// v3 = x^4 * ( A4 + x*A5 + ... + x^21*A25 )// v3 is splitted to 2 even parts (10 coefficient in each one).// These 2 parts are also factorized using binary tree technique.// // So Multiprecision and Tail parts cost is almost the same// and we have both results ready before final summation.//// 4. Saturation path: 6.53 <= |x| < +INF //// We use formula sign(x)*(1.0 - tiny_value) instead of simple sign(x)*1.0// just to meet IEEE requirements for different rounding modes in this case.//// Registers used//==============================================================// Floating Point registers used: // f8 - input & output// f32 -> f90// General registers used: // r2, r3, r32 -> r52 // Predicate registers used:// p0, p6 -> p11, p14, p15// p6 - arg is zero, denormal or special IEEE// p7 - arg is in [4;8] binary interval// p8 - arg is in [3.25;4] interval// p9 - arg < 1/8// p10 - arg is NOT in [3.25;4] interval// p11 - arg in saturation domain// p14 - arg is positive// p15 - arg is negative// Assembly macros//==============================================================rDataPtr = r2rTailDataPtr = r3rBias = r33rSignBit = r34rInterval = r35rArgExp = r36rArgSig = r37r3p25Offset = r38r2to4 = r39r1p25 = r40rOffset = r41r1p5 = r42rSaturation = r43r3p25Sign = r44rTiny = r45rAddr1 = r46rAddr2 = r47rTailAddr1 = r48rTailAddr2 = r49rTailOffset = r50rTailAddOffset = r51rShiftedDataPtr = r52//==============================================================fA0H = f32fA0L = f33fA1H = f34fA1L = f35fA2H = f36fA2L = f37fA3 = f38fA4 = f39fA5 = f40fA6 = f41fA7 = f42fA8 = f43fA9 = f44fA10 = f45fA11 = f46fA12 = f47fA13 = f48fA14 = f49fA15 = f50fA16 = f51fA17 = f52fA18 = f53fA19 = f54fA20 = f55 fA21 = f56 fA22 = f57 fA23 = f58fA24 = f59fA25 = f60fArgSqr = f61fArgCube = f62fArgFour = f63fArgEight = f64fArgAbsNorm = f65fArgAbsNorm2 = f66fArgAbsNorm2L = f67fArgAbsNorm3 = f68fArgAbsNorm4 = f69fArgAbsNorm11 = f70fRes = f71fResH = f72fResL = f73fRes1H = f74fRes1L = f75fRes1Hd = f76fRes2H = f77fRes2L = f78fRes3H = f79fRes3L = f80fRes4 = f81fTT = f82 fTH = f83fTL = f84fTT2 = f85 fTH2 = f86fTL2 = f87f1p5 = f88f2p0 = f89fTiny = f90// Data tables//==============================================================RODATA.align 64LOCAL_OBJECT_START(erfl_data)////////// Main tables ///////////_0p125_to_0p25_data: // exp = 2^-3// Polynomial coefficients for the erf(x), 1/8 <= |x| < 1/4 data8 0xACD9ED470F0BB048, 0x0000BFF4 //A3 = -6.5937529303909561891162915809e-04data8 0xBF6A254428DDB452 //A2H = -3.1915980570631852578089571182e-03data8 0xBC131B3BE3AC5079 //A2L = -2.5893976889070198978842231134e-19data8 0x3FC16E2D7093CD8C //A1H = 1.3617485043469590433318217038e-01data8 0x3C6979A52F906B4C //A1L = 1.1048096806003284897639351952e-17data8 0x3FCAC45E37FE2526 //A0H = 2.0911767705937583938791135552e-01data8 0x3C648D48536C61E3 //A0L = 8.9129592834861155344147026365e-18data8 0xD1FC135B4A30E746, 0x00003F90 //A25 = 6.3189963203954877364460345654e-34data8 0xB1C79B06DD8C988C, 0x00003F97 //A24 = 6.8478253118093953461840838106e-32data8 0xCC7AE121D1DEDA30, 0x0000BF9A //A23 = -6.3010264109146390803803408666e-31data8 0x8927B8841D1E0CA8, 0x0000BFA1 //A22 = -5.4098171537601308358556861717e-29data8 0xB4E84D6D0C8F3515, 0x00003FA4 //A21 = 5.7084320046554628404861183887e-28data8 0xC190EAE69A67959A, 0x00003FAA //A20 = 3.9090359419467121266470910523e-26data8 0x90122425D312F680, 0x0000BFAE //A19 = -4.6551806872355374409398000522e-25data8 0xF8456C9C747138D6, 0x0000BFB3 //A18 = -2.5670639225386507569611436435e-23data8 0xCDCAE0B3C6F65A3A, 0x00003FB7 //A17 = 3.4045511783329546779285646369e-22data8 0x8F41909107C62DCC, 0x00003FBD //A16 = 1.5167830861896169812375771948e-20data8 0x82F0FCB8A4B8C0A3, 0x0000BFC1 //A15 = -2.2182328575376704666050112195e-19data8 0x92E992C58B7C3847, 0x0000BFC6 //A14 = -7.9641369349930600223371163611e-18LOCAL_OBJECT_END(erfl_data)LOCAL_OBJECT_START(_0p25_to_0p5_data)// Polynomial coefficients for the erf(x), 1/4 <= |x| < 1/2 data8 0xF083628E8F7CE71D, 0x0000BFF6 //A3 = -3.6699405305266733332335619531e-03data8 0xBF978749A434FE4E //A2H = -2.2977018973732214746075186440e-02data8 0xBC30B3FAFBC21107 //A2L = -9.0547407100537663337591537643e-19data8 0x3FCF5F0CDAF15313 //A1H = 2.4508820238647696654332719390e-01data8 0x3C1DFF29F5AD8117 //A1L = 4.0653155218104625249413579084e-19data8 0x3FD9DD0D2B721F38 //A0H = 4.0411690943482225790717166092e-01data8 0x3C874C71FEF1759E //A0L = 4.0416653425001310671815863946e-17data8 0xA621D99B8C12595E, 0x0000BFAB //A25 = -6.7100271986703749013021666304e-26data8 0xBD7BBACB439992E5, 0x00003FAE //A24 = 6.1225362452814749024566661525e-25data8 0xFF2FEFF03A98E410, 0x00003FB2 //A23 = 1.3192871864994282747963195183e-23data8 0xAE8180957ABE6FD5, 0x0000BFB6 //A22 = -1.4434787102181180110707433640e-22data8 0xAF0566617B453AA6, 0x0000BFBA //A21 = -2.3163848847252215762970075142e-21data8 0x8F33D3616B9B8257, 0x00003FBE //A20 = 3.0324297082969526400202995913e-20data8 0xD58AB73354438856, 0x00003FC1 //A19 = 3.6175397854863872232142412590e-19data8 0xD214550E2F3210DF, 0x0000BFC5 //A18 = -5.6942141660091333278722310354e-18data8 0xE2CA60C328F3BBF5, 0x0000BFC8 //A17 = -4.9177359011428870333915211291e-17data8 0x88D9BB274F9B3873, 0x00003FCD //A16 = 9.4959118337089189766177270051e-16data8 0xCA4A00AB538A2DB2, 0x00003FCF //A15 = 5.6146496538690657993449251855e-15data8 0x9CC8FFFBDDCF9853, 0x0000BFD4 //A14 = -1.3925319209173383944263942226e-13LOCAL_OBJECT_END(_0p25_to_0p5_data)LOCAL_OBJECT_START(_0p5_to_1_data)// Polynomial coefficients for the erf(x), 1/2 <= |x| < 1 data8 0xDB742C8FB372DBE0, 0x00003FF6 //A3 = 3.3485993187250381721535255963e-03data8 0xBFBEDC5644353C26 //A2H = -1.2054957547410136142751468924e-01data8 0xBC6D7215B023455F //A2L = -1.2770012232203569059818773287e-17data8 0x3FD492E42D78D2C4 //A1H = 3.2146553459760363047337250464e-01data8 0x3C83A163CAC22E05 //A1L = 3.4053365952542489137756724868e-17data8 0x3FE6C1C9759D0E5F //A0H = 7.1115563365351508462453011816e-01data8 0x3C8B1432F2CBC455 //A0L = 4.6974407716428899960674098333e-17data8 0x95A6B92162813FF8, 0x00003FC3 //A25 = 1.0140763985766801318711038400e-18data8 0xFE5EC3217F457B83, 0x0000BFC6 //A24 = -1.3789434273280972156856405853e-17data8 0x9B49651031B5310B, 0x0000BFC8 //A23 = -3.3672435142472427475576375889e-17
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -