亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频

? 歡迎來到蟲蟲下載站! | ?? 資源下載 ?? 資源專輯 ?? 關(guān)于我們
? 蟲蟲下載站

?? technical_background.txt

?? 這是一個(gè)LINUX環(huán)境的 VDR 插件源代碼,可支持Irdeto, Seca, Viaccess, Nagra, Conax & Cryptoworks等CA系統(tǒng)的讀卡、共享等操作。
?? TXT
字號:
-------FFdecsa-------This doc is for people who looked into the source code and found itdifficult to believe that this is a decsa algorithm, as it appearscompletely different from other decsa implementations.It appears different because it is different. Being different is whatenables it to be a lot faster than all the others (currently it has morethan 800% the speed of the best version I was able to find)The csa algo was designed to be run in hardware, but people are nowrunning it in software.Hardware has data lines carrying bits and functional blocks doingcalculations (logic operations, adders, shifters, table lookup, ...),software instead uses memory to contain data values and executes asequence of instructions to transform the values. As a consequence,writing a software implementation of a hardware algorithm can beinefficient.For example, if you have 32 data lines, you can permutate the bits withzero cost in hardware (you just permute the physical traces), but if youhave the bits in a 32 bit variable you have to use 32 "and" operationswith 32 different masks, 32 shifts and 31 "or" operations (if yousuggest using "if"s testing the bits one by one you know nothing abouthow jump prediction works in modern processors).So the approach is *emulating the hardware*.Then there are some additional cool tricks.TRICK NUMBER 0: emulate the hardware------------------------------------We will work on bits one by one, that is a 4 bit word is now fourvariables. In this way we revert complex software operations intohardware emulation:  software                      hardware  -------------------------------------------  copy values                   copy values  logic op                      logic op  (bit permut.) ands+shifts+ors copy values  additions                     logic op emulating adders  (comparisons) if              logic op selecting one of the two results  lookup tables                 logic op synthetizing a ROM (*)(*) sometimes lookup tables can be converted to logic expressionsThe sbox in the stream cypher have been converted to efficient logicoperations using a custom written software (look into logic directory)and is responsible for a lot of speed increase. Maybe there exists aslightly better way to express the sbox as logical expressions, but itwould be a minuscule improvement. The sbox in the block cypher can't beconverted to efficient logic operations (8 bits of inputs are just toomuch) and is implemeted with a traditional lookup in an array.But there is a problem; if we want to process bits, but our externalinput and output wants bytes. We need conversion routines. Conversionroutines are similar to the awful permutations we described before, sothis has to be done efficiently someway.TRICK NUMBER 1: virtual shift registers---------------------------------------Shift registers are normally implemented by moving all data around.Better leave the data in the same memory locations and redefine wherethe start of the register is (updating a pointer). That is calledvirtual shift register.TRICK NUMBER 2: parallel bitslice---------------------------------Implementing the algorithm as described in tricks 1 and 2 give us about15% of the speed of a traditional implementation. This happens becausewe work on only one bit, even if our CPU is 32 bit wide. But *we canprocess 32 different packets at the same time*. This is called"bitslice" method. It can be done only if the program flow is notdependent of the data (if, while,...). Luckily this is true.Things like  if(a){    b=c&d;  }  else{    b=e&f;  }can be coded as (think of how hardware would implement this)  b1=c&d;  b2=e&f;  b=b2^(a&(b1^b2));and things like  if(a){    b=c&d  }can be transformed in the same way, as they may be written as  if(a){    b=c&d  }  else{    b=b;  }It could look wasteful, but it is not; and destroys data dependency.Our codes takes the same time as before, but produces 32 results, sospeed is now 480% the speed of a traditional implementation.TRICK NUMBER 3: multimedia instructions---------------------------------------If our CPU is 32 bit but it can also process larger blocks of dataefficiently (multimedia instructions), we can use them. We only needlogic ops and these are typically available.We can use MMX and work on 64 packets, or SSE and work on 128 packets.The speed doesn't automatically double going from 32 to 64 because theinteger registers of the processor are normally faster. However, somespeed is gained in this way.Multimedia instructions are often used by writing assembler by hand, butcompilers are very good in doing register allocation, loop unrolling andinstruction scheduling, so it is better to write the code in C and usenative multimedia data types (intrinsics).Depending on number of available registers, execution latency, number ofexecution units in the CPU, it may be good to process more than one datablock at the same time, for example 2 64bit MMX values. In this case wework on 128 bits by simulating a 128 bit op with two consecutive 64 bitop. This may or may not help (apparently not because x86 architecturehas a small number of registers).We can also try working on 96 bit, pairing a MMX and an int op, or 192bit by using MMX and SSE. While this is doable in theory and couldexploit different execution units in the CPU, speed doesn't improve(because of cache line handling problems inside the CPU, maybe).Besides int, MMX, SSE, we can use long long int (64 bit) and, why not,unsigned char.Using groups of unsigned chars (8 or 16) could give the compiler anopportunity to insert multimedia instructions automatically. Forexample, icc can use one MMX istruction to do  unsigned char a[8],b[8],c[8];  for(i=0;i<8;i++){    a[i]=b[i]&c[i];  }Some compilers (like icc) are efficient in this case, but usingintrinsics manually is generally faster.All these experiments can be easily done if the code is written in a waywhich abstracts the data type used. This is not easy but doable, all theoperations on data become (inlined) function calls or preprocessormacros. Good compilers are able to simplify all the abstraction atcompile time and generate perfect code (gcc is great).The data abstraction used in the code is called "group".TRICK NUMBER 4: parallel byteslice----------------------------------The bitslice method works wonderfully on the stream cypher, but can't beapplied to the block cypher because of the evil big look up table.As we have to convert input data from normal to bitslice before startingprocessing and from bitslice to normal before output, we convert thestream cypher output to normal before the block calculations and do theblock stage in a traditional way.There are some xors in the block cypher; so we arrange bytes fromdifferent packets side by side and use multimedia instructions to workon many bytes at the same time. This is not exactly bitslice, maybe itis called byteslice. The conversion routines are similar (just a bitsimpler).The data type we use to do this in the code is called "batch".The virtual shift register described in trick number 2 is useful too.The look up table is the only thing which is done serially one byte at atime. Luckily if we do it on 32 or 64 bytes the loop is heavilyunrolled, and the compiler and the CPU manage to get a good speedbecause there is little dependency between instructions.TRICK NUMBER 5: efficient bit permutation-----------------------------------------The block cypher has a bit permutation part. As we are not in a bitsliced form at that point, permuting bits in a byte takes 8 masks, 8and, 7 or; but three bits move in the same direction, so we make it with6 masks, 6 and, 5 or. Batch processing through multimedia instructionsis applicable too.TRICK NUMBER 6: efficient normal<->slice conversion---------------------------------------------------The bitslice<->normal conversion routines are a sort of transpositionoperation, that is you have bits in rows and want them in columns. Thiscan be done efficiently. For example, transposition of 8 bytes (matrixof 8x8=64 bits) can be done this way (we want to exchange bit[i][j] withbit[j][i] and we assume bit 0 is the MSB in the byte):  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  for(i=0;i<8;i++) b[i]=0;  for(i=0;i<8;i++){    for(j=0;j<8;j++){      b[i]|=((a[j]>>(7-i)&1))<<(7-j);    }  }but it is slow (128 shifts, 64 and, 64 or), or  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  for(i=0;i<8;i++) b[i]=0;  for(i=0;i<8;i++){    for(j=0;j<8;j++){      if(a[j]&(1<<(7-i))) b[i]|=1<<(7-j);    }  }but is very very slow (128 shifts, 64 and, 64 or, 128 unpredictableif!), or using a>>=1 and b<<=1, which gains you nothing, or  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  unsigned char top,bottom;  for(j=0;j<1;j++){    for(i=0;i<4;i++){      top=   a[8*j+i];      bottom=a[8*j+4+i];      a[8*j+i]=   (top&0xf0)    |((bottom&0xf0)>>4);      a[8*j+4+i]=((top&0x0f)<<4)| (bottom&0x0f);    }  }  for(j=0;j<2;j++){    for(i=0;i<2;i++){      top=   a[4*j+i];      bottom=a[4*j+2+i];      a[4*j+i]  = (top&0xcc)    |((bottom&0xcc)>>2);      a[4*j+2+i]=((top&0x33)<<2)| (bottom&0x33);    }  }  for(j=0;j<4;j++){    for(i=0;i<1;i++){      top=   a[2*j+i];      bottom=a[2*j+1+i];      a[2*j+i]  = (top&0xaa)    |((bottom&0xaa)>>1);      a[2*j+1+i]=((top&0x55)<<1)| (bottom&0x55);    }  }  for(i=0;i<8;i++) b[i]=a[i]; //easy to integrate into one of the stages abovewhich is very fast (24 shifts, 48 and, 24 or) and has redundant loopsand address calculations which will be optimized away by the compiler.It can be written as 3 nested loops but it becomes less readable andmakes it difficult to have results in b without an extra copy. Thecompiler always unrolls heavily.The gain is much bigger when operating with 32 bit or 64 bit values (weare going from N^2 to Nlog(N)). This method is used for rectangularmatrixes too (they have to be seen as square matrixes side by side).Warning: this code is not *endian independent* if you use ints to workon 4 bytes. Running it on a big endian processor will give you adifferent and strange kind of bit rotation if you don't modify masks andshifts.This is done in the code using int or long long int. It should bepossible to use MMX instead of long long int and it could be faster, butthis code doesn't cost a great fraction of the total time. There areproblems with the shifts, as multimedia instructions do not have allpossible kind of shift we need (SSE has none!).TRICK NUMBER 7: try hard to process packets together----------------------------------------------------As we are able to process many packets together, we have to avoidrunning with many slots empty. Processing one packet or 64 packets takesthe same time if the internal parallelism is 64! So we try hard toaggregate packets that can be processed together; for simplicity reasonswe don't mix packets with even and odd parity (different keys), even ifit should be doable with a little effort. Sometimes the transition fromeven to odd parity and viceversa is not sharp, but there are sequenceslike EEEEEOEEOEEOOOO. We try to group all the E together even if thereare O between them. This out-of-order processing complicates theinterface to the applications a bit but saves us three or four runs withmany empty slots.We have also logic to process together packets with a different size ofthe payload, which is not always 184 bytes. This involves sorting thepackets by size before processing and careful operation of the 23iteration loop to exclude some packets from the calculations. It is notCPU heavy.Packets with payload <8 bytes are identical before and after decryption(!), so we skip them without using a slot. (according to DVB specs thesekind of packets shouldn't happen, but they are used in the real world).TRICK NUMBER 8: try to avoid doing the same thing many times------------------------------------------------------------Some calculations related to keys are only done when the keys are set,then all the values depending on keys are stored in a convenient formand used everytime we convert a group of packets.TRICK NUMBER 9: compiler------------------------Compilers have a lot of optimization options. I used -march to target myCPU and played with unsual options. In particular  "--param max-unrolled-insns=500"does a good job on the tricky table lookup in the block cypher. Biggervalues unroll too much somewhere and loose speed. All the testing hasbeen done on an AthlonXP CPU with a specific version of gcc  gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)Other combinations of CPU and compiler can give different speeds. If thecompiler is not able to simplify the group and batch structures andstores everything in memory instead of registers, performance will below.Absolutely use a good compiler!Note: the same code can be compiled in C or C++ mode. g++ gives a 3%speed increase compared to gcc (I suppose some stricter constraint onarray and pointers in C++ mode gives the optimizer more freedom).TRICK NUMBER a: a lot of brain work-----------------------------------The code started as very slow but correct implementation and was thentweaked for months with a lot of experimentation and by adding all thegood ideas one after another to achieve little steps toward the bestspeed possible, while continously testing that nothing had been broken.Many hours were spent on this code.Enjoy the result.

?? 快捷鍵說明

復(fù)制代碼 Ctrl + C
搜索代碼 Ctrl + F
全屏模式 F11
切換主題 Ctrl + Shift + D
顯示快捷鍵 ?
增大字號 Ctrl + =
減小字號 Ctrl + -
亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频
狠狠色伊人亚洲综合成人| a在线欧美一区| 男女激情视频一区| 日韩和欧美的一区| 亚洲欧美日韩中文字幕一区二区三区 | 日韩欧美一区在线| 欧美色图在线观看| 欧美吞精做爰啪啪高潮| 欧美亚洲综合另类| 丝袜国产日韩另类美女| 亚洲一二三区视频在线观看| 日韩美女啊v在线免费观看| 国产精品久久久久婷婷| 亚洲色欲色欲www| 综合在线观看色| 亚洲精选一二三| 香蕉加勒比综合久久| 亚洲国产乱码最新视频 | 亚洲精品成人精品456| 亚洲精品免费在线播放| 亚洲男人天堂av| 国产欧美日韩激情| 自拍偷自拍亚洲精品播放| 国产精品素人一区二区| 国产亚洲欧洲997久久综合 | 中文字幕欧美国产| 国产女主播一区| 国产精品无码永久免费888| 亚洲免费伊人电影| 亚洲国产精品麻豆| 奇米影视在线99精品| 紧缚奴在线一区二区三区| 美脚の诱脚舐め脚责91| 国内不卡的二区三区中文字幕| 国产乱码字幕精品高清av | 欧美成人福利视频| 久久欧美一区二区| 一区二区三区在线影院| 亚洲午夜av在线| 天天综合日日夜夜精品| 国产呦萝稀缺另类资源| 精品在线一区二区三区| 国产成人精品一区二区三区四区| 精品一区二区免费在线观看| 成人免费视频caoporn| 91麻豆成人久久精品二区三区| 日本电影欧美片| 日韩视频一区二区在线观看| 久久精品一二三| 欧美激情综合在线| 亚洲一二三专区| 久久99精品久久久久久动态图| 成人午夜短视频| 日本高清不卡aⅴ免费网站| 欧美精品丝袜中出| 国产精品国产三级国产有无不卡 | 日本不卡免费在线视频| 国产一区二区0| 欧美性大战久久久| 精品福利一二区| 亚洲欧洲美洲综合色网| 开心九九激情九九欧美日韩精美视频电影 | 亚洲黄色性网站| 日本视频免费一区| 国产精品综合网| 欧美日韩一区二区三区在线看| 26uuu亚洲综合色| 亚洲欧美综合色| 狂野欧美性猛交blacked| 国产成人免费视频一区| 欧美性三三影院| 国产精品护士白丝一区av| 日韩成人免费电影| 国产黑丝在线一区二区三区| 欧美日韩国产一级二级| 国产精品入口麻豆九色| 久久99精品久久久久久国产越南| 在线亚洲欧美专区二区| 久久夜色精品国产欧美乱极品| 日韩激情av在线| 色拍拍在线精品视频8848| 久久久一区二区三区| 久久精品国产久精国产| 欧美影视一区在线| 欧美成人一区二区三区在线观看| 亚洲综合无码一区二区| 丁香五精品蜜臀久久久久99网站| 欧美成人伊人久久综合网| 亚洲一区二区免费视频| 成人av动漫在线| 中文字幕巨乱亚洲| 久久99国内精品| 在线观看视频一区| 怡红院av一区二区三区| 成人国产精品免费| 国产日韩精品一区二区浪潮av| 蜜桃精品视频在线观看| 欧美调教femdomvk| 亚洲va天堂va国产va久| 色综合久久久久网| 国产精品看片你懂得| 成人av动漫网站| 中文字幕欧美国产| www.亚洲色图.com| 国产精品久久毛片a| 国产精品综合av一区二区国产馆| 欧美成人bangbros| 美国精品在线观看| 欧美日韩高清一区二区不卡| 日欧美一区二区| 欧美日韩黄色一区二区| 日本不卡不码高清免费观看| 69久久夜色精品国产69蝌蚪网| 亚洲一区二区三区精品在线| 欧美人牲a欧美精品| 亚洲成人av福利| 欧美性生活一区| 青娱乐精品在线视频| 91精品国产综合久久福利| 日本女人一区二区三区| 日韩一级完整毛片| 免费高清不卡av| 久久精子c满五个校花| 国产露脸91国语对白| 中文字幕久久午夜不卡| 99久久久无码国产精品| 亚洲美女淫视频| 欧美高清精品3d| 久久99久久久久久久久久久| 欧美日韩电影在线播放| 国内外精品视频| 欧美激情自拍偷拍| 在线一区二区三区四区五区| 亚洲777理论| 日韩精品中文字幕在线一区| 成人免费视频播放| 亚洲激情图片小说视频| 欧美一区二区三区视频在线| 国产在线一区二区综合免费视频| 久久久精品日韩欧美| www.日本不卡| 亚洲成人动漫在线观看| 日韩午夜av电影| 国产成a人无v码亚洲福利| 亚洲乱码中文字幕综合| 色婷婷综合久久久久中文 | 国产精品久久久久三级| 国产mv日韩mv欧美| 亚洲第一久久影院| 日韩欧美电影一区| 97久久精品人人做人人爽50路| 亚洲精品第1页| 欧美日本乱大交xxxxx| 成人中文字幕合集| 亚洲不卡av一区二区三区| 国产午夜亚洲精品理论片色戒| 91在线视频观看| 日本成人中文字幕在线视频| 欧美经典一区二区三区| 91福利区一区二区三区| 国产激情91久久精品导航 | 天堂在线亚洲视频| 久久久777精品电影网影网| 一本色道**综合亚洲精品蜜桃冫 | 色综合天天在线| 美女免费视频一区| 国产精品短视频| 精品国产一区二区亚洲人成毛片| 91色|porny| 久久av中文字幕片| 亚洲va欧美va人人爽| 欧美高清在线一区| jiyouzz国产精品久久| 看国产成人h片视频| 亚洲精品国久久99热| 欧美国产精品一区二区三区| 91精品午夜视频| 99国产精品国产精品久久| 国内成人自拍视频| 五月天亚洲婷婷| 日本一区二区视频在线观看| 日韩欧美高清一区| 欧美亚洲日本一区| 成人avav在线| 六月丁香婷婷久久| 一区二区三区在线视频播放| 久久中文娱乐网| 欧美精品自拍偷拍| 欧美色视频在线观看| 成人一区二区三区视频在线观看| 精品中文av资源站在线观看| 亚洲6080在线| 国产精品视频一二三区 | 亚洲黄色尤物视频| 国产欧美日韩激情| 2020日本不卡一区二区视频| 日韩一区二区三区免费观看| 精品视频1区2区| 色天天综合色天天久久| 青青青爽久久午夜综合久久午夜 |