?? paper6
字號:
.EQdelim $$define <- ?< "\h'-0.5m'" up 10 "\(em" down 10 ?define gtorder ?"\z>\d\~\u"?define EXIST ?"\z\-\d\z\-\r\-\d\v'0.2m'\(br\v'-0.2m'"?define ALL ?"\o'V-'"?define 0M '0~...~M-1'define LH 'lo~...~hi'define RR 'bold R'define HH 'bold H'define KK 'bold K'define or '"\fBor\fI"~'define and '"\fBand\fI"~'define if '"\fBif\fI"~'define then '"\fBthen\fI"~'define else '"\fBelse\fI"~'define repeat '"\fBrepeat\fI"~'define until '"\fBuntil\fI"~'define while '"\fBwhile\fI"~'define do '"\fBdo\fI"~'define case '"\fBcase\fI"~'define end '"\fBend\fI"~'define begin '"\fBbegin\fI"~'define elseif '"\fBelseif\fI"~' define for '"\fBfor\fI"~'define From '"\fBfrom\fI"~'define To '"\fBto\fI"~'define exit '"\fBexit\fI"~'.EN.ls 1 .ceCOMPACT HASH TABLES USING BIDIRECTIONAL LINEAR PROBING.sp 3.ceJohn G. Cleary.ceThe University of Calgary, Alberta, Canada..sp 3.sp 20\u1\dAuthors Present Address: Man-Machine Systems Group, Department ofComputer Science, The University of Calgary, 2500 University Drive NWCalgary, Canada T2N 1N4..sp\u2\dThis research was supported bythe Natural Sciences and Engineering Research Council of Canada..sp 2.ls 2.bpIndex Terms -- Searching, hash storage, open addressing, bidirectional linear probing,address calculation, information retrieval, scatter storage, performance analysis, memory compaction..bp.ppAbstract -- An algorithm is developed which reduces the memory requirements of hash tables.This is achieved by storing onlya part of each key along with a few extra bits needed to ensure thatall keys are stored unambiguously. The fraction of each key storeddecreases as the size of the hash table increases. Significant reductionsin total memory usage can be achieved especially when the key size is notmuch larger than the size of a memory index and when only a small amountof data is stored with each key.The algorithm is based on bidirectional linear probing.Search and insertion times are shown by simulation to be similar to thosefor ordinary bidirectional linear probing..bp.sh "1 Introduction".ppThe retrieval of a single item from among many others is a common problemin computer science. I am particularly concerned here with the case where the item is retrieved on the basis of a single labelor key attached to each entry and where the keys are not ordered in anyparticular way.There is a well known solutionto this problem in the form of hash tables.Knuth [8], Knott [7] and Maurer and Lewis [11] provide good introductions to this subject..ppAn efficient version of hashing called.ulbidirectional linear probing (BLP),was developed by Amble and Knuth [1].As it forms the basis of what follows it is described in more detail in thefollowing section. Section 3 shows how it can be modified so as to significantly reduce its memory requirements. This is done by storing onlya small part of each key -- a few extra bits are needed to ensure that different keys, that look the same after truncation, are correctlydistinguished..ppThe execution time of this compact hashing algorithm is considered inSection 4. It is shown by simulation to be similar to ordinary BLPfor both successful searches and insertion. It is significantlybetter for unsuccessful searches. .ppA hashing scheme similar to compact hashing in that not all of the key isstored has been proposed by Andreae [2] (Chapter 1). However, his technique has a small but finite probability of retrieving an incorrect key.Although compact hashingis not based on this earlier technique it provided the impetus toseek the current solution..ppIn hashing algorithms using an overflow area and a linked list of synonymsor by variations of this using buckets (see Maurer and Lewis [11]) only theremainder of each key need be stored. This has been known since at least1965 (Feldman and Low [6] and Knuth [8] sec. 6.4, exercise 13, p543). However, each entry (including the original hash location) requires a pointerto the next overflow record. This pointer will about the same size as thereduction in the key size. So, there is no net memory saving overopen addressing techniques such as BLP..ppAmongst the possible applications of compact hashing is the storageof trees and TRIES without the use of pointers but still preservinga $log N$ retrieval time. It is hoped to report on this application in more detail later..ppPascal versions of the algorithms described below are availablefrom the author..sh "2 Bidirectional linear probing.".ppI will now introduce the open addressing technique which forms the basisof compact hashing.The .ulhash tablein which the keys will be stored is an array $T[ 0M ]$ . I willbe concerned only with the the keys themselves as the items associated with each key do not significantly affect the algorithms. In order to compute the locationfor each key I will use two functions: $t$ which randomises the originalkeys, and $h$ which computes a value in the range $0M$. .ppLet $KK$ be the set of all possible keys and $HH$ be the set of all possibletransformed keys. Then $t: KK -> HH$ is an invertible function.This function is introducedto ensure that the keys stored are random and so, as a consequence,the hashingprocedure has a satisfactoryaverage performance. In what follows these transformedkeys will be used rather than the original keys. For example, it is the transformed keys that are stored in $T$. (-1 is used to indicate an unoccupiedlocation in $T$.).pp$h: HH ->"{" 0M "}"$ and has the property that for$H sub 1 ~, H sub 2 ~ "\(mo" HH$$H sub 1 ~<=~ H sub 2~~ "\fBiff\fP"~~h(H sub 1 ) ~<=~ h(H sub 2 )$. As a consequence the keys will be mapped into the hash table in the same order as the values of their transformedkeys. This ordering is essential to the compaction attained later.Suitable functions $t$ and $h$ have been extensively discussed (Carter and Wegman, [3]; Knott [7]; Lum, [9]; Lum, Yuen and Dodd, [10]).These authors show that there are functions which almost always makethe distribution of transformed keys random. I will not consider anyparticular functions for $t$ although some examples of $h$ will be introducedlater..ppTo retrieve a key, $K$, from the hash table the transformed key and the hash location are first computed. If the (transformed) key stored at thehash location is greater than $t(K)$ then the table is searched upward until one of three things happen. Either an empty location will be found,$T[j]=-1$, or the sought key will be found, $T[j]=t(K)$, or a key greaterthan the sought key will be found, $T[j]>t(K)$. If the first key examinedis less than $t(K)$ then an analogous search is done down the hash table.The search is successful if the sought key is found, that isif the last location examined is equal to $t(K)$, and is unsuccessfulotherwise. (See Amble and Knuth [1] for the details of this algorithm)..ppFor a given set of keys there are many ways that they can be arranged in $T$so that the search algorithm above will still work correctly.There is thusfreedom, when designing an algorithm to insert new keys, to choose different strategies for positioning the keys.There are two conditions that must be satisfied when a new key is inserted.One is that all keys in the memory must remain in ascending orderand the other is that there must be no empty locations between the original hashlocation of any key and its actual storage position. These imply that allkeys sharing the same initial hash location must form a single unbroken group..ppWithin these constraints one would like to insert a new key so as to minimise later retrieval times and the time to do the insertion itself. Intuitivelykeys which share the same initial hash location should be centered around thatinitial address. There are two ways of inserting keys which cause littledisturbance to the memory. One is to find the position where the key shouldbe placed according to its ordering and then to create a gap for it bymoving .ulup all entries from this position up to the next empty location. The second way issymmetric to this and creates a gap by moving entries .uldown one location.The insertion algorithm given by Amble and Knuth [1] chooses which of thesetwo moves to make using a strategy which is guaranteed to minimise the numberof locations in $T$ which are examined during later successful or unsuccessfulsearches, although it is not guaranteed to minimise the insertion time itself..ppOne consequence of this insertion strategy is that sometimes it is necessaryto move entries below 0 and above $M$ in the array $T$. One solution to thiswould be to make the array circular and move entries from 0 to $M-1$ andvice versa. However, following Amble and Knuth [1], I will instead extendthe array $T$ and other arrays to be defined later at their top and bottom.This gives 'breathing room' for the array to expand. An extra 20 entriesat the top and bottom were found to be quite sufficient for allthe simulation runs reported in Section 4. Accordingly I will define$lo ~=~-20$ and $hi~=~M+19$ and define the array $T$ over the range$lo$ to $hi$..sh "3 Compact Hashing Using Bidirectional Linear Probing".ppI will now show that the memory required to store the keys in BLP can besignificantly reduced. First consider the case whenthe number of possible keys in $KK$ is less than $M$, then every possible keycan be assigned its own location in $T$ without possibility of collision.In this case $T$ degenerates to an ordinary indexed array and the keys neednever be stored. At worst a single bit might be needed to say whethera particular key has been stored or not. This raises the question of whetherit is necessary to hold the entire key in memory if the key space $KK$ is slightlylarger than $M$. For example if $KK$ were, say, four times larger than $M$then it might be possible to hold only two bits of the key rather than the entirekey. The reasoning here is that the main function of the stored keys is toensure that entries which collide at the same location can be correctlyseparated.Provided $h$ is suitably chosen at most four keys can be mapped to a single location. The two bits might then be sufficient to store fourdifferent values for these four keys. It is in fact possible to realise thisreduction in stored key size although a fixed amount of extra information is neededat each location in order to correctly handle collisions..ppSo that I can talk about the part of the key which is in excess of theaddress space I will now introduce a .ulremainder function$r$. $r$ maps from the transformed keys $HH$ to a set of remainders $RR~==~"{"0,~1,~2,~...,~Rm-1"}"$. It is these remainders that will be stored in lieuof the transformed keys. The essential propertyof $r$ is that $r(H)$ and $h(H)$ together are sufficient to uniquely determine $H$. .pp.ne 9Formally,.sp $RR ~~==~~ "{"0,~1,~2,~...,~Rm-1"}"$.sp $r: HH -> RR$.spand $h( H sub 1 )~=~h( H sub 2 )~and~r( H sub 1 )~=~r( H sub 2 )~~ "\fBiff\fP" ~~ H sub 1 ~~=~~ H sub 2$ ..spFor a given function $h$ there are usually many possible functions $r$.One particularly simple pair of functions, referred to by Maurer and Lewis [10]as the .uldivision method, is $h(H)~~=~~ left floor^ H/Rm right floor$ and$r(H)~~=~~ H~ "\fBmod\fP"~Rm$ . .spWhen $r$ is defined as above and $Rm$ is between $2 sup d$ and $2 sup d+1$ the number of bits needed to specify a remainder is the number of bits in the key less $d$..ppConsider a new array$R [ LH ]$ into which the remainders will be stored. In what follows $R$ will be kept in place of $T$ but it will be useful totalk about $T$ as if it were still there. $R$ and the additional arrays tobe introduced shortly specify just the information in $T$, albeitmore compactly. Each value $R [i]$ will hold the value $r(T[i])$ with theexception that when $T[i]$ is $-1$ (marking an empty location) then $R[i]$is also set to $-1$. Ifthere have been no collisions then each $R[i]$ paired with the value $i$unambiguously gives the transformed key that would have been stored in $T[i]$.However, if there have been collisions it is not possibleto tell if a value of $R[i]$ is at its home location or if it has been movedfrom, say, $i-1$ and corresponds to a key, $H$, where $r(H)~=~ R[i]$ and $h(H)~=~i-1$.If there were some way to locate for each $R[i]$ where it was originally hashed then the original keys could all be unambiguously determined.This can be done by maintaining two extra arrays of bits, the virgin array $V$,and the change array $C$..ppThe virgin array$V[ LH ]$ marks those locations which have never been hashed to. That is, $V[i]$ has a value of $1$stored if any of the stored keys in the hash table has $i$ as its hashaddress, and $0$ otherwise. $V$ is maintained by initialising it to $0$and thereafter setting $V[h(H)] <-~1$ whenever a key $H$ is inserted in thememory. The virginity of a location is unaffected by the move operationsduring insertion.The $V$ array is similar to the array of pass bits recommended in [1]..ppTo understand the change array $C[ LH ]$ it is necessary to look more closelyat the distribution of values of $R[i]$. These remainders can be grouped according to whether or not they share the same original hash address.Also recall that the hash table, as in BLP, is ordered, so,all the remainders in a particular group will occur at consecutive locations. The change bits $C[i]$ are used to delimit the boundaries of these groups. This is done by marking the first remainder(the one stored at the lowest address) of each group with a $1$. All other members of a group have $C[i]=0$. To simplify the search and insertionalgorithms it is also convenient to set $C[i]$ to 1 for all locationswhich are empty ($R[i]=-1$).Thus we have the formal definitions of thevalues of $V$ and $C$ in terms of the now notional array $T$ (the array$A$ is described later):.bp.nf.ls 1.ta 0.5i +0.75i +0.9i \(lt\|$r(T[i])$ $T[i] != -1$ $R[i]~~==~~$ \(lk\| \(lb\|$-1$ $T[i]=-1$.sp \(lt\|$1 EXIST~ j~h(T[j])=i$ $V[i]~~==~~$ \(lk\| \(lb\|$0$ otherwise.sp \(lt\|$1 T[i] != T[i-1]~ roman or ~T[i]=-1$ $C[i]~~==~~$ \(lk\| \(lb\|$0$ otherwise.sp 2 \(lt\|$a(i) -Na <= a(i) <= Na$ $A[i]~~==~~$ \(lk\| \(lb\|$inf$ otherwise.sp where.sp $Na ~>=~ 0$.br $a(i)~==~ sum from j=lo to i |C[j]=1~"and"~R[j] != -1|~-~sum from j=lo to i V[j]$.fi
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -