?? readme
字號:
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.18 2007/09/12 22:10:26 tgl Exp $This directory contains a correct implementation of Lehman and Yao'shigh-concurrency B-tree management algorithm (P. Lehman and S. Yao,Efficient Locking for Concurrent Operations on B-Trees, ACM Transactionson Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We alsouse a simplified version of the deletion logic described in Lanin andShasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).The Lehman and Yao algorithm and insertions-------------------------------------------We have made the following changes in order to incorporate the L&Y algorithminto Postgres:The requirement that all btree keys be unique is too onerous,but the algorithm won't work correctly without it. Fortunately, it isonly necessary that keys be unique on a single tree level, because L&Yonly use the assumption of key uniqueness when re-finding a key in aparent page (to determine where to insert the key for a split page).Therefore, we can use the link field to disambiguate multipleoccurrences of the same user key: only one entry in the parent levelwill be pointing at the page we had split. (Indeed we need not look atthe real "key" at all, just at the link field.) We can distinguishitems at the leaf level in the same way, by examining their links toheap tuples; we'd never have two items for the same heap tuple.Lehman and Yao assume that the key range for a subtree S is describedby Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parentpage. This does not work for nonunique keys (for example, if we haveenough equal keys to spread across several leaf pages, there *must* besome equal bounding keys in the first level up). Therefore we assumeKi <= v <= Ki+1 instead. A search that finds exact equality to abounding key in an upper tree level must descend to the left of thatkey to ensure it finds any equal keys in the preceding page. Aninsertion that sees the high key of its target page is equal to the keyto be inserted has a choice whether or not to move right, since the newkey could go on either page. (Currently, we try to find a page wherethere is room for the new key without a split.)Lehman and Yao don't require read locks, but assume that in-memorycopies of tree pages are unshared. Postgres shares in-memory buffersamong backends. As a result, we do page-level read locking on btreepages in order to guarantee that no record is modified while we areexamining it. This reduces concurrency but guaranteees correctbehavior. An advantage is that when trading in a read lock for awrite lock, we need not re-read the page after getting the write lock.Since we're also holding a pin on the shared buffer containing thepage, we know that buffer still contains the page and is up-to-date.We support the notion of an ordered "scan" of an index as well asinsertions, deletions, and simple lookups. A scan in the forwarddirection is no problem, we just use the right-sibling pointers thatL&Y require anyway. (Thus, once we have descended the tree to thecorrect start point for the scan, the scan looks only at leaf pagesand never at higher tree levels.) To support scans in the backwarddirection, we also store a "left sibling" link much like the "rightsibling". (This adds an extra step to the L&Y split algorithm: whileholding the write lock on the page being split, we also lock its formerright sibling to update that page's left-link. This is safe since nowriter of that page can be interested in acquiring a write lock on ourpage.) A backwards scan has one additional bit of complexity: afterfollowing the left-link we must account for the possibility that theleft sibling page got split before we could read it. So, we have tomove right until we find a page whose right-link matches the page wecame from. (Actually, it's even harder than that; see deletion discussionbelow.)Page read locks are held only for as long as a scan is examining a page.To minimize lock/unlock traffic, an index scan always searches a leaf pageto identify all the matching items at once, copying their heap tuple IDsinto backend-local storage. The heap tuple IDs are then processed whilenot holding any page lock within the index. We do continue to hold a pinon the leaf page, to protect against concurrent deletions (see below).In this state the scan is effectively stopped "between" pages, eitherbefore or after the page it has pinned. This is safe in the presence ofconcurrent insertions and even page splits, because items are never movedacross pre-existing page boundaries --- so the scan cannot miss any itemsit should have seen, nor accidentally return the same item twice. The scanmust remember the page's right-link at the time it was scanned, since thatis the page to move right to; if we move right to the current right-linkthen we'd re-scan any items moved by a page split. We don't similarlyremember the left-link, since it's best to use the most up-to-dateleft-link when trying to move left (see detailed move-left algorithm below).In most cases we release our lock and pin on a page before attemptingto acquire pin and lock on the page we are moving to. In a few placesit is necessary to lock the next page before releasing the current one.This is safe when moving right or up, but not when moving left or down(else we'd create the possibility of deadlocks).Lehman and Yao fail to discuss what must happen when the root pagebecomes full and must be split. Our implementation is to split theroot in the same way that any other page would be split, then constructa new root page holding pointers to both of the resulting pages (whichnow become siblings on the next level of the tree). The new root pageis then installed by altering the root pointer in the meta-data page (seebelow). This works because the root is not treated specially in anyother way --- in particular, searches will move right using its linkpointer if the link is set. Therefore, searches will find the datathat's been moved into the right sibling even if they read the meta-datapage before it got updated. This is the same reasoning that makes asplit of a non-root page safe. The locking considerations are similar too.When an inserter recurses up the tree, splitting internal pages to insertlinks to pages inserted on the level below, it is possible that it willneed to access a page above the level that was the root when it began itsdescent (or more accurately, the level that was the root when it read themeta-data page). In this case the stack it made while descending does nothelp for finding the correct page. When this happens, we find the correctplace by re-descending the tree until we reach the level one above thelevel we need to insert a link to, and then moving right as necessary.(Typically this will take only two fetches, the meta-data page and the newroot, but in principle there could have been more than one root splitsince we saw the root. We can identify the correct tree level by means ofthe level numbers stored in each page. The situation is rare enough thatwe do not need a more efficient solution.)Lehman and Yao assume fixed-size keys, but we must deal withvariable-size keys. Therefore there is not a fixed maximum number ofkeys per page; we just stuff in as many as will fit. When we split apage, we try to equalize the number of bytes, not items, assigned toeach of the resulting pages. Note we must include the incoming item inthis calculation, otherwise it is possible to find that the incomingitem doesn't fit on the split page where it needs to go!The deletion algorithm----------------------Before deleting a leaf item, we get a super-exclusive lock on the targetpage, so that no other backend has a pin on the page when the deletionstarts. This is not necessary for correctness in terms of the btree indexoperations themselves; as explained above, index scans logically stop"between" pages and so can't lose their place. The reason we do it is toprovide an interlock between non-full VACUUM and indexscans. Since VACUUMdeletes index entries before deleting tuples, the super-exclusive lockguarantees that VACUUM can't delete any heap tuple that an indexscanningprocess might be about to visit. (This guarantee works only for simpleindexscans that visit the heap in sync with the index scan, not for bitmapscans. We only need the guarantee when using non-MVCC snapshot rules suchas SnapshotNow, so in practice this is only important for system catalogaccesses.)Because a page can be split even while someone holds a pin on it, it ispossible that an indexscan will return items that are no longer stored onthe page it has a pin on, but rather somewhere to the right of that page.To ensure that VACUUM can't prematurely remove such heap tuples, we requirebtbulkdelete to obtain super-exclusive lock on every leaf page in the index,even pages that don't contain any deletable tuples. This guarantees thatthe btbulkdelete call cannot return while any indexscan is still holdinga copy of a deleted index tuple. Note that this requirement does not saythat btbulkdelete must visit the pages in any particular order. (See alsoon-the-fly deletion, below.) There is no such interlocking for deletion of items in internal pages,since backends keep no lock nor pin on a page they have descended past.Hence, when a backend is ascending the tree using its stack, it mustbe prepared for the possibility that the item it wants is to the left ofthe recorded position (but it can't have moved left out of the recordedpage). Since we hold a lock on the lower page (per L&Y) until we havere-found the parent item that links to it, we can be assured that theparent item does still exist and can't have been deleted. Also, becausewe are matching downlink page numbers and not data keys, we don't have anyproblem with possibly misidentifying the parent item.We consider deleting an entire page from the btree only when it's becomecompletely empty of items. (Merging partly-full pages would allow betterspace reuse, but it seems impractical to move existing data items left orright to make this happen --- a scan moving in the opposite directionmight miss the items if so. We could do it during VACUUM FULL, though.)Also, we *never* delete the rightmost page on a tree level (thisrestriction simplifies the traversal algorithms, as explained below).To delete an empty page, we acquire write lock on its left sibling (ifany), the target page itself, the right sibling (there must be one), andthe parent page, in that order. The parent page must be found using thesame type of search as used to find the parent during an insertion split.
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -