?? readme
字號:
Then we update the side-links in the siblings, mark the target pagedeleted, and remove the downlink from the parent, as well as the parent'supper bounding key for the target (the one separating it from its rightsibling). This causes the target page's key space to effectively belongto its right sibling. (Neither the left nor right sibling pages need tochange their "high key" if any; so there is no problem with possibly nothaving enough space to replace a high key.) The side-links in the targetpage are not changed.(Note: Lanin and Shasha prefer to make the key space move left, but theirargument for doing so hinges on not having left-links, which we haveanyway. So we simplify the algorithm by moving key space right.)To preserve consistency on the parent level, we cannot merge the key spaceof a page into its right sibling unless the right sibling is a child ofthe same parent --- otherwise, the parent's key space assignment changestoo, meaning we'd have to make bounding-key updates in its parent, andperhaps all the way up the tree. Since we can't possibly do thatatomically, we forbid this case. That means that the rightmost child of aparent node can't be deleted unless it's the only remaining child.When we delete the last remaining child of a parent page, we mark theparent page "half-dead" as part of the atomic update that deletes thechild page. This implicitly transfers the parent's key space to its rightsibling (which it must have, since we never delete the overall-rightmostpage of a level). Searches ignore the half-dead page and immediately moveright. We need not worry about insertions into a half-dead page --- insertionsinto upper tree levels happen only as a result of splits of child pages, andthe half-dead page no longer has any children that could split. Thereforethe page stays empty even when we don't have lock on it, and we can completeits deletion in a second atomic action.The notion of a half-dead page means that the key space relationship betweenthe half-dead page's level and its parent's level may be a little out ofwhack: key space that appears to belong to the half-dead page's parent on theparent level may really belong to its right sibling. To prevent any possibleproblems, we hold lock on the deleted child page until we have finisheddeleting any now-half-dead parent page(s). This prevents any insertions intothe transferred keyspace until the operation is complete. The reason fordoing this is that a sufficiently large number of insertions into thetransferred keyspace, resulting in multiple page splits, could propagate keysfrom that keyspace into the parent level, resulting in transientlyout-of-order keys in that level. It is thought that that wouldn't cause anyserious problem, but it seems too risky to allow.A deleted page cannot be reclaimed immediately, since there may be otherprocesses waiting to reference it (ie, search processes that just left theparent, or scans moving right or left from one of the siblings). Theseprocesses must observe that the page is marked dead and recoveraccordingly. Searches and forward scans simply follow the right-linkuntil they find a non-dead page --- this will be where the deleted page'skey-space moved to.Moving left in a backward scan is complicated because we must considerthe possibility that the left sibling was just split (meaning we must findthe rightmost page derived from the left sibling), plus the possibilitythat the page we were just on has now been deleted and hence isn't in thesibling chain at all anymore. So the move-left algorithm becomes:0. Remember the page we are on as the "original page".1. Follow the original page's left-link (we're done if this is zero).2. If the current page is live and its right-link matches the "original page", we are done.3. Otherwise, move right one or more times looking for a live page whose right-link matches the "original page". If found, we are done. (In principle we could scan all the way to the right end of the index, but in practice it seems better to give up after a small number of tries. It's unlikely the original page's sibling split more than a few times while we were in flight to it; if we do not find a matching link in a few tries, then most likely the original page is deleted.)4. Return to the "original page". If it is still live, return to step 1 (we guessed wrong about it being deleted, and should restart with its current left-link). If it is dead, move right until a non-dead page is found (there must be one, since rightmost pages are never deleted), mark that as the new "original page", and return to step 1.This algorithm is correct because the live page found by step 4 will havethe same left keyspace boundary as the page we started from. Therefore,when we ultimately exit, it must be on a page whose right keyspaceboundary matches the left boundary of where we started --- which is whatwe need to be sure we don't miss or re-scan any items.A deleted page can only be reclaimed once there is no scan or search thathas a reference to it; until then, it must stay in place with itsright-link undisturbed. We implement this by waiting until alltransactions that were running at the time of deletion are dead; which isoverly strong, but is simple to implement within Postgres. When markeddead, a deleted page is labeled with the next-transaction counter value.VACUUM can reclaim the page for re-use when this transaction number isolder than the oldest open transaction. (NOTE: VACUUM FULL can reclaimsuch pages immediately.)Reclaiming a page doesn't actually change its state on disk --- we simplyrecord it in the shared-memory free space map, from which it will behanded out the next time a new page is needed for a page split. Thedeleted page's contents will be overwritten by the split operation.(Note: if we find a deleted page with an extremely old transactionnumber, it'd be worthwhile to re-mark it with FrozenTransactionId so thata later xid wraparound can't cause us to think the page is unreclaimable.But in more normal situations this would be a waste of a disk write.)Because we never delete the rightmost page of any level (and in particularnever delete the root), it's impossible for the height of the tree todecrease. After massive deletions we might have a scenario in which thetree is "skinny", with several single-page levels below the root.Operations will still be correct in this case, but we'd waste cyclesdescending through the single-page levels. To handle this we use an ideafrom Lanin and Shasha: we keep track of the "fast root" level, which isthe lowest single-page level. The meta-data page keeps a pointer to thislevel as well as the true root. All ordinary operations initiate theirsearches at the fast root not the true root. When we split a page that isalone on its level or delete the next-to-last page on a level (both casesare easily detected), we have to make sure that the fast root pointer isadjusted appropriately. In the split case, we do this work as part of theatomic update for the insertion into the parent level; in the delete caseas part of the atomic update for the delete (either way, the metapage hasto be the last page locked in the update to avoid deadlock risks). Thisavoids race conditions if two such operations are executing concurrently.VACUUM needs to do a linear scan of an index to search for deleted pagesthat can be reclaimed because they are older than all open transactions.For efficiency's sake, we'd like to use the same linear scan to search fordeletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pagesin index order, but it is possible to visit them in physical order instead.The tricky part of this is to avoid missing any deletable tuples in thepresence of concurrent page splits: a page split could easily move sometuples from a page not yet passed over by the sequential scan to alower-numbered page already passed over. (This wasn't a concern for theindex-order scan, because splits always split right.) To implement this,we provide a "vacuum cycle ID" mechanism that makes it possible todetermine whether a page has been split since the current btbulkdeletecycle started. If btbulkdelete finds a page that has been split sinceit started, and has a right-link pointing to a lower page number, thenit temporarily suspends its sequential scan and visits that page instead.It must continue to follow right-links and vacuum dead tuples untilreaching a page that either hasn't been split since btbulkdelete started,or is above the location of the outer sequential scan. Then it can resumethe sequential scan. This ensures that all tuples are visited. It may bethat some tuples are visited twice, but that has no worse effect than aninaccurate index tuple count (and we can't guarantee an accurate countanyway in the face of concurrent activity). Note that this still worksif the has-been-recently-split test has a small probability of falsepositives, so long as it never gives a false negative. This makes itpossible to implement the test with a small counter value stored on eachindex page.On-the-fly deletion of index tuples-----------------------------------If a process visits a heap tuple and finds that it's dead and removable(ie, dead to all open transactions, not only that process), then we canreturn to the index and mark the corresponding index entry "known dead",allowing subsequent index scans to skip visiting the heap tuple. The"known dead" marking works by setting the index item's lp_flags stateto LP_DEAD. This is currently only done in plain indexscans, not bitmapscans, because only plain scans visit the heap and index "in sync" and sothere's not a convenient way to do it for bitmap scans.Once an index tuple has been marked LP_DEAD it can actually be removedfrom the index immediately; since index scans only stop "between" pages,no scan can lose its place from such a deletion. We separate the stepsbecause we allow LP_DEAD to be set with only a share lock (it's exactlylike a hint bit for a heap tuple), but physically removing tuples requiresexclusive lock. In the current code we try to remove LP_DEAD tuples whenwe are otherwise faced with having to split a page to do an insertion (andhence have exclusive lock on it already).This leaves the index in a state where it has no entry for a dead tuplethat still exists in the heap. This is not a problem for the currentimplementation of VACUUM, but it could be a problem for anything thatexplicitly tries to find index entries for dead tuples. (However, thesame situation is created by REINDEX, since it doesn't enter deadtuples into the index.)It's sufficient to have an exclusive lock on the index page, not asuper-exclusive lock, to do deletion of LP_DEAD items. It might seemthat this breaks the interlock between VACUUM and indexscans, but that isnot so: as long as an indexscanning process has a pin on the page wherethe index item used to be, VACUUM cannot complete its btbulkdelete scanand so cannot remove the heap tuple. This is another reason why
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -