Overview of lock-free concurrency

Anyone out there using locks for concurrency, go look up lock-free methods. I finally read that paper on hazard pointers for memory reclamation in lock-free systems. Wow, that changes everything. Back to the drawing board.

Lock-free overview, as I know it so far:

Every step of the way is patented. I've spotted patents by Sun (1999), IBM, and Microsoft so far.

There's a chip-supported atomic operation, Compare And Swap (CAS). It looks like

    CAS(ptr, old, new) {
        if (*ptr == old) {
            *ptr = new;
            return TRUE;
        } else return FALSE;
    }

except it does it atomically.

Consider inserting B into a linked list A->C. First you set B->C. Then you do CAS(A, C, B). If you're inserting B1 and someone else is inserting B2, one of you will go first, and the other's CAS will see A is no longer pointing to C and fail. So you can insert into a linked list concurrently with no locks.
(Harris's trick) If your nodes are aligned to 4-byte boundaries, the bottom two bits of pointers to nodes are always 0. You can mark a pointer by setting its bottom bit, then CAS will see both the pointer and the mark. Consider deleting B lock-free from A->B->C. First mark B->C to say that B has been deleted, so B is actually pointing to C' = C | 1. Others trying to insert after B will see that, so CAS(B,C,X) will fail, and CAS(B,C',X) won't be attempted. The deleter (or anyone else) can actually remove B from the list by doing CAS(A,B,C).
Another way to do it is to insert a special "deleting" node between B and C. Then B no longer points to C, and everyone knows not to insert after the "deleting" node. The deleting node acts just like a marked pointer, except you need to allocate it, recycle it somehow, traverse it etc. But at least you just have normal pointers.
(Hazard pointers) Memory reclamation looks tricky. Consider the deleted B above. Suppose you successfully insert it into a free list. But when you do the CAS(A,B,C) to pull it off the free list and reuse it, what if someone else already did that, used it, deleted it again, and now B is pointing at Q? This is the "ABA problem", that CAS may think that things didn't change, but really they changed but went back to their original value.
Hazard pointers get around this. Ready? What you really want to do is make sure the node you're freeing isn't referenced by anyone else. You can do this, of course, by traversing every relevant pointer in every other thread after you've made the node inaccessible. Pain. To make things even more painful, you could walk all those pointers in every other thread, put their values in a hash table (multiply the cost of looking up all those pointers by the cost of hashing each one), then look up the node you want to delete in that hash table. Oh this is silly. But ... what if you delete more than one node at once? If you can limit every thread to say 3 pointers to freeable memory at any given time, and there's only 32 threads, that's under 100 pointers to traverse. If you wait until you've got a list of 1000 nodes to be deleted, you can build the 100-item hash table of in-use pointer values, do 1000 O(1) lookups, and delete at least 900 of those nodes. Voila, no unnecessary restrictions, no reference counts, amortized constant time checks per node being deleted. It's wonderful. The pointers you need to traverse are called "hazard pointers".
How does that help with the ABA problem? Back to reusing a node. Point a hazard pointer at B before removing it. Then nobody can recycle it under your feet.
For hash tables, I had found two other ways to get around the ABA problem before reading about hazard pointers. One is to keep a reference count and deleted-chain in each bucket, not really recycling the deleted-chain until the reference count goes to 0. Reusing a node in the freelist would require CASing the head of the freelist to null, then CASing it to the next item in the freelist. If it had switched away from null in the meantime then there's extra work. Another way was to keep a reference count per bucket and two deleted-chains in each bucket. When the reference count goes to 0 you toggle which deleted-chain you can reuse nodes from.
(Cursors) Hash table usage tends to have a lot of
build key; if (!find(table, key)) { build data; insert(table, key, data); }
Typically, the find is going to hash the key and do a table lookup, then the insert is going to rehash the key and redo the table lookup. To avoid doing 2x the needed amount of work, you can moderate all access to the hash table with a "cursor" that remembers the last key, hash, and table position. Cursors are a good idea even for singlethreaded hash tables. But for a multithreaded hash table with hazard pointers, note that all pointers to freeable shared memory are in the cursor. So you don't need special hazard pointers. All you need is a chain of all the cursors on the table; you use the pointers in the cursors as the hazard pointers.
(Split-ordered hash tables) Shalev ('03) had a good trick, that you could split a hash table bucket into two buckets if you ordered the keys so that all the keys for the first bucket precede all the keys for the second bucket. Since buckets may be split multiple times, this has to be done recursively. This turns out to be the same as ordering by the hash value for hash table lookup, but with the bits in the reverse order. To avoid splitting a list, then having someone insert something in the old bucket that ought to be at the head of the new bucket (but the new bucket isn't told to point at it), they insert the new bucket itself into the old list when they split.
(Dynamic Array) Due to limitations on the size of contiguous chunks of memory, large arrays (like the array of buckets for a large hash table) have to be represented by a tree of smaller (say 8k) blocks. To avoid scanning, every block has a known number of leaves in its subtree, so you can do arithmetic to determine which branch to follow. If you want to track some property for the array as a whole, like "are all buckets split?" or "how full is the array?", you can store that info in the branches of the tree describing their subtree. Different subtrees can be maintained concurrently; the only interference would be in recording that info in their common root path. (I don't think this is patented. It's been done this way since Moses had to govern the tribes of Israel, if not before.)

A design for a shrinking/growing lock-free hash table

This is preliminary. I haven't written the code yet, or fully thought through insert/delete conflicts in the various phases, and surely other stuff.

To keep hash table lookup fast, we'll get it to the correct bucket with no if-statements. But we're shrinking and growing the hash table, there may be multiple versions of the hash table! So we'll have a pointer to a guaranteed correct version, and lookup will use that pointer. That pointer can be changed whenever an old and new version of the hash table are both complete.

To avoid dummy nodes for the new bucket, I'll have insert/delete maintain both the old and new version. They'll check if they need to maintain the new version after they've updated the old version, so there's no chance of someone else splitting the bucket then them not maintaining the new version because they don't know about it. And since maintaining two versions is expensive, I'll also have every insert/delete split 100 or so buckets (which needs to be done anyhow, and makes the extra maintenance negligible in comparison).

Phases of the hash table and the cursors:

OK: no resize is going on, only one version of the tree. Move to NEW when the table needs to be resized.
NEW: allocate a new tree. The dynamic array will keep track of which subtrees still need some allocations. Move to FILL when the new tree is completely allocated.
FILL: Insert/delete fill the new tree while maintaining the old tree. They fill all the buckets corresponding to the buckets in the old dynamic array block this insert/delete hashed to, or some other block's buckets if this one is being scanned already. The dynamic array keeps track of which subtrees still need work. Move to FLIP once all buckets in the new tree are filled.
FLIP: Flip the lookup pointer from the old tree to the new tree. Insert/delete continue to fill both the old and new tree, old first new second. Move to DONE once all read cursors are in state FLIP. (Insert/delete determines this by scanning all cursors. A hung read could keep us here indefinitely.)
DONE: Insert/delete are told to stop maintaining the old tree. Move to FREE once all insert/delete cursors are in state DONE.
FREE: Insert/delete nulls out the pointer to the old tree with CAS(). If CAS() says we're the lucky winner who nulled out the old tree, we have to free all the blocks of the old tree. Move to OK once the pointer to the old tree is null.
NONE (cursors only): Any cursor not currently executing a hash table operation is in the NONE state. The insert/delete in charge of freeing the old tree goes to the NONE state before it starts freeing.

Since a whole new tree is allocated, old and new buckets won't overlap, so growing a hash table can add a low-order bit rather than a high-order bit to the hash value. Old bucket 0 will map to new 0 and 1, and old bucket 1 will map to new 2 and 3. That way the bucket can be determined by the hash in split-order. The high bits of the hash value can be used as-is for both bucket lookup and ordering within the bucket. Current machine instructions make hashes aimed at high-order bits slightly faster than ones aimed at low-order bits (for example, multiplication might be a good enough hash function).

You need two flags per array block to track whether operations are done on subblocks. Buckets are just a next pointer, and nodes are just a next pointer and a data pointer. No dummy nodes, no reference counts, no mapping the hash value to split-order, no extra info stored per bucket. Delete still needs a bit per node (in the next pointer, Harris's trick). CAS() is still needed all over the place.