Overview of lock-free concurrency
Anyone out there using locks for concurrency, go look
up lock-free methods. I finally read that paper on hazard
pointers for memory reclamation in lock-free systems. Wow, that
changes everything. Back to the drawing board.
Lock-free overview, as I know it so far:
- Every step of the way is patented. I've spotted patents by Sun (1999),
IBM, and Microsoft so far.
- There's a chip-supported atomic
operation, Compare And Swap (CAS). It looks like
CAS(ptr, old, new) {
if (*ptr == old) {
*ptr = new;
return TRUE;
} else return FALSE;
}
|
except it does it atomically.
- Consider inserting B into a linked list A->C. First you set
B->C. Then you do CAS(A, C, B). If you're inserting B1 and
someone else is inserting B2, one of you will go first, and the
other's CAS will see A is no longer pointing to C and fail. So you
can insert into a linked list concurrently with no locks.
- (Harris's trick) If your nodes are aligned to 4-byte boundaries,
the bottom two bits of pointers to nodes are always 0. You can mark a
pointer by setting its bottom bit, then CAS will see both the pointer
and the mark. Consider deleting B lock-free from A->B->C.
First mark B->C to say that B has been deleted, so B is actually
pointing to C' = C | 1. Others trying to insert after B will see that, so
CAS(B,C,X) will fail, and CAS(B,C',X) won't be attempted. The deleter
(or anyone else) can actually remove B from the list by doing CAS(A,B,C).
Another way to do it is to insert a special "deleting" node between
B and C. Then B no longer points to C, and everyone knows not to
insert after the "deleting" node. The deleting node acts just like a
marked pointer, except you need to allocate it, recycle it somehow,
traverse it etc. But at least you just have normal pointers.
- (Hazard pointers) Memory reclamation looks tricky. Consider the
deleted B above. Suppose you successfully insert it into a free list.
But when you do the CAS(A,B,C) to pull it off the free list and reuse
it, what if someone else already did that, used it, deleted it again,
and now B is pointing at Q? This is the "ABA problem", that CAS may
think that things didn't change, but really they changed but went back
to their original value.
Hazard pointers get around this. Ready? What you really want to
do is make sure the node you're freeing isn't referenced by anyone
else. You can do this, of course, by traversing every relevant
pointer in every other thread after you've made the node inaccessible.
Pain. To make things even more painful, you could walk all those
pointers in every other thread, put their values in a hash table
(multiply the cost of looking up all those pointers by the cost of
hashing each one), then look up the node you want to delete in that
hash table. Oh this is silly. But ... what if you delete more than
one node at once? If you can limit every thread to say 3 pointers to
freeable memory at any given time, and there's only 32 threads, that's
under 100 pointers to traverse. If you wait until you've got a list
of 1000 nodes to be deleted, you can build the 100-item hash table of
in-use pointer values, do 1000 O(1) lookups, and delete at least 900
of those nodes. Voila, no unnecessary restrictions, no reference
counts, amortized constant time checks per node being deleted. It's
wonderful. The pointers you need to traverse are called "hazard
pointers".
How does that help with the ABA problem? Back to reusing a node.
Point a hazard pointer at B before removing it. Then nobody can
recycle it under your feet.
For hash tables, I had found two other ways to get around the ABA
problem before reading about hazard pointers. One is to keep a
reference count and deleted-chain in each bucket, not really recycling
the deleted-chain until the reference count goes to 0. Reusing a node
in the freelist would require CASing the head of the freelist to null,
then CASing it to the next item in the freelist. If it had switched
away from null in the meantime then there's extra work. Another way
was to keep a reference count per bucket and two deleted-chains in
each bucket. When the reference count goes to 0 you toggle which
deleted-chain you can reuse nodes from.
- (Cursors) Hash table usage tends to have a lot of
build key;
if (!find(table, key)) {
build data;
insert(table, key, data);
}
|
Typically, the find is going to hash the key and do a table lookup,
then the insert is going to rehash the key and redo the table lookup.
To avoid doing 2x the needed amount of work, you can moderate all
access to the hash table with a "cursor" that remembers the last key,
hash, and table position. Cursors are a good idea even for
singlethreaded hash tables. But for a multithreaded hash table with
hazard pointers, note that all pointers to freeable shared memory are
in the cursor. So you don't need special hazard pointers. All you
need is a chain of all the cursors on the table; you use the pointers
in the cursors as the hazard pointers.
- (Split-ordered hash tables) Shalev ('03) had a good trick, that
you could split a hash table bucket into two buckets if you ordered
the keys so that all the keys for the first bucket precede all the
keys for the second bucket. Since buckets may be split multiple
times, this has to be done recursively. This turns out to be the same
as ordering by the hash value for hash table lookup, but with the bits
in the reverse order. To avoid splitting a list, then having someone
insert something in the old bucket that ought to be at the head of the
new bucket (but the new bucket isn't told to point at it), they insert
the new bucket itself into the old list when they split.
- (Dynamic Array) Due to limitations on the size of contiguous
chunks of memory, large arrays (like the array of buckets for a large
hash table) have to be represented by a tree of smaller (say 8k)
blocks. To avoid scanning, every block has a known number of leaves
in its subtree, so you can do arithmetic to determine which branch to
follow. If you want to track some property for the array as a whole,
like "are all buckets split?" or "how full is the array?", you can
store that info in the branches of the tree describing their subtree.
Different subtrees can be maintained concurrently; the only
interference would be in recording that info in their common root
path. (I don't think this is patented. It's been done this way since
Moses had to govern the tribes of Israel, if not before.)
A design for a shrinking/growing lock-free hash table
This is preliminary. I haven't written the code yet, or fully
thought through insert/delete conflicts in the various phases, and
surely other stuff.
To keep hash table lookup fast,
we'll get it to the correct bucket with no if-statements. But we're
shrinking and growing the hash table, there may be multiple versions
of the hash table! So we'll have a pointer to a guaranteed correct
version, and lookup will use that pointer. That pointer can be
changed whenever an old and new version of the hash table are both complete.
To avoid dummy nodes for the new bucket, I'll have insert/delete
maintain both the old and new version. They'll check if they need to
maintain the new version after they've updated the old version, so
there's no chance of someone else splitting the bucket then them not
maintaining the new version because they don't know about it. And
since maintaining two versions is expensive, I'll also have every
insert/delete split 100 or so buckets (which needs to be done
anyhow, and makes the extra maintenance negligible in comparison).
Phases of the hash table and the cursors:
- OK: no resize is going on, only one version of the tree. Move to
NEW when the table needs to be resized.
- NEW: allocate a new tree. The dynamic array will keep track of
which subtrees still need some allocations. Move to FILL when the new
tree is completely allocated.
- FILL: Insert/delete fill the new tree while maintaining the old
tree. They fill all the buckets corresponding to the buckets in the
old dynamic array block this insert/delete hashed to, or some other
block's buckets if this one is being scanned already. The dynamic
array keeps track of which subtrees still need work. Move to FLIP
once all buckets in the new tree are filled.
- FLIP: Flip the lookup pointer from the old tree to the new tree.
Insert/delete continue to fill both the old and new tree, old first
new second. Move to DONE once all read cursors are in state FLIP.
(Insert/delete determines this by scanning all cursors. A hung read
could keep us here indefinitely.)
- DONE: Insert/delete are told to stop maintaining the old tree.
Move to FREE once all insert/delete cursors are in state DONE.
- FREE: Insert/delete nulls out the pointer to the old tree with
CAS(). If CAS() says we're the lucky winner who nulled out the old
tree, we have to free all the blocks of the old tree. Move to OK once
the pointer to the old tree is null.
- NONE (cursors only): Any cursor not currently executing a hash
table operation is in the NONE state. The insert/delete in charge of
freeing the old tree goes to the NONE state before it starts freeing.
Since a whole new tree is allocated, old and new buckets won't
overlap, so growing a hash table can add a low-order bit rather than a
high-order bit to the hash value. Old bucket 0 will map to new 0 and
1, and old bucket 1 will map to new 2 and 3. That way the bucket can
be determined by the hash in split-order. The high bits of the hash
value can be used as-is for both bucket lookup and ordering within the
bucket. Current machine instructions make hashes aimed at high-order
bits slightly faster than ones aimed at low-order bits (for example,
multiplication might be a good enough hash function).
You need two flags per array block to track whether operations are
done on subblocks. Buckets are just a next pointer, and nodes are
just a next pointer and a data pointer. No dummy nodes, no reference
counts, no mapping the hash value to split-order, no extra info stored
per bucket. Delete still needs a bit per node (in the next pointer,
Harris's trick). CAS() is still needed all over the place.