style="display:inline-block;width:300px;height:250px"
data-ad-client="ca-pub-5935214489160196"
data-ad-slot="8007533899">

Semaphores (Linus Torvalds)

http://yarchive.net/comp/linux/semaphores.html

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: NT kernel guy playing with Linux
Date: 27 Jun 1999 03:25:59 GMT

In article <7l3oan$mrj$1@wire.cadcamlab.org>,
Peter Samuelson wrote:

>I say this in the hope of sparing others the confusion I had about what
>a spinlock is. (It looked a lot like what CS calls a semaphore, so why
>don’t they just call it a semaphore, and what’s this semaphore_t…?)

If your CS courses didn’t tell you the difference between a semaphore
and a spinlock, your CS courses were bad (or just didn’t cover much
about concurrency, which is fairly common).

Blame your professors, don’t blame the Linux kernel code.

>A spinlock is a semaphore used for very short critical sections; while
>waiting for a spinlock to be released, the kernel sits on the CPU and
>does nothing.

A spinlock is a mutual exclusion mechanism, not a semaphore (a semaphore
is a very specific _kind_ of mutual exclusion).

But yes, you’re right in that it’s (by design) a busy-waiting one.
That’s why they are called “spinlocks” – they “spin” waiting for the
lock to go away.

> A `semaphore_t’ is a longer-term semaphore which has the
>advantage that the kernel puts the needy process to sleep and goes and
>does something else rather than busy-waiting.

A semaphore also has the ability to have more than one process enter the
critical region. Basically semaphores were (as far as I know) first
proposed by Dijkstra, and they explicitly imply a “sleep”/”wakeup”
behaviour, ie they are _not_ spinlocks. They originally had operations
called “P()” and “V()”, but nobody ever remembers whether P() was down()
or up(), so nobody uses those names any more. Dijkstra was probably a
bit heavy on drugs or something (I think the official explanation is
that P and V are the first letters in some Dutch words, but I personally
find the drug overdose story much more believable).

Also, unlike basic spinlocks, a semaphore has a “count” value: each
process that does a down() operation decrements the count if positive,
until the count would go negative. Only then do they sleep. The
original intent of this was to allow multiple entries to the region
protected by the semaphore: by initializing the count to 4, you allow
four down() operations and only the fifth one will block.

However, almost all practical use of semaphores is a special case where
the counter is initialized to 1, and where they are used as simple
mutual exclusion with only one user allowed in the critical region.
Such a semaphore is often called a “mutex” semaphore for MUTual
EXclusion.

I’ve never really seen anybody use the more complex case of semaphores,
although I do know of cases where it can be useful. For example, one use
of a more complex semaphore is as a “throttle”, where you do something
like this:

/* Maximum concurrent users */
#define MAX_CONCURRENT_USERS 20
struct semaphore sem;

init_sema(&sem, MAX_CONCURRENT_USERS);

and then each user does a down() on the semaphore before starting an
operation. It won’t block until you have 20 users – you’ve not created
a mutual exclusion, but you HAVE created a throttling mechanism. See?

Potentially useful, as I said, but the common case (and the only case
currently in use in the Linux kernel – even though the implementation
definitely can handle the general case) is certainly the mutex one.

>However, you need to grab a spinlock for the purpose of grabbing a
>semaphore_t, so for short critical sections a spinlock (even with the
>potential of busy-waiting) is more efficient.

Only in bad implementations or on bad hardware.

All reasonably modern CPU’s can do a semaphore without having grabbed a
spinlock. Often a spinlock is needed for the _contention_ case, but if
done right that is very rare. The contention case for a semaphore is
usually the very expensive one, because that’s when you have to
re-schedule etc.

So basically spinlocks are much simpler, and faster under short-lived
contention, so that’s why they tend to be used. Also, semaphores cannot
be used by interrupt handlers, as Linux doesn’t allow interrupt handlers
to sleep, so anything that protects interrupts needs to be a spinlock.

In addition to semaphores, there are other mutual exclusion notions, the
most popular being a “read-write” lock – something that requires
exclusive access for writers, but allows any number of readers at a
time. Linux has the spinning version of this, but not the blocking one.
We’ll probably add a blocking version some day, as it’s often very
useful, but it hasn’t been a major issue yet.

For example, the per-VM memory management semaphore could very usefully
be a blocking read-write lock, but without heavy thread contention a
mutex semaphore is basically equivalent.

> Since the usual practice
>is to make your critical sections as short as possible, the result is
>that the kernel uses a lot more spinlocks than semaphore_t’s.

True. Semaphores are really only useful around anything that does IO,
for example. When the potential contention period is multiple
milliseconds as opposed to nano- or microseconds, blocking operations
(ie operations that cause a re-schedule on contention) are the right way
to go.

Note that some people believe in a mix-and-match approach, where you
have a spinlock that gets upgraded to a semaphore if it waits too long.
Personally, I think that only makes sense if (a) you’re in user space
and don’t know what the scheduling rules are or (b) your locking is so
badly designed that you have tons of short-lived contention on your
semaphores, so you want to try the light approach first.

Linus

Date: Sun, 8 Apr 2001 20:08:13 -0700 (PDT)
From: Linus Torvalds
Subject: Re: rw_semaphores
Newsgroups: fa.linux.kernel

On Sun, 8 Apr 2001, Andrew Morton wrote:
>
> One issue with the current implementation is the (necessary)
> design where one reader (or writer) is sleeping in
> down_x_failed_biased(), with its bias in place, whereas
> all other sleepers are sleeping in down_x_failed(), with
> their bias not in place.

Why not do this the same way the _real_ semaphores do it?

You have a fast-path that has a single integer, and a slow path which has
more state but and protected by a spinlock. The _only_ worry I see is to
make sure that we properly have a “contention” state, because both readers
and writers need to be able to know when they should wake things up. But
there’s a _trivial_ contention state. Read on:

Forget about the BIAS stuff, and go back to the old simple “negative is
writer, positive is reader” implementation:

0 – unlocked
1..n – 1-n readers
0xc0000000 – writer
other <0 - contention

Do you see anything wrong with this?

Implementation:

- fast path:

down_read:
lock incl (%sem)
js __down_read_failed

down_write:
xorl %eax,%eax
movl $0xc0000000,%r
lock cmpxchgl %r,(%sem)
jne __down_write_failed

up_read:
lock decl (%sem)
js __up_read_wakeup

up_write:
lock andl $0x3fffffff,(%sem)
jne __up_write_wakeup

The above are all fairly obvious for the non-failure case, agreed?
Including the guarantee that _if_ there is contention, the "up()"
routines will always go to the slow path to handle the contention.

Now, the _slow_ case could be your generic "protected by a spinlock" code,
although I have a suggestion. As follows:

The only half-way "subtle" case is that __down_write_failed needs to make
sure that it marks itself as a contender (the down_read() fast-case code
will have done so already by virtue of incrementing the counter, which is
guaranteed to have resulted in a "contention" value.

While down_read() automatically gets a "contention value" on failure,
"down_write()" needs to do extra work. The extra work is not all that
expensive: the simplest thing is to just do

subl $0x8000,(%sem)

at the beginning - which will cause a "contention" value regardless of
whether it was pure-reader before (it goes negative by the simple
assumption that there will never be more than 32k concurrent readers), or
whether it was locked by a writer (it stays negative due to the 0x40000000
bit in the write lock logic). In both cases, both a up_write() and a
up_read() will notice that they need to handle the contention of a waiting
writer-to-be.

Would you agree that the above fast-path implementation guarantees that we
always get to the spinlock-protected slow path?

Now, the above _heavily_ favours readers. There's no question that the
above is unfair. I think that's ok. I think fairness and efficiency tend
to be at odds. But queuing theory shows that the faster you can get out of
the critical region, the smaller your queue will be - exponentially. So
speed is worth it.

Let's take an example at this point:

- lock is zero

- writer(1) gets lock: lock is 0xc0000000

- reader(2) tries to get in: lock becomes 0xc0000001, synchronizes on
spinlock.

- another writer(3) tries to get in: lock becomes 0xbfff8001, synchronizes
on spinlock.

- writer(1) does up_write: lock becomes 0x3fff8001, != 0, so writer decides
it needs to wake up, and synchronizes on spinlock.

- another reader(4) comes on on another CPU, increments, and notices that
it can do so without it being negative: lock becomes 0x3fff8002 and
this one does NOT synchronize on the spinlock.

End result: reader(4) "stole base" and actually got the lock without ever
seeing any contention, and we now have (1), (2) and (3) who are serialized
inside the spinlock.

So we get to what the serializers have to do, ie the slow path:

First, let's do the __up_read/write_wakeup() case, because that one is the
easy one. In fact, I think it ends up being the same function:

spin_lock(&sem->lock);
wake_up(&sem->waiters);
spin_unlock(&sem->lock);

and we’re all done. The only optimization here (which we should do for
regular semaphores too) is to use the same spinlock for the semaphore lock
and the wait-queue lock.

The above is fairly obviously correct, and sufficient: we have shown that
we’ll get here if there is contention, and the only other thing that the
wakup could sanely do would possibly be to select which process to wake.
Let’s not do that yet.

The harder case is __down_read/write_failed(). Here is my suggested
pseudo-code:

__down_write_failed(sem)
{
DECLARE_WAIT_QUEUE(wait, current);

lock subl $0x8000,(%sem) /* Contention marker */
spin_lock(&sem->lock);
add_wait_queue_exclusive(&sem->wait, &wait);
for (;;) {
unsigned int value, newvalue;

set_task_state(TASK_SLEEPING);
value = sem->value;

/*
* Ignore other pending writers: but if there is
* are pending readers or a write holder we should
* sleep
*/
if (value & 0xc0007fff) {
spin_unlock(&sem->lock);
schedule();
spin_lock(&sem->lock);
continue;
}

/*
* This also undoes our contention marker thing, while
* leaving other waiters contention markers in place
*/
newvalue = (value + 0x8000) | 0xc0000000;
if (lock_cmpxchg(sem->value, value, newvalue))
break; /* GOT IT! */

/* Damn, somebody else changed it from under us */
continue;
}
remove_wait_queue(&sem->wait, &wait);
spin_unlock(&sem->lock);
}

The down_read() slow case is equivalent, but ends up being much simpler
(because we don’t need to play with contention markers or ignore other
peoples contention markers):

__down_read_failed(sem)
{
DECLARE_WAIT_QUEUE(wait, current);

spin_lock(&sem->lock);
add_wait_queue(&sem->wait, &wait);
for (;;) {
set_task_state(TASK_SLEEPING);
/*
* Yah! We already did our “inc”, so if we ever see
* a positive value we’re all done.
*/
if (sem->value > 0)
break;
spin_unlock(&sem->lock);
schedule();
spin_lock(&sem->lock);
}
remove_wait_queue(&sem->wait, &wait);
spin_unlock(&sem->lock);
}

Can anybody shoot any holes in this? I haven’t actually tested it, but
race conditions in locking primitives are slippery things, and I’d much
rather have an algorithm we can _think_ about and prove to be working. And
I think the above one is provably correct.

Not that I want to go to that kind of extremes.

Anybody? Andrew? Mind giving the above a whirl on your testbed? Or can you
see some thinko in it without even testing?

Note in particular how the above keeps the non-contention “down_read()” /
“up_read()” cases as two single instructions, no slower than a spinlock.

(There are the following assumptions in the code: there are at most 32k
active readers, and also at most 32k pending writers. The limits come from
the writer contention marker logic. You have the 32 bits split up as:

– 2 bits “write lock owner” (it’s really only one bit, but the other bit
is set so that the writer contention marker won’t turn the most
negative number into a positive one, so we have one “extra” bit set to
keep the thing negative for the whole duration of a write lock)
– 15 “reader count” bits
– 15 “pending writer count” bits

and the 15 bits is why you have the 32k user limitation. I think it’s an
acceptable one – and if not you can expand on it by adding extra fields
that are only accessed within the spinlock).

Linus

Date: Sun, 8 Apr 2001 21:18:20 -0700 (PDT)
From: Linus Torvalds
Subject: Re: rw_semaphores
Newsgroups: fa.linux.kernel

The “down_writer_failed()” case was wrong:

On Sun, 8 Apr 2001, Linus Torvalds wrote:
>
> __down_write_failed(sem)
> {
> ….
> /*
> * Ignore other pending writers: but if there is
> * are pending readers or a write holder we should
> * sleep
> */
> if (value & 0xc0007fff) {
….

The “value & 0xc0007ffff” test is wrong, because it’s actually always true
for some normal contention case (multiple pending writers waiting for a
reader to release the lock). Because even if there is no write lock
holder, other pending writers (and we’re one) will have caused the high
bits to be set, so the above would end up causing us to think that we
can’t get the lock even if there is no real lock holder.

The comment is right, though. It’s just the test that is simplistic and
wrong.

The pending readers part is correct, and obvious enough:

(value & 0x7fff) != 0

implies that there are readers. In which case we should try again. Simple
enough.

The pending writers part is slightly more subtle: we know that there is at
least _one_ pending writer, namely us. It turns out that we must check the
two high bits, and the logic is:

– 11: only pending writers, no write lock holder (if we had a write lock
holder, he would have set the bits to 11, but a pending writer
would have borrowed from the lower bit, so you’d get bit pattern
10).
– 10: we have a real lock holder, and the pending writers borrowed from
the low lock bit when they did the “subl 0x8000″ to mark off
contention.
– 01: must not happen. BUG.
– 00: we had a real write lock holder, but he released the lock and
cleared both bits.

So the “is there a write lock holder” test basically becomes

(value & 0xc0000000) == 0x80000000

and the full test should be

if ((value & 0x7fff) || ((value & 0xc0000000) == 0x80000000) {
spin_unlock();
schedule();
spin_lock();
continue;
}

which might be rewritten some simpler way. I’m too lazy to think about it
even more.

For similar reasons, the “newvalue” calculation was subtly bogus: we must
make sure that we maintain the correct logic for the two upper bits in the
presense of _other_ pending writers. We can’t just do the unconditional
binary “or” operation to set the two upper bits, because then we’d break
the above rules if there are other pending writers. So the newvalue
calculation should be something along the lines of

/* Undo _our_ contention marker */
newvalue = value + 0x8000;

/* Get rid of stale writer lock bits */
newvalue &= 0x3fffffff;

/*
* If there were no other pending writers (newvalue == 0), we set
* both high bits, otherwise we only set bit 31.
* (see above on the “borrow bit being clear” logic).
*/
if (!newvalue)
newvalue = 0xc0000000;
newvalue |= 0x80000000;

And THEN I think the algorithm in the email I’m following up to should
actually really work.

Does anybody find any other details I missed?

And no, I have not yet actually _tested_ any of this. But all my code
works on the first try (or, in this case, second try if you want to be a
stickler for details).

No wonder we didn’t get this right first time through. It’s not really all
that horribly complicated, but the _details_ kill you.

Linus

Date: Mon, 9 Apr 2001 22:43:53 -0700 (PDT)
From: Linus Torvalds
Subject: Re: rw_semaphores
Newsgroups: fa.linux.kernel

On Tue, 10 Apr 2001, Tachino Nobuhiro wrote:
>
> I am not familiar with semaphore or x86, so this may not be correct,
> but if the following sequence is possible, the writer can call wake_up()
> before the reader calls add_wait_queue() and reader may sleep forever.
> Is it possible?

The ordering is certainly possible, but if it happens,
__down_read_failed() won’t actually sleep, because it will notice that the
value is positive and just return immediately. So it will do some
unnecessary work (add itself to the wait-queue only to remove itself
immediately again), but it will do the right thing.

Linus

Date: Tue, 10 Apr 2001 12:42:07 -0700 (PDT)
From: Linus Torvalds
Subject: Re: [PATCH] i386 rw_semaphores fix
Newsgroups: fa.linux.kernel

On Tue, 10 Apr 2001, David Howells wrote:
>
> Here’s a patch that fixes RW semaphores on the i386 architecture. It is very
> simple in the way it works.

XADD only works on Pentium+.

That’s no problem if we make this SMP-specific – I doubt anybody actually
uses SMP on i486’s even if the machines exist, as I think they all had
special glue logic that Linux would have trouble with anyway. But the
advantages of being able to use one generic kernel that works on plain UP
i386 machines as well as SMP P6+ machines is big enough that I would want
to be able to say “CONFIG_X86_GENERIC” + “CONFIG_SMP”.

Even if it would be noticeably slower (ie a fallback to a spinlock might
be perfectly ok).

If you do this, I would suggest having asm-i386/{rwsem.h|rwsem-xadd.h},
and just having a

#ifndef CONFIG_XADD
#include
#else
#include
#endif

(And adding “CONFIG_XADD” to the list of generated optimization
configuration options in arch/i386/config.in, of course).

That way we don’t make the semaphore.h file even more unreadable.

Linus

Date: Tue, 10 Apr 2001 13:16:10 -0700 (PDT)
From: Linus Torvalds
Subject: Re: [PATCH] i386 rw_semaphores fix
Newsgroups: fa.linux.kernel

On Tue, 10 Apr 2001, Andi Kleen wrote:
>
> I guess 386 could live with an exception handler that emulates it.

That approach is fine, although I’d personally prefer to take the
exception just once and just rewrite the instuction as a “call”. The
places that need xadd would have to follow some strict guidelines (long
modrms or other instructions to pad out to enough size, and have the
arguments in fixed registers)

> (BTW an generic exception handler for CMPXCHG would also be very useful
> for glibc — currently it has special checking code for 386 in its mutexes)
> The 386 are so slow that nobody would probably notice a bit more slowness
> by a few exceptions.

Ehh. I find that the slower the machine is, the more easily I _notice_
that it is slow. So..

Linus

Date: Tue, 10 Apr 2001 17:55:09 -0700 (PDT)
From: Linus Torvalds
Subject: Re: [PATCH] i386 rw_semaphores fix
Newsgroups: fa.linux.kernel

On Wed, 11 Apr 2001, David Weinehall wrote:
> >
> > Yes, and with CMPXCHG handler in the kernel it wouldn’t be needed
> > (the other 686 optimizations like memcpy also work on 386)
>
> But the code would be much slower, and it’s on 386’s and similarly
> slow beasts you need every cycle you can get, NOT on a Pentium IV.

Note that the “fixup” approach is not necessarily very painful at all,
from a performance standpoint (either on 386 or on newer CPU’s). It’s not
really that hard to just replace the instruction in the “undefined
instruction” handler by having strict rules about how to use the “xadd”
instruction.

For example, you would not actually fix up the xadd to be a function call
to something that emulates “xadd” itself on a 386. You would fix up the
whole sequence of “inline down_write()” with a simple call to an
out-of-line “i386_down_write()” function.

Note that down_write() on an old 386 is likely to be complicated enough
that you want to do it out-of-line anyway, so the code-path you take
(afetr the first time you’ve trapped on that particular location) would be
the one you would take for an optimized 386 kernel anyway. And similarly,
the restrictions you place on non-386-code to make it fixable are simple
enough that it probably shouldn’t make a difference for performance on
modern chips.

Linus

Date: Wed, 11 Apr 2001 11:41:06 -0700 (PDT)
From: Linus Torvalds
Subject: Re: [PATCH] i386 rw_semaphores fix
Newsgroups: fa.linux.kernel

On Wed, 11 Apr 2001, David Howells wrote:
>
> > These numbers are infinity :)
>
> I know, but I think Linus may be happy with the resolution for the moment. It
> can be extended later by siphoning off excess quantities of waiters into a
> separate counter (as is done now) and by making the access count use a larger
> part of the variable.

I’m certainly ok with the could being limited to “thousands”. I don’t see
people being able to exploit it any practical way. But we should remember
to be careful: starting thousands of threads and trying to make them all
take page faults and overflowing the read counter would be a _nasty_
attack, It would probably not be particularly easy to arrange, but still.

Note that blocking locks are different from spinlocks: for spinlocks we
can get by with just 7 bits in a byte, and be guaranteed that that is
enough for any machine with less than 128 processors. For the blocking
locks, that is not true.

(Right now the default “max user processes” ulimit already protects us
from this exploit, I just wanted to make sure that people _think_ about
this).

So a 16-bit count is _fine_. And I could live with less.

We should remember the issue, though. If we ever decide to expand it, it
would be easy enough to make an alternative “rwsem-reallybig.h” that uses
cmpxchg8b instead, or whatever. You don’t need to start splitting the
counts up to expand them past 16 bits, you could have a simple system
where the _read_ locks only look at one (fast) 32-bit word for their fast
case, and only the write lock actually needs to use cmpxchg8b.

(I think it’s reasonable to optimize the read-lock more than the
write-lock: in the cases where you want to do rw-locking, the common
reason is because you really _want_ to allow many concurrent readers. That
also implies that the read case is the common one).

So you could fairly easily expand past 16-bit counters by using a 31-bit
counter for the reader, and making the high bit in the reader count be the
“contention” bit. Then the slow path (and the write path) would use the
64-bit operations offered by cmpxchg8b.

And yes, it’s a Pentium+ instruction (and this time I -checked- ;), but by
the time you worry about hundreds of thousands of threads I think you can
safely just say “you’d better be running on a big a modern machine”, and
just make the code conditional on CONFIG_PENTIUM+

So no real trickiness needed for expanding the number space, but certainly
also no real _reason_ for it at this time.

Linus

Date: Mon, 16 Apr 2001 10:05:57 -0700 (PDT)
From: Linus Torvalds
Subject: Re: rw_semaphores
Newsgroups: fa.linux.kernel

On Mon, 16 Apr 2001 yodaiken@fsmlabs.com wrote:
>
> I’m trying to imagine a case where 32,000 sharing a semaphore was anything but a
> major failure and I can’t. To me: the result of an attempt by the 32,768th locker
> should be a kernel panic. Is there a reasonable scenario where this is wrong?

Hint: “I’m trying to imagine a case when writing all zeroes to /etc/passwd
is anything but a major failure, but I can’t. So why don’t we make
/etc/passwd world-writable?”

Right. Security.

There is _never_ any excuse for panic’ing because of some inherent
limitation of the data structures. You can return -ENOMEM, -EAGAIN or
something like that, but you must _not_ allow a panic (or a roll-over,
which would just result in corrupted kernel data structures).

Note that the limit is probably really easy to work around even without
extending the number of bits: a sleeper that notices that the count is
even _halfway_ to rolling around could easily do something like:

– undo “this process” action
– sleep for 1 second
– try again from the beginning.

I certainly agree that no _reasonable_ pattern can cause the failure, but
we need to worry about people who are malicious. The above trivial
approach would take care of that, while not penalizing any non-malicious
users.

So I’m not worried about this at all. I just want people _always_ to think
about “how could I mis-use this if I was _truly_ evil”, and making sure it
doesn’t cause problems for others on the system.

(NOTE: This does not mean that the kernel has to do anything _reasonable_
under all circumstances. There are cases where Linux has decided that
“this is not something a reasonable program can do, and if you try to do
it, we’ll give you random results back – but they will not be _security_
holes”. We don’t need to be _nice_ to unreasonable requests. We just must
never panic, otherwise crash or allow unreasonable requests to mess up
_other_ people)

Linus

Date: Fri, 20 Apr 2001 10:46:01 -0700 (PDT)
From: Linus Torvalds
Subject: Re: [andrea@suse.de: Re: generic rwsem [Re: Alpha “process table
Newsgroups: fa.linux.kernel

On Fri, 20 Apr 2001, David Howells wrote:
>
> The file should only be used for the 80386 and maybe early 80486’s where
> CMPXCHG doesn’t work properly, everything above that can use the XADD
> implementation.

Why are those not using the generic files? The generic code is obviously
more maintainable.

> But if you want it totally non-inline, then that can be done. However, whilst
> developing it, I did notice that that slowed things down, hence why I wanted
> it kept in line.

I want to keep the _fast_ case in-line.

I do not care at ALL about the stupid spinlock version. That should be the
_fallback_, and it should be out-of-line. It is always going to be the
slowest implementation, modulo bugs in architecture-specific code.

For i386 and i486, there is no reason to try to maintain a complex fast
case. The machines are unquestionably going away – we should strive to not
burden them unnecessarily, but we should _not_ try to save two cycles.

In short:
– the only case that _really_ matters for performance is the uncontended
read-lock for “reasonable” machines. A i386 no longer counts as
reasonable, and designing for it would be silly. And the write-lock
case is much less compelling.
– We should avoid any inlines where the inline code is >2* the
out-of-line code. Icache issues can overcome any cycle gains, and do
not show up well in benchmarks (benchmarks tend to have very hot
icaches). Note that this is less important for the out-of-line code in
another segment that doesn’t get brought into the icache at all for the
non-contention case, but that should still be taken _somewhat_ into
account if only because of kernel size issues.

Both of the above rules implies that the generic spin-lock implementation
should be out-of-line.

> (1) asm-i386/rwsem-spin.h is wrong, and can probably be replaced with the
> generic spinlock implementation without inconveniencing people much.
> (though someone has commented that they’d want this to be inline as
> cycles are precious on the slow 80386).

Icache is also precious on the 386, which has no L2 in 99% of all cases.
Make it out-of-line.

> (2) “fix up linux/rwsem-spinlock.h”: do you want the whole generic spinlock
> implementation made non-inline then?

Yes. People who care about performance _will_ have architecture-specific
inlines on architectures where they make sense (ie 99% of them).

Linus

Date: Fri, 20 Apr 2001 16:45:32 -0700 (PDT)
From: Linus Torvalds
Subject: Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem
Newsgroups: fa.linux.kernel

On Fri, 20 Apr 2001, Andrea Arcangeli wrote:
>
> While dropping the list_empty check to speed up the fast path I faced the same
> complexity of the 2.4.4pre4 lib/rwsem.c and so before reinventing the wheel I
> read how the problem was solved in 2.4.4pre4.

I would suggest the following:

– the generic semaphores should use the lock that already exists in the
wait-queue as the semaphore spinlock.

– the generic semaphores should _not_ drop the lock. Right now it drops
the semaphore lock when it goes into the slow path, only to re-aquire
it. This is due to bad interfacing with the generic slow-path routines.

I suspect that this lock-drop is why Andrea sees problems with the
generic semaphores. The changes to “count” and “sleeper” aren’t
actually atomic, because we don’t hold the lock over them all. And
re-using the lock means that we don’t need the two levels of
spinlocking for adding ourselves to the wait queue. Easily done by just
moving the locking _out_ of the wait-queue helper functions, no?

– the generic semaphores are entirely out-of-line, and are just declared
universally as regular FASTCALL() functions.

The fast-path x86 code looks ok to me. The debugging stuff makes it less
readable than it should be, I suspect, and is probably not worth it at
this stage. The users of rw-semaphores are so well-defined (and so well
debugged) that the debugging code only makes the code harder to follow
right now.

Comments? Andrea? Your patches have looked ok, but I absolutely refuse to
see the non-inlined fast-path for reasonable x86 hardware..

Linus

Date: Sat, 21 Apr 2001 10:18:06 -0700 (PDT)
From: Linus Torvalds
Subject: Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem
Newsgroups: fa.linux.kernel

On Sat, 21 Apr 2001, Russell King wrote:
>
> Erm, spin_lock()? What if up_read or up_write gets called from interrupt
> context (is this allowed)?

Currently that is not allowed.

We allow it for regular semaphores, but not for rw-semaphores.

We may some day have to revisit that issue, but I suspect we won’t have
much reason to.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation
Date: Fri, 16 Dec 2005 21:43:03 UTC
Message-ID:
Original-Message-ID:

On Fri, 16 Dec 2005, Thomas Gleixner wrote:

> On Thu, 2005-12-15 at 21:32 +0100, Geert Uytterhoeven wrote:
> > > Why have the “MUTEX” part in there? Shouldn’t that just be DECLARE_SEM
> > > (oops, I mean DEFINE_SEM). Especially that MUTEX_LOCKED! What is that?
> > > How does a MUTEX start off as locked. It can’t, since a mutex must
> > > always have an owner (which, by the way, helped us in the -rt patch to
> > > find our “compat_semaphores”). So who’s the owner of a
> > > DEFINE_SEM_MUTEX_LOCKED?
> >
> > No one. It’s not really a mutex, but a completion.
>
> Well, then let us use a completion and not some semantically wrong
> workaround

It is _not_ wrong to have a semaphore start out in locked state.

For example, it makes perfect sense if the data structures that the
semaphore needs need initialization. The way you _should_ handle that is
to make the semaphore come up as locked, and the data structures in some
“don’t matter” state, and then the thing that initializes stuff can do so
properly and then release the semaphore.

Yes, in some cases such a locked semaphore is only used once, and ends up
being a “completion”, but that doesn’t invalidate the fact that this is
a perfectly fine way to handle a real issue.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation
Date: Fri, 16 Dec 2005 22:21:08 UTC
Message-ID:
Original-Message-ID:

On Fri, 16 Dec 2005, Thomas Gleixner wrote:
>
> Well, in case of a semaphore it is a semantically correct use case. In
> case of of a mutex it is not.

I disagree.

Think of “initialization” as a user. The system starts out initializing
stuff, and as such the mutex should start out being held. It’s that
simple. It _is_ mutual exclusion, with one user being the early bootup
state.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes
Date: Sun, 18 Dec 2005 18:43:42 UTC
Message-ID:
Original-Message-ID:

On Sun, 18 Dec 2005, Russell King wrote:
>
> On Sat, Dec 17, 2005 at 10:30:41PM -0800, Linus Torvalds wrote:
> > An interrupt can never change the value without changing it back, except
> > for the old-fashioned use of “up()” as a completion (which I don’t think
> > we do any more – we used to do it for IO completion a looong time ago).
>
> I doubt you can guarantee that statement, or has the kernel source
> been audited for this recently?

Well, _if_ it’s a noticeable performance win, we should just do it. We
already know that people don’t call “down()” in interrupts (it just
wouldn’t work), we can instrument “up()” too.

It’s easy enough to add a “might_sleep()” to the up(). Not strictly true,
but conceptually it would make sense to make up/down match in that sense.
We’d have to mark the (few) places that do down_trylock() + up() in
interrupt context with a special “up_in_interrupt()”, but that would be ok
even from a documentation standpoint.

> However, the real consideration is stability – if a semaphore was
> used for a completion and it was merged, would it be found and
> fixed? Probably not, because it won’t cause any problems on
> architectures where semaphores have atomic properties.

Actually, the reason we have completions is that using semaphores as
completions caused some really subtle problems that had nothing to do with
atomicity of the operations themselves, so if you find somebody who uses a
semaphore from an interrupt, I think we want to know about it.

Completions actually have another – and more important – property than the
fact that they have a more logical name for a particular usage.

The completion has “don’t touch me” guarantees. A thread or interrupt that
does an “up()” on a semaphore may still touch the memory that was
allocated for the semaphore after the “down()” part has been released.
And THAT was the reason for the completions: we allocate them on the stack
or in temporary allocations, and the thing that does the “down()” to wait
for something to finish will also do the _release_ of the memory.

With semaphores, that caused problems, because the side doing the “up()”
would thus possibly touch memory that got released from under it.

This problem happens only on SMP (since you need to have the downer and
the upper running at the same time), but it’s totally independent of the
other atomicity issues. And almost any semaphore that is used as a
completion from an interrupt will have this problem, so yes, if you find
somebody doing an “up()” in interrupt context, we’ll fix it.

It would be good to make the rules clear, that you can never touch a
semaphore from irq context without changing it back before you return.

Of course, that still leaves the following sequence

if (!down_trylock(..)) {
… do something ..
up(..);
}

which is actually used from interrupts too. At least the console layer
does that (printk() can happen from interrupts, and we do a down_trylock
on the console semaphore. But that one shouldn’t mess with the _count_,
although it does mean that the wait-queue preparation etc (for when the
fast case fails) does still need to be protected against interrupts.

But that would be the slow case, so from a performance standpoint, it
would still allow the case that really _matters_ to be done with
interrupts enabled.

> Unless of course sparse can be extended to detect the use of unbalanced
> semaphores in interrupt contexts.

In theory, yes, but in practice I’d much rather just do the stupid brute
force things.

> > (Of course, maybe it’s not worth it. It might not be a big performance
> > issue).
>
> Balancing the elimination of 4 instructions per semaphore operation,
> totalling about 4 to 6 cycles, vs stability I’d go for stability
> unless we can prove the above assertion via (eg) sparse.

I agree, if arm interrupt disables are fast. For example, on x86 (where
this isn’t needed, because you can have an “interrupt-safe” decrement by
just having it as a single instruction, even if it isn’t SMP-safe),
disabling and re-enabling interrupts is just one instruction each, but the
combination is usually something like 50+ cycles. So if this was an issue
on x86, we’d definitely care.

But if you don’t think it’s a big issue on ARM, it just doesn’t matter.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes
Date: Sun, 18 Dec 2005 19:55:47 UTC
Message-ID:
Original-Message-ID:

On Sun, 18 Dec 2005, James Bottomley wrote:
>
> Actually, I don’t think you want might_sleep(): there are a few cases
> where we do an up() from under a spinlock, which will spuriously trigger
> this. I’d suggest WARN_ON(in_interrupt()) instead.

Ahh, good point. Yes.

However, if even the arm people aren’t all that interested in this, maybe
it simply doesn’t matter. A lot of other architectures either have
“decrement in memory” or can already use ll/sc for it.

(of course, on some architectures, ll/sc is really really slow, so they
might well prefer using a normal load and store instead).

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/15] Generic Mutex Subsystem
Date: Mon, 19 Dec 2005 19:12:41 UTC
Message-ID:
Original-Message-ID:

On Mon, 19 Dec 2005, Ingo Molnar wrote:
>
> in fact, generic mutexes are _more_ fair than struct semaphore in their
> wait logic. In the stock semaphore implementation, when a waiter is
> woken up, it will retry the lock, and if it fails, it goes back to the
> _tail_ of the queue again – waiting one full cycle again.

Ingo, I don’t think that is true.

It shouldn’t be true, at least. The whole point with the “sleeper” count
was to not have that happen. Of course, bugs happen, so I won’t guarantee
that’s actually true, but ..

If you are woken up as a waiter on a semaphore, you shouldn’t fail to get
it. You will be woken first, and nobody else will get at it, because the
count has been kept negative or zero even by the waiters, so that a
fast-path user shouldn’t be able to get the lock without going through the
slow path and adding itself (last) to the list.

But hey, somebody should test it with kernel threads that just do
down()/up() and some make-believe work in between to make sure there
really is contention, and count how many times they actually get the
semaphore. That code has been changed so many times that it may not work
the way it is advertized 😉

[ Oh. I’m looking at the semaphore code, and I realize that we have a
“wake_up(&sem->wait)” in the __down() path because we had some race long
ago that we fixed by band-aiding over it. Which means that we wake up
sleepers that shouldn’t be woken up. THAT may well be part of the
performance problem.. The semaphores are really meant to wake up just
one at a time, but because of that race hack they’ll wake up _two_ at a
time – once by up(), once by down().

That also destroys the fairness. Does anybody remember why it’s that
way? ]

Ho humm.. That’s interesting.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/15] Generic Mutex Subsystem
Date: Mon, 19 Dec 2005 19:57:13 UTC
Message-ID:
Original-Message-ID:

On Mon, 19 Dec 2005, Benjamin LaHaise wrote:
>
> The only thing I can see as an improvement that a mutex can offer over
> the current semaphore implementation is if we can perform the same
> optimization that spinlocks perform in the unlock operation: don’t use
> a locked, serialising instruction in the up() codepath. That might be
> a bit tricky to implement, but it’s definately a win on the P4 where the
> cost of serialisation can be quite high.

Good point. However, it really _is_ hard, because we also need to know if
the mutex was under contention. A spinlock doesn’t care, so we can just
overwrite the lock value. A mutex would always care, in order to know
whether it needs to do the slow wakeup path.

So I suspect you can’t avoid serializing the unlock path for a mutex. The
issue of “was there contention while I held it” fundamentally _is_ a
serializing question.

> > [ Oh. I’m looking at the semaphore code, and I realize that we have a
> > “wake_up(&sem->wait)” in the __down() path because we had some race long
> > ago that we fixed by band-aiding over it. Which means that we wake up
> > sleepers that shouldn’t be woken up. THAT may well be part of the
> > performance problem.. The semaphores are really meant to wake up just
> > one at a time, but because of that race hack they’ll wake up _two_ at a
> > time – once by up(), once by down().
> >
> > That also destroys the fairness. Does anybody remember why it’s that
> > way? ]
>
> History?

Oh, absolutely, I already checked the old BK history too, and that extra
wake_up() has been there at least since before we even started using BK.
So it’s very much historical, I’m just wondering if somebody remembers far
enough back that we’d know.

I don’t see why it’s needed (since we re-try the “atomic_add_negative()”
inside the semaphore wait lock, and any up() that saw contention should
have always been guaranteed to do a wakeup that should fill the race in
between that atomic_add_negative() and the thing going to sleep).

It may be that it is _purely_ historical, and simply isn’t needed. That
would be funny/sad, in the sense that we’ve had it there for years and
years 😉

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/15] Generic Mutex Subsystem
Date: Mon, 19 Dec 2005 20:14:24 UTC
Message-ID:
Original-Message-ID:

On Mon, 19 Dec 2005, Ingo Molnar wrote:
>
> average cost per op: 206.59 usecs
> average cost per op: 512.13 usecs

(mutex vs semaphore).

That looks suspiciously like exactly double the cost, so I do believe that
the double wake_up() might be exactly what is going on.

However:

> hm, removing that wakeup quickly causes hung test-tasks.

So clearly it really is still hiding some bug.

> and even considering that the current semaphore implementation may have
> a fairness bug, i cannot imagine that making it more fair would also
> speed it up.

That’s not the point. The extra wakeup() in the semaphore code wakes up
two processes for every single up(), so the semaphores end up not just
being unfair, they also end up doing twice the work (because it will
result in the other processes effectively just doing the down() twice).

> I personally find the semaphore implementation clever but too complex,
> maybe that’s a reason why such bugs might be hiding there. (possibly
> for many years already …)

Oh, absolutely. It is too complex.

And don’t get me wrong: if it’s easier to just ignore the performance bug,
and introduce a new “struct mutex” that just doesn’t have it, I’m all for
it. However, if so, I do NOT want to do the unnecessary renaming. “struct
semaphore” should stay as “struct semaphore”, and we should not affect old
code in the _least_.

Then code can switch to “struct mutex” if people want to. And if one
reason for it ends up being that the code avoids a performance bug in the
process, all the better 😉

IOW, I really think this should be a series of small patches that don’t
touch old users of “struct semaphore” at all. None of this “semaphore” to
“arch_semaphore” stuff, and the new “struct mutex” would not re-use _any_
of the names that the old “struct semaphore” uses.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 04/15] Generic Mutex Subsystem,
add-atomic-call-func-x86_64.patch
Date: Tue, 20 Dec 2005 20:13:01 UTC
Message-ID:
Original-Message-ID:

On Tue, 20 Dec 2005, Russell King wrote:
>
> Also, Nico has an alternative idea for mutexes which does not
> involve decrementing or incrementing – it’s an atomic swap.
> That works out at about the same cycle count on non-Intel ARM
> CPUs as the present semaphore path. I’m willing to bet that
> it will be faster than the present semaphore path on Intel ARM
> CPUs.

Don’t be so sure, especially not in the future.

An atomic “swap” operation is, from a CPU design standpoint, fundamentally
more expensive that a “load + store”.

Now, most ARM architectures don’t notice this, because they are all
in-order, and not SMP-aware anyway. No suble memory ordering, no nothing.
Which is the only case when “swap” basically becomes a cheap “load+store”.

What I’m trying to say is that a plain “load + store” is almost always
going to be the best option in the long run.

It’s also almost certainly always the best option for UP + non-preempt,
for both present and future CPU’s. The reason is simply that a
microarchitecture will _always_ be optimized for that case, since it’s
pretty much by definition the common situation.

Is preemption even the common case on ARM? I’d assume not. Why are people
so interested in the preemption case? IOW, why don’t you just do

ldr lr,[%0]
subs lr, lr, %1
str lr,[%0]
blmi failure

as the _base_ timings, since that should be the common case. That’s the
drop-dead fastest UP case.

There’s an additional advantage of the regular load/store case: if some
CPU has scheduling issues, you can actually write this out as C code (with
an extra empty ASM to make sure that the compiler doesn’t move anything
out of the critical region). Again, that probably doesn’t matter on most
ARM chips, but in the general case it sure does matter.

(Btw, inlining _any_ of these except perhaps the above trivial case, is
probably wrong. None of the ARM chips tend to have caches all that big, I
bet).

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 04/15] Generic Mutex Subsystem,
add-atomic-call-func-x86_64.patch
Date: Tue, 20 Dec 2005 22:07:04 UTC
Message-ID:
Original-Message-ID:

On Tue, 20 Dec 2005, Nicolas Pitre wrote:
>
> I mean…… what is it with mutexes that you dislike to the point of
> bending backward that far, and even after seeing the numbers, with such
> a semaphore implementation that _I_ even wouldn’t trust people to use
> correctly?

Quite frankly, what has disgusted me about this mutex discussion is the
totally specious arguments for the new mutexes that just rubs me entirely
the wrong way.

If it had _started_ with a mutex implementation that was faster, simpler,
and didn’t rename the old and working semaphores, I’d have been perfectly
fine with it.

As it is, the discussion has been pretty much everything but that.

And then people who argue about single cycles, end up dismissing the
single cycles when I argue that “ld+st” is faster – like you just did.

Be consistent, dammit. If single cycles matter, they matter. If they
don’t, then the existing code is better, since it’s existing and works.
You can’t have it both ways.

In other words: if people didn’t mix up issues that had nothing to do with
this into it, I’d be happier. I’ve already said that a mutex that does
_not_ replace semaphore (and doesn’t mess with naming) is acceptable.

We’ve done that before. But do it RIGHT, dammit. And don’t mix existing
semaphores into it (for example, completions didn’t change any old users).

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2/8] mutex subsystem, add asm-generic/mutex-[dec|xchg].h
Date: Thu, 22 Dec 2005 23:58:36 UTC
Message-ID:
Original-Message-ID:

On Fri, 23 Dec 2005, Ingo Molnar wrote:
>
> add the two generic mutex fastpath implementations.

Now this looks more like it. This is readable code without any #ifdef’s in
the middle.

Now the only #ifdef’s seem to be for mutex debugging. Might it be
worthwhile to have a generic debugging, that just uses spinlocks and just
accept that it’s going to be slow, but shared across absolutely all
architectures?

Then you could have just doing a single

#ifdef CONFIG_MUTEX_DEBUG
# include
#else
# include
#endif

and have mutex-dbg.h just contain prototypes (no point in inlining them,
they’re going to be big anyway) and then have a

obj$(CONFIG_MUTEX_DEBUG) += mutex-debug.c

in the kernel/ subdirectory? That way you could _really_ have a clean
separation, with absolutely zero pollution of any architecture mess or
debugging #ifdef’s in any implementation code.

At that point I’d like to switch to mutexes just because the code is
cleaner!

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [patch 08/19] mutex subsystem, core
Date: Tue, 03 Jan 2006 15:40:21 UTC
Message-ID:
Original-Message-ID:

On Tue, 3 Jan 2006, Ingo Molnar wrote:
> >
> > Is this an interrupt deadlock, or do you not allow interrupt contexts
> > to even trylock a mutex?
>
> correct, no irq contexts are allowed. This is also checked for if
> CONFIG_DEBUG_MUTEXES is enabled.

Note that semaphores are definitely used from interrupt context, and as
such you can’t replace them with mutexes if you do this.

The prime example is the console semaphore. See kernel/printk.c, look for
“down_trylock()”, and realize that they are all about interrupts.

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Replace completions with semaphores
Date: Tue, 15 Apr 2008 17:00:38 UTC
Message-ID:

On Tue, 15 Apr 2008, Andi Kleen wrote:
>
> > – probably add support for completions to do counting
>
> But that’s just a semaphore, isn’t it?

Exactly. But the point here is:

– nobody should use semaphores anyway (use mutexes)
– making *more* code use semaphores is wrong
– completions have a different _mental_ model

IOW, this is not about implementation issues. It’s about how you think
about the operations.

We should _not_ implement completions as semaphores, simply because we
want to get *rid* of semaphores some day.

So rather than this long and involved patch series that first makes
semaphores generic, and then makes them be used as completions, I’d much
rather just skip this whole pointless exercise entirely.

Why have “generic semaphores” at all, if we want to get rid of them?

Linus

From: Linus Torvalds
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Replace completions with semaphores
Date: Tue, 15 Apr 2008 18:20:47 UTC
Message-ID:

On Tue, 15 Apr 2008, Matthew Wilcox wrote:
>
> > In other words, what makes me not like this is hat we first turn
> > semaphores into the generic code (which is largely what completions were:
> > just a special case of the generic semaphores!) and then turns completions
> > into these things. That just doesn’t make any sense to me!
>
> Blame me for not realising that completions were semaphores under a
> different name.

The origin of completions is literally the semaphore code – just
simplified to use spinlocks and be usable as just a mutex. We used to use
semaphores, and because of the subtle race with lockless semaphores I
wrote that stupid completion code as a “generic semaphore with a very
specific usage scenario” and called them “completions”.

The completions _could_ have been extended/used as mutex semaphores, but
the difference was really the mental model for them. That then limited the
implementation of them: the functions working on completions are defined
on purpose to be limited – it doesn’t really have “up()” and “down()”
functions: “complete()” is really a up(), but “wait_for_completion()” is
more like a “wait_until_I_could_do_a_trydown()” function.

Would it make sense to use completions for countable events too? Yeah. In
fact, we have some things that really would like to do counting, both in
the sense of “wait for events to all complete” _and_ in the sense of
“allow up to events to be outstanding”. Both of which could be done as
a counting function (just make “complete” increment the counter, and then
make “wait for events” initialize it to negative, while “allow
outstanding events” would be a positive counter, and make
“wait_for_completion()” basically be a “decrement and wait until it
is zero”.

IOW, completions() really follow the same patterns as semaphores, and it
*does* make sense to just have one single code-base. But if we want to
make semaphores go away, I think that it would be better to implement
semaphores in terms of “extended completions” rather than the other way
around. That way, we could one day really get rid of semaphores entirely.

Linus