Multithreading support in memcached

OVERVIEW

By default, memcached is compiled as a single-threaded application. This is
the most CPU-efficient mode of operation, and it is appropriate for memcached
instances running on single-processor servers or whose request volume is
low enough that available CPU power is not a bottleneck.

More heavily-used memcached instances can benefit from multithreaded mode.
To enable it, use the "--enable-threads" option to the configure script:

./configure --enable-threads

You must have the POSIX thread functions (pthread_*) on your system in order
to use memcached's multithreaded mode.

Once you have a thread-capable memcached executable, you can control the
number of threads using the "-t" option; the default is 4. On a machine
that's dedicated to memcached, you will typically want one thread per
processor core. Due to memcached's nonblocking architecture, there is no
real advantage to using more threads than the number of CPUs on the machine;
doing so will increase lock contention and is likely to degrade performance.


INTERNALS

The threading support is mostly implemented as a series of wrapper functions
that protect calls to underlying code with one of a small number of locks.
In single-threaded mode, the wrappers are replaced with direct invocations
of the target code using #define; that is done in memcached.h. This approach
allows memcached to be compiled in either single- or multi-threaded mode.

Each thread has its own instance of libevent ("base" in libevent terminology).
The only direct interaction between threads is for new connections. One of
the threads handles the TCP listen socket; each new connection is passed to
a different thread on a round-robin basis. After that, each thread operates
on its set of connections as if it were running in single-threaded mode,
using libevent to manage nonblocking I/O as usual.

UDP requests are a bit different, since there is only one UDP socket that's
shared by all clients. The UDP socket is monitored by all of the threads.
When a datagram comes in, all the threads that aren't already processing
another request will receive "socket readable" callbacks from libevent.
Only one thread will successfully read the request; the others will go back
to sleep or, in the case of a very busy server, will read whatever other
UDP requests are waiting in the socket buffer. Note that in the case of
moderately busy servers, this results in increased CPU consumption since
threads will constantly wake up and find no input waiting for them. But
short of much more major surgery on the I/O code, this is not easy to avoid.


TO DO

The locking is currently very coarse-grained.  There is, for example, one
lock that protects all the calls to the hashtable-related functions. Since
memcached spends much of its CPU time on command parsing and response
assembly, rather than managing the hashtable per se, this is not a huge
bottleneck for small numbers of processors. However, the locking will likely
have to be refined in the event that memcached needs to run well on
massively-parallel machines.

One cheap optimization to reduce contention on that lock: move the hash value
computation so it occurs before the lock is obtained whenever possible.
Right now the hash is performed at the lowest levels of the functions in
assoc.c. If instead it was computed in memcached.c, then passed along with
the key and length into the items.c code and down into assoc.c, that would
reduce the amount of time each thread needs to keep the hashtable lock held.