An application might not be always interested in events arriving on all of its open file descriptors. For example, as mentioned in Section 8.1, the Squid proxy server temporarily ignores data arriving in dribbles; it would rather process large buffers, if possible.
Therefore, our API includes a system call allowing a thread
to declare its interest (or lack of interest) in a file
descriptor:
#define EVENT_READ 0x1 #define EVENT_WRITE 0x2 #define EVENT_EXCEPT 0x4
int declare_interest(int fd, int interestmask, int *statemask);
Once the thread has declared its interest, the kernel tracks event arrivals for the descriptor. Each arrival is added to a per-thread queue. If multiple threads are interested in a descriptor, a per-socket option selects between two ways to choose the proper queue (or queues). The default is to enqueue an event-arrival record for each interested thread, but by setting the SO_WAKEUP_ONE flag, the application indicates that it wants an event arrival delivered only to the first eligible thread.
If the statemask argument is non-NULL, then declare_interest() also reports the current state of the file descriptor. For example, if the EVENT_READ bit is set in this value, then the descriptor is ready for reading. This feature avoids a race in which a state change occurs after the file has been opened (perhaps via an accept() system call) but before declare_interest() has been called. The implementation guarantees that the statemask value reflects the descriptor's state before any events are added to the thread's queue. Otherwise, to avoid missing any events, the application would have to perform a non-blocking read or write after calling declare_interest().
To wait for additional events, a
thread invokes another new system call:
typedef struct { int fd; unsigned mask; } event_descr_t;
int get_next_event(int array_max, event_descr_t *ev_array, struct timeval *timeout);
By allowing an application to request an arbitrary number of event reports in one call, it can amortize the cost of this call over multiple events. However, if at least one event is queued when the call is made, it returns immediately; we do not block the thread simply to fill up its ev_array.
If no events are queued for the thread, then the call blocks until at least one event arrives, or until the timeout expires.
Note that in a multi-threaded application (or in an application where the same socket or file is simultaneously open via several descriptors), a race could make the descriptor unready before the application reads the mask bits. The application should use non-blocking operations to read or write these descriptors, even if they appear to be ready. The implementation of get_next_event() does attempt to try to report the current state of a descriptor, rather than simply reporting the most recent state transition, and internally suppresses any reports that are no longer meaningful; this should reduce the frequency of such races.
The implementation also attempts to coalesce multiple reports for the same descriptor. This may be of value when, for example, a bulk data transfer arrives as a series of small packets. The application might consume all of the buffered data in one system call; it would be inefficient if the application had to consume dozens of queued event notifications corresponding to one large buffered read. However, it is not possible to entirely eliminate duplicate notifications, because of races between new event arrivals and the read, write, or similar system calls.