Server Strategies
In this section, we'll take a look at several strategies for handling resources depending on the nature of the server. Also, the more control you have over the design of the client and server allows you to design both accordingly to avoid the limitations and bottlenecks discussed previously. Again, there is no foolproof method that will work 100 percent in all situations. Servers can be divided roughly into two categories: high throughput and high connections. A high throughput server is more concerned with pushing data on a small number of connections. Of course, the meaning of the phrase “small number of connections” is relative to the amount of resources available on the server. A high connection server is more concerned with handling a large number of connections and is not attempting to push large data amounts.
In the next two sections, we'll discuss both high throughput and high connection server strategies. After that, we'll look at performance numbers gathered from the server samples provided on the companion CD.
High Throughput
An FTP server is an example of a high throughput server. It is concerned with delivering bulk content. In this case, the server is concerned with processing each connection to minimize the amount of time required to transfer the data. To do so, the server must limit the number of concurrent connections because the greater the simultaneous connections, the lower the throughput will be on each connection. An example would be an FTP server that refuses a connection because it is too busy.
The goal for this strategy is I/O. The server should keep enough receives or sends posted to maximize throughput. Because each overlapped I/O requires memory to be locked as well as a small portion of non-paged pool for each IRP associated with the operation, it is important to limit I/O to a small set of connections. It is possible for the server to continually accept connections and have a relatively high number of established connections, but I/O must be limited to a smaller set.
In this case, the server may post a number of sends or receives on a subset of the established clients. For example, the server could handle client connections in a first-in, first-out manner and post a number of overlapped sends and/or receives on the first 100 connections. After those clients are handled, the server can move on the next set of clients in the queue. In this model, the number of outstanding send and receive operations are limited to a smaller set of connections. This prevents the server from blindly posting I/O operations on every connection, which could quickly exhaust the server's resources.
The server should take care to monitor the number of operations outstanding on each connection so it may prevent malicious clients from attacking it. For example, a server designed to receive data from a client, process it, and send some sort of response should keep track of how many sends are outstanding. If the client is simply flooding the server with data but not posting any receives, the server may end up posting dozens of overlapped sends that will never complete. In this case, once the server finds that there are too many outstanding operations, it can close the connection.
Maximizing Connections
Maximizing the number of concurrent client connections is the more difficult of the two strategies. Handling the I/O on each connection becomes difficult. A server cannot simply post one or more sends or receives on each connection because the amount of memory (both in terms of locked pages and non-paged pool) is great. In this scenario, the server is interested in handling many connections at the expense of throughput on each connection. An example of this would be an instant messenger server. The server would handle many thousands of connections but would need to send or receive only a small number of bytes at a time.
For this strategy, the server does not necessarily want to post an overlapped receive on each connection because this would involve locking many pages for each of the receive buffers. Instead, the server can post an overlapped zero-byte receive. Once the receive completes, the server would perform a non-blocking receive until WSAEWOUDLBLOCK is returned. This allows the server to immediately receive all buffered data received on that connection. Because this model is geared toward clients that send data intermittently, it minimizes the number of locked pages but still allows processing of data on each connection.
Performance Numbers
This section covers performance numbers from the different servers provided in Chapters 5 and 6. The various servers tested are those using blocking sockets, non-blocking with select, WSAAsyncSelect, WSAEventSelect, overlapped I/O with events, and overlapped I/O with completion ports. Table 6-3 summarizes the results of these tests. For each I/O model, there are a couple of entries. The first entry is where 7000 connections were attempted from three clients. For all of these tests, the server is an echo server. That is, for each connection that is accepted, data is received and sent back to the client. The first entry for each I/O model represents a high-throughput server where the client sends data as fast as possible to the server. Each of the sample servers does not limit the number of concurrent connections. The remaining entries represent the connections when the clients limit the rate in which they send data so as to not overrun the bandwidth available on the network. The second entry for each I/O model represents 12,000 connections from the client, which is rate limiting the data sent. If the server was able to handle the majority of the 12,000 connections, then the third entry is the maximum number of clients the server was able to handle.
As we mentioned, the servers used are those provided from Chapter 5 except for the I/O completion port server, which is a slightly modified version of the Chapter 5 completion port server except that it limits the number of outstanding operations. This completion port server limits the number of outstanding send operations to 200 and posts just a single receive on each client connection. The client used in this test is the I/O completion port client from Chapter 5. Connections were established in blocks of 1000 clients by specifying the ‘-c 1000' option on the client. The two x86-based clients initiated a maximum of 12,000 connections and the Itanium system was used to establish the remaining clients in blocks of 4000. In the tests that were rate limited, each client block was limited to 200,000 bytes per second (using the ‘-r 200000' switch). So the average send throughput for that entire block of clients was limited to 200,000 bytes per second (not that each client was limited to this amount).
Table 6-3 I/O Method Performance Comparison
I/O Model
Attempted/Connected
Memory Used (KB)
Non-Paged Pool
CPU Usage
Threads
Throughput (Send/ Receive Bytes Per Second)
Blocking
7000/ 1008
25,632
36,121
10–60%
2016
2,198,148/ 2,198,148
12,000/ 1008
25,408
36,352
5– 40%
2016
404,227/ 402,227
Non- blocking
7000/ 4011
4208
135,123
95–100%*
1
0/0
12,000/ 5779
5224
156,260
95–100%*
1
0/0
WSA- Async Select
7000/ 1956
3640
38,246
75–85%
3
1,610,204/ 1,637,819
12,000/ 4077
4884
42,992
90–100%
3
652,902/ 652,902
WSA- Event Select
7000/ 6999
10,502
36,402
65–85%
113
4,921,350/ 5,186,297
12,000/ 11,080
19,214
39,040
50–60%
192
3,217,493/ 3,217,493
46,000/ 45,933
37,392
121,624
80–90%
791
3,851,059/ 3,851,059
Over- lapped (events)
7000/ 5558
21,844
34,944
65–85%
66
5,024,723/ 4,095,644
12,000/12,000
60,576
48,060
35–45%
195
1,803,878/ 1,803,878
49,000/48,997
241,208
155,480
85–95%
792
3,865,152/ 3,834,511
Over- lapped (comple- tion port)
7000/ 7000
36,160
31,128
40–50%
2
6,282,473/ 3,893,507
12,000/12,000
59,256
38,862
40–50%
2
5,027,914/ 5,027,095
50,000/49,997
242,272
148,192
55–65%
2
4,326,946/ 4,326,496
The server was a Pentium 4 1.7 GHz Xeon with 768 MB memory. Clients were established from three machines: Pentium 2 233MHz with 128 MB memory, Pentium 2 350 MHz with 128 MB memory, and an Itanium 733 MHz with 1 GB memory. The test network was a 100 MB isolated hub. All of the machines tested had Windows XP installed.
The blocking model is the poorest performing of all the models. The blocking server spawns two threads for each client connection: one for sending data and one for receiving it. In both test cases, the server was unable to handle a fraction of the connections because it hit a system resource limit on creating threads. Thus the CreateThread call was failing with ERROR_NOT_ENOUGH_MEMORY. The remaining client connections failed with WSAECONNREFUSED.
The non-blocking model faired only somewhat better. It was able to accept more connections but ran into a CPU limitation. The non-blocking server puts all the connected sockets into an FD_SET, which is passed into select. When select completes, the server uses the FD_ISSET macro to search to determine if that socket is signaled. This becomes inefficient because the number of connections increases. Just to determine if a socket is signaled, a linear search through the array is required! To partially alleviate this problem, the server can be redesigned so that it iteratively steps through the FD_SETs returned from select. The only issue is that the server then needs to be able to quickly find the SOCKET_INFO structure associated with that socket handle. In this case, the server can provide a more sophisticated cataloging mechanism, such as a hash tree, which allows quicker lookups. Also note that the non-paged pool usage is extremely high. This is because both AFD and TCP are buffering data on the client connections because the server is unable to read the data fast enough (as indicated by the zero-byte throughput) as indicated by the high CPU usage.
The WSAAsyncSelect model is acceptable for a small number of clients but does not scale well because the overhead of the message loop quickly bogs down its capability to process messages fast enough. In both tests, the server is able to handle only about a third of the connections made. The clients receive many WSAECONNREFUSED errors indicating that the server cannot handle the FD_ACCEPT messages quickly enough so the listen backlog is not exhausted. However, even for those connections accepted, you will notice that the average throughput is rather low (even in the case of the rate limited clients).
Surprisingly, the WSAEventSelect model performed very well. In all the tests, the server was, for the most part, able to handle all the incoming connections while obtaining very good data throughput. The drawback to this model is the overhead required to manage the thread pool for new connections. Because each thread can wait on only 64 events, when new connections are established new threads have to be created to handle them. Also, in the last test case in which more than 45,000 connections were established, the machine became very sluggish. This was most likely due to the great number of threads created to service the many connections. The overhead for switching between the 791 threads becomes significant. The server reached a point at which it was unable to accept any more connections due to numerous WSAENOBUFS errors. In addition, the client application reached its limitation and was unable to sustain the already established connections (we'll discuss this in detail later).
The overlapped I/O with events model is similar to the WSAEventSelect in terms of scalability. Both models rely on thread pools for event notification, and both reach a limit at which the thread switching overhead becomes a factor in how well it handles client communication. The performance numbers for this model almost exactly mirror that of WSAEventSelect. It does surprisingly well until the number of threads increases.
The last entry is for overlapped I/O with completion ports, which is the best performing of all the I/O models. The memory usage (both user and non-paged pool) and accepted clients are similar to both the overlapped I/O with events and WSAEventSelect model. However, the real difference is in CPU usage. The completion port model used only around 60 percent of the CPU, but the other two models required substantially more horsepower to maintain the same number of connections. Another significant difference is that the completion port model also allowed for slightly better throughput.
While carrying out these tests, it became apparent that there was a limitation introduced due to the nature of the data interaction between client and server. The server is designed to be an echo server such that all data received from the client was sent back. Also, each client continually sends data (even if it's at a lower rate) to the server. This results in data always pending on the server's socket (either in the TCP buffers or in AFD's per-socket buffers, which are all non-paged pool). For the three well-performing models, only a single receive is performed at a time; however, this means that for the majority of the time, there is still data pending. It is possible to modify the server to perform a non-blocking receive once data is indicated on the connection. This would drain the data buffered on the machine. The drawback to this approach in this instance is that the client is constantly sending and it is possible that the non-blocking receive could return a great deal of data, which would lead to starvation of other connections (as the thread or completion thread would not be able to handle other events or completion notices). Typically, calling a non-blocking receive until WSAEWOULDBLOCK works on connections where data is transmitted in intervals and not in a continuous manner.
From these performance numbers it is easily deduced that WSAEventSelect and overlapped I/O offer the best performance. For the two event based models, setting up a thread pool for handling event notification is cumbersome but still allows for excellent performance for a moderately stressed server. Once the connections increase and the number of threads increases, then scalability becomes an issue as more CPU is consumed for context switching between threads. The completion port model still offers the ultimate scalability because CPU usage is less of a factor as the number of clients increases.