Recently, I was involved in a performance issue that involved a bge interface. The bge interface is a gigabit interface based on the the Broadcom 5704 (IIRC) chipset. Also known as the NeXtreme interface.
Anyway, traditionally, Sun’s products are tuned to allow greater throughput. This is usually what enterprise users want. They buy gigabit interfaces so that they can move large amounts of data around quickly, and they complain a lot when they do not get the throughput they expect. For instance, in order to run gigabit interfaces at near wire speed, we had to re-architect the way we handled the network stack.
Sometimes this runs afoul of the requirements for other customers. In this particular case, the customer was not as concerned with network throughput, they were much more concerned with the latency. Their application is memcached, and their normal mode of operation involves a web server on one system retrieving a data object from the memcache server. The primary datum being measured is the interval from the time the request is written, to the time the read returns with the response.
The first thing we noticed is that the webserver first reads 5 bytes which are examined to see
if the response is a cache hit or miss. This means that the interval represents the time it takes for the first packet of the response to be returned, no matter how large the response actually is. Thus, this application is extremely latency sensitive.
Initially, we found that Solaris Nevada had a average response time of 251µs while Redhat Fedora Core 4 was responding at an average of 182µs, on identical hardware. Not good for us. I looked a little further and found that the minimum time it took was identical, namely 90µs. This lead me to suspect a feature called interrupt coalescence.
Interrupt coalescence sets the NIC to delay packets slightly so that a single interrupt can be used to process more than one packet. This is great for throughput, since the CPU can execute fewer cycles per packet, but it is not so good for latency, although not generally too much, since in a production environment you are likely to have many streams of packets that trigger those interrupts so that the delay will be small. However, in these tests the request always fit in a single packet and it was the only network stream. Thus the request would get whacked with the maximum delay nearly every time.
Looking at the source code for the bge driver, I found 4 parameters that control this process. Specifically, the culprit is bge_rx_ticks_norm, which is set to 128. This number is units of µs’s. The other parameters had much less impact.
Luckily, these parameters can be changed at boot time by setting them in the /etc/system file.
I changed the 128 to 8 by adding this line to the /etc/system file: “set bge:bge_rx_ticks_norm = 8”. This brought our performance in line with the Linux performance.