White Papers SCSI Host CPU Utilization and Caching Author: Y.P. Cheng Low Adapter CPU Utilization - What Does It Mean? In a multi-tasking file server environment, many workstation clients request data concurrently. To improve response time for these requests, the data for different clients is spread onto many disk drives. This reduces disk arm contention, a major contributor to response delay time. File servers can have from half a dozen disk drives in a small office application for example, to hundreds of disk drives in an application such as an airline reservation system, where literally thousands of travel agents request information for thousands of flights. In these file servers, the requests are handled by the device driver of an I/O host adapter, to which some or all disk drives are attached. When handling many disk drives, multiple IC's are often used on a single I/O host adapter to spread out the work. Most disk drives today handle over 100 requests per second, with less than 10 milliseconds per request. With twenty disk drives connected to a file server, it is possible to have 2,000 requests per second requiring service. This information could be applied to determine how many disk drives are needed by an application, such an airline reservation system servicing 200,000 requests per second at peak hours, if the time an adapter takes to process a request is known. To allow us to focus on the topic of I/O adapter CPU utilization, assume the CPU of the file server is not a bottleneck. As an example, assume adapter A is capable of passing one request to a disk drive every 100 microseconds and adapter B needs 250 microseconds. Furthermore, assume on a very busy file server, there are only 10% of CPU bandwidth, or 100 milliseconds per second, available for the device driver of the adapters. (Note, a file server needs time to get requests from the network and process the requests before sending an I/O request to the I/O host adapter.) Then with the 100 milliseconds available, adapter A can send 1,000 requests to disk drives and adapter B can only send 400 requests to disk drives. What this means is that adapter A can keep 10 disk drives busy while adapter B can only keep 4 disk drives busy. Continuing with the airline reservation example for serving 200,000 requests per second, 2,000 disk drives would be required to service these drives and 200 adapter A's or 500 adapter B's would be required. Conclusion: In a high performance file server, the CPU usage per I/O request by an adapter is extremely important. Low CPU usage or utilization means more overall throughput. Low Adapter CPU Utilization - How Does AdvanSys Accomplish it? To ensure minimum CPU utilization, all AdvanSys SCSI host adapters have the following features: 1.On-chip application specific RISC engine which performs 100% of SCSI handshakes and bus master data transfers. The RISC engine runs at 40 MIPS on the AdvanSys Ultra2 and Ultra3 products, with only 500 instructions needed to complete a SCSI request. 2.Local RAM stores a large number of the requests, in that there is minimum delay in starting a new request after detecting a free SCSI bus. 3.Only a single wakeup call is required to start the RISC engine to fetch new requests. 4.All requests are links into a list, such that the RISC engine can fetch as many requests as its local RAM allows. 5.Using its own DMA channel, the RISC engine copies back the completed request with updated status and signals request completion. With a single interrupt reset instruction to acknowledge completion, the device can immediately process the completed request. Needless to say, by having the RISC engine perform all the SCSI handshakes, there is little demand for CPU processing by the adapter. By keeping a large number of requests in its local RAM, the adapters from AdvanSys increase the total throughput by quickly changing a idle SCSI bus to busy again. However, in addition to having the RISC and local RAM, the most important feature that stands out among competition is the single wakeup call and interrupt reset. In a 500 MHz dual-Pentium file server, the most important thing for its performance is its pipelined instruction execution and L2 cache. By keeping the most recently executed program data in the cache, the pipelined CPU can go through a number of instructions lightening fast. However, any access to an I/O adapter, whose command and status registers cannot be cached, freezes the pipeline. Requiring only a single wakeup call to the adapter minimizes the chance of freezing the instruction execution pipeline. Since the RISC engine on the adapter performs DMA to fetch a request and to update the status at completion, there is also minimum impact to the cache contents because the DMA does not pass through the cache. With cache contents not disturbed and pipeline flowing continuously, the dual-Pentium CPU is allowed to process more requests. Software Cache Driver - What Does It Mean? To achieve the highest possible performance benchmark scores on workstations with only one or two disk drives active at a time, the important thing is turning a single tasking software application into multitasking, i.e., making the I/O requests overlap with the application processing. To do so, the application software must have its write I/O requests happening in the background. Currently, almost all application software uses "synchronous" requests which wait until the requests are complete first. On the other hand, if the application software uses an "asynchronous" request which allows the I/O request to be processed, the application software can overlap its processing with the I/O request. Of course, the application must be aware of the delayed availability of the requested data. Multi-threaded software has one process to send an I/O request and another process to work on previously received data. When I/O requests and processing are overlapped, the total processing time is reduced. AdvanSys has developed a software cache driver which caches all disk write data. By first moving the write data into a cache buffer and immediately reporting completion, the application software moves forward believing the write is completed. While the application software is working on its next tasks, for example, collecting more data for the next write, the device driver of the I/O adapter schedules a disk write that is processed by the RISC engine of the adapter without any intervention from the CPU. The overlapped application software and disk writes increase benchmark scores. Using this cache program with AdvanSys SCSI host adapters and measuring performance with the ZD Lab benchmark, overall system performance improved by as much as 15%. Software Cache Driver - How Does It Work? The AdvanSys cache driver is a regular NT mini-port driver (MPD), which allocates a portion of system memory as cache. The size of the cache can be selected by the end user. The cache driver works like a normal MPD driver when no cache is allocated. With cache selected, the driver divides the cache into buffer blocks and provides a directory to track the contents of each buffer. Initially, the cache is empty. On a write request, the I/O adapter driver quickly moves the application data to its cache and reports completion. This allows the application to move forward, working in parallel with the actual disk writes. In the meantime, the cache directory is updated to track the contents of the cache. A cache block is dirty before the data is written to disk; it is clean after the data is written to disk. Since the disk write is scheduled after reporting completion, this is also known as write-behind. Many hard disk manufacturers currently provide write-behind caching on their disk drives to improve system performance. If the write fails, recovery action is taken by retrying the write, including finding an alternative good disk sector. On a read request, the driver searches the cache directory to determine if requested data is in the cache. If yes, the data is delivered to the application software without accessing the disk. If the requested data is not found or only a portion of the data is found, the read request is send directly to the disk drive. The cache directory is organized to minimize delay generated by the search. When the cache is full, clean buffer blocks are reallocated to accommodate new data. When every cache block is dirty, write requests are queued to wait for clean cache blocks. Last by not least, a user can activate a delayed write-behind feature, which waits for some specified amount of time before writing. This provides the added benefit of eliminating duplicated writes or multiple writes to the same disk location. Only the very last write is honored. There has always been debate whether write-behind leaves system data integrity vulnerable when there is a sudden loss of power. In reality, every Operating System caches data and even the disk drives cache write data. Regardless what the MPD driver is doing, the sudden loss of power to a system can expose the system to loss of data. This is why the Windows 98 and NT scan disk drives after an improper shut down of the system. Every UNIX system does file system checks (FSCK) when rebooting after a power disruption. The only sure way to avoid loss of data resulting from power disruption is to provide an uninterruptable power supply (UPS) to the system.