White Papers 

                     SCSI Host CPU Utilization and Caching 

                     Author: Y.P. Cheng 

                     Low Adapter CPU Utilization - What Does It Mean? 

                     In a multi-tasking file server environment, many workstation clients request data concurrently. To
                     improve response time for these requests, the data for different clients is spread onto many disk
                     drives. This reduces disk arm contention, a major contributor to response delay time. File servers
                     can have from half a dozen disk drives in a small office application for example, to hundreds of
                     disk drives in an application such as an airline reservation system, where literally thousands of
                     travel agents request information for thousands of flights. 

                     In these file servers, the requests are handled by the device driver of an I/O host adapter, to which
                     some or all disk drives are attached. When handling many disk drives, multiple IC's are often used
                     on a single I/O host adapter to spread out the work. Most disk drives today handle over 100
                     requests per second, with less than 10 milliseconds per request. With twenty disk drives connected
                     to a file server, it is possible to have 2,000 requests per second requiring service. This information
                     could be applied to determine how many disk drives are needed by an application, such an airline
                     reservation system servicing 200,000 requests per second at peak hours, if the time an adapter
                     takes to process a request is known. To allow us to focus on the topic of I/O adapter CPU
                     utilization, assume the CPU of the file server is not a bottleneck. 

                     As an example, assume adapter A is capable of passing one request to a disk drive every 100
                     microseconds and adapter B needs 250 microseconds. Furthermore, assume on a very busy file
                     server, there are only 10% of CPU bandwidth, or 100 milliseconds per second, available for the
                     device driver of the adapters. (Note, a file server needs time to get requests from the network and
                     process the requests before sending an I/O request to the I/O host adapter.) Then with the 100
                     milliseconds available, adapter A can send 1,000 requests to disk drives and adapter B can only
                     send 400 requests to disk drives. What this means is that adapter A can keep 10 disk drives busy
                     while adapter B can only keep 4 disk drives busy. Continuing with the airline reservation example
                     for serving 200,000 requests per second, 2,000 disk drives would be required to service these
                     drives and 200 adapter A's or 500 adapter B's would be required. 

                     Conclusion: In a high performance file server, the CPU usage per I/O request by an adapter is
                     extremely important. Low CPU usage or utilization means more overall throughput. 

                     Low Adapter CPU Utilization - How Does AdvanSys Accomplish it? 

                     To ensure minimum CPU utilization, all AdvanSys SCSI host adapters have the following features: 

                        1.On-chip application specific RISC engine which performs 100% of SCSI handshakes and
                          bus master data transfers. The RISC engine runs at 40 MIPS on the AdvanSys Ultra2 and
                          Ultra3 products, with only 500 instructions needed to complete a SCSI request. 
                        2.Local RAM stores a large number of the requests, in that there is minimum delay in starting
                          a new request after detecting a free SCSI bus. 
                        3.Only a single wakeup call is required to start the RISC engine to fetch new requests. 
                        4.All requests are links into a list, such that the RISC engine can fetch as many requests as
                          its local RAM allows. 
                        5.Using its own DMA channel, the RISC engine copies back the completed request with
                          updated status and signals request completion. With a single interrupt reset instruction to
                          acknowledge completion, the device can immediately process the completed request. 

                     Needless to say, by having the RISC engine perform all the SCSI handshakes, there is little
                     demand for CPU processing by the adapter. By keeping a large number of requests in its local
                     RAM, the adapters from AdvanSys increase the total throughput by quickly changing a idle SCSI
                     bus to busy again. However, in addition to having the RISC and local RAM, the most important
                     feature that stands out among competition is the single wakeup call and interrupt reset. In a 500
                     MHz dual-Pentium file server, the most important thing for its performance is its pipelined
                     instruction execution and L2 cache. By keeping the most recently executed program data in the
                     cache, the pipelined CPU can go through a number of instructions lightening fast. However, any
                     access to an I/O adapter, whose command and status registers cannot be cached, freezes the
                     pipeline. Requiring only a single wakeup call to the adapter minimizes the chance of freezing the
                     instruction execution pipeline. Since the RISC engine on the adapter performs DMA to fetch a
                     request and to update the status at completion, there is also minimum impact to the cache
                     contents because the DMA does not pass through the cache. With cache contents not disturbed
                     and pipeline flowing continuously, the dual-Pentium CPU is allowed to process more requests. 

                     Software Cache Driver - What Does It Mean? 

                     To achieve the highest possible performance benchmark scores on workstations with only one or
                     two disk drives active at a time, the important thing is turning a single tasking software application
                     into multitasking, i.e., making the I/O requests overlap with the application processing. To do so,
                     the application software must have its write I/O requests happening in the background. Currently,
                     almost all application software uses "synchronous" requests which wait until the requests are
                     complete first. On the other hand, if the application software uses an "asynchronous" request
                     which allows the I/O request to be processed, the application software can overlap its processing
                     with the I/O request. Of course, the application must be aware of the delayed availability of the
                     requested data. Multi-threaded software has one process to send an I/O request and another
                     process to work on previously received data. When I/O requests and processing are overlapped,
                     the total processing time is reduced. 

                     AdvanSys has developed a software cache driver which caches all disk write data. By first moving
                     the write data into a cache buffer and immediately reporting completion, the application software
                     moves forward believing the write is completed. While the application software is working on its
                     next tasks, for example, collecting more data for the next write, the device driver of the I/O adapter
                     schedules a disk write that is processed by the RISC engine of the adapter without any
                     intervention from the CPU. The overlapped application software and disk writes increase
                     benchmark scores. Using this cache program with AdvanSys SCSI host adapters and measuring
                     performance with the ZD Lab benchmark, overall system performance improved by as much as
                     15%. 

                     Software Cache Driver - How Does It Work? 

                     The AdvanSys cache driver is a regular NT mini-port driver (MPD), which allocates a portion of
                     system memory as cache. The size of the cache can be selected by the end user. The cache
                     driver works like a normal MPD driver when no cache is allocated. With cache selected, the driver
                     divides the cache into buffer blocks and provides a directory to track the contents of each buffer.
                     Initially, the cache is empty. 

                     On a write request, the I/O adapter driver quickly moves the application data to its cache and
                     reports completion. This allows the application to move forward, working in parallel with the actual
                     disk writes. In the meantime, the cache directory is updated to track the contents of the cache. A
                     cache block is dirty before the data is written to disk; it is clean after the data is written to disk.
                     Since the disk write is scheduled after reporting completion, this is also known as write-behind.
                     Many hard disk manufacturers currently provide write-behind caching on their disk drives to
                     improve system performance. If the write fails, recovery action is taken by retrying the write,
                     including finding an alternative good disk sector. On a read request, the driver searches the cache
                     directory to determine if requested data is in the cache. If yes, the data is delivered to the
                     application software without accessing the disk. If the requested data is not found or only a portion
                     of the data is found, the read request is send directly to the disk drive. The cache directory is
                     organized to minimize delay generated by the search. 

                     When the cache is full, clean buffer blocks are reallocated to accommodate new data. When every
                     cache block is dirty, write requests are queued to wait for clean cache blocks. 

                     Last by not least, a user can activate a delayed write-behind feature, which waits for some
                     specified amount of time before writing. This provides the added benefit of eliminating duplicated
                     writes or multiple writes to the same disk location. Only the very last write is honored. 

                     There has always been debate whether write-behind leaves system data integrity vulnerable when
                     there is a sudden loss of power. In reality, every Operating System caches data and even the disk
                     drives cache write data. Regardless what the MPD driver is doing, the sudden loss of power to a
                     system can expose the system to loss of data. This is why the Windows 98 and NT scan disk
                     drives after an improper shut down of the system. Every UNIX system does file system checks
                     (FSCK) when rebooting after a power disruption. The only sure way to avoid loss of data resulting
                     from power disruption is to provide an uninterruptable power supply (UPS) to the system.