|Walter B. Ligon III||Robert B. Ross|
|Parallel Architecture Research Lab||Parallel Architecture Research Lab|
|Clemson University||Clemson University|
The Parallel Virtual File System (PVFS) Project is an effort to provide a parallel file system for PC clusters. As a parallel file system, PVFS provides a global name space, striping of data across multiple I/O nodes, and multiple user interfaces. The system is implemented at the user level, so no kernel modifications are necessary to install or run the system. All communication is performed using TCP/IP, so no additional message passing libraries are needed, and support is included for using existing binaries on PVFS files. This paper describes the key aspects of the PVFS system and presents recent performance results on a 64 node Beowulf workstation. Conclusions are drawn and areas of future work are discussed.
As the use of PC clusters has grown, it has become obvious that system software support is necessary for parallel computing to continue to grow on this popular platform. In particular, the file systems that have been commonly used on this type of machine (AFS, NFS) do not provide the services needed by parallel applications, especially as machine sizes grow beyond only a few nodes. Thus new solutions are necessary to fill the need for true parallel file systems in PC clusters.
The Parallel Virtual File System (PVFS) project began in the early 1990's as an experiment in user-level parallel file systems for clusters of workstations but has since grown into a very usable, freely available, high performance parallel file system for PC clusters. This work will provide an overview of the design and features of PVFS, present some of our latest measurements of performance on a Beowulf workstation , and point to future work in the development of PVFS.
The PVFS system consists of three components: the manager daemon, which runs on a single node, the I/O daemons, one of which runs on each I/O nodes, and the application library, through which applications communicate with the PVFS daemons. The manager daemon handles permission checking for file creation, open, close, and remove operations. The I/O daemons handle all file I/O without the intervention of the manager.
In the following sections we will describe how PVFS stores file data, how it stores file metadata, one method by which applications can interface to the system, how data is transferred between applications and I/O nodes, and how existing binaries can operate on PVFS files.
As a result of this, the UNIX mmap(), read(), and write() calls can be used directly by the I/O daemons to perform file I/O. The operating system will also cache file data for PVFS, so this is not performed in the PVFS code proper. A disadvantage of this approach is that we give up control of block allocation and cannot directly control what data is cached.
With PVFS, an NFS-mounted file system is used to store the metadata. We found that this scheme was more convenient than our previous attempts in that it gives us our unique name space, provides a directory structure for applications to see, and lightens the load on the manager.
The Vesta parallel file system was originally designed for the Vulcan parallel computer. Its interface was interesting in that it allowed processes to ``partition'' a file in such a way that the process would only see a subset of the file data.
We combined these two concepts into a ``Partitioned File Interface'', which forms the basic interface to PVFS. This interface provides the same functionality as the standard UNIX file I/O functions, but extends the interface by providing a partitioning mechanism. With this interface applications can specify partitions on files which allow them to access simple strided regions of the file with single read() and write() calls, reducing the number of I/O calls for many common applications.
Application tasks communicate directly with I/O nodes when file data is transferred, and connections are kept open between applications and I/O nodes throughout the lifetime of the application in order to avoid the time penalty of opening TCP connections multiple times. A predetermined ordering is imposed on all data transfers to minimize control messages, and simple-strided requests are supported by the I/O nodes directly to allow for larger request sizes.
Figure 1: Example of an I/O Stream
Figure 1 shows an example of the I/O stream between an application and an I/O node resulting from a strided request. Each side calculates the intersection of physical stripe and the strided request. The data is always passed in ascending byte order and is packed into TCP packets by the underlying networking software.
Two different models of Seagate disks were used in the system, with advertised sustained transfer rates of 5.0 and 7.9MB/sec. Testing using the Bonnie disk benchmark showed 8.81MB/sec writing with 27.1% CPU utilization and 7.51MB/sec reading with 17.3% CPU utilization. Using ttcp version 1.1 TCP throughput was measured at 11.0MBps.
The 64 nodes in the system were divided into 16 I/O and 48 compute nodes for the purposes of these tests. The number of I/O and compute nodes used was varied throughout the tests. The test application, run under MPICH, performed the following operations:
Figure 2: Effects of Increasing Number of I/O Nodes
In Figure 2 we see that we reach a maximum of around 30MB/sec for 4 I/O nodes, 60MB/sec for 8 I/O nodes, and 120MB/sec for 16 I/O nodes. These values closely match the maximum performance we would expect to get out of the disks on each node, although it is likely that at the points where these peaks are occurring we are mostly working from cache. In all cases we find that network performance is a bottleneck for small numbers of application tasks, but it appears that disk is the bottleneck for larger numbers of application tasks (and thus larger amounts of data).
Figure 3: Write Performance for 16 I/O Nodes
Figure 3 focuses on PVFS write performance using 16 I/O nodes with application tasks writing between 4MB and 32MB of data each. We see a significant drop-off in performance for 16-20 application tasks, but performance climbs again after this point. The subsequent rise in performance indicates that we have not hit the limit in performance of disk or network, but rather that we are inappropriately using one or both of these resources.
Figure 4: Read Performance for 4 I/O Nodes
At the same time, PVFS is stable enough for regular use and provides compatibility with existing binaries, which makes parallel I/O a real option for Beowulf workstations and Piles-of-PCs. For more information on obtaining and installing PVFS, see the PVFS Project pages.
 Peter F. Corbett, Dror G. Feitelson, Jean-Pierre Prost, and Sandra Johnson Baylor, "Parallel Access to Files in the Vesta File System," Proceedings of Supercomputing '93, pages 472--481, Portland, OR, 1993. IEEE Computer Society Press.
 Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best, "File-Access Characteristics of Parallel Scientific Workloads," IEEE Transactions on Parallel and Distributed Systems, 7(10):1075--1089, October 1996.
 Daniel Ridge, Donald Becker, Phillip Merkey, and Thomas Sterling, "Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs," Proceedings of the 1997 IEEE Aerospace Conference, 1997.
 Rajeev Thakur, William Gropp, and Ewing Lusk. "On Implementing MPI-IO Portably and with High Performance," Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 23--32, May 1999.