|
(Old) Parallel Virtual File System Frequently Asked Questions List
PVFS is a virtual parallel file system which operates on clusters of PCs
running Linux. It is virtual in that file data is actually stored on
multiple file systems on local disks, not by PVFS itself. By parallel
we mean that data is stored on multiple independent PCs, or nodes, and
that multiple clients can access this data simultaneously.
What architectures does PVFS support?
PVFS (at least as of version 1.5.4) is known to compile and work properly
on Alpha, x86, and IA64 based Linux systems. If you have had success with any
other platforms please let us know.
How do I install PVFS?
If you want to get your system up and running quickly, you should
probably start of by reading the quickstart guide. It can be found
here.
What are these enablemgr and enableiod scripts all about?
They are simple scripts that can be used to set up the links in the rc.d
directories on RedHat machines in order to start the iod or manager at
boot time. For example, if you wanted to have the manager started on a
machine on boot, you should run the enablemgr script once on that
machine. These scripts (if used) only need to be run once for each
machine. The daemons will start on boot from then on.
How can I store PVFS data on multiple disks on a single node?
You have two options. One is to use the md driver or something similar
to create a disk array or RAID, create a file system on that, and use
one I/O daemon to perform accesses to the new file system.
The alternative is to run multiple I/O daemons on the same node, one
per file system you wish to use.
How can I run multiple I/O daemons on the same node?
This is easy; you just give them separate ports on which to communicate.
This involves both the .iodtab file and some iod.conf files for
configuration.
Remember, the .iodtab file exists in the root of the metadata directory
tree and is used by the manager to determine the locations of iods.
Examples are in the User's Guide.
For example, let's assume that you are going to run PVFS with two nodes
used for I/O, but you want to use two disks on each node. Your .iodtab
file might look like:
192.168.0.1:7000
192.168.0.1:7001
192.168.0.2:7000
192.168.0.2:7001
This would specify that four iods will be used for the file system. Two
are running on the machine 192.168.0.1. One is listening on port 7000,
the other on port 7001. Likewise there are two iods running on
192.168.0.2 listening on ports 7000 and 7001.
The iod.conf file tells a given iod about its configuration. We'll
continue the example. Let's assume that the disks are mounted on the
nodes at /pvfs_disk0 and /pvfs_disk1 (on both nodes). So we'll build a
couple of configuration files:
Config file 1, "/etc/iod0.conf":
port 7000
user nobody
group nobody
rootdir /
datadir /pvfs_disk0
logdir /tmp
Config file 2, "/etc/iod1.conf":
port 7001
user nobody
group nobody
rootdir /
datadir /pvfs_disk1
logdir /tmp
Ok. So we copy these two files out to the two nodes. Then we start the
iods (on each of the two nodes) with "iod /etc/iod0.conf" and
"iod /etc/iod1.conf". The iods will read their respective configuration
files and prepare themselves to service requests.
I ran Bonnie and the performance is terrible. Why? Is there
anything I can do?
Bonnie is a file system benchmark written by Tim Bray (see
http://www.textuality.com/bonnie/).
With PVFS v1.4.2 and later Bonnie will run fine, but the performance numbers
are likely to be very low. Bonnie uses a 16Kbyte buffer for accessing
the file which it is writing to, which is a particularly small access
size for the PVFS system. PVFS performs poorly at this size, because
TCP overhead is very apparent at requests this small.
There really isn't much to do about this at this time. Future versions
of PVFS using different network transfer protocols will hopefully have
better small-access performance. In the mean time you can hack Bonnie
to use larger accesses (the value is "Chunk") and see what larger
accesses will do if you want to see some better numbers.
Why is program XXX so slow?
Many applications use rather small buffers by default, and this can
cause poor PVFS performance. Applications such as "tar" and "dd" are
good examples. In cases such as this, if there is an option to set a
block size, use it (smile)! Try something around 16-64K; it will almost
certainly help things out.
As an example, the program "cpio" uses a 512 byte block by default. The
--block-size option can be used to set the block size to some multiple
of 512 bytes, so "cpio --block-size=128" would use a 64K buffer, which
should perform much better.
Does PVFS support redundancy? What if a node fails?
Nope! Sure doesn't. We've talked about it, we have some ideas, but we
haven't implemented any redundancy. So, if an I/O node fails, PVFS
accesses that need that node will also fail. Generally though, barring
disk destruction, restarting the node (and sometimes restarting the
other PVFS daemons) will get you right back where you were, no data
lost.
PVFS will run on top of RAID file systems, however. This can provide at
least some measure of redundancy at the disk level. It does not protect
against more catistrophic hardware failures such as IDE controller
failure or spontaneous combustion.
Why do my modification dates change on PVFS files that I am reading
from?
The PVFS manager is not involved in I/O operations, so it has no direct
way of knowing if a file has been modified or not. It updates the
modification time any time a file is closed. Really we should check to
see if the file was opened for writing, but we don't at the moment.
Note that this behavior has been fixed as of revision 1.5.3.
How do I get MPI-IO for PVFS?
See the ROMIO web pages (http://www.mcs.anl.gov/romio). ROMIO is an
MPI-IO implementation that is included with MPICH, but generally you
will need to recompile to get PVFS support. This is discussed in the
ROMIO documentation.
When I try to compile ROMIO (MPI-IO) with
pvfs support it fails with a list of "undefined reference" errors.
How do I fix this?
The problem here is the pvfs library that needs to be linked in during
the compilation. This must be specified when you run the configure
script. Here is an
example of the command line needed to build the full MPICH distribution
with ROMIO and PVFS support:
./configure -opt=-O -device=ch_p4 --with-romio="-file_system=pvfs"
-lib="-L/usr/lib/libpvfs.a -lpvfs" -cflags="-I/usr/include/pvfs"
Can I directly manipulate PVFS files on the
manager or I/O servers without going through the client interface?
The short answer is no. The metadata and file data is not meant to be
modified directly by users. Doing so may cause corrupt data, lost
storage space, etc. If you wish to delete or move files, always do so
through a PVFS client interface, whether it is through the kernel module
or the native PVFS library.
How can I back up my PVFS file system?
This isn't as easy as it should be, but it can be done. First, I'll
give some specifics about why this is troublesome, then I'll discuss
solutions and a suggestion for making this easier.
The problem with backing up PVFS comes from a design decision made by me
(Rob) with respect to handing out unique handles (inode numbers) for
files. The manager has to pick these numbers, and at the time it seemed
like a good idea to just use the inode number from the actual metadata
file. This was convenient because the data was already stored (as part
of the file) and was guaranteed to be unique by the file system.
This is great, but it becomes a real problem for backups -- if you go to
recreate the metadata directory it's next to impossible with standard
tools (eg. tar) to get the inode numbers back to the same, especially
since tools like tar don't save them anyway. This is all that keeps one
from just tar'ing up all the local directories that make up PVFS and
backing it up in that manner.
There are two solutions. The first is to use tar or some similar tool,
through the PVFS interfaces (either the kernel or the library one) to
pull all data off of PVFS. This will result in an archive that could be
restored to a newly built PVFS file system with no problems. However,
you have to have a storage device big enough to hold all the data, and
pulling all the data off in this manner will likely take a long time for
a large PVFS file system.
The second solution works by backing up the local directories
individually, avoiding the need for a single large device to archive to
(you could still use one device if you like) and allowing for archiving
to take place in parallel on the machine. If you are not familiar with
disk partitioning, "dd", and writing raw partitions back to disk, just
don't try this.
In order for this scheme to work, the metadata directory should be
stored on its own file system, preferably one that isn't too large.
Remember that the metadata files are quite small, so a file system of 50
Mbytes is probably enough to hold all the metadata files you will ever
create. So use a little partition to store the metadata.
Then the backup is simple. Use tar to archive the data on the I/O
nodes. There's nothing special about those directory structures that
tar won't keep up with. Then use dd to grab the entire partition that
the metadata is stored on. Stuff it in a file, gzip or bzip2 it, and
keep it with your I/O archives.
Then if there are problems and you need to restore, dd the partition
back into place, untar the I/O directories onto the right machines, and
away you go.
How can I contribute to the PVFS project?
We are always looking for help with implementing new features, testing,
or simply commenting on what we are doing. If you are interested, have
a look at the developers page for more information.
What are the glibc wrappers and/or where are they now?
The glibc wrappers are no longer supported as of PVFS version 1.5.0.
They were discontinued because their
functionality has been subsumed by the pvfs-kernel package.
There may still be references to these wrappers in documentation
occasionally, but this will be corrected over time. For curious
readers, the wrappers were a mechanism for providing compatibility with
existing applications by wrapping libc I/O function calls and trapping
the ones that dealt with PVFS. As you may imagine, it was rather
difficult to maintain software that depended on very specific versions
of libc to operate correctly. We now provide a much higher level of
compatibility through a kernel module client side implementation.
When did you add symlinks support to PVFS v1?
We added symbolic link support in PVFS 1.6.1. Hardlinks are not supported.
Can I add, remove, or change the order of the
I/O daemons on an existing PVFS file system?
No. If you need to add, remove, or swap I/O servers the the existing
list (which can usually be found in the /pvfs-meta/.iodtab file), we
recommend that you rebuild your file system. The safest thing to do is
to copy all of your data to another location, delete all of the files on
the existing PVFS file system, make your changes, restart PVFS, and then copy your data
back onto the file system. All of the PVFS components rely on the ordering of
I/O servers listed in the .iodtab file, and altering it will result in
file corruption. We realize that this is really inconvenient, but we
really don't have a better solution at this time. Future releases will
hopefully better address this issue.
Does PVFS work across heterogeneous
architectures?
Some. Currently PVFS only works on mixed x86 and IA64 clusters, or on
Alpha-only clusters.
How do I keep the locate cron job from scanning
the PVFS directory?
Most linux distributions allow you to control this from existing configuration
files. On SuSE, you can edit the UPDATEDB_PRUNEPATHS setting in /etc/rc.config. On
Redhat, you can edit /etc/cron.daily/slocate.cron.
Why does df show less free space than I think it
should? What can I do about that?
PVFS calculates free space by multiplying the minimum amount free on any
one iod by the number of iods in use. It does this because when you
fill up the disk on one iod, PVFS is no longer be able to write files
out with the default stripe (in general). PVFS doesn't try to modify
the default stripe to adjust to full disks.
If the I/O server local file systems are used for things other than PVFS,
or if large numbers of small files are being stored on PVFS, then the
free space available on I/O servers might not be roughly equal. In the
case of non-PVFS files on the I/O server's file system, you should move
them :). If lots of small files are being stored on PVFS, you might
want to consider using the u2p utility to redistribute the files and/or
using the random base (-r) option on the manager in order to place new
files on other servers. Note that the use of the random base option
will likely result in a situation where new I/O servers cannot be added
to the existing file system, so keep that in mind.
When I try to compile pvfs-kernel, I get
an error message that says: /usr/include/linux/modversions.h:1:2:
#error Modules should never use kernel-headers system headers.
What's wrong?
Make sure that you have the proper kernel source/headers installed and configured
on your system. Check the INSTALL file included in pvfs-kernel for more details.
Can I use multiple managers (mgr processes)
in PVFS?
No. PVFS only allows one mgr process per file system.
You can run as many iod's as you like, however. Fortunately,
the constraint on the number of mgr processes is not as much of
a bottleneck as most people expect. The mgr is _not_ involved in
I/O operations (reads/writes) at all; these are handled directly
between clients and iods. The mgr is only used for handling
meta data, and therefore is contacted only for operations such
as directory listings, opening and closing files, and changing
permissions. A single mgr process is sufficient for these types
of operations in most environments.
I see unresolved symbols errors when I try to
load the kernel module. What should I do?
Please check the INSTALL file in pvfs-kernel for more information. Most likely you need to verify that you have the correct kernel headers installed and configured on your system.
When I load the pvfs module I see the
following message: "devfs_register(pvfsd): could not append to
parent, err: -17". What does this mean?
Everything is fine. This is a warning that occurs on later 2.4.x kernel ( at least 2.4.18 and above ) if you are using devfs. It happens because of the way we handle creating device entries for backwards compatibility with kernels that do not use devfs.
How can I use multiple disks on each of my I/O servers?
Currently the best way to take advantage of multiple disks in a single
I/O server is to use some form of RAID or disk array solution to first
create a local file system that spans the disks, then tell the iod to
use a subdirectory on that disk to store its data.
You should search online for "Linux software RAID" for more
information on setting up such a local disk configuration with
commodity hardware.
Does PVFS have a maximum file system size? If so, what is it?
No, PVFS doesn't have an inherent maximum file system size.
For a long time Linux had a maximum addressable block device size of
2TB, which meant that all file systems residing on a single block
device could only be of 2TB size. At the time of writing patches
existed that would work around this, but these patches for the most
part weren't available in most kernel distributions. This means that
the local file systems for iods can only be 2TB, so you're limited to
2TB per iod due to this constraint.
PVFS doesn't deal with block devices, nor does the pvfs kernel code
deal in terms of block devices, so this 2TB limit has no impact on
PVFS other than the above mentioned limit on the local storage region
size for an iod.
|