It is a
clustered file system. It breaks a file into blocks of a configured size, less than 1 megabyte each, which are distributed across multiple cluster nodes. The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system. It also has the ability to replicate across volumes at the higher file level. Features of the architecture include • Distributed metadata, including the directory tree. There is no single "directory controller" or "index server" in charge of the filesystem. • Efficient indexing of directory entries for very large directories. • Distributed locking. This allows for full
POSIX filesystem semantics, including locking for exclusive file access. • Partition Aware. A failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem — some machines will remain working. • Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This maximizes the filesystem availability, and thus the availability of the supercomputer cluster itself. Other features include high availability, ability to be used in a heterogeneous cluster, disaster recovery, security,
DMAPI,
HSM and
ILM.
Compared to Hadoop Distributed File System (HDFS) Hadoop's HDFS filesystem, is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without
RAID disks and a
storage area network (SAN). • HDFS also breaks files up into blocks, and stores them on different filesystem nodes. • GPFS has full Posix filesystem semantics. • GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must store all index information in-RAM. • GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size. ==Information lifecycle management==