Summary Examples of features specific to ZFS include: • Designed for long-term storage of data, and indefinitely scaled datastore sizes with zero data loss, and high configurability. • Hierarchical
checksumming of all data and
metadata, ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent
block, rather than with the block itself. This contrasts with many file systems where checksums (if held) are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect. • Can store a user-specified number of copies of data or metadata, or selected types of data, to improve the ability to recover from data corruption of important files and structures. • Automatic rollback of recent changes to the file system and data, in some circumstances, in the event of an error or inconsistency. • Automated and (usually) silent self-healing of data inconsistencies and write failure when detected, for all errors where the data is capable of reconstruction. Data can be reconstructed using all of the following: error detection and correction checksums stored in each block's parent block; multiple copies of data (including checksums) held on the disk; write intentions logged on the SLOG (ZIL) for writes that should have occurred but did not occur (after a power failure); parity data from RAID/RAID-Z disks and volumes; copies of data from mirrored disks and volumes. • Native handling of standard RAID levels and additional ZFS RAID layouts ("
RAID-Z"). The RAID-Z levels stripe data across only the disks required, for efficiency (many RAID systems stripe indiscriminately across all devices), and checksumming allows rebuilding of inconsistent or corrupted data to be minimized to those blocks with defects; • Native handling of tiered storage and caching devices, which is usually a volume related task. Because ZFS also understands the file system, it can use file-related knowledge to inform, integrate, and optimize its tiered storage handling which a separate device cannot; • Native handling of snapshots and backup/
replication which can be made efficient by integrating the volume and file handling. Relevant tools are provided at a low level and require external scripts and software for utilization. • Native
data compression and
deduplication, although the latter is largely handled in
RAM and is memory hungry. • Efficient rebuilding of RAID arrays—a RAID controller often has to rebuild an entire disk, but ZFS can combine disk and file knowledge to limit any rebuilding to data which is actually missing or corrupt, greatly speeding up rebuilding; • Unaffected by RAID hardware changes which affect many other systems. On many systems, if self-contained RAID hardware such as a RAID card fails, or the data is moved to another RAID system, the file system will lack information that was on the original RAID hardware, which is needed to manage data on the RAID array. This can lead to a total loss of data unless near-identical hardware can be acquired and used as a "stepping stone". Since ZFS manages RAID itself, a ZFS pool can be migrated to other hardware, or the operating system can be reinstalled, and the RAID-Z structures and data will be recognized and immediately accessible by ZFS again. • Ability to identify data that would have been found in a cache but has been discarded recently instead; this allows ZFS to reassess its caching decisions in light of later use and facilitates very high cache-hit levels (ZFS cache hit rates are typically over 80%); • Alternative caching strategies can be used for data that would otherwise cause delays in data handling. For example, synchronous writes which are capable of slowing down the storage system can be converted to asynchronous writes by being written to a fast separate caching device, known as the SLOG (sometimes called the ZIL—ZFS Intent Log). • Highly tunable—many internal parameters can be configured for optimal functionality. • Can be used for
high availability clusters and computing, although not fully designed for this use.
Data integrity One major feature that distinguishes ZFS from other
file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against
silent data corruption caused by
data degradation, power surges (
voltage spikes), bugs in disk
firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.. A 1999 study showed that neither any of the then-major and widespread filesystems (such as
UFS,
Ext,
XFS,
JFS, or
NTFS), nor hardware RAID (which has
some issues with data integrity) provided sufficient protection against data corruption problems. Initial research indicates that ZFS protects data better than earlier efforts. It is also faster than UFS and can be seen as its replacement. Within ZFS, data integrity is achieved by using a
Fletcher-based checksum or a
SHA-256 hash throughout the file system tree. Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at
its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a
Merkle tree. It is optionally possible to provide additional in-pool redundancy by specifying (or ), which means that data will be stored twice (or three times) on the disk, effectively halving (or, for , reducing to one-third) the storage capacity of the disk. Additionally, some kinds of data used by ZFS to manage the pool are stored multiple times by default for safety even with the default copies=1 setting. If other copies of the damaged data exist or can be reconstructed from checksums and
parity data, ZFS will use a copy of the data (or recreate it via a RAID recovery mechanism) and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored. If there are no copies of the damaged data, ZFS puts the pool in a faulted state, preventing its future use and providing no documented ways to recover pool contents. Consistency of data held in memory, such as cached data in the ARC, is not checked by default, as ZFS is expected to run on enterprise-quality hardware with
error correcting RAM. However, the capability to check in-memory data exists and can be enabled using "debug flags".
RAID For ZFS to be able to guarantee data integrity, it needs multiple copies of the data or parity information, usually spread across multiple disks. This is typically achieved by using either a
RAID controller or so-called "soft" RAID (built into a
file system).
Avoidance of hardware RAID controllers While ZFS
can work with hardware
RAID devices, it functions more efficiently and with greater data protection when provided with raw access to all storage devices. ZFS relies on direct communication with the disk to determine the moment data is confirmed as safely written and has numerous
algorithms designed to optimize its use of
caching,
cache flushing, and disk handling. Disks connected to the system using a hardware, firmware, other "soft" RAID, or any other controller that modifies the ZFS-to-disk
I/O path will affect ZFS performance and data integrity. If a third-party device performs caching or presents drives to ZFS as a single system without the
low level view ZFS relies upon, there is a much greater chance that the system will perform
less optimally and that ZFS will be less likely to prevent failures, recover from failures more slowly, or lose data due to a write failure. For example, if a hardware RAID card is used, ZFS may not be able to determine the condition of disks, determine if the RAID array is degraded or rebuilding, detect all data corruption, place data optimally across the disks, make selective repairs, control how repairs are balanced with ongoing use, or make repairs that ZFS could usually undertake. The hardware RAID card will interfere with ZFS's algorithms. RAID controllers also usually add controller-dependent data to the drives which prevents software RAID from accessing the user data. In the case of a hardware RAID controller failure, it may be possible to read the data with another compatible controller, but this isn't always possible and a replacement may not be available. Alternate hardware RAID controllers may not understand the original manufacturer's custom data required to manage and restore an array. Unlike most other systems where RAID cards or similar hardware can
offload resources and processing to enhance performance and reliability, with ZFS it is strongly recommended that these methods
not be used as they typically
reduce the system's performance and reliability. If disks must be attached through a RAID or other controller, it is recommended to minimize the amount of processing done in the controller by using a plain
HBA (host adapter), a simple
fanout card, or configure the card in
JBOD mode (i.e. turn off RAID and caching functions), to allow devices to be attached with minimal changes in the ZFS-to-disk I/O pathway. A RAID card in JBOD mode may still interfere if it has a cache or, depending upon its design, may detach drives that do not respond in time (as has been seen with many energy-efficient consumer-grade hard drives), and as such, may require
Time-Limited Error Recovery (TLER)/CCTL/ERC-enabled drives to prevent drive dropouts, so not all cards are suitable even with RAID functions disabled.
ZFS's approach: RAID-Z and mirroring Instead of hardware RAID, ZFS employs "soft" RAID, offering
RAID-Z (
parity based like
RAID 5 and similar) and
disk mirroring (similar to
RAID 1). The schemes are highly flexible. RAID-Z is a data/parity distribution scheme like
RAID-5, but uses dynamic stripe width: every block is its own RAID stripe, regardless of blocksize, resulting in every RAID-Z write being a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, eliminates the
write hole error. RAID-Z is also faster than traditional RAID 5 because it does not need to perform the usual
read-modify-write sequence. As all stripes are of different sizes, RAID-Z reconstruction has to traverse the filesystem metadata to determine the actual RAID-Z geometry. This would be impossible if the filesystem and the RAID array were separate products, whereas it becomes feasible when there is an integrated view of the logical and physical structure of the data. Going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes, whereas traditional RAID products usually cannot do this. The need for RAID-Z3 arose in the early 2000s as multi-terabyte capacity drives became more common. This increase in capacity—without a corresponding increase in throughput speeds—meant that rebuilding an array due to a failed drive could "easily take weeks or months" to complete.
Resilvering and scrub (array syncing and integrity checking) ZFS has no tool equivalent to
fsck (the standard Unix and Linux data checking and repair tool for file systems). Instead, ZFS has a built-in
scrub function which regularly examines all data and repairs silent corruption and other problems. Some differences are: • fsck must be run on an offline filesystem, which means the filesystem must be unmounted and is not usable while being repaired, while scrub is designed to be used on a mounted, live filesystem, and does not need the ZFS filesystem to be taken offline. • fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still not match the original data as stored. • fsck cannot always validate and repair data when checksums are stored with data (often the case in many file systems), because the checksums may also be corrupted or unreadable. ZFS always stores checksums separately from the data they verify, improving reliability and the ability of scrub to repair the volume. ZFS also stores multiple copies of data—metadata, in particular, may have upwards of 4 or 6 copies (multiple copies per disk and multiple disk mirrors per volume), greatly improving the ability of scrub to detect and repair extensive damage to the volume, compared to fsck. • scrub checks everything, including metadata and the data. The effect can be observed by comparing fsck to scrub times—sometimes a fsck on a large RAID completes in a few minutes, which means only the metadata was checked. Traversing all metadata and data on a large RAID takes many hours, which is exactly what scrub does. • while fsck detects and tries to fix errors using available filesystem data, scrub relies on redundancy to recover from issues. While fsck offers to fix the file system with partial data loss, scrub puts it into faulted state if there is no redundancy.
Capacity ZFS is a
128-bit file system, Some theoretical limits in ZFS are: • 16
exbibytes (264 bytes): maximum size of a single file • 248: number of entries in any individual directory • 16 exbibytes: maximum size of any attribute • 256: number of attributes of a file (actually constrained to 248 for the number of files in a directory) • 256 quadrillion
zebibytes (2128 bytes): maximum size of any zpool • 264: number of devices in any zpool • 264: number of file systems in a zpool • 264: number of zpools in a system
Encryption With Oracle Solaris, the encryption capability in ZFS is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system offline. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided—this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism. the encryption feature is also fully integrated into OpenZFS 0.8.0 available for Debian and Ubuntu Linux distributions. There have been anecdotal end-user reports of failures when using ZFS native encryption. An exact cause has not been established.
Read/write efficiency ZFS will automatically allocate data storage across all vdevs in a pool (and all devices in each vdev) in a way that generally maximises the performance of the pool. ZFS will also update its write strategy to take account of new disks added to a pool, when they are added. As a general rule, ZFS allocates writes across vdevs based on the free space in each vdev. This ensures that vdevs which have proportionately less data already, are given more writes when new data is to be stored. This helps to ensure that as the pool becomes more used, the situation does not develop that some vdevs become full, forcing writes to occur on a limited number of devices. It also means that when data is read (and reads are much more frequent than writes in most uses), different parts of the data can be read from as many disks as possible at the same time, giving much higher read performance. Therefore, as a general rule, pools and vdevs should be managed and new storage added, so that the situation does not arise that some vdevs in a pool are almost full and others almost empty, as this will make the pool less efficient. Free space in ZFS tends to become fragmented with usage. ZFS does not have a mechanism for defragmenting free space. There are anecdotal end-user reports of diminished performance when high free-space fragmentation is coupled with disk space over-utilization.
Other features Storage devices, spares, and quotas Pools can have hot spares to compensate for failing disks. When mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis. Storage pool composition is not limited to similar devices, but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to datasets (file system instances or ZVOLs) as needed. Arbitrary storage device types can be added to existing pools to expand their size. The storage capacity of all vdevs is available to all of the file system instances in the zpool. A
quota can be set to limit the amount of space a file system instance can occupy, and a
reservation can be set to guarantee that space will be available to a file system instance.
Caching mechanisms: ARC, L2ARC, Transaction groups, ZIL, SLOG, Special VDEV ZFS uses different layers of disk cache to speed up read and write operations. Ideally, all data should be stored in RAM, but that is usually too expensive. Therefore, data is automatically cached in a hierarchy to optimize performance versus cost; these are often called "hybrid storage pools". Frequently accessed data will be stored in RAM, and less frequently accessed data can be stored on slower media, such as
solid-state drives (SSDs). Data that is not often accessed is not cached and left on the slow hard drives. If old data is suddenly read a lot, ZFS will automatically move it to SSDs or to RAM. ZFS caching mechanisms include one each for reads and writes, and in each case, two levels of caching can exist, one in computer memory (RAM) and one on fast storage (usually
solid-state drives (SSDs)), for a total of four caches. A number of other caches, cache divisions, and queues also exist within ZFS. For example, each VDEV has its own data cache, and the ARC cache is divided between data stored by the user and
metadata used by ZFS, with control over the balance between these.
Special VDEV Class In OpenZFS 0.8 and later, it is possible to configure a Special VDEV class to preferentially store filesystem metadata, and optionally the Data Deduplication Table (DDT), and small filesystem blocks. of the target block, which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any
metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and ZIL (
intent log) write cache is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see
Merkle signature scheme).
Snapshots and clones An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing the old data can be retained, allowing a
snapshot version of the file system to be maintained. ZFS snapshots are consistent (they reflect the entire data as it existed at a single point in time), and can be created extremely quickly, since all the data composing the snapshot is already stored, with the entire storage pool often snapshotted several times per hour. They are also space efficient, since any unchanged data is shared among the file system and its snapshots. Snapshots are inherently read-only, ensuring they will not be modified after creation, although they should not be relied on as a sole means of backup. Entire snapshots can be restored and also files and directories within snapshots. Writable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the copy-on-write principle.
Sending and receiving snapshots ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a stream representation of the file system's state. This stream can either describe complete contents of the file system at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g., for synchronizing offsite backups or high availability mirrors of a pool.
Dynamic striping Dynamic
striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus, all disks in a pool are used, which balances the write load across them.
Variable block sizes ZFS uses variable-sized blocks, with 128 KB as the default size. Available features allow the administrator to tune the maximum block size which is used, as certain workloads do not perform well with large blocks. If
data compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).
Lightweight filesystem creation In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or expand a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.
Adaptive endianness Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an
endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory. This does not affect the stored data; as is usual in
POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.
Deduplication Data deduplication capabilities were added to the ZFS source repository at the end of October 2009, and relevant OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128). Effective use of deduplication may require large RAM capacity; recommendations range between 1 and 5 GB of RAM for every TB of storage. An accurate assessment of the memory required for deduplication is made by referring to the number of unique blocks in the pool, and the number of bytes on disk and in RAM ("core") required to store each record—these figures are reported by inbuilt commands such as zpool and zdb. Insufficient physical memory or lack of ZFS cache can result in virtual memory
thrashing when using deduplication, which can cause performance to plummet, or result in complete memory starvation. Because deduplication occurs at write-time, it is also very CPU-intensive and this can also significantly slow down a system. Other storage vendors use modified versions of ZFS to achieve very high
data compression ratios. Two examples in 2012 were GreenBytes and Tegile. In May 2014, Oracle bought GreenBytes for its ZFS deduplication and replication technology. As described above, deduplication is usually
not recommended due to its heavy resource requirements (especially RAM) and impact on performance (especially when writing), other than in specific circumstances where the system and data are well-suited to this space-saving technique.
Additional capabilities • Explicit I/O priority with deadline scheduling. • Claimed globally optimal I/O sorting and aggregation. • Multiple independent prefetch streams with automatic length and stride detection. • Parallel, constant-time directory operations. • End-to-end checksumming, using a kind of "
Data Integrity Field", allowing data corruption detection (and recovery if you have redundancy in the pool). A choice of 3 hashes can be used, optimized for speed (fletcher), standardization and security (
SHA256) and salted hashes (
Skein). • Transparent filesystem compression. Supports
LZJB,
gzip,
LZ4 and
Zstd. • Intelligent
scrubbing and
resilvering (resyncing). • Load and space usage sharing among disks in the pool. • Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance). If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure. • ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers. This feature provides safety and a performance boost compared with some other filesystems. • On Solaris, when entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it does not know if other slices are managed by non-write-cache safe filesystems, like
UFS. The FreeBSD implementation can handle disk flushes for partitions thanks to its
GEOM framework, and therefore does not suffer from this limitation. • Per-user, per-group, per-project, and per-dataset quota limits. • Filesystem encryption since Solaris 11 Express, and OpenZFS (ZoL) 0.8. (on some other systems ZFS can utilize encrypted disks for a similar effect;
GELI on FreeBSD can be used this way to create fully encrypted ZFS storage). • Pools can be imported in read-only mode. • It is possible to recover data by rolling back entire transactions at the time of importing the zpool. • Snapshots can be taken manually or automatically. The older versions of the stored data that they contain can be exposed as full read-only file systems. They can also be exposed as historic versions of files and folders when used with
CIFS (also known as
SMB, Samba or
file shares); this is known as "Previous versions", "VSS shadow copies", or "File history" on
Windows, or
AFP and "Apple Time Machine" on Apple devices. • Disks can be marked as 'spare'. A data pool can be set to automatically and transparently handle disk faults by activating a spare disk and beginning to resilver the data that was on the suspect disk onto it, when needed. == Limitations ==