Some errors go unnoticed, without being detected by the disk firmware or the host operating system; these errors are known as
silent data corruption. There are many error sources beyond the disk storage subsystem itself. For instance, cables might be slightly loose, the power supply might be unreliable, external vibrations such as a loud sound, the network might introduce undetected corruption,
cosmic radiation and many other causes of
soft memory errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures. The error rates as observed by a
CERN study on silent corruption are far higher than one in every 1016 bits.
Amazon Web Services acknowledged that data corruption was the cause of a widespread outage of their
Amazon S3 storage network in 2008. In 2021, faulty processor cores were identified as an additional cause in publications by Google and Facebook; cores were found to be faulty at a rate of several in thousands of cores. One problem is that hard disk drive capacities have increased substantially, but their error rates remain unchanged. The data corruption rate has always been roughly constant in time, meaning that modern disks are not much safer than old disks. In old disks the probability of data corruption was very small because they stored tiny amounts of data. In modern disks the probability is much larger because they store much more data, whilst not being safer. That way, silent data corruption has not been a serious concern while storage devices remained relatively small and slow. In modern times and with the advent of larger drives and very fast RAID setups, users are capable of transferring 1016 bits in a reasonably short time, thus easily reaching the data corruption thresholds. As an example,
ZFS creator Jeff Bonwick stated that the fast database at
Greenplum, which is a database software company specializing in large-scale data warehousing and analytics, faces silent corruption every 15 minutes. As another example, a real-life study performed by
NetApp on more than 1.5 million HDDs over 41 months found more than 400,000 silent data corruptions, out of which more than 30,000 were not detected by the hardware RAID controller (only detected during
scrubbing). Another study, performed by
CERN over six months and involving about 97
petabytes of data, found that about 128
megabytes of data became permanently corrupted silently somewhere in the pathway from network to disk. Silent data corruption may result in
cascading failures, in which the system may run for a period of time with undetected initial error causing increasingly more problems until it is ultimately detected. For example, a failure affecting file system
metadata can result in multiple files being partially damaged or made completely inaccessible as the file system is used in its corrupted state. == Countermeasures ==