After a recent SD Card failure on a Raspberry Pi, I decided to research storage devices and configurations to improve performance and device lifetime. This post contains the results of that research.

SD Card Types and Reliability

As a result of an enlightening comment chain on Hacker News about SD Card reliability I started researching common NAND flash storage technologies for representing bits in flash cells. In decreasing order of cost/reliability:

Single-Level Cell (SLC)
Stores one bit per cell.
Multi-Level Cell (MLC)
Stores two bits per cell.
Triple-Level Cell (TLC)
Stores three bits per cell.

Due to the high cost of SLC, there are some intermediate technologies which use MLC flash cells with firmware that only stores one bit per cell instead of two. This results in better reliability and longevity than traditional MLC at cheaper cost than SLC:

advancedMLC (aMLC)
ATP Electronics name for MLC with one bit per cell.
SLC Lite (pSLC)
Panasonic name for MLC with one bit per cell.

For my current project I decided to use an 8GB ATP aMLC card (AF8GSD3A or AF8GUD3A with an adapter - both are available from Digi-Key, Arrow, and other suppliers).

Logical Volumes and Filesystems

For my current project, power failures and hard resets are not uncommon. I need a storage configuration which performs well on an SD Card and is reasonably resistant to corruption after power failure. eMMC/SSD File System Tuning Methodology (2013) by Cogent Embedded, Inc. is a wonderful source of information for this purpose.

F2FS

The most performant configuration appears to be a single partition with F2FS, a filesystem which is optimized for flash storage. Unfortunately, as noted in the “Power-Fail Tolerance” section, F2FS is unsuitable in the presence of power failure. Although it now includes an fsck utility, “[the] initial version of the tool does not fix any inconsistency”.

BTRFS

lockheed on Unix SE provided a corruption-resistant configuration using BTRFS RAID. This approach looks promising, with the adjustment noted in the comments to use the BTRFS DUP Profile instead of RAID1. As I understand it, the primary difference is that the BTRFS DUP profile will only read one copy when not corrupted and that the distribution of the data copies on disk may differ. However, if the SD Card deduplicates data internally this approach will not actually result in any redundancy (as noted in the DUP Profiles on a Single Device section of the mkfs.btrfs man page). I do not think SD cards currently deduplicate data internally, but this is a significant concern.

Note that BTRFS DUP/RAID can be useful because the filesystem checksums indicate corruption. Using generic software RAID1 across partitions would not reduce corruption because it does not have a way to indicate which read is bad, so it was not considered.

ext4

ext4 is a very widely deployed filesystem and the default of most Raspberry Pi distributions. “eMMC/SSD File System Tuning Methodology” notes that ext4 tolerated power failures quite well, while BTRFS did not. This result may have changed due to BTRFS improvements since 2013 and with the use of DUP (or RAID1 across partitions) as described above. It may also have different results when using the ext4 metadata_csum feature for metadata checksums. However, I have not conducted a comparison.

There are also other application-specific features to consider between ext4 and BTRFS. For example, BTRFS supports filesystem snapshots, subvolumes, and compression. Also, ext4 is built-in to the Raspberry Pi Foundation-provided kernel builds while BTRFS is not, thus necessitating an initramfs to boot from a BTRFS root filesystem (see raspberrypi/linux#1550, raspberrypi/linux#1761). Keeping such an initramfs updated to match the kernel is also complicated on the Pi and requires custom scripting or manual filename changes on update (see raspberrypi/firmware#608 and RPi-Distro/firmware#1 - note that the referenced rpi-initramfs-tools package has not yet been created).

Conclusion: Use ext4 with metadata_csum or BTRFS with DUP profile for metadata (and data, if warranted) based on application-specific considerations and willingness to deal with initramfs issues.

Read-Only Filesystems

Another option for reducing or mitigating corruption is to use a read-only filesystem (or a writable filesystem mounted read-only). This can be done on a per-directory basis (e.g. read-only root with read-write /var) or using an overlay filesystem such as unionfs with either read-write partitions or tmpfs for ephemeral information. However, this adds configuration complexity in addition to more complicated failure scenarios.

Partition and Filesystem Alignment

For optimal performance and lifetime, partitions and filesystem structures should be aligned to the erase block size. This size is occasionally listed on the spec sheet for the SD card. More commonly the preferred_erase_size (or discard_granularity) reported for the device in sysfs could be used. It is also often possible to use flashbench to empirically determine the erase block size by measuring the device performance.

For the ext4 filesystem, there may be benefits to configuring the stride and/or stripe width to match the erase block size. Various methods for determining the ext4 stride and/or stripe size based on the flash media exist. I have insufficient understanding of the implications of stride and stripe size settings to know whether this is a good idea and haven’t seen any benchmarks to compare performance.

I/O Schedulers

Complete Fairness Queueing (CFQ) has been the Linux default I/O scheduler since 2.6.18. It is a good default, and it provides some behavior optimizations on non-rotational media. However, both “eMMC/SSD File System Tuning Methodology” and Phoronix Linux 3.16: Deadline I/O Scheduler Generally Leads With A SSD found that both noop and deadline outperformed cfq. A caveat is that neither deadline nor noop support I/O prioritization (e.g. ionice). If prioritization is not required, some performance can be gained by changing the I/O scheduler. This change can be accomplished to all non-rotational media by placing the following content in a udev rule file (e.g. /etc/udev/rules.d/60-nonrotational-iosched.rules):

ACTION=="add|change", KERNEL=="mmcblk[0-9]", ATTR{queue/scheduler}="deadline"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"