Systems Performance Ch.8: File System Deep Dive

Introduction

[!NOTE] This article was generated by feeding my chaotic, personal reading notes of this massive textbook into an AI to restructure them into a readable blog post.

We’ve covered the foundations of CPU, Memory, and Networking. Now, let’s dive into “Chapter 8: File System,” which has a direct and often violent impact on application performance.

When a system slows down, we often have a spinal reflex to blame the storage, screaming “The disk I/O is slow!” But wait—the entity your application is actually interacting with and waiting on isn’t the disk; it’s the “File System.” File systems work tirelessly to hide the inherent sluggishness of disks using various clever tricks. If you leave this layer as a black box and only stare at disk statistics, you’ll never find the true bottleneck.

In this post, we’ll explore the “deep chasm” between the file system and the disk, and I’ll share a survival cheat sheet for exposing the true nature of file system latency using modern BCC/bpftrace tools.

1. Exposing the Disk Lie: Logical vs. Physical I/O

The I/O requests an application sends to the file system (Logical I/O) and the actual I/O that reaches the disk (Physical I/O) are almost never 1:1. Failing to understand this gap will lead you into an analytical dead end.

Shrinking or Vanishing I/O: No matter how many read/write requests an app makes, physical I/O can be “zero” if they hit the page cache or if the file system cancels out the writes in memory.
Expanding I/O: Conversely, a single “1-byte” write request from an app can explode into a massive I/O storm. The file system might expand it to its record size (e.g., 128KB), update metadata (access times, inodes), and write to journals (logs).

If you see “mysterious disk I/O” when your app is supposedly idle, it’s highly likely that background file system tasks are making their move.

2. Traps of High Intelligence: Prefetch and Write-back

File systems use sophisticated prefetching and delayed writing to avoid the agony of slow disk access.

Prefetch (Read-ahead): When the file system detects sequential reads, it predicts “they’ll probably read the next part too” and loads data into cache ahead of time. When it hits, it’s lightning fast. When it misses, it generates a mountain of useless disk I/O and pollutes the cache.
Write-back Caching: The file system buffers written data in the page cache in RAM and flushes it to disk asynchronously later. From the app’s perspective, writes finish instantly. However, behind the scenes, dirty data is piling up, leading to a massive burst of disk I/O when the flush finally triggers. It’s a ticking time bomb.

3. The Price of Durability: `fsync(2)` and Synchronous Writes

Caches in DRAM vanish when the power goes out. For critical data like database transaction logs, we use synchronous writes, but this comes with a lethal performance penalty.

Synchronous writes mercilessly block application threads until the data reaches the disk device.
Using flags like O_SYNC for every write causes your storage to scream as it forces metadata updates along with every data write.
The standard pro-tier move is to buffer writes asynchronously in the application’s address space and then call fsync(2) at logical checkpoints to flush them in a coordinated group.

OS stats typically focus on the disk device level (like iostat). However, an app can be stalled at the file system layer due to lock contention or queuing even if the disk itself is perfectly responsive.

Measure VFS (Virtual File System) Latency: Modern performance analysis dictates that you should measure latency at the VFS layer—the common interface for all file systems—rather than just looking at the disk.
Visualizing Bimodal Distribution with bpftrace: Using BPF-based tracers like bpftrace to histogram vfs_read latency makes the “bimodal distribution” instantly clear. You’ll see fast responses (microseconds) from cache hits and slow responses (milliseconds) from cache misses. It’s truly a beautiful sight to see.

5. Benchmark Pitfalls: What are you actually testing?

Most engineers who run file system benchmarks and celebrate “incredible storage speed” are actually just measuring the speed of their main memory.

The Working Set Size (WSS) Trap: If the total size of your test files fits within the page cache in RAM, you hit 100% cache and learn nothing about the disk.
To measure pure storage performance, you must either flush the cache with echo 3 > /proc/sys/vm/drop_caches to ensure a cold start or bypass the cache entirely using O_DIRECT (Direct I/O).

Conclusion

Reading this chapter drives home how dangerous it is to judge disk performance purely by disk-level stats like iostat. The file system fundamentally transforms the nature of I/O as it passes through.

Systems Performance Ch.8: File System Deep Dive

Introduction

1. Exposing the Disk Lie: Logical vs. Physical I/O

2. Traps of High Intelligence: Prefetch and Write-back

3. The Price of Durability: fsync(2) and Synchronous Writes

4. The Analytical Blind Spot: “VFS Latency”

5. Benchmark Pitfalls: What are you actually testing?

Conclusion

Related Posts

Systems Performance Ch. 2: Scalability Analysis & Modeling

Systems Performance Ch. 13-15: perf, Ftrace, & BPF Tracing

Systems Performance Ch.4: The World of Observability Tools

Systems Performance Ch.9: Exploring the Disk Abyss

Systems Performance Ch.10: Network Deep Dive

Systems Performance Ch.7: Memory Deep Dive

Systems Performance Ch.6: CPU Deep Dive

Systems Performance Ch.3: Operating Systems

Systems Performance Ch.11: Cloud Computing Traps

Systems Performance Ch.12: Benchmarking Guide

3. The Price of Durability: `fsync(2)` and Synchronous Writes