Skip to content
This repository was archived by the owner on Mar 17, 2025. It is now read-only.
This repository was archived by the owner on Mar 17, 2025. It is now read-only.

About ZIL #7

@jimklimov

Description

@jimklimov

Hello, first of all - thanks for this write-up and articles on LE site, and for popularization of ZFS in practical large-scale production (even though my background is in another, original, branch of the project ;) )

One point that caught my eye here was that you did not use a separate ZIL because all devices are already fast. While this gauge is true, a separated-hardware ZIL can bring other benefits due to a couple of points:

  • when you have sync writes, which is probably all writes in a database-centric setting, a pool without separate ZILs can end up spewing random-LBA writes which must complete before the sync operation is acknowledged, and probably pre-empts reads. With a ZIL, you have dedicated devices into which such writes land sequentially.
  • with a separate ZIL main pool devices would act same for sync and async writes - they would store bundled writes (large chunks of data) when a TXG is closed, so usually every 5 sec or when a cached size is exceeded. In case of old HDD pools this meant mostly-sequential writes, in case of SSDs (and I suppose NVMs) it means reprogramming lots of complete flash pages at once, rather than read-modify-write of lots of small sub-page bits. Pages are the "atomic write" on solid state storage, and they are generally much larger than the sectors exposed to an OS (256Kb+ vs. 4-8Kb), and in SSD/NVM can house data from "sectors" at far-away logical block addresses.

The main idea with ZIL is that it journals sync writes, going as a ring-buffer, and in a perfect world your system never crashes and so never has to read from it. This allows for some devices to be much more efficient than others at this job.

  • A needed ZIL size depends on your write intensity and pool bandwidth; general rule of thumb is that it should host about 5-6 TXGs worth of data. So if your 24*NVMe pool writes at say 100GB/s max, and TXGs are synced every 5 sec, you'd need some 3000GB devices for ZILs (preferably mirrored).
    ** Note to other readers: for many smaller servers and home systems that do not write at full throttle 24/7 onto pools this fast, battery-backed RAM drives maxed at 4-8Gb used as ZILs usually suffice. Or partitions/other disks of that scale...
  • If the ZIL devices are hardware tailored for writes, e.g. RAM drives backed by flash and capacitors in case of power outage, they can be a lot faster than your main pool. Even so, the PCI bus for one or two dedicated ZILs in your setup may prove a bottleneck compared to "implicit" ZILing on a 24-device pool, so testing on a separate system with your realistic workload is advised to check if the benefit is there :)

Practical benefits can be:

  • latency: most of the time your main pool devices are not busy writing so they can be busy reading;
  • wear-leveling and storage longevity: most of the time your main pool devices are writing so much data in async bursts they can reprogram whole flash pages, with little overhead loss to read-modify-write for relocating the un-used "sectors". Even if the ZIL devices are from same technology, they end up writing sequentially as a ring-buffer and so also can reprogram whole pages with little overhead.
  • less fragmentation of logical data in main pool across LBAs, maybe not so much an issue with SSD/NVMe.

On a separate note, people often partition not 100% but some 80%-90% of their device to be used as SSD storage, to ensure there are always "unused" logical pages available for complete reporogramming, in addition to whatever hardware redundancy the vendor cooked into the device.

Also note that since ZFS never overwrites currently referenced data in-place, it can succumb to free space fragmentation and take much longer to find free spots in the data tree to put new writes into after some percentage of the pool is used (and in case of HDDs, it also involved much more seeking for small writes). The particular number/percentage is different for every pool depending on its write and delete history, but keeping somewhere around 10%-25% always free is regarded as a safe zone without looking closer at a particular pool. That said, I've had an 8Tb server back in the day whose performance only took a hit as it was chewing through the last 100Gb, so mileage really varies a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions