March 18, 2026

Fixing FSDP Checkpoint Deadlocks on 2× RTX 4090

Author: Robin, Kroonen AI Inc.

Genesisdistributed-trainingfsdpdcpnvidiartx-4090

I'm training my own AI model from scratch on a local dual-GPU workstation. It kept crashing. Not during training, but every time it tried to save progress. This is what broke, why, and how I fixed it.

Summary

Over the past several days, I have been building and testing a full local pretraining pipeline for a language model from scratch, including:

Custom SentencePiece tokenizer (49,152 vocab)
Curated ~60B token multilingual corpus
Distributed pretraining stack (FSDP on PyTorch 2.8)
Evaluation and checkpointing pipeline
26 tracked experiment runs with crash documentation

The core training pipeline always worked. Forward pass, backward pass, gradient accumulation, loss going down. All fine.

The blocker was distributed checkpointing. Every run crashed at checkpoint boundaries, not during training, but during the save operation itself.

This article documents the problem, the root cause, and the fix.

The Problem

Hardware

GPUs: 2× NVIDIA RTX 4090 (24 GB each)
CPU: AMD Ryzen 9 7950X3D
Topology: PCIe-only (PHB), no NVLink
OS: Pop!_OS (Linux)
PyTorch: 2.8.0+cu128

What Happened

Training would run for hundreds of steps with healthy loss curves and stable throughput. Then, at the first checkpoint boundary, the process would hang indefinitely.

The failure was 100% reproducible. Every single run crashed at the same point: the checkpoint save operation.

Crash points across 26 tracked runs:

Step 250: hang
Step 499: hang
Step 999: hang

Never during forward pass. Never during backward pass. Always during checkpoint save.

Root Cause

The standard FSDP checkpoint approach uses FullStateDictConfig to gather the complete model state onto rank 0:

with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, save_policy):
    model_state = model.state_dict()
    optim_state = FSDP.optim_state_dict(model, optimizer)

This triggers an ALLGATHER operation across all GPUs via NCCL. On datacenter hardware with NVLink (providing 600+ GB/s bidirectional bandwidth), this completes in seconds.

On PCIe-connected consumer GPUs, this ALLGATHER becomes a bottleneck. Real-time telemetry shows sustained PCIe TX/RX throughput of ~6.5 GiB/s during FSDP gradient synchronization, near the practical ceiling for Gen 4 x8 with protocol overhead. With both GPUs already near memory capacity from training, the gather operation requires materializing the full model state on rank 0 while rank 1 waits. The NCCL timeout fires. The process deadlocks.

The same issue affected the evaluation path, which also performed a full-state gather to save a temporary checkpoint for an async eval subprocess.

The system could train indefinitely. It could not save.

The Fix

Three key changes turned the crashing checkpoint path into a viable local training pipeline:

1. DCP Sharded Checkpoints

Replace the full-state gather with PyTorch's Distributed Checkpoint (DCP), coupled with ShardedStateDictConfig(offload_to_cpu=True) for resume. Each rank saves its own shard independently. No ALLGATHER. No NCCL coordination during save. On resume, the CPU offload ensures shards are reassembled in system RAM before distribution to GPUs, preventing silent weight mapping corruption on PCIe topology.

import torch.distributed.checkpoint as dcp

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
    }
    dcp.save(state_dict, checkpoint_id=checkpoint_dir)

Resume works the same way. Each rank loads its own shard:

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
    }
    dcp.load(state_dict, checkpoint_id=checkpoint_dir)
    model.load_state_dict(state_dict["model"])

2. Gradient Accumulation with no_sync

FSDP synchronizes gradients on every backward() call by default. With gradient accumulation (64 microsteps in our case), that means 63 unnecessary NCCL communications per step.

for micro_step in range(grad_accum):
    ctx = model.no_sync() if micro_step < grad_accum - 1 else nullcontext()
    with ctx:
        loss = model(x, y) / grad_accum
        loss.backward()

Only the final microstep synchronizes. This is standard practice but easy to miss.

3. Evaluation Strategy

The original pipeline spawned a separate Python process for evaluation, which required saving a full-state checkpoint first, triggering the same deadlock.

The first fix was lightweight in-process validation on all ranks with dist.all_reduce to aggregate loss. This removed the subprocess and full-state gather, but inline eval under FULL_SHARD still caused rank desynchronization on this PCIe topology. The FSDP module-gather order during model.eval() diverged across ranks.

The practical solution: disable inline eval during training and evaluate from saved checkpoints in a separate single-GPU script. This decouples evaluation from the training loop entirely, eliminating any risk of eval crashing a multi-day run.

Known Limitation: Inline Distributed Evaluation

One issue remains unsolved in the current pipeline: inline evaluation under FSDP FULL_SHARD causes rank desynchronization on this PCIe topology.

When only rank 0 enters the eval code path (a common pattern), the two ranks diverge in their FSDP all-gather sequence. Rank 0 begins gathering parameters for the eval forward pass while rank 1 expects the next training step's gather. The sequence numbers drift apart, and NCCL times out.

Making both ranks participate in eval (with dist.all_reduce to aggregate val loss) was the right approach, but the ranks still desynchronized, likely because the FSDP module-gather order during model.eval() differs subtly from training mode on PCIe.

Current workaround: Evaluate from saved checkpoints in a separate single-GPU script. This is actually cleaner. It decouples evaluation from training and avoids any risk of eval crashing a multi-day run.

Alternative to explore: Switching from FULL_SHARD to SHARD_GRAD_OP reduces the per-forward all-gather frequency and may make inline eval viable. This has not been tested yet.

Software Stack

PyTorch: 2.8.0+cu128
CUDA: 12.8
OS: Pop!_OS (Linux, kernel 6.x)
GPU Driver: NVIDIA 580.126
Distributed: FSDP (FULL_SHARD) + DCP
Tracking: Weights & Biases
Tokenizer: SentencePiece BPE (49,152 vocab)

Why Local R&D Matters

Every failure documented in this post cost electricity and time. It did not cost $32/hour in cloud GPU bills. By doing the R&D on a consumer workstation, every checkpoint deadlock, every NCCL timeout, every rank desync was debugged at local hardware rates. The result is production-ready distributed training code that has been tested against real PCIe topology failures, not theoretical ones.

When this code moves to NVLink clusters for larger models, the distributed systems layer is already battle-tested. The only variable that changes is the interconnect speed. That is a deliberate engineering strategy: reduce the cost of failure during R&D, then scale with confidence.

Live Update, March 23, 2026

As of step 8,800+, this fix has enabled continuous training without a single checkpoint deadlock. The DCP sharded save/resume pipeline, combined with CPU-offload state dict configuration, has survived multiple stop/resume cycles and a full script rewrite mid-run. Total uptime: 8,800+ steps across 6 days of training.

Genesis

Genesis 5 min

Genesis 1B, Run 2: 3x Throughput, Same Hardware

Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.

Genesis 8 min

Genesis 1B: Run 2 Finished

Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.

Genesis 10 min

The Genesis Manifesto: Sovereign Intelligence

Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.

Postmortems

Postmortems 8 min

The Optimizer State Bug: A Silent Failure

A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.

Research

Research 5 min

Mapping the Mind of Qwen 3.5 9B

A sparse autoencoder for mechanistic interpretability: zero dead features, 16,384 dimensions.