Genesis 1B: Run 2 Extended — 20k → 40k Steps
Author: Robin, Kroonen AI Inc.
🟢 Live — Step ~22,250 / 40,000 (~55.6%), ETA ~6 days
Run 2 hit 20,000 steps on March 31, 2026, then extended to 40,000 steps (~21B tokens, slightly above Chinchilla-optimal). Training continues on 2× RTX 4090, loss ~2.05, throughput ~18,900 tok/s.
Model: Genesis 1B
| Parameters | 1,000M (1.0B) |
| Architecture | Llama-style decoder-only transformer |
| Hidden dim | 1536 |
| Layers | 32 |
| Attention heads | 12 (6 KV heads, GQA) |
| FFN dim | 4736 (SwiGLU) |
| Context length | 2048 |
| Vocab size | 49,152 |
| Precision | bfloat16 |
| Positional encoding | RoPE (θ=500,000) |
Training Configuration
| GPUs | 2× RTX 4090 (PCIe, no NVLink) |
| Batch size | 4 per GPU |
| Gradient accumulation | 32 steps |
| Effective batch | 524,288 tokens/step |
| Learning rate | 1e-4 → 1e-5 (cosine decay) |
| Warmup | 1,000 steps |
| Optimizer | AdamW (β1=0.9, β2=0.95, wd=0.1) |
| Activation checkpointing | Enabled (per TransformerBlock) |
| DCP resume | ShardedStateDictConfig(offload_to_cpu=True) |
| CUDA allocator | expandable_segments:True |
| VRAM per GPU | ~20 GB with activation checkpointing |
| Throughput | ~19,000 tok/s |
| Target | ~21B tokens (40,000 steps, above Chinchilla-optimal ~38,150) |
| Script | pretrainv3.py |
| NCCL | NCCL_P2P_DISABLE=1 |
Run 2: Training Progress (20k → 40k Extension)
Run 2 launched March 24, 2026 with a redesigned 32-layer architecture and reached 20,000 steps on March 31, 2026. Rather than stopping there, the run was extended to 40,000 steps (~21B tokens). The Chinchilla-optimal compute for a 1B parameter model is ~20B tokens (~38,150 steps at 524K tokens/step) — 40,000 steps slightly exceeds that intentionally, producing a stronger pre-trained base. Training is currently live at step ~22,250+.
| Step | Loss | Grad Norm | tok/s |
|---|---|---|---|
| 0 | 11.1377 | 20.00 | 17,425 |
| 1,000 | 3.4161 | 0.74 | 18,936 |
| 2,000 | 3.0866 | 0.30 | 18,954 |
| 3,000 | 2.5517 | 0.22 | 18,948 |
| 4,000 | 2.6568 | 0.22 | 18,958 |
| 5,000 | 2.2971 | 0.17 | 18,946 |
| 6,000 | 2.2877 | 0.18 | 18,935 |
| 7,000 | 2.2235 | 0.17 | 18,936 |
| 8,000 | 2.1325 | 0.16 | 18,947 |
| 9,000 | 2.2878 | 0.16 | 18,830 |
| 10,000 | 2.1776 | 0.16 | 18,955 |
| 11,000 | 2.1164 | 0.16 | 18,960 |
| 12,000 | 2.2426 | 0.16 | 18,967 |
| 13,000 | 2.1838 | 0.16 | 18,971 |
| 14,000 | 2.0864 | 0.17 | 18,978 |
| 15,000 | 1.9520 | 0.17 | 18,975 |
| 16,000 | 1.8105 | 0.15 | 18,965 |
| 17,000 | 2.1301 | 0.16 | 18,956 |
| 18,000 | 2.1521 | 0.18 | 18,869 |
| 19,000 | 1.8729 | 0.16 | 18,973 |
| 20,000 | 1.8369 | 0.17 | 18,967 |
| 21,000 | ~2.05 | 0.18 | 18,894 |
| 22,250 | 2.051 | 0.18 | 18,894 |
Training loss curve
Full loss curve reconstructed from the local run-0a3gme49.wandb run file, covering step 0 through the end of Run 2.
At 20k steps, loss was 1.8369. After extending to 40k, the learning rate schedule reset to continue cosine decay toward 1e-5, and the run is progressing with healthy gradient norms. Latest checkpoint (step 22,250) shows loss 2.051, which is expected as the continued pre-training explores new data territory at a lower learning rate. Throughput holds steady at ~18,900 tok/s.
Checkpoints are backed up locally every 10 minutes and uploaded to HuggingFace. Try them in the live playground.
The Dataset
~60B tokens, curated from public sources:
- FineWeb-Edu (English web, educational filter)
- DCLM baseline + extra slices
- StarCoderData (code)
- FineMath (mathematics)
- Wikipedia (multilingual)
- CulturaX (Arabic, German, Spanish, French, Japanese, Korean, Portuguese, Chinese)
- OpenHermes, Orca AgentInstruct (instruction data)
- Function calling datasets (Glaive, Gorilla, Hermes, xLAM)
- Cosmopedia (synthetic textbooks)
All tokenized with a custom SentencePiece BPE tokenizer trained on the corpus itself.
The Road to Genesis 1B v0.1
Pre-training is only the first phase. The full pipeline has four stages:
Phase 1: Pre-training (in progress — extended to 40k)
The initial 20k-step milestone was reached March 31, 2026. The run was immediately extended to 40,000 steps (~21B tokens). Chinchilla-optimal for a 1B model is ~20B tokens (~38,150 steps at 524K tokens/step) — 40,000 steps slightly overshoots intentionally, giving a stronger base. Currently at step ~22,250, with ~17,750 steps remaining. Estimated completion ~April 7, 2026.
Phase 2: SFT (Supervised Fine-Tuning) — in progress
Already underway. Early SFT checkpoints have been produced on top of the pre-trained base. Full SFT will run once the 40k pre-training base is complete. The approach is inspired by Constitutional AI: define a set of principles and train the model to follow them. The goal is a model with genuine personality, not a model optimized for refusal rates.
Phase 3: DPO (Direct Preference Optimization)
Refine taste and style. Train the model to prefer interesting, thoughtful responses over generic safe ones. Preference pairs are constructed to reward curiosity and penalize hedging.
Phase 4: Continued pre-training cycles
Continue pre-training to 40,000 steps (~10.5B tokens), then run SFT and DPO again from the stronger base. Repeat at 60,000 and 80,000+ steps. Each cycle produces a better pre-trained foundation, which produces a better aligned model.
The 60B token corpus means zero data repetition even at extended step counts. Every token the model sees is genuinely new data.
Run 1: What Happened (Historical)
📜 Run 1 History — Click to expand (steps 0–8,500, March 17–24)
Run 1 used a different architecture: 20 layers, dim 2048, 16 heads, batch size 1. It achieved 6,500 tok/s and was on track for ~13 days to 20k steps. Two critical failures occurred:
1. FSDP Checkpoint Deadlock
Checkpoint saves hung indefinitely due to NCCL ALLGATHER over PCIe without NVLink. Fixed by switching to DCP sharded checkpoints.
2. Optimizer State Bug (Silent)
The DCP resume path only loaded model weights, not AdamW optimizer state. This produced a false recovery — loss looked healthy for ~50 steps, then diverged. The fix: load optimizer state alongside model weights with try/except fallback.
These failures led to the Run 2 redesign. See the full postmortems: FSDP Deadlock · Optimizer State Bug
Run 1 Loss Data
| Step | Loss | Step | Loss |
|---|---|---|---|
| 0 | 11.17 | 3,400 | 2.73 |
| 200 | 4.87 | 3,600 | 2.42 |
| 400 | 4.34 | 3,800 | 2.45 |
| 600 | 3.55 | 4,000 | 2.25 |
| 800 | 3.03 | 4,200 | 2.35 |
| 1,000 | 3.27 | 4,400 | 2.19 |
| 1,200 | 3.02 | 4,600 | 2.46 |
| 1,400 | 3.02 | 4,800 | 2.10 |
| 1,600 | 2.94 | 5,000 | 2.39 |
| 1,800 | 2.74 | 5,500 | 2.26 |
| 2,000 | 2.54 | 6,000 | 2.20 |
| 2,200 | 2.36 | 6,500 | 2.15 |
| 2,400 | 2.44 | 7,000 | 1.90 |
| 2,600 | 2.54 | 7,500 | 1.69 |
| 2,800 | 2.62 | 8,000 | 1.53 |
| 3,000 | 2.68 | 8,500 | 1.42 |
Try It Yourself
The model is ready to inspect. Select a checkpoint and generate text to see how it evolved across the run:
Powered by HuggingFace ZeroGPU, free inference on NVIDIA H200
Contact
If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].
More Posts
Genesis
Genesis 1B, Run 2: 3x Throughput, Same Hardware
Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.
The Genesis Manifesto: Sovereign Intelligence
Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.
Postmortems
The Optimizer State Bug: A Silent Failure
A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.
Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090
How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.