47 hypotheses · composite score 0.742
Reducing attention head dropout from 0.1 to 0.05 in top 4 transformer layers may improve convergence on validation set without overfitting.
Applying a cosine learning rate schedule with warm restarts every 1000 steps to balance exploration and exploitation across the full eval battery.
Experimenting with rotary position embeddings (RoPE) in place of learned absolute positions to improve generalization on longer sequences.
| ID | Status | Δ Score | Hypothesis | Time |
|---|---|---|---|---|
| EXP-047 | KEPT | +1.4% | LR schedule: cosine annealing T_max=500 | 2m ago |
| ╰ | REVERTED | -0.3% | Dropout rate: 0.2→0.3 in FFN | 8m ago |
| ╰ | KEPT | +0.9% | Weight decay 0.01→0.001 | 14m ago |
| ╰ | FLAGGED | +3.4% BLEU | WordPiece tokenization for code tokens | 25m ago |
| ╰ | KEPT | +0.5% | Batch size schedule: linear 32→128 | 36m ago |
| ╰ | KEPT | +0.2% | Layer norm epsilon: 1e-5→1e-8 | 48m ago |
| ╰ | REVERTED | -0.9% | Increased FFN hidden size 2048→3072 | 1h ago |
| ╰ | KEPT | +4.2% | Mixed precision FP16 baseline | 1h 22m ago |
Avg gain last 10 exp: +0.9%
3/4 evals still improving
78% of budget consumed
Sampler split 2:1