Transformer optimization

47 hypotheses · composite score 0.742

Running
EXP-047
Run #3
|2h ago|47 hypotheses completed
Composite Baseline0.742+6.2% from start
Sampler Agents2 running · 1 queued
SA-001DEPTH
4m 12s

Reducing attention head dropout from 0.1 to 0.05 in top 4 transformer layers may improve convergence on validation set without overfitting.

86%
SA-002BREADTH
2m 47s

Applying a cosine learning rate schedule with warm restarts every 1000 steps to balance exploration and exploitation across the full eval battery.

58%
SA-003DEPTH
-

Experimenting with rotary position embeddings (RoPE) in place of learned absolute positions to improve generalization on longer sequences.

Evaluator Agents4 active
Val Loss
2.31-0.160
best: 2.28
BLEU-4
0.341+0.023
best: 0.347
Pass@1
0.280+0.040
best: 0.310
Inference Latency
48-4ms
best: 44
Experiment LogEXP-047
IDStatusΔ ScoreHypothesisTime
EXP-047KEPT+1.4%LR schedule: cosine annealing T_max=5002m ago
REVERTED-0.3%Dropout rate: 0.2→0.3 in FFN8m ago
KEPT+0.9%Weight decay 0.01→0.00114m ago
FLAGGED+3.4% BLEUWordPiece tokenization for code tokens25m ago
KEPT+0.5%Batch size schedule: linear 32→12836m ago
KEPT+0.2%Layer norm epsilon: 1e-5→1e-848m ago
REVERTED-0.9%Increased FFN hidden size 2048→30721h ago
KEPT+4.2%Mixed precision FP16 baseline1h 22m ago
Soft-Stop Checkpoints
Marginal Return Threshold

Avg gain last 10 exp: +0.9%

Eval Convergence

3/4 evals still improving

Compute Budget

78% of budget consumed

Depth/Breadth Balance

Sampler split 2:1