Claude for Research

EXP-046|89 hypotheses evaluated|Best composite: 0.902+9.8% from baseline

All success criteria met

Why It Stopped

Soft-stop triggered: AUC-ROC and MCC both converged within 0.5% of individual maxima

Over the final 22 hypotheses, AUC-ROC varied ±0.002 and MCC varied ±0.003 — both below the 0.5% convergence threshold. The composite gain rate dropped to +0.03% per hypothesis.

Success criteria

AUC-ROC ≥ 0.90(0.903)

ECE < 0.05(0.038)

MCC ≥ 0.80(0.814)

Triggered at

Hypothesis 89 of 120

Best Configuration

Cumulative changes from the highest-scoring hypothesis chain

Architecture diff vs. baseline0.902 composite (+9.8%)

PoolingMean poolingAttention-weighted sum+1.2%

AugmentationSMILES dropout 0.1SMILES dropout 0.2+0.7%

Features2D fingerprint only+ 3D conformer embedding+2.3% (flagged)

Loss weightingUniformInverse class frequency+1.8%

Cross-val splitRandom scaffoldScaffold stratified k-fold+0.5%

Best Hypothesis Per Eval

Depth-first optimizer results — the single change that most improved each eval

AUC-ROCHIGH

auc_roc_macro

0.903

+0.083 vs baseline

0.820

0.903

Graph conv pooling: mean → attention-weighted sum

Attention pooling learns which atoms matter per endpoint, recovering signal lost in uniform aggregation.

SA-003 · DEPTHHypothesis 61

F1 ScoreHIGH

f1_macro

0.871

+0.080 vs baseline

0.791

0.871

Multi-task loss weighting: inverse class frequency

Rare ADMET endpoints (BBB, hERG) were under-weighted. Inverse-frequency re-weighting recovered macro F1 on minority classes.

SA-001 · DEPTHHypothesis 44

MCCHIGH

matthews_corrcoef

0.814

+0.080 vs baseline

0.734

0.814

Scaffold-stratified k-fold cross-validation

Random splits leak scaffold information; stratified splits produce unbiased MCC estimates and prevent overfitting to common scaffolds.

SA-002 · DEPTHHypothesis 38

Calibration (ECE)MEDIUM

expected_calibration_error

0.038

−0.030 vs baseline

0.068

0.038

Temperature scaling post-hoc calibration (T=1.4)

Without calibration the model was overconfident at high-probability predictions. Temperature scaling brought ECE well below the 0.05 target.

SA-004 · BREADTHHypothesis 77

Proposed Next Directions

Select a direction to configure a new experiment run, or chat with Claude to define a custom path