Training Setup:
- Models used:
b40c768nbt and b28c512nbt (for distillation)
- Total training samples: 104.7M (approximately 100 million positions)
Current Results:
- Strengths: Early game (first ~95 moves) completely dominates b6c96 models at the same parameter scale
- Weaknesses: Late game performance significantly deteriorates; often gets outplayed and loses after advantageous positions
Specific Problems Encountered:
-
Life and Death Misjudgment (Critical Issue)
- The model cannot clearly distinguish between alive and dead groups
- Frequently treats dead stones as living territory
- This leads to severe scoring errors, especially in endgame situations
-
Weak Late Game (around move 95+)
- Early to mid-game performance is excellent (even crushing same-scale models)
- After approximately 95 moves, winning rate plummets
- Opponents consistently make comebacks from losing positions
- Suggests the model hasn't learned proper late-game reduction/conservative play
-
Unstable Win Rate Curve
- Win rate predictions show unnatural oscillations
- Lacks smooth transitions between positions
- Makes it difficult to trust the model's judgment in close games
Current Training Metrics (b20c256 at ~100M samples):
- Policy loss (p0loss): ~2.33
- Value loss (vloss): ~0.79
- Ownership loss (oloss): ~0.69
- Score loss (sloss): ~0.57
- Gradient norm (gnorm): ~2000-4000 (some spikes observed)
Questions for the Community:
-
How can we specifically enhance life-and-death judgment without ruining the model's excellent early-game performance?
-
What techniques work best for improving late-game decision-making?
- Should I over-sample late-game positions?
- Increase loss weights for late moves?
- Use position-weighted sampling?
-
The win rate curve is very volatile - would lower learning rate or different optimizer (AdamW vs SGD) help smooth it?
-
My ownership loss is still relatively high (0.69). Could this be the direct cause of the group status confusion? Any suggestions to specifically target oloss?
-
For the late-game collapse issue, would "value head distillation" or "policy head distillation" from a stronger teacher model help more?
Any insights or similar experiences would be greatly appreciated!
Training Setup:
b40c768nbtandb28c512nbt(for distillation)Current Results:
Specific Problems Encountered:
Life and Death Misjudgment (Critical Issue)
Weak Late Game (around move 95+)
Unstable Win Rate Curve
Current Training Metrics (b20c256 at ~100M samples):
Questions for the Community:
How can we specifically enhance life-and-death judgment without ruining the model's excellent early-game performance?
What techniques work best for improving late-game decision-making?
The win rate curve is very volatile - would lower learning rate or different optimizer (AdamW vs SGD) help smooth it?
My ownership loss is still relatively high (0.69). Could this be the direct cause of the group status confusion? Any suggestions to specifically target oloss?
For the late-game collapse issue, would "value head distillation" or "policy head distillation" from a stronger teacher model help more?
Any insights or similar experiences would be greatly appreciated!