Data Distillation: Issues Encountered During Training

**Training Setup:**
- Models used: `b40c768nbt` and `b28c512nbt` (for distillation)
- Total training samples: 104.7M (approximately 100 million positions)

**Current Results:**
- **Strengths:** Early game (first ~95 moves) completely dominates b6c96 models at the same parameter scale
- **Weaknesses:** Late game performance significantly deteriorates; often gets outplayed and loses after advantageous positions

**Specific Problems Encountered:**

1. **Life and Death Misjudgment (Critical Issue)**
   - The model cannot clearly distinguish between alive and dead groups
   - Frequently treats dead stones as living territory
   - This leads to severe scoring errors, especially in endgame situations

2. **Weak Late Game (around move 95+)**
   - Early to mid-game performance is excellent (even crushing same-scale models)
   - After approximately 95 moves, winning rate plummets
   - Opponents consistently make comebacks from losing positions
   - Suggests the model hasn't learned proper late-game reduction/conservative play

3. **Unstable Win Rate Curve**
   - Win rate predictions show unnatural oscillations
   - Lacks smooth transitions between positions
   - Makes it difficult to trust the model's judgment in close games

**Current Training Metrics (b20c256 at ~100M samples):**
- Policy loss (p0loss): ~2.33
- Value loss (vloss): ~0.79
- Ownership loss (oloss): ~0.69
- Score loss (sloss): ~0.57
- Gradient norm (gnorm): ~2000-4000 (some spikes observed)

**Questions for the Community:**

1. How can we specifically enhance life-and-death judgment without ruining the model's excellent early-game performance?

2. What techniques work best for improving late-game decision-making? 
   - Should I over-sample late-game positions?
   - Increase loss weights for late moves?
   - Use position-weighted sampling?

3. The win rate curve is very volatile - would lower learning rate or different optimizer (AdamW vs SGD) help smooth it?

4. My ownership loss is still relatively high (0.69). Could this be the direct cause of the group status confusion? Any suggestions to specifically target oloss?

5. For the late-game collapse issue, would "value head distillation" or "policy head distillation" from a stronger teacher model help more?

Any insights or similar experiences would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Distillation: Issues Encountered During Training #1196

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data Distillation: Issues Encountered During Training #1196

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions