COMP: Regenerate ShapeLabelMap baselines at full double precision#6521
COMP: Regenerate ShapeLabelMap baselines at full double precision#6521hjmjohnson wants to merge 1 commit into
Conversation
This comment has been minimized.
This comment has been minimized.
|
I think I wrote most of these tests. The computation should be done at double precision in the shape filter, 1e-3 is REALLY high tolerance. Do you have a link to the failing tests? In the linked issue I do see mention of matrix instability for principle axises in some cases. |
|
@blowekamp Here's the failure data from the 30-day CDash sweep (2026-05-25 → 06-26). 187 failed rows, all
Two distinct failure modes:
So your instinct looks right: a blanket Links to specific failing instances (CDash)Non-
(The macOS/Linux I couldn't script the exact expected-vs-actual deltas out of the CDash test-output pages (SPA), but I can pull the specific per-assertion numbers from a chosen build, or reproduce locally, if that helps decide the right tolerance (and whether the |
|
@blowekamp Follow-up with precise numbers. I instrumented the two
(full vectors: Centroid
(full vectors: Centroid TakeawayThe baselines are stored at 4–5 decimal places, and that rounding alone consumes up to ~45% of the You're right that |
The ShapeLabelMapFixture "resulting value" baselines were stored at 4-5 decimal places. For a value like Perimeter ~28.39 that rounding is ~5e-5, skewing the fixed 1e-4 absolute window enough that MSVC RelWithDebInfo builds flapped (top candidate in InsightSoftwareConsortium#6518). Replace the rounded baselines with the full double-precision computed values, and tighten to 1e-5 for the rotation-invariant scalars (perimeter, equivalent-sphere perimeter/radius, roundness) and the magnitude-invariant equivalent-ellipsoid diameter. The orientation-dependent centroid and oriented-bounding-box origin keep 1e-4: they traverse the less-stable principal-axes path, and the full-precision baseline re-centers their tolerance window.
b63c56d to
2ed4092
Compare
|
@blowekamp FYI — I force-pushed a different approach onto this PR to test it on CI. Instead of loosening the tolerance, I regenerated the "resulting value" baselines at full I originally took the easy "increase tolerance" path because the long-standing acceptance of these small deviations was never really a concern — in practice everyone just re-ran the flaky test and moved on. But your point pushed me to check whether higher-precision baselines are the easier real win, and they look like it. What was done
Why this fixes the flapThe baselines were stored at 4–5 decimals. For |
|
@hjmjohnson I am not sure how automated these messages are or how engaged you are with this. But the above corse of action does seem logical to me. If the assumption is that we can compute across platforms with 1e-5 accuracy than the above tolerances and precision of the base line is sufficient. Second, If the results of the failing test are looked at in most cases this not a tolerance issue: The ShapeLabelMapFixture.3D_T1x1x1's This particular test is JUST one pixel set on a 3D image! What is going on here? It needs further investigation. We should be able to do this well. On this failing test: This does appear to be a tolerance issue. I saw on a couple other tests with the RMS of vector. This has a custom ITK_EXPECT_VECTOR_NEAR macro where the tolerance should likely be adjusted. |
|
On the one system I check that this test is failing has The failing vector comparison above with the bonding box not being with in the tolerance make sense with this configuration. And relaxing the vector near tolerance is appropriate. I don't have a complete picture of the reasons for the other test failures. Adding the full precision is a reasonable change, and an near important digit may have been incorrectly dropped in the current values. EDIT: Also this system with floating point computation enables has 255 other test failures. So Our tests are not configured to pass when floating point precision is enabled ( nor do I think the tolerances should be relaxed to make them) |
|
The main value of the build with |
@dzenanz I don't think have tests failing in an "Expected" section of the dashboard is expected. Perhaps the testing phase of this build could be skipped? Or moved to a non-expected section. Also this particular test is not on the current build because there is a compilation error. |
|
I moved this build to "Nightly" group. |
|
@blowekamp FYI: I'm unlikely to be able to look more deeply into this until later this weekend. I do review all the code changes and read all the commit messages. I'll be honest here, was focused on making the tests not flaky, I don't think this is indicative of code bugs, just testing environment inconsistency boundary conditions. I was too lazy in trying to keep this test from continuing to infect the review of other PRs by it's failures and CDASH noise. |
Do you have an example CI failure where the FLOAT option was not enabled? EDIT: Found a couple: https://open.cdash.org/tests/2615238403 They are specifically related to the oriented bounding box computation. The change also is related to eigen/eispack-eigensystem which is what is computation uses. I don't think these tests were "flacky" before the numerics changes. |
|
Root-cause TL;DR — a full report PDF is attached separately for reference (not required reading):
|
|
Is there a link to this "flacky" test failure that is not due to floating point computation or during the migration to the eigen toolkit? Hans, I am sorry to continue to be difficult here. But there are some things in this AI analysis I am skeptical about.
This is me being skeptical of the AI here and the working theory that the initial premises/promp present to the LLM, and the instructions to be helpful has produce incorrect results that are in agreement with information in the prompt. I think my inline comment about the signs of the eigen values (even though it appears to be no longer true), has corrupted the helpful and agreeable AI. |
|
Triage context and the full root-cause arc for this PR are tracked in #6518 — see this analysis comment. Short version: the |
No problems at all, you are not being difficult. The flakyness claim is FALSE, it was an incorrect first guess at the problem, and it went way too far down the 'fix flakyness' path before you helped me realize the real problem was a change in which of many "correct" solutions was chosen. Then it took me a while to figure out how to fix the test and augment the core algorithm code to provide a clearer path for future developers. I was trying to track down tests that were passing and failing in superficially 'weird' patterns (i.e., not failing all the time). I asked for the search over the past 30 days of CDash and CI infrastructure. I forgot to condition the query on the introduced changes. On the same day, some other (i.e. DCMTK) CI unrelated to the change both had, and some did not have, the changed vnl-> eigen code; the superficial search of failing tests appeared as "sometimes it fails, sometimes it succeeds". I do think the new tests are "more robust at identifying a correct solution among several candidate correct solutions" rather than "test against the particular correct solution that the vnl_ algorithm converged on." If there are concerns with this solution, or you would like changes made, please just let me know. PS: I was going to look into the valgrind time-out hypothesis next for the other test failure cases. |
|
Following up on the valgrind-timeout hypothesis for the other The (Unrelated to this PR — just closing the loop on the #6518 triage thread.) |
|
With this test not being flacky and the eigen value order resolved, adding the center OOB point is not needed. Keeping the full double precision baseline values is a good thing to keep, IMHO. |
1584e7f to
2ed4092
Compare
|
@blowekamp I was after an independent, flip-robust correctness signal after recognizing that the origin may change position. The center of the cube being stable was the first thing that came to mind. That the origin assertion can't give, I overlooked that GetCentroid already provides exactly that, so the center accessor was redundant. |
Regenerates the
ShapeLabelMapFixture"resulting value" baselines at fulldoubleprecision and tightens most tolerances, instead of loosening them. Highest-volume flaky candidate tracked in #6518 (187 failures / 30 days, dominated by Windows MSVC).Root cause (Windows-only scalar group)
The "resulting value" baselines were stored at only 4–5 decimal places. For
Perimeter ≈ 28.39that rounding is ~5e-5, which skews the fixed1e-4absolute tolerance window asymmetrically: the true (macOS) value sits only ~5e-6 inside the lower edge, so an MSVC RelWithDebInfo build that differs by just ~5e-6 fails. The computation was alwaysdoubleprecision — the stored baselines were the lossy part.What changed
doubleprecision (captured atsetprecision(17)).1e-5(10× tighter than the original1e-4) for the rotation-invariant scalars (Perimeter,EquivalentSphericalPerimeter/Radius,Roundness) and the magnitude-invariantEquivalentEllipsoidDiameter.1e-4on the orientation-dependent positions (Centroid/OrientedBoundingBoxOriginin the_Directiontests): they traverse the principal-axes path, where the eigenvalue order and sign canonicalization now fix the result deterministically.1e-99/1e-10) asserts untouched.All 12
ShapeLabelMapFixturetests pass locally (macOS arm64); clang-format clean.