Skip to content

[Common] Persistent Grouped NVFP4 quantization kernel#2743

Open
Oleg-Goncharov wants to merge 34 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel
Open

[Common] Persistent Grouped NVFP4 quantization kernel#2743
Oleg-Goncharov wants to merge 34 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel

Conversation

@Oleg-Goncharov
Copy link
Collaborator

@Oleg-Goncharov Oleg-Goncharov commented Mar 6, 2026

Description

This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Added persistent grouped kernel
  • Added test suite

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Oleg-Goncharov and others added 30 commits February 27, 2026 15:53
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Oleg-Goncharov and others added 4 commits March 6, 2026 14:33
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR introduces a new persistent grouped NVFP4 quantize+transpose kernel (group_quantize_transpose_nvfp4_tuned_1D.cuh) with static grid-stride scheduling, double-buffered TMA prefetch, and per-tensor scale-offset bookkeeping. It also promotes the dbias output parameter from NVTETensor to NVTEGroupedTensor across the grouped activation/quantization API.

The new kernel itself is well-designed. However, the PR contains several changes that appear to be development-time leftovers that would cause serious regressions if merged as-is:

  • All pre-existing single-tensor quantization is silently broken: Both quantize_fwd_helper and quantize_bwd_helper in quantize.cuh have their entire bodies commented out, leaving empty no-op stubs for every scaling mode.
  • MXFP8 grouped quantization is disabled: The NVTE_MXFP8_1D_SCALING cases in both group_quantize_fwd_helper and group_quantize_bwd_helper are commented out, meaning any grouped MXFP8 call hits NVTE_ERROR.
  • Test CI coverage gutted: tests/cpp/operator/CMakeLists.txt comments out all 29 existing test sources, and tests/cpp/CMakeLists.txt restricts the fallback CUDA architecture to Blackwell-only (sm_100). These would prevent detection of any regressions introduced by the above changes.
  • Debug utilities left in test code: dump_nvfp4_tensor_data and print_detailed_tensor_comparison in the new test file write files to disk and print verbose tables to stdout; these should not ship in production test code.
  • Unused variable: truncated_pair is computed but never read in the reference quantize_nvfp4 helper.

Confidence Score: 1/5

  • Not safe to merge — pre-existing quantization paths are silently disabled and the test suite is stripped down to a single new test file.
  • The new kernel implementation looks sound in isolation, but the PR inadvertently (or as a development shortcut) comments out the entire bodies of quantize_fwd_helper and quantize_bwd_helper, disables the MXFP8 grouped dispatch, restricts test builds to Blackwell only, and comments out all pre-existing operator tests. These changes would silently break all non-grouped and all MXFP8 quantization in the library, and the disabled test suite would prevent CI from catching it.
  • transformer_engine/common/cast/dispatch/quantize.cuh (empty fwd/bwd helpers, disabled MXFP8 case), tests/cpp/CMakeLists.txt (architecture restriction), tests/cpp/operator/CMakeLists.txt (all tests commented out).

Important Files Changed

Filename Overview
tests/cpp/CMakeLists.txt CUDA architecture narrowed to sm_100 only in the fallback path, silently dropping Volta/Ampere/Ada/Hopper coverage; old line left as a comment.
tests/cpp/operator/CMakeLists.txt All 29 existing operator test sources commented out; only the new nvfp4 grouped test is compiled, completely suppressing regression coverage.
transformer_engine/common/cast/dispatch/quantize.cuh Both quantize_fwd_helper and quantize_bwd_helper are now empty stubs (entire bodies commented out); MXFP8 grouped dispatch is also disabled — all pre-existing quantization paths are silently broken.
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh New persistent grouped NVFP4 quantize+transpose kernel with static grid-stride scheduling, double-buffered TMA prefetch, and separate rowwise/colwise scaling helpers; well-structured with configurable TunableConfig.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Refactored to share ShapeRepresentation enum and JobDescriptor/BlockDescriptor structs from common; MXFP8 kernel now also has persistent-mode support, but is currently unreachable because the dispatch in quantize.cuh is commented out.
transformer_engine/common/cast/core/common.cuh Adds ShapeRepresentation enum (moved from mxfp8) and a new group_reduce_dbias_kernel / grouped_reduce_dbias for multi-tensor dbias reduction; changes are self-contained and correct.
tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu New grouped NVFP4 quantize+transpose test suite; logic is sound but includes unused truncated_pair variable and debug file-dump/verbose-print utilities that should not ship in production tests.
tests/cpp/operator/test_cast_mxfp8_grouped.cu Updated to support per-tensor dbias comparison across all group sizes; the old single-tensor special-casing is replaced with a unified loop, which is cleaner and more correct.
transformer_engine/common/include/transformer_engine/cast.h API signature for nvte_group_quantize_dbias updated: dbias parameter promoted from NVTETensor to NVTEGroupedTensor, consistent with the new per-tensor dbias layout.
transformer_engine/common/activation/gelu.cu dbias parameters in nvte_group_quantize_dbias_dgelu / nvte_group_quantize_dbias_dqgelu and internal local nulls updated from NVTETensor to NVTEGroupedTensor; straightforward signature alignment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["nvte_group_quantize()"] --> B["group_quantize_fwd_helper()"]
    B --> C{scaling_mode?}
    C -->|NVTE_NVFP4_1D_SCALING| D["nvfp4::group_quantize_transpose()"]
    C -->|NVTE_MXFP8_1D_SCALING| E["❌ NVTE_ERROR\n(case commented out)"]
    C -->|other| F["NVTE_ERROR"]

    D --> G{use_single_work_grid?}
    G -->|SAME_BOTH_DIMS or\nVARYING_FIRST_DIM| H["Single contiguous work grid\n(ctaid_Y * cols + ctaid_X)"]
    G -->|VARYING_LAST_DIM or\nVARYING_BOTH_DIMS| I["Per-element block grid\n(block_id * ELTS_PER_CHUNK)"]

    H --> J["Persistent kernel\n(static grid-stride loop)"]
    I --> J

    J --> K["decode_job()\nResolve tensor_id, rows, cols"]
    K --> L["is_job_valid()\nBounds check vs offsets"]
    L -->|invalid| M["Drain in-flight TMA\nthen exit"]
    L -->|valid| N["decode_block()\nblock_id_Y, block_id_X, offsets"]

    N --> O["Prime pipeline:\nPrefetch PREFETCH_STAGES stage(s)\nvia TMA global→shared"]
    O --> P["Main stage loop\n(STAGES = TILES_Y × TILES_X)"]

    P --> Q["process_colwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
    P --> R["process_rowwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
    Q --> S["store_output_stage()\nTMA shared→global commit"]
    R --> S

    S --> T{Last PREFETCH_STAGES?}
    T -->|yes| U["Prefetch stage-0\nof next job"]
    U --> P
    T -->|next job in stride| V["Advance ctaid\nstatic_next_block_id += stride"]
    V --> K
    T -->|no more jobs| W["cp_async_bulk_wait\nthen exit CTA"]
Loading

Comments Outside Diff (1)

  1. transformer_engine/common/cast/dispatch/quantize.cuh, line 175-314 (link)

    quantize_bwd_helper body completely commented out — function is now a no-op

    Identical to quantize_fwd_helper: the full backward quantization dispatch (all scaling modes) has been commented out. Any backward quantization pass — including the critical IS_DBIAS and IS_DACT paths used during training — now silently does nothing. This must be restored before merging.

Last reviewed commit: 46fdb93

Comment on lines +11 to +12
# set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)
set(CMAKE_CUDA_ARCHITECTURES 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Development-only architecture restriction should not be merged

The old multi-architecture fallback (75 80 89 90) has been replaced with a Blackwell-only build (100). The original line is left behind as a commented-out breadcrumb. This means on any CI runner with CUDA < 12.8 the test binary will only compile for sm_100, silently skipping all Volta/Ampere/Ada/Hopper targets. This is clearly a local development shortcut and must be reverted before merging.

Suggested change
# set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)
set(CMAKE_CUDA_ARCHITECTURES 100)
set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)

Comment on lines +6 to +35
# test_cast.cu
# test_cast_current_scaling.cu
# test_cast_dbias.cu
# test_cast_dbias_dgelu.cu
# test_cast_gated_swiglu.cu
# test_cast_mxfp8_gated_swiglu.cu
# test_qdq.cu
# test_cast_mxfp8.cu
# test_cast_mxfp8_grouped.cu
# test_cast_nvfp4_transpose.cu
test_cast_nvfp4_transpose_grouped.cu
# test_cast_float8blockwise.cu
# test_dequantize_mxfp8.cu
# test_transpose.cu
# test_cast_transpose.cu
# test_cast_transpose_current_scaling.cu
# test_cast_transpose_dbias.cu
# test_cast_transpose_dbias_dgelu.cu
# test_cast_transpose_dgeglu.cu
# test_act.cu
# test_normalization.cu
# test_normalization_mxfp8.cu
# test_memset.cu
# test_multi_cast_transpose.cu
# test_multi_padding.cu
# test_multi_unpadding.cu
# test_causal_softmax.cu
# test_swizzle.cu
# test_swap_first_dims.cu
# test_grouped_gemm.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All existing operator tests commented out

Every pre-existing test (test_cast.cu, test_cast_mxfp8.cu, test_cast_mxfp8_grouped.cu, test_normalization.cu, etc.) has been commented out, leaving only test_cast_nvfp4_transpose_grouped.cu in the build. This completely disables the existing regression suite and hides any breakage introduced by the changes in quantize.cuh and group_quantize_mxfp8.cuh. These comment-outs appear to be a local development convenience and must be reverted before merging.

The new test file should simply be added to the existing list, not substituted for it.

Comment on lines +34 to 174
// using namespace detail;

// const Tensor *input_tensor = convertNVTETensorCheck(input);
// Tensor *output_tensor = convertNVTETensorCheck(output);

// // Quantization config
// QuantizationConfig quant_config_cpp;
// if (quant_config != nullptr) {
// quant_config_cpp = *reinterpret_cast<QuantizationConfig *>(quant_config);
// }

// // Noop flag
// Tensor dummy_tensor;
// Tensor *noop_tensor = &dummy_tensor;
// if (quant_config_cpp.noop_tensor != nullptr) {
// noop_tensor = convertNVTETensorCheck(quant_config_cpp.noop_tensor);
// }

// // Check for unsupported options
// if (quant_config_cpp.stochastic_rounding) {
// NVTE_CHECK(output_tensor->scaling_mode == NVTE_NVFP4_1D_SCALING,
// "Stochastic rounding is only supported for NVFP4 quantization.");
// }

// NVTE_CHECK(output_tensor->has_data() || output_tensor->has_columnwise_data(),
// "Either rowwise or columnwise output data need to be allocated.");

// // Dispatch to quantization kernel depending on data format
// switch (output_tensor->scaling_mode) {
// case NVTE_DELAYED_TENSOR_SCALING: {
// const Tensor *dummy_input_tensor = nullptr;
// Tensor *dummy_dbias_tensor = nullptr;
// Tensor *dummy_workspace_tensor = nullptr;
// if (output_tensor->has_columnwise_data()) {
// NVTE_CHECK(output_tensor->has_data(),
// "Quantizing in only the columnwise direction not supported yet!");
// if constexpr (!IS_ACT) {
// cast_transpose(*input_tensor, *noop_tensor, output_tensor, stream);
// } else {
// cast_transpose_fused</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, float, ParamOP, OP>(
// *input_tensor, dummy_input_tensor, output_tensor, dummy_dbias_tensor,
// dummy_workspace_tensor, stream);
// }
// } else if (output_tensor->has_data()) {
// fp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
// *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor,
// dummy_workspace_tensor, stream);
// }
// break;
// }
// case NVTE_MXFP8_1D_SCALING: {
// const Tensor *dummy_input_tensor = nullptr;
// Tensor *dummy_dbias_tensor = nullptr;
// Tensor *dummy_workspace_tensor = nullptr;
// mxfp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
// *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor,
// dummy_workspace_tensor, stream);
// break;
// }
// case NVTE_NVFP4_1D_SCALING: {
// NVTE_CHECK(!IS_ACT, "IS_ACT is not supported by FWD NVTE_NVFP4_1D_SCALING");

// // Check tensors
// CheckNoopTensor(*noop_tensor, "cast_noop");
// CheckInputTensor(*input_tensor, "input");
// CheckOutputTensor(*output_tensor, "output", false);

// // Choose kernel
// int32_t rows = input_tensor->flat_first_dim();
// int32_t cols = input_tensor->flat_last_dim();
// auto dtype = input_tensor->dtype();
// bool use_optimized_kernel = (dtype == DType::kBFloat16) && (rows % 32 == 0) &&
// (cols % 32 == 0) && output_tensor->has_data();

// // Launch NVFP4 quantize kernel
// if (use_optimized_kernel) {
// if (quant_config_cpp.nvfp4_2d_quantization) {
// nvfp4::quantize_transpose</*use_2d_quantization=*/true>(
// *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream);
// } else {
// nvfp4::quantize_transpose</*use_2d_quantization*/ false>(
// *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream);
// }
// } else {
// auto &global_amax = (output_tensor->amax.dptr != nullptr) ? output_tensor->amax
// : output_tensor->columnwise_amax;
// quantize_transpose_vector_blockwise_fp4(
// /*input=*/input_tensor->data, /*global_amax=*/global_amax,
// /*scale_inv=*/output_tensor->scale_inv,
// /*scale_inv_t=*/output_tensor->columnwise_scale_inv,
// /*output=*/output_tensor->data, /*output_t=*/output_tensor->columnwise_data,
// /*epsilon=*/0.0f, /*return_identity=*/output_tensor->has_data(),
// /*return_transpose=*/output_tensor->has_columnwise_data(), /*pow2_scale=*/false,
// /*swizzled_scale=*/false,
// /*use_stochastic_rounding=*/quant_config_cpp.stochastic_rounding,
// /*rng_state=*/quant_config_cpp.rng_state,
// /*use_2d_quantization=*/quant_config_cpp.nvfp4_2d_quantization,
// /*noop_tensor=*/noop_tensor->data, /*stream=*/stream);
// }
// break;
// }
// case NVTE_BLOCK_SCALING_2D: {
// // TODO(kwyss): IS_ACT, ParamOP, OP parameters support.
// NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_2D");
// bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales;
// float epsilon = quant_config_cpp.amax_epsilon;
// quantize_transpose_square_blockwise(
// input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv,
// output_tensor->data, output_tensor->columnwise_data, epsilon,
// /*return_transpose=*/output_tensor->has_columnwise_data(), force_pow_2_scales,
// /*noop_tensor=*/noop_tensor->data, stream);
// break;
// }
// case NVTE_BLOCK_SCALING_1D: {
// // TODO(kwyss): IS_ACT, ParamOP, OP parameters support.
// NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_1D");
// bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales;
// float epsilon = quant_config_cpp.amax_epsilon;
// FP8BlockwiseRowwiseOption rowwise_option = FP8BlockwiseRowwiseOption::NONE;
// FP8BlockwiseColumnwiseOption columnwise_option = FP8BlockwiseColumnwiseOption::NONE;
// if (output_tensor->has_data()) {
// rowwise_option = FP8BlockwiseRowwiseOption::ROWWISE_GEMM_READY;
// }
// if (output_tensor->has_columnwise_data()) {
// columnwise_option = FP8BlockwiseColumnwiseOption::COLUMNWISE_GEMM_READY;
// }
// quantize_transpose_vector_blockwise(
// input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv,
// output_tensor->data, output_tensor->columnwise_data, epsilon, rowwise_option,
// columnwise_option, force_pow_2_scales, noop_tensor->data, stream);
// break;
// }
// default:
// NVTE_ERROR("Not implemented scaling mode: " + to_string(output_tensor->scaling_mode) + ".");
// }
}

template <bool IS_DBIAS, bool IS_DACT, typename ParamOP, float (*OP)(float, const ParamOP &)>
void quantize_bwd_helper(const NVTETensor grad, const NVTETensor input, NVTETensor output,
NVTETensor dbias, NVTETensor workspace,
const NVTEQuantizationConfig quant_config, cudaStream_t stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize_fwd_helper body completely commented out — function is now a no-op

The entire implementation of quantize_fwd_helper (covering NVTE_DELAYED_TENSOR_SCALING, NVTE_MXFP8_1D_SCALING, NVTE_NVFP4_1D_SCALING, NVTE_BLOCK_SCALING_2D, and NVTE_BLOCK_SCALING_1D) has been replaced with block comments, leaving the function as an empty stub. Any call to single-tensor forward quantization (nvte_quantize, nvte_quantize_dbias, etc.) now silently does nothing.

This is a severe regression that would cause all non-grouped quantization code paths to silently produce uninitialized output. These commented-out lines must be restored.

Comment on lines +410 to +416
// case NVTE_MXFP8_1D_SCALING: {
// mxfp8::group_quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
// input_tensor, activations_tensor, noop_tensor, output_tensor, dbias_tensor,
// workspace_tensor, stream);
// break;
// }
case NVTE_NVFP4_1D_SCALING: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MXFP8 grouped quantization dispatch disabled

The NVTE_MXFP8_1D_SCALING case has been commented out from both group_quantize_fwd_helper (line ~410) and group_quantize_bwd_helper (line ~464). As a result, any call to grouped MXFP8 forward or backward quantization falls through to the default branch and calls NVTE_ERROR, breaking existing MXFP8 grouped workflows.

If this is intentional (e.g., MXFP8 is being refactored in parallel), the PR description and checklist should call this out explicitly as a breaking change, and the NVTE_MXFP8_1D_SCALING cases should at least be preserved in a functional state or guarded by a feature flag.

fp4e2m1x2 casted_to_e2m1_pair(scaled_elt_pair);
output[idx_pair] = casted_to_e2m1_pair;

const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable truncated_pair

truncated_pair is computed from cvt_fp4x2_to_double2(casted_to_e2m1_pair) but its value is never read. This will generate a compiler warning and the dead computation should be removed.

Suggested change
const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair);

Comment on lines +234 to +310
void dump_nvfp4_tensor_data(const std::string& prefix,
const fp4e2m1 *test_data, const fp4e2m1 *ref_data,
const int rows, const int cols) {
std::string test_file = prefix + "_test.txt";
std::string ref_file = prefix + "_ref.txt";
std::string diff_file = prefix + "_diff.txt";

std::ofstream test_out(test_file);
std::ofstream ref_out(ref_file);
std::ofstream diff_out(diff_file);

if (test_out.is_open() && ref_out.is_open() && diff_out.is_open()) {
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; j += 2) {
const int idx = i * cols + j;
double2 test_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[idx/2]));
double2 ref_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[idx/2]));

for (int k = 0; k < 2; ++k) {
const double t = (k == 0 ? test_data_pair.x : test_data_pair.y);
const double r = (k == 0 ? ref_data_pair.x : ref_data_pair.y);
const int pos = idx + k;

test_out << "pos[" << pos << "] = " << t << std::endl;
ref_out << "pos[" << pos << "] = " << r << std::endl;
diff_out << "pos[" << pos << "] test=" << t << " ref=" << r
<< " abs_diff=" << fabs(t - r)
<< " rel_diff=" << (r == 0 ? 0.0 : fabs((t - r) / r)) << std::endl;
}
}
}
std::cout << "DEBUG: Dumped tensor data to files: " << test_file << ", " << ref_file << ", " << diff_file << std::endl;
} else {
std::cout << "WARNING: Could not open files for tensor data dump" << std::endl;
}
}

void print_detailed_tensor_comparison(const std::string& name,
const fp4e2m1 *test_data, const fp4e2m1 *ref_data,
const int rows, const int cols) {
printf("\n=== DETAILED COMPARISON for %s (%d×%d = %d elements) ===\n",
name.c_str(), rows, cols, rows * cols);

const int total_elements = rows * cols;
const int check_count = 128;

printf("--- FIRST %d ELEMENTS ---\n", check_count);
printf("Index | Test_Value | Ref_Value | Match\n");
printf("------|---------------|---------------|-------\n");
for (int i = 0; i < std::min(check_count, total_elements); ++i) {
double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2]));
double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2]));

double t = (i % 2 == 0) ? test_pair.x : test_pair.y;
double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y;
bool match = (fabs(t - r) < 1e-6);

printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗");
}

if (total_elements > 2 * check_count) {
printf("\n--- LAST %d ELEMENTS ---\n", check_count);
printf("Index | Test_Value | Ref_Value | Match\n");
printf("------|---------------|---------------|-------\n");
for (int i = total_elements - check_count; i < total_elements; ++i) {
double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2]));
double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2]));

double t = (i % 2 == 0) ? test_pair.x : test_pair.y;
double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y;
bool match = (fabs(t - r) < 1e-6);

printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗");
}
}
printf("==================================\n");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug dump and verbose comparison utilities should be removed or hidden before merging

dump_nvfp4_tensor_data (line 234) writes test/ref/diff text files to disk during CI runs. print_detailed_tensor_comparison (line 271) prints element-by-element tables to stdout. Both functions are only conditionally exercised and print_detailed_tensor_comparison is already fully commented out at the call site (lines 323–324). These are development-time diagnostic aids that add noise, may fail in read-only CI environments, and were not part of any documented public API. They should be removed from the final PR, or at minimum wrapped in #ifdef NVTE_DEBUG_DUMP guards so they never run in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant