fix(linear-summarization): Remove interaction term check to enable features with colons#183
fix(linear-summarization): Remove interaction term check to enable features with colons#183tonywu1999 wants to merge 1 commit intodevelfrom
Conversation
…atures with colons
📝 WalkthroughWalkthroughRemoved the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Failed to generate code suggestions for PR |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/linear_summary.cpp (1)
29-40:⚠️ Potential issue | 🟠 MajorFix is correct for colon-named features, but may double-count genuine FEATURE:RUN interaction terms
The removal of
find_colon-based exclusion correctly resolves the bug where feature coefficients whose names contain a colon (e.g.,"FEATUREK.ABCDE:FGHIJ.R") were incorrectly filtered out.However,
find_features = grep("FEATURE", coef_names)can also match genuine FEATURE:RUN interaction-term coefficients (e.g.,"FEATUREpeptide1:RUNrun1"), since those names contain the substring"FEATURE". Such terms are simultaneously present infind_runsas well and are already handled byget_feature_run()viaintersect(find_runs, find_features). Including them here too causes double/triple-counting in the combined contrast vector.Additionally, if interaction terms inflate
temp_feature.length()beyondcounts.ncol() - 1(the length returned byget_feature_props()), thefeature.attr("names") = temp_featureassignment becomes a length-mismatch that Rcpp will propagate as an error or silent corruption.A safer approach is to compute pure (non-interaction) feature/run index sets in
make_contrast_run_quantbefore calling these functions, soget_feature_run()remains the sole owner of interaction terms:💡 Proposed fix in
make_contrast_run_quantNumericVector find_features = grep("FEATURE", coef_names); NumericVector find_runs = grep("RUN", coef_names); NumericVector find_ref = grep("ref", coef_names); + // Terms matching BOTH patterns are FEATURE:RUN interaction terms; + // keep them exclusively in get_feature_run() to avoid double-counting. + NumericVector find_interaction = intersect(find_features, find_runs); + NumericVector find_pure_features = setdiff(find_features, find_interaction); + NumericVector find_pure_runs = setdiff(find_runs, find_interaction); NumericVector intercept = get_intercept(coef_names); - NumericVector features = get_features(coef_names, find_features, + NumericVector features = get_features(coef_names, find_pure_features, contrast_matrix, counts, input); - NumericVector runs = get_run(coef_names, find_runs, + NumericVector runs = get_run(coef_names, find_pure_runs, contrast_matrix, is_labeled, n_runs);
get_feature_run()already receives the originalfind_runs/find_featuresand usesintersect()internally, so interaction terms continue to be handled there without any change to its call site.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/linear_summary.cpp` around lines 29 - 40, get_features currently uses find_features derived via grep("FEATURE", coef_names) which also matches FEATURE:RUN interaction terms, causing double-counting and potential name-length mismatches; fix by computing pure (non-interaction) feature and run index sets in make_contrast_run_quant (i.e., remove any coef_names containing ':' from the feature set before passing to get_features) so get_feature_run remains the sole handler of interactions, then call get_features with that filtered find_features (or alternatively have get_features filter out entries with ':' itself) and ensure the returned temp_feature length matches counts.ncol()-1 before assigning names to avoid Rcpp length-mismatch errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/linear_summary.cpp`:
- Around line 29-40: get_features currently uses find_features derived via
grep("FEATURE", coef_names) which also matches FEATURE:RUN interaction terms,
causing double-counting and potential name-length mismatches; fix by computing
pure (non-interaction) feature and run index sets in make_contrast_run_quant
(i.e., remove any coef_names containing ':' from the feature set before passing
to get_features) so get_feature_run remains the sole handler of interactions,
then call get_features with that filtered find_features (or alternatively have
get_features filter out entries with ':' itself) and ensure the returned
temp_feature length matches counts.ncol()-1 before assigning names to avoid Rcpp
length-mismatch errors.
Motivation and Context
There looked to be problems w.r.t. features containing colons in them. Notably, here:
I believe this was written to find interaction terms (e.g., RUN:FEATURE), and then setdiff/intersect against find_runs and find_features is used to distinguish "just runs", "just features", and "run:feature interactions".
But if a Run ID or Feature ID contains a colon (like UniMod:7), those indices end up in find_colon, and then get incorrectly excluded from just_features or just_runs via setdiff. So legitimate main-effect terms get treated as if they were interaction terms and dropped.
Looking at the model formulas, there are no interaction terms anywhere in this code. All models use only main effects, so the simplest fix is to just remove find_colon entirely and clean up the functions that use it:
Changes
Please provide a detailed bullet point list of your changes.
Testing
Checklist Before Requesting a Review