Extending cuik-molmaker to reactions (cuik-reactmaker)#4
Conversation
Implements C++ CGR (Condensed Graph of Reaction) featurization to accelerate chemprop reaction property prediction (~9x over Python path). New C++ code: - src/reaction_features.cpp: batch_reaction_featurizer supporting all 6 RxnModes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) and all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR). Uses O(bonds) hash-map bond enumeration instead of O(n^2) atom-pair scan. - src/features.h: ReactionMode enum, reaction_mode_names_to_array, and parse_reaction declarations; get_atomic_num_onehot_index helper for num_only representation of unmatched atoms - src/one_hot.cpp/h: get_atomic_num_onehot_index for num_only encoding - src/cuik_molmaker_cpp.cpp: exports reaction_mode_names_to_array and batch_reaction_featurizer to Python - CMakeLists.txt: add reaction_features.cpp to cuik_molmaker_core sources Test fixtures: - tests/data/sample_rxns_100.csv: 100 balanced reactions (50 E2 + 50 SN2) plus 10 hand-crafted unbalanced reactions covering num_only and BALANCE mode edge cases. Verified against chemprop CondensedGraphOfReactionFeaturizer across all 6 modes with max_diff=0.
test_reaction_features.py: parametrized over all 4 atom featurizer versions (V1, V2, ORGANIC, RIGR) × all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) = 24 test cases. RIGR uses reduced bond features (["is-null", "in-ring"], bond_fdim=2 per side = 4 total) unlike V1/V2/ORGANIC which use 5 bond features (bond_fdim=14 per side = 28 total). Golden .xz files generated from C++ batch_reaction_featurizer output after verifying agreement with chemprop CondensedGraphOfReactionFeaturizer (max_diff=0 on E2/SN2 data across all 6 modes).
|
Hi @sveccham , just a friendly ping on the PR whenever you have a chance. I'd appreciate any feedback/comments before it is ready to merge. |
|
Hi @akshatzalte, Thank you for your patience. I will have a look at this PR in the next few days. We are currently working on releasing |
sveccham
left a comment
There was a problem hiding this comment.
Hi @akshatzalte,
Thanks again for the PR. Very neat idea using a hashmap to reduce the problem complexity. This is also a very good use case for moving to C++.
I had a high level look at the PR and left a couple of initial comments. I had one question about the overall design. The CGR bond lookup is constructed once for each reaction in a batch. Would it be more computationally efficient to construct it for the entire batch (with appropriate atom and bond offsets) at once and exploit the O(1) lookup?
| offset_carbon, | ||
| ReactionMode(mode_int)); | ||
| }, | ||
| "Featurize a batch of reactions (CGR) and return 5 NumPy arrays."); |
There was a problem hiding this comment.
List what the 5 numpy arrays are briefly.
There was a problem hiding this comment.
What is the source of these sample reactions? We should make sure they are references appropriately and comply with license (if any)
Add
batch_reaction_featurizerfor CGR reaction featurizationWhat this PR adds
A new
batch_reaction_featurizerfunction — the reaction analogue ofbatch_mol_featurizer— using the Condensed Graph of Reaction (CGR) representation (same as Chemprop'sCondensedGraphOfReactionFeaturizer). API is consistent with the existing package:Supported: all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR) and all 6 reaction modes (
REAC_DIFF,REAC_PROD,PROD_DIFF, and their_BALANCEvariants).keep_h/add_hsemantics match Chemprop'smake_molexactly.Implementation notes
All new code is in
src/reaction_features.cpp(701 lines); additions tofeatures.h,one_hot.cpp, andcuik_molmaker_cpp.cpp. No existing code was modified.Two design choices worth noting:
parse_rxn_side_moldoes not clear atom-map numbers. The existingparse_molstrips them for reordering purposes; reaction featurization needs them for reactant↔product correspondence. Fully additive — molecule featurization is unchanged.unordered_map<uint64_t, size_t>keyed by(min_idx << 32) | max_idx), replacing the Python CGR featurizer's O(n²) atom-pair scan.Node ordering matches Chemprop exactly: reactant atoms 0..n_reac−1, then product-only atoms n_reac..n_cgr−1.
Correctness
Verified against Chemprop's Python
CondensedGraphOfReactionFeaturizeron:[H-:2]lone-hydride nucleophiles)num_onlyvs BALANCE divergence)A test fixture CSV (
tests/data/sample_rxns_100.csv, 110 reactions) is included. Golden.xzreference files can be committed once you've had a look at the design.Speedup benchmarks present here