Add profiling instrumentation for NAM building blocks#219
Add profiling instrumentation for NAM building blocks#219sdatkinson merged 7 commits intosdatkinson:main-profilingfrom
Conversation
Adds a profiling framework (NAM/profiling.h, NAM/profiling.cpp) with NAM_PROFILE_START()/NAM_PROFILE_ADD() macros and 14 timing categories. Supports both desktop (std::chrono) and ARM Cortex-M7 (DWT cycle counter) backends. Profiling is compile-time gated via -DNAM_PROFILING. Instruments wavenet _Layer::Process() and _LayerArray::ProcessInner() with per-category timing, and adds profiling reset/print calls to the benchmodel tool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sdatkinson
left a comment
There was a problem hiding this comment.
Some comments & Q's.
The main thing that makes me pause is adding things into the high-level logic like in the Layer process method. If there were a way to do it that's a little more "hidden", then that would be much preferred before merging this into main...
| @@ -1,4 +1,5 @@ | |||
| #include "conv1d.h" | |||
| #include "profiling.h" | |||
There was a problem hiding this comment.
Does this have to be included? Feels weird to include it if I'm doing e.g. a release build.
There was a problem hiding this comment.
Fair, it should at very least be inside an #ifdef. I'll change that.
NAM/profiling.h
Outdated
| } | ||
|
|
||
| uint32_t total() const { | ||
| return conv1d + input_mixin + layer1x1 + head1x1 + rechannel + conv1x1 + activation + film + copies + setzero + ringbuf + condition + lstm + other; |
There was a problem hiding this comment.
This feels fragile. Is there a way to "register" things or something?
There was a problem hiding this comment.
This whole implementation was targeted at the Wavenet and I didn't think of generalizing it, only covered anything that will be in A2... but you are right, it would be better if we just kept a dictionary of things we are measuring and reported all of them, just like a logging library.
NAM/profiling.h
Outdated
| } | ||
| }; | ||
|
|
||
| print_row("Conv1D", t.conv1d); |
There was a problem hiding this comment.
Yeah...this could be an iterator over the various registered ops?
|
|
||
| void nam::wavenet::_Layer::Process(const Eigen::MatrixXf& input, const Eigen::MatrixXf& condition, const int num_frames) | ||
| { | ||
| NAM_PROFILE_START(); |
There was a problem hiding this comment.
Yeah...I'm not super-hot on things like this sitting in the middle of the code. Is there a way to do this that doesn't add lines like this?
| # Also ensure assertions are enabled (NDEBUG is not defined) so tests actually run | ||
| set_target_properties(run_tests PROPERTIES COMPILE_OPTIONS "-O0") | ||
| # Benchmodel should be built with NAM_PROFILING set | ||
| target_compile_definitions(benchmodel PRIVATE NAM_PROFILING) |
There was a problem hiding this comment.
Is there a way to make this optional via a command-line flag?
I'd also be interested to see how much overhead the profiler adds.
There was a problem hiding this comment.
I'll look into how to make it a CMake flag. I briefly looked into it, thought it was more complicated than I wanted to deal with, and immediately forgot about it... especially since profiling should only be inserted into profiling tools.
| // Conv1x1 stores either a full (out_channels x in_channels) matrix (possibly | ||
| // block-diagonal when grouped), or a depthwise weight vector when groups == | ||
| // in_channels == out_channels. | ||
| static MemoryResult conv1x1_memory(int in_ch, int out_ch, bool bias, int groups, int M) |
There was a problem hiding this comment.
So this stores only the memory due to weights? Is there any overhead that should be accounted for? It's probably small but I figure I should ask while I'm here.
There was a problem hiding this comment.
No overhead for small local variables, but if anything is allocating an entire temporary buffer that wasn't accounted for we need to take that into account. Internal buffers are being accounted for (see e.g. line 59)
|
|
||
| // Conv1D stores kernel_size weight matrices (each out_ch x in_ch) or depthwise | ||
| // vectors, plus a bias vector, a ring buffer, and an output buffer. | ||
| static MemoryResult conv1d_memory(int in_ch, int out_ch, int kernel_size, bool bias, int dilation, int groups, int M) |
There was a problem hiding this comment.
M is buffer_size everywhere, thanks for catching it. I'll rename it all across the file.
| // Recursive condition_dsp | ||
| bool has_condition_dsp = false; | ||
| int condition_output_channels = condition_dim; | ||
| if (config.find("condition_dsp") != config.end()) |
There was a problem hiding this comment.
That's correct, thanks for catching it!
…lementation (sdatkinson#223) * Make activation apply method pure virtual instead of no-op default * Fix bugs * Refactor to throw std::invalid_argument in debug mode, add tests
The Timings struct hardcoded 14 named fields, requiring manual updates to reset(), total(), print_results(), and every call site whenever a category was added or removed. Replace with a flat-array registry where types are registered at file scope via register_type(), returning an integer index for O(1) accumulation in the hot path. Also adds NAM_PROFILE_RESTART() macro to replace a raw #ifdef block in wavenet.cpp.
sdatkinson
left a comment
There was a problem hiding this comment.
The code is nice as it is. The trouble with it, like I've said elsewhere, is that I feel uneasy about having this "in main".
What I'm going to do in the meantime is make a main-profiling branch and merge this into there.
It'll be a bit more work to keep the two in sync, but if this really seems to be worth the lines (or I get other feedback from folks saying that they like it) then I'll merge it up to main later.
Adds a profiling framework (NAM/profiling.h, NAM/profiling.cpp) with NAM_PROFILE_START()/NAM_PROFILE_ADD() macros and 14 timing categories. Supports both desktop (std::chrono) and ARM Cortex-M7 (DWT cycle counter) backends. Profiling is compile-time gated via -DNAM_PROFILING.
Instruments wavenet _Layer::Process() and _LayerArray::ProcessInner() with per-category timing, and adds profiling reset/print calls to the benchmodel tool.
Developed with support and sponsorship from TONE3000