You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RAPL metrics are stable against coherent workloads and Perf, HWPC and SmartWatts produced metrics stay stable (i.e does not introduce instability) by using this interface as base input.
Open Questions
Q: Is stress-ng a correct candidate ?
Stress can be configured to reproduce the same usage at given report frequency
Q: What is the current baseline/SOTA to be considered ?
RAPL
Q: What are the metrics to be consider both for MM and OM
MM
Joules
OM
CPU Consumption
Memory usage
Global system calls impact
Q: What is the "most specific" zoom to consider in sub-benchmarks results
To Avoid : Benchmarking Crimes
Selective Benchmarking
Not evaluating degradation in other areas : Performance significantly improve in area of interest AND does not significantly degrade elsewhere
Check for decrease of std, variance and cv of Metric Measured (MM) with stable Outside Metrics (OM)
Cherry-picking subsets of a suite without justification (and conclude about the whole suite) : Justify avoided subset, not arbitrary choices
Run the benchmark against all available type of infrastructure (with reasonable combination)
Selective data set hidding deficiencies : no restriction of a dataset if removed data tells a different story
Present the whole benchmarks results
Improper handling of benchmarks results
Micro-benchmarks cannot be pictured alone, they can be used an example before presenting macro-benchmarking for a real-world workloads
May this benchmark conclude of stability against stress-ng use-cases, ensure that it stays true again a more complexe use case, still controlled ?
Throughput degraded by X% => overhead is X% : Accompany throughput comparisons with complete CPU load AND Compare I/O throughput in terms of processing time per bit
If comparing two stability results for different PowerAPI version, do so against OM
Downplaying overheads
X% to Y% is (Y-X)% increase/decrease : 1% to 2% is doubling
baseline is denominator in relative comparison
No indication of signifiance of data : always refer to variance/standard deviation and any obvious indicators (R²...), even consider min/max. Use student's t-test to check significance
Missing specification of evaluation platform : give as much details about ALL used hardware to ensure reproducibility : processor arch, nb cores, clock rate, memory sizes, all cache level size, core type, microarchitecture, OS & version, hypervisor & version
Enrich as much metadata as possible to have sup resources
Present only top level/aggregated results : sub-benchmarks shall be presented to avoid loss of information
Define OQ Adequat grain for "most specific" sub-benchmarks
Relative numbers only : present raw values in addition of ratio to allow people to sanity check
Do several runs and check std (expected to be < 0.1% )
[ ]We may define 0.1% as the ideal threshold to focus on something else (and then keep 0.1% as a threshold for further features acceptance)
Use combination of successive and separate run
same twice in a row (possible caching exhibit)
May be relevant for HWPC Sensor
explore data set in both directions
"directions" might be "more stressed" to "more relaxed" and return ? may need a huge "cool off" step ?
If making use of regular strides (2, 4, 8, 16...) also use "random" points to avoid "pathological case" but keep regular in order to identify those pathological cases
From https://docs.google.com/document/d/1iKMhEt-780Ub3iqzNHx1hTi-nJlUZosVuoWUSwtYoAU/edit?tab=t.0#heading=h.nbhjcz20m98s
H_0
RAPL metrics are stable against coherent workloads and Perf, HWPC and SmartWatts produced metrics stay stable (i.e does not introduce instability) by using this interface as base input.
Open Questions
stress-nga correct candidate ?To Avoid : Benchmarking Crimes
Best Practice