Benchmark stability of Perf, HWPC Sensor and SmartWatts against RAPL stability

From https://docs.google.com/document/d/1iKMhEt-780Ub3iqzNHx1hTi-nJlUZosVuoWUSwtYoAU/edit?tab=t.0#heading=h.nbhjcz20m98s 

## H_0
RAPL metrics are stable against coherent workloads and Perf, HWPC and SmartWatts produced metrics stay stable (i.e does not introduce instability) by using this interface as base input.

---    

## Open Questions 
1. - [x] Q: Is `stress-ng` a correct candidate ?  
  - Stress can be configured to reproduce the same usage at given report frequency    
  
2. - [x] Q: What is the current baseline/SOTA to be considered ?   
  - RAPL  
  
3. - [x] Q: What are the metrics to be consider both for MM and OM  
  - MM  
    - Joules
  - OM  
    - CPU Consumption
    - Memory usage
    - Global system calls impact

4. - [ ] Q: What is the "most specific" zoom to consider in sub-benchmarks results  

---  
## To Avoid : Benchmarking Crimes

1. Selective Benchmarking
	1. Not evaluating degradation in other areas : Performance significantly improve in area of interest **AND** does not significantly degrade elsewhere
	  - [ ] **Check for decrease of std, variance and cv of Metric Measured (MM) with stable Outside Metrics (OM)**
	2. Cherry-picking subsets of a suite without justification (and conclude about the whole suite) : Justify avoided subset, not arbitrary choices
	  - [x] **Run the benchmark against all available type of infrastructure (with reasonable combination)**
	3. Selective data set hidding deficiencies : no restriction of a dataset if removed data tells a different story
	  - [ ] **Present the whole benchmarks results**
2. Improper handling of benchmarks results
	1. Micro-benchmarks cannot be pictured alone, they can be used an example before presenting macro-benchmarking for a real-world workloads
	  - [ ] **May this benchmark conclude of stability against stress-ng use-cases, ensure that it stays true again a more complexe use case, still controlled ?** 
	2. Throughput degraded by X% => overhead is X% : Accompany throughput comparisons with complete CPU load **AND** Compare I/O throughput in terms of processing time per bit
	  - [ ] **If comparing two stability results for different PowerAPI version, do so against OM**  
	3. Downplaying overheads
          - [ ] **X% to Y% is (Y-X)% increase/decrease : 1% to 2% is doubling**
          - [ ] **baseline is denominator in relative comparison**
	4. [ ] No indication of signifiance of data : **always refer to variance/standard deviation and any obvious indicators (R²...), even consider min/max. Use student's t-test to check significance**
	5. **[Geometric mean shall be preferred to Arithmetic means](https://dl.acm.org/doi/10.1145/5666.5673) at least with normalized values**
3. Using the wrong benchmarks
	1. Benchmarking of simplified simulated system : use a representative system **AND** do not make any simplifying assumption with impact
          - [x] **Ensure that stress-ng is a correct candidate for benchmarking** 
	2. Unappropriated and misleading benchmarks : measure what you are changing
          - [x] **Use the same inputs of formula/sensors on a given benchmark** 
	3. Same dataset for calibration and validation : disjoint both data sets
	  - [ ] **Use different program / options to calibrate SmartWatts on development process**  
4. Improper comparison of results
	1. No proper baseline : always compare to the "real" baseline (SOTA solution, theoretically best)
          - [ ] **Define OQ Baseline** 
	2. Only evaluating against yourself : Even re-edition/correction shall refer to the current "real" baseline
          - [ ] **Present evolution from [previous paper](https://hal.science/hal-02403379) and from current SOTA/baseline** 
	3. Unfair benchmarking of competitors : be explicit and complete about competitor's tool use (even consider having them validating the results)
          - [ ] **If comparing to Scaphandre...** 
	4. Inflating gains by not comparing against current SOTA : avoid "they improved by X and us by Y" prefer "We improved X new established baseline by Z"
          - [ ] **Update results of [previous papier](https://hal.science/hal-02403379) with current baseline** 
5. Missing information
	1. Missing specification of evaluation platform : give as much details about **ALL** used hardware to ensure reproducibility : processor arch, nb cores, clock rate, memory sizes, all cache level size, core type, microarchitecture, OS & version, hypervisor & version
	  - [ ] **Enrich as much metadata as possible to have sup resources**
	2. Present only top level/aggregated results : sub-benchmarks shall be presented to avoid loss of information
          - [ ] **Define OQ Adequat grain for "most specific" sub-benchmarks**
	3. Relative numbers only : present raw values in addition of ratio to allow people to sanity check
	  - [ ] **To do ! :)** 
---  
## Best Practice 
- Document what I do
	- [x] **In /\*\*/\*.md in [github](https://github.com/powerapi-ng/benchmarking)**
- Run warp-up iterations that aren't timed
	- [x] **May be usefull to litteraly warm-up CPU** 
- Do several runs and check std (expected to be < 0.1% )
	-  [ ]**We may define 0.1% as the ideal threshold to focus on something else (and then keep 0.1% as a threshold for further features acceptance)**
- Use combination of successive and separate run
	- same twice in a row (possible caching exhibit)
		- [ ] **May be relevant for HWPC Sensor** 
	- explore data set in both directions
		- [ ] **"directions" might be "more stressed" to "more relaxed" and return ? may need a huge "cool off" step ?**
- If making use of regular strides (2, 4, 8, 16...) also use "random" points to avoid "pathological case" but keep regular in order to identify those pathological cases
	- [x] **May bee considered in formulas ?** 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark stability of Perf, HWPC Sensor and SmartWatts against RAPL stability #1

H_0

Open Questions

To Avoid : Benchmarking Crimes

Best Practice

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark stability of Perf, HWPC Sensor and SmartWatts against RAPL stability #1

Description

H_0

Open Questions

To Avoid : Benchmarking Crimes

Best Practice

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions