Skip to content

csvstats: number of bin calculations is too high for data/random-data.csv dataset #31

@Notgnoshi

Description

@Notgnoshi
# This generates a NaN as the first delta
$ csvdelta -i -c timestamp data/random-data.csv
$ cargo run --bin csvstats -- -c timestamp-deltas data/random-data.csv -H
Stats for column "timestamp-deltas":
    count: 1173
    filtered: 1 (total: 1174)
    Q1: 0.0010528564453125
    median: 0.0010530948638916016
    Q3: 0.0010530948638916016
    min: 0.0010089874267578125 at index: 0
    max: 0.0010700225830078125 at index: 1172
    mean: 0.0010526164006902329
    stddev: 0.000029847184932617655

2025-03-10T00:00:38.348237Z  INFO csvizmo::plot: Using 1350 bins with width 0.0000

1350 is more bins than there are samples. Either I have a bug in my Freedman Diaconis rule calculation, or it doesn't give the kind of results I want.

A dataset like this isn't quite normal, so I think the KDE estimation isn't quite right, and a histogram isn't the most useful way of visualizing this data.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcsvTools to work with CSVs

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions