Statistical SQL functions for DuckDB, backed by the C++17 statcpp library, with an optional Lua layer for authoring new functions without recompiling C++.
Including a couple of headers and calling one registration function gives a DuckDB connection 200+ statistical SQL functions — descriptive statistics, order statistics, robust estimators, correlation/covariance, weighted statistics, error/distance metrics, window/rolling statistics, time-series transforms, probability distributions (pdf/cdf/quantile/rand), special functions, effect sizes and power analysis — all computed by statcpp. The same data path is also reachable through Lua, so new SQL functions can be composed in a script and take effect on the next run with no rebuild.
Function names mirror those of the sibling project sqlite3-stats; this PoC covers the same set except JSON-returning functions (hypothesis-test result objects, frequency tables, survival curves, CI/regression objects), which are out of scope here.
📄 Want to see it work first? A full captured run of the demo — every category's SQL and result, plus the
203 functions, 203 okcoverage check — is checked in as PoC-Output.txt.
Released under the PoC Evaluation License v1.0 — see LICENSE.md (view / compile / evaluate only; no redistribution or production use). DuckDB, statcpp and Lua are third-party dependencies under their own licenses — see License.
Two header-only layers you register on a duckdb::Connection:
#include "statcpp_udf.hpp" // LIST-based UDFs: aggregates, two-column, window
#include "statcpp_scalar.hpp" // scalar UDFs: distributions, special functions, effect size
#include "lua_udf.hpp" // optional Lua path: author new functions in Lua
duckdb::DuckDB db(nullptr);
duckdb::Connection con(db);
statcpp_duckdb::RegisterStatcppFunctions(con); // stat_*
lua_duckdb::RegisterLuaStatcppFunctions(con, "src/lua/stats.lua"); // lua_stat_*Functions operate on a LIST<DOUBLE>, so a column is aggregated with the
built-in list() aggregate first:
-- Descriptive + robust statistics per group
SELECT grp,
stat_mean(list(v)) AS mean,
stat_median(list(v)) AS median,
stat_stdev(list(v)) AS stdev,
stat_hodges_lehmann(list(v)) AS robust_location,
stat_mad(list(v)) AS robust_scale,
stat_percentile(list(v), 0.9) AS p90
FROM measurements
GROUP BY grp;
-- Correlation between two columns
SELECT stat_pearson_r(list(x), list(y)) AS r FROM paired;
-- Rolling mean (window) over an ordered list, expanded back to rows
SELECT unnest(stat_rolling_mean(list(v ORDER BY id), 3)) AS ma3 FROM measurements;
-- Scalar: 97.5th percentile of the standard normal (≈ 1.96)
SELECT stat_normal_quantile(0.975, 0, 1) AS z_975;Statistics reference. This repository only exposes statcpp to SQL. For the definition, assumptions, parameters and algorithm of each statistic, see the statcpp documentation. Names below mirror statcpp with a
stat_prefix.
200+ functions in two families: LIST-based (a column is first aggregated
with list()) and scalar (plain DOUBLE arguments). Missing values
(SQL NULL / NaN) are dropped before single- and two-sample statistics
(pairwise for two-sample); on invalid input the function returns SQL NULL.
Names mirror sqlite3-stats.
| Category | Functions |
|---|---|
| Central tendency | stat_mean stat_median stat_mode stat_geometric_mean stat_harmonic_mean |
| Dispersion | stat_range stat_var stat_population_variance stat_sample_variance stat_stdev stat_population_stddev stat_sample_stddev stat_cv stat_iqr stat_mad_mean stat_geometric_stddev |
| Shape | stat_skewness stat_population_skewness stat_kurtosis stat_population_kurtosis |
| Estimation / robust | stat_se stat_mad stat_mad_scaled stat_hodges_lehmann |
| Extras (no sqlite3 equiv.) | stat_sum stat_count stat_minimum stat_maximum |
stat_trimmed_mean stat_percentile stat_moe_mean stat_cohens_d
stat_hedges_g stat_acf_lag stat_biweight_midvar
| Category | Functions |
|---|---|
| Correlation / covariance | stat_covariance stat_population_covariance stat_pearson_r stat_spearman_r stat_kendall_tau stat_weighted_covariance |
| Weighted statistics | stat_weighted_mean stat_weighted_harmonic_mean stat_weighted_variance stat_weighted_stddev stat_weighted_median stat_weighted_percentile |
| Regression (scalar) | stat_r_squared stat_adjusted_r_squared |
| Error metrics | stat_mae stat_mse stat_rmse stat_mape |
| Distance metrics | stat_euclidean_dist stat_manhattan_dist stat_cosine_sim stat_cosine_dist stat_minkowski_dist stat_chebyshev_dist |
| Two-sample effect sizes | stat_cohens_d2 stat_hedges_g2 stat_glass_delta |
| Category | Functions |
|---|---|
| Rolling / moving | stat_rolling_mean stat_rolling_std stat_rolling_min stat_rolling_max stat_rolling_sum stat_moving_avg stat_ema |
| Rank / fill | stat_rank stat_fillna_mean stat_fillna_median stat_fillna_ffill stat_fillna_bfill stat_fillna_interp |
| Encoding / binning | stat_label_encode stat_bin_width stat_bin_freq |
| Time series | stat_lag stat_diff stat_seasonal_diff |
| Outliers / robust | stat_outliers_iqr stat_outliers_zscore stat_outliers_mzscore stat_winsorize |
| Extras | stat_impute_mean stat_log_transform stat_sqrt_transform stat_missing_rate (→ scalar) |
Wrap a window/transform result in unnest() to expand it back into rows; keep
input order with list(col ORDER BY key).
Each distribution provides pdf/pmf, cdf, quantile and rand:
stat_normal_* stat_lognormal_* stat_uniform_* stat_exponential_*
stat_gamma_* stat_beta_* stat_weibull_* stat_chisq_* stat_t_*
stat_f_* stat_binomial_* stat_poisson_* stat_geometric_*
stat_nbinom_* stat_hypergeom_* stat_bernoulli_* stat_duniform_*
| Category | Functions |
|---|---|
| Special functions | stat_lgamma stat_tgamma stat_beta_func stat_lbeta stat_erf stat_erfc stat_betainc stat_betaincinv stat_norm_cdf stat_norm_quantile stat_gammainc_lower stat_gammainc_upper stat_gammainc_lower_inv |
| Combinatorics | stat_binomial_coef stat_log_binomial_coef stat_log_factorial |
| Tests / corrections | stat_bonferroni stat_bh_correction stat_holm_correction stat_nnt stat_aic stat_aicc stat_bic stat_boxcox stat_logarithmic_mean |
| Effect size | stat_hedges_j stat_t_to_r stat_d_to_r stat_r_to_d stat_eta_squared_ef stat_partial_eta_sq stat_omega_squared_ef stat_cohens_h |
Interpretation (→ VARCHAR) |
stat_interpret_d stat_interpret_r stat_interpret_eta2 |
| Power / sample size | stat_power_t1 stat_n_t1 stat_power_t2 stat_n_t2 stat_power_prop stat_n_prop stat_moe_prop stat_moe_prop_worst stat_n_moe_prop stat_n_moe_mean |
This PoC exposes every sqlite3-stats function that fits a DOUBLE / LIST /
VARCHAR-label signature. JSON-returning functions are out of scope —
hypothesis-test result objects (stat_t_test, stat_anova1, …), frequency
tables, contingency tables, survival curves (Kaplan–Meier / Nelson–Aalen),
confidence-interval and regression objects, and bootstrap distributions. Call
statcpp directly in C++ for those.
Scalar arities are fixed. Every scalar function takes all of its arguments explicitly (e.g.
stat_normal_pdf(x, mu, sigma)— no defaultedmu/sigma). Integer-valued arguments are passed asDOUBLEand cast internally.Random functions (
*_rand) are registered as ordinary (deterministic) UDFs because the DuckDB C++ UDF API exposes no volatility flag; calls with only constant arguments may be constant-folded to a single value per query.
Function names and numerical results mirror sqlite3-stats. The differences are in which functions exist and how they are invoked, not in the math:
| Aspect | sqlite3-stats | this PoC (duckdb) |
|---|---|---|
| JSON-returning functions | included (test/CI/regression/survival/frequency objects as JSON) | omitted — out of scope (see Scope) |
| Window / rolling | true SQL window functions: stat_rolling_mean(v, 3) OVER (ORDER BY t) |
(LIST) → LIST transforms: unnest(stat_rolling_mean(list(v ORDER BY t), 3)) |
| Aggregates | native aggregates: stat_mean(v) |
applied to a list(): stat_mean(list(v)) |
| Scalar argument defaults | optional args with defaults (e.g. stat_normal_pdf(x) ⇒ mu=0, sigma=1) |
fixed arity, all args required |
*_rand volatility |
registered non-deterministic | registered deterministic (DuckDB C++ UDF API limitation) |
| Renamed for clarity? | — | names are identical to sqlite3-stats (stat_stdev, stat_var, stat_pearson_r, stat_fillna_interp, …) |
The following statistical conventions are inherited unchanged from sqlite3-stats / statcpp, and are called out here only because they can surprise:
stat_stdev/stat_varuseddof = 0(population). They are equivalent tostat_population_stddev/stat_population_variance. For the sample estimators (ddof = 1) usestat_sample_stddev/stat_sample_variance. Example: for{9, 10, 10.5, 11, 1000},stat_stdev = 395.95(population), whilestat_sample_stddev = 442.69.- Rolling/window functions are leading-aligned.
stat_rolling_mean(list(v), w)places each window result at the first position of its window, so the lastw − 1rows areNULL(not the firstw − 1). The values themselves are the means/sums of consecutivew-element runs.
This is a PoC, so these choices are not set in stone. Any of the differences above can be revisited on request — e.g. trailing-aligned rolling,
ddof = 1defaults forstat_stdev/stat_var, defaulted scalar arguments, true window functions, or exposing the JSON-returning functions. Open an issue or ask.
The same statistics are reachable through Lua-backed UDFs (prefix lua_stat_),
which call statcpp through a thin C binding. This path exists so that new SQL
functions can be authored in Lua, by composing statcpp primitives and encoding
policy, with no C++ recompilation — useful for developers who do not work in
C/C++ or for fast iteration.
-- src/lua/stats.lua — a NEW metric, composed in Lua (not a single statcpp call):
function lua_robust_cv(data)
local scale = statcpp.mad(data)
local loc = statcpp.hodges_lehmann(data)
if scale == nil or loc == nil or loc == 0 then return nil end
return (1.4826 * scale) / loc -- mad_scaled / Hodges-Lehmann
end
-- ...and a POLICY (business rule) that branches on the data:
function lua_smart_impute(data)
local rate = statcpp.missing_rate(data)
if rate == nil or rate > 0.5 then return nil end -- refuse if too sparse
if rate == 0 then return data end -- nothing to do
return statcpp.impute_mean(data) -- else mean-impute
endEach global lua_<name> in stats.lua is registered as the SQL function
lua_stat_<name>. Editing the script and re-running takes effect immediately
(the path is resolved at run time via LUA_STATS_SCRIPT_PATH), as long as the
statcpp primitives used are already exposed by the binding. Exposing a new
statcpp primitive to Lua is the one step that still needs a one-time C++ edit in
lua_statcpp_bindings.hpp.
The bundled stats.lua ships three Lua-only functions with no C++ counterpart,
as worked examples of the pattern:
| Lua-backed UDF | Returns | Pattern shown |
|---|---|---|
lua_stat_robust_cv |
scalar | Composition: mad_scaled / Hodges-Lehmann |
lua_stat_smart_impute |
list | Policy: refuse imputation if too sparse |
lua_stat_summary |
string | Free-form report: standard reps + custom fields |
The demo in main.cpp also asserts that the C++ and Lua paths
produce identical numbers (a MATCH column), so the Lua layer can be trusted as
a faithful front-end to statcpp.
SQL query
│ SELECT stat_median(list(v)) / SELECT lua_stat_robust_cv(list(v))
▼
DuckDB UDF
│ C++ path: statcpp_udf.hpp — table-driven {SQL name, lambda} registration
│ Lua path: lua_udf.hpp — marshals LIST<DOUBLE> <-> Lua table, lua_pcall
▼
(Lua path only) src/lua/stats.lua <-- authored by the extension developer
│ function lua_<name>(data) ... statcpp.<fn>(data) ...
▼
(Lua path only) lua_statcpp_bindings.hpp C binding: require("statcpp")
▼
statcpp (C++17, header-only) the actual statistical computation
| File | Role |
|---|---|
| statcpp_compute.hpp | DB-agnostic pure compute functions (compute::*) ported from statcpp. |
| statcpp_udf.hpp | LIST-based UDFs: registration helpers + aggregate / two-column / window tables. |
| statcpp_scalar.hpp | Scalar (DOUBLE…) UDFs: distributions, special functions, effect size; plus the RegisterStatcppFunctions umbrella. |
| lua_udf.hpp | Bridges DuckDB and Lua; registers each Lua function as a vectorized UDF. |
| lua_statcpp_bindings.hpp | One-time C binding exposing statcpp to Lua as require("statcpp"). |
| src/lua/stats.lua | The file an extension developer edits — pure Lua. |
| main.cpp | Demo driver: registers both paths and walks each category. |
- DuckDB
LIST<DOUBLE>↔ a 1-indexed Lua table (Lua path) /std::vector<double>. - SQL
NULL↔ statcppNaN↔ Luanil(missing value), converted at each boundary. - A
NaNelement in a transform's output becomes SQLNULL. - Errors raised by statcpp or Lua are caught at the boundary and surfaced as SQL
NULL.
Requirements: CMake 3.20+, a C++17 compiler (AppleClang), Ninja (used to build DuckDB). DuckDB, statcpp and Lua (5.5.0) are all fetched from source on the first configure — no system-installed Lua is required.
# Configure (first run downloads & builds DuckDB; this can take a while)
cmake -S . -B build
# Build
cmake --build build
# Run the demo
./build/a.outIf the dynamically linked DuckDB library is not found at run time, prefix the command with its path:
DYLD_LIBRARY_PATH=download/DuckDB/DuckDB_C_CPP-install/lib ./build/a.outA full captured run of the demo is checked in as
PoC-Output.txt for reference — each category's SQL and its
result, ending with the 203 functions, 203 ok, 0 failed coverage check.
duckdb-lua-statcpp/
├── CMakeLists.txt
├── cmake/
│ ├── DuckDB_C_CPP.cmake # Downloads, builds and links DuckDB (v1.4.4)
│ ├── statcpp.cmake # Downloads statcpp headers (v0.2.0)
│ └── lua.cmake # Downloads & builds Lua from source (v5.5.0)
├── src/
│ ├── main.cpp # Demo: registers both paths, walks each category
│ ├── lua/
│ │ └── stats.lua # Lua extension functions (edited by the developer)
│ └── include/
│ ├── statcpp_compute.hpp # Pure compute functions (DB-agnostic)
│ ├── statcpp_udf.hpp # LIST-based UDFs (aggregate/two-col/window)
│ ├── statcpp_scalar.hpp # Scalar UDFs (distributions/effect size) + umbrella
│ ├── lua_statcpp_bindings.hpp # C binding: statcpp -> Lua module
│ └── lua_udf.hpp # Registers Lua functions as DuckDB UDFs
└── download/ # Cached DuckDB / statcpp / Lua sources (git-ignored)
- Sorting. Single-sample scalar functions sort the cleaned sample before the
call, because several statcpp routines (
median,iqr,percentile,trimmed_mean) require a sorted range. Sorting is harmless for order-independent statistics. Two-sample functions keep positional pairing; transforms keep order. stat_madis unscaled. It returns the raw Median Absolute Deviation; the normal-consistent variant isstat_mad_scaled(× 1.4826).- Lua threading. A single
lua_Stateis shared across the Lua-backed UDFs and is not thread-safe. For concurrent query execution use per-thread states or a mutex. The C++ path has no such restriction. - Per-row overhead. The Lua path performs a
lua_pcallplus table marshalling per group/row, so it is heavier than the C++ path; preferstat_*for hot paths andlua_stat_*for functions you want to author/iterate in Lua.
This repository is released under the PoC Evaluation License v1.0 — see LICENSE.md for the full terms. In short, you may view, compile and internally evaluate the software, but may not redistribute or publish the source, use it in production, or claim ownership; it is provided "as is" without warranty.
Third-party dependencies are not covered by that license and keep their own terms — refer to each project for the authoritative text:
DuckDB, statcpp and Lua are all fetched at build time (cached under
download/, git-ignored) and are not redistributed as part of this
repository.