-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
When an integer column is mapped to fill (or color) in a bar chart, ggsql's stat transform drops the column from the result, causing a validation error:
Validation error: Column 'fill' referenced in aesthetic 'fill' (layer 1 (global data)) does not exist.
Available columns: __ggsql_aes_pos1__, __ggsql_aes_pos2__, __ggsql_aes_pos2end__
Reproducible example
Rust (integration test style)
use ggsql::reader::{DuckDBReader, Reader};
use ggsql::writer::VegaLiteWriter;
let reader = DuckDBReader::from_connection_string("duckdb://memory").unwrap();
// Integer column (survived: 0/1) mapped to fill
let spec = reader.execute(
"SELECT *
FROM (VALUES
('Male', 0), ('Male', 1), ('Female', 0), ('Female', 1),
('Male', 0), ('Male', 0), ('Female', 1), ('Female', 1)
) AS t(sex, survived)
VISUALISE sex AS x, survived AS fill
DRAW bar"
);
// This fails with: Column 'fill' referenced in aesthetic 'fill' ... does not exist
assert!(spec.is_ok(), "Should handle integer fill: {:?}", spec.err());Python
import ggsql
import polars as pl
reader = ggsql.DuckDBReader("duckdb://memory")
df = pl.DataFrame({
"sex": ["Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female"],
"survived": [0, 1, 0, 1, 0, 0, 1, 1],
})
reader.register("titanic", df)
# Fails with validation error
spec = reader.execute("""
SELECT * FROM titanic
VISUALISE sex AS x, survived AS fill
DRAW bar
""")Note: adding SCALE DISCRETE fill or SCALE fill RENAMING 0 => 'No', 1 => 'Yes' doesn't help because RENAMING doesn't set a scale_type, so the discreteness check still falls through to the schema-based inference.
Root cause
In src/execute/schema.rs:171-172, discreteness is determined purely by data type:
let is_discrete =
matches!(dtype, DataType::String | DataType::Boolean) || dtype.is_categorical();Integers are never considered discrete. The downstream effect:
add_discrete_columns_to_partition_by(src/execute/mod.rs:677) checks if a mapped column is discrete- Integer
survived→ not discrete → not added topartition_by - The bar stat transform (
src/plot/layer/geom/bar.rs:87) buildsGROUP BYfrompartition_by+ x column - Since
fillisn't ingroup_by,survivedis dropped from the aggregation SQL - The resulting DataFrame only has
pos1,pos2,pos2end - Writer validation fails because
fillreferences a column that no longer exists
Note that SCALE fill RENAMING ... doesn't help because RENAMING doesn't set scale.scale_type, so add_discrete_columns_to_partition_by falls through to the schema check (line 740-741), which still says "integer = not discrete."
Inconsistency with ggplot2
In ggplot2, this works because all mapped aesthetics contribute to grouping, regardless of column type:
library(ggplot2)
df <- data.frame(sex = c("Male", "Female", "Male", "Female"),
survived = c(0L, 1L, 0L, 1L))
# Works fine — survived (integer) is used for grouping in stat_count
ggplot(df, aes(x = sex, fill = survived)) + geom_bar()ggplot2 treats the integer as continuous for color scale purposes (producing a gradient), but still uses it for grouping in the stat transform. The grouping and the scale type are independent concerns.
Possible approaches
A) Aesthetic-based grouping
Certain aesthetics (fill, color, shape, linetype, stroke) inherently imply grouping. Any column mapped to these should be added to partition_by regardless of data type.
Pros: Targeted fix, only changes behavior for aesthetics where grouping is clearly intended.
Cons: Doesn't cover edge cases like mapping a numeric column to opacity in a bar chart. Requires maintaining a list of "grouping aesthetics."
B) All non-positional mapped columns survive stat transforms
Every non-positional, non-stat-consumed aesthetic column gets added to GROUP BY for stat transforms, regardless of data type or aesthetic name.
Pros: Simpler logic, matches ggplot2's behavior most closely (where group is the interaction of all mapped discrete variables, but stat transforms preserve all mappings). No need to maintain a special list.
Cons: Broader change — could affect behavior for intentionally continuous aesthetics like opacity mapped to a numeric column in a stat geom. Though in practice, including a continuous column in GROUP BY just means "don't aggregate it away," which is usually correct.
Additional consideration: RENAMING should imply discrete
Independently of the above, SCALE fill RENAMING ... should probably set or imply a discrete scale type. If you're providing explicit label mappings for specific values, discrete semantics are almost certainly intended.