Skip to content

Integer columns mapped to fill/color are dropped by bar stat transform #239

@cpsievert

Description

@cpsievert

Summary

When an integer column is mapped to fill (or color) in a bar chart, ggsql's stat transform drops the column from the result, causing a validation error:

Validation error: Column 'fill' referenced in aesthetic 'fill' (layer 1 (global data)) does not exist.
Available columns: __ggsql_aes_pos1__, __ggsql_aes_pos2__, __ggsql_aes_pos2end__

Reproducible example

Rust (integration test style)

use ggsql::reader::{DuckDBReader, Reader};
use ggsql::writer::VegaLiteWriter;

let reader = DuckDBReader::from_connection_string("duckdb://memory").unwrap();

// Integer column (survived: 0/1) mapped to fill
let spec = reader.execute(
    "SELECT *
     FROM (VALUES
       ('Male', 0), ('Male', 1), ('Female', 0), ('Female', 1),
       ('Male', 0), ('Male', 0), ('Female', 1), ('Female', 1)
     ) AS t(sex, survived)
     VISUALISE sex AS x, survived AS fill
     DRAW bar"
);

// This fails with: Column 'fill' referenced in aesthetic 'fill' ... does not exist
assert!(spec.is_ok(), "Should handle integer fill: {:?}", spec.err());

Python

import ggsql
import polars as pl

reader = ggsql.DuckDBReader("duckdb://memory")
df = pl.DataFrame({
    "sex": ["Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female"],
    "survived": [0, 1, 0, 1, 0, 0, 1, 1],
})
reader.register("titanic", df)

# Fails with validation error
spec = reader.execute("""
    SELECT * FROM titanic
    VISUALISE sex AS x, survived AS fill
    DRAW bar
""")

Note: adding SCALE DISCRETE fill or SCALE fill RENAMING 0 => 'No', 1 => 'Yes' doesn't help because RENAMING doesn't set a scale_type, so the discreteness check still falls through to the schema-based inference.

Root cause

In src/execute/schema.rs:171-172, discreteness is determined purely by data type:

let is_discrete =
    matches!(dtype, DataType::String | DataType::Boolean) || dtype.is_categorical();

Integers are never considered discrete. The downstream effect:

  1. add_discrete_columns_to_partition_by (src/execute/mod.rs:677) checks if a mapped column is discrete
  2. Integer survived → not discrete → not added to partition_by
  3. The bar stat transform (src/plot/layer/geom/bar.rs:87) builds GROUP BY from partition_by + x column
  4. Since fill isn't in group_by, survived is dropped from the aggregation SQL
  5. The resulting DataFrame only has pos1, pos2, pos2end
  6. Writer validation fails because fill references a column that no longer exists

Note that SCALE fill RENAMING ... doesn't help because RENAMING doesn't set scale.scale_type, so add_discrete_columns_to_partition_by falls through to the schema check (line 740-741), which still says "integer = not discrete."

Inconsistency with ggplot2

In ggplot2, this works because all mapped aesthetics contribute to grouping, regardless of column type:

library(ggplot2)
df <- data.frame(sex = c("Male", "Female", "Male", "Female"),
                 survived = c(0L, 1L, 0L, 1L))
# Works fine — survived (integer) is used for grouping in stat_count
ggplot(df, aes(x = sex, fill = survived)) + geom_bar()

ggplot2 treats the integer as continuous for color scale purposes (producing a gradient), but still uses it for grouping in the stat transform. The grouping and the scale type are independent concerns.

Possible approaches

A) Aesthetic-based grouping

Certain aesthetics (fill, color, shape, linetype, stroke) inherently imply grouping. Any column mapped to these should be added to partition_by regardless of data type.

Pros: Targeted fix, only changes behavior for aesthetics where grouping is clearly intended.
Cons: Doesn't cover edge cases like mapping a numeric column to opacity in a bar chart. Requires maintaining a list of "grouping aesthetics."

B) All non-positional mapped columns survive stat transforms

Every non-positional, non-stat-consumed aesthetic column gets added to GROUP BY for stat transforms, regardless of data type or aesthetic name.

Pros: Simpler logic, matches ggplot2's behavior most closely (where group is the interaction of all mapped discrete variables, but stat transforms preserve all mappings). No need to maintain a special list.
Cons: Broader change — could affect behavior for intentionally continuous aesthetics like opacity mapped to a numeric column in a stat geom. Though in practice, including a continuous column in GROUP BY just means "don't aggregate it away," which is usually correct.

Additional consideration: RENAMING should imply discrete

Independently of the above, SCALE fill RENAMING ... should probably set or imply a discrete scale type. If you're providing explicit label mappings for specific values, discrete semantics are almost certainly intended.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions