xaacify is a Rust CLI tool that reads Parquet files containing embedded WAV audio, re-encodes the audio payloads to AAC, and writes the result back to Parquet while preserving the dataset schema.
- scans an input directory recursively for
.parquetfiles - reads rows with the schema:
{"info": {"features": {"audio": {"_type": "Audio"}, "duration": {"dtype": "float64", "_type": "Value"}, "transcription": {"dtype": "string", "_type": "Value"}}}}- expects the Parquet layout to contain:
audio.bytes: WAV file bytesaudio.sampling_rate: sampling rateaudio.path: original audio file namedurationtranscription
- converts
audio.bytesfrom WAV to ADTS AAC - rewrites
audio.pathto use the.aacextension - writes the transformed rows to an output directory with the same relative file layout
- Rust toolchain
- system support for building
xaac-rsand itslibxaacbackend
cargo build --releasePortable release builds target the baseline x86-64 ISA so the resulting binary
works on older x86_64 CPUs. If you want a host-specific build on your own machine,
override it explicitly:
RUSTFLAGS="-Ctarget-cpu=native" cargo build --releasecargo run --release -- \
--input wav-parquet \
--output aac-parquet--input <PATH> Input directory containing parquet files
--output <PATH> Output directory for converted parquet files
--workers <N> Number of worker threads
--batch-size <N> Parquet batch size used by the Arrow reader
--bitrate <N> AAC target bitrate in bits per second
--profile <PROFILE> AAC encoder profile
--bandwidth <HZ> AAC encoder bandwidth hint in hertz
--frame-length <N> AAC frame length override
--output-format <FORMAT> AAC output format
--disable-tns Disable temporal noise shaping
--full-bandwidth Enable full-bandwidth encoding mode
--remove-input-file Remove the source parquet file after successful conversion
--continue-on-error Keep processing after row/file errors and report them in the final summary
--scheduler <MODE> Scheduler mode: auto | files | rows
Supported --profile values:
aac-lche-aac-v1aac-ldhe-aac-v2aac-eldusac
Supported --output-format values:
adtsraw
auto- uses row-level parallelism when processing a single Parquet file
- uses row-level parallelism when there are only a few files and at least one is large
- otherwise uses file-level parallelism
files- parallelizes across Parquet files
- rows inside each batch are processed sequentially
rows- processes files sequentially
- parallelizes audio transcoding across rows inside each batch
auto is the default and is intended to avoid nested rayon overhead.
- default mode is fail-fast
- with
--continue-on-error:- row-level conversion failures are logged and counted
- file-level failures are logged and counted
- the process still exits non-zero if any row or file failed
- failed rows keep
audio.bytes = null - failed rows preserve the original
audio.path
Convert one directory:
cargo run --release -- \
--input wav-parquet \
--output aac-parquet \
--bitrate 192000 \
--profile aac-lc \
--bandwidth 18000 \
--frame-length 1024 \
--output-format adts \
--workers 8 \
--batch-size 256Convert and delete source files after each successful output write:
cargo run --release -- \
--input wav-parquet \
--output aac-parquet \
--remove-input-file \
--continue-on-errorForce row-level scheduling for a single large file:
cargo run --release -- \
--input wav-parquet \
--output aac-parquet \
--scheduler rowsThe tool uses log and env_logger.
Default logging level is info. You can override it with RUST_LOG:
RUST_LOG=debug cargo run --release -- --input wav-parquet --output aac-parquet- output audio is stored as ADTS AAC bytes in
audio.bytes - Parquet schema and Arrow metadata are preserved in the output file
audio.sampling_rateis validated against the WAV header; mismatches are logged as warnings and the WAV header is used- supported input WAV bit depths:
162432
- supported AAC sample rates:
735080001102512000160002205024000320004410048000640008820096000
Run checks:
cargo checkRun tests:
cargo testAt the end of each run, the tool logs a summary containing:
- files discovered
- files succeeded
- files failed
- rows converted
- rows failed
- elapsed time
- files/sec
- rows/sec
- effective scheduler mode