Skip to content

RustedBytes/xaacify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xaacify

xaacify is a Rust CLI tool that reads Parquet files containing embedded WAV audio, re-encodes the audio payloads to AAC, and writes the result back to Parquet while preserving the dataset schema.

What It Does

  • scans an input directory recursively for .parquet files
  • reads rows with the schema:
{"info": {"features": {"audio": {"_type": "Audio"}, "duration": {"dtype": "float64", "_type": "Value"}, "transcription": {"dtype": "string", "_type": "Value"}}}}
  • expects the Parquet layout to contain:
    • audio.bytes: WAV file bytes
    • audio.sampling_rate: sampling rate
    • audio.path: original audio file name
    • duration
    • transcription
  • converts audio.bytes from WAV to ADTS AAC
  • rewrites audio.path to use the .aac extension
  • writes the transformed rows to an output directory with the same relative file layout

Requirements

  • Rust toolchain
  • system support for building xaac-rs and its libxaac backend

Build

cargo build --release

Portable release builds target the baseline x86-64 ISA so the resulting binary works on older x86_64 CPUs. If you want a host-specific build on your own machine, override it explicitly:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

Usage

cargo run --release -- \
  --input wav-parquet \
  --output aac-parquet

CLI Flags

--input <PATH>               Input directory containing parquet files
--output <PATH>              Output directory for converted parquet files
--workers <N>                Number of worker threads
--batch-size <N>             Parquet batch size used by the Arrow reader
--bitrate <N>                AAC target bitrate in bits per second
--profile <PROFILE>          AAC encoder profile
--bandwidth <HZ>             AAC encoder bandwidth hint in hertz
--frame-length <N>           AAC frame length override
--output-format <FORMAT>     AAC output format
--disable-tns                Disable temporal noise shaping
--full-bandwidth             Enable full-bandwidth encoding mode
--remove-input-file          Remove the source parquet file after successful conversion
--continue-on-error          Keep processing after row/file errors and report them in the final summary
--scheduler <MODE>           Scheduler mode: auto | files | rows

Supported --profile values:

  • aac-lc
  • he-aac-v1
  • aac-ld
  • he-aac-v2
  • aac-eld
  • usac

Supported --output-format values:

  • adts
  • raw

Scheduler Modes

  • auto
    • uses row-level parallelism when processing a single Parquet file
    • uses row-level parallelism when there are only a few files and at least one is large
    • otherwise uses file-level parallelism
  • files
    • parallelizes across Parquet files
    • rows inside each batch are processed sequentially
  • rows
    • processes files sequentially
    • parallelizes audio transcoding across rows inside each batch

auto is the default and is intended to avoid nested rayon overhead.

Failure Behavior

  • default mode is fail-fast
  • with --continue-on-error:
    • row-level conversion failures are logged and counted
    • file-level failures are logged and counted
    • the process still exits non-zero if any row or file failed
    • failed rows keep audio.bytes = null
    • failed rows preserve the original audio.path

Examples

Convert one directory:

cargo run --release -- \
  --input wav-parquet \
  --output aac-parquet \
  --bitrate 192000 \
  --profile aac-lc \
  --bandwidth 18000 \
  --frame-length 1024 \
  --output-format adts \
  --workers 8 \
  --batch-size 256

Convert and delete source files after each successful output write:

cargo run --release -- \
  --input wav-parquet \
  --output aac-parquet \
  --remove-input-file \
  --continue-on-error

Force row-level scheduling for a single large file:

cargo run --release -- \
  --input wav-parquet \
  --output aac-parquet \
  --scheduler rows

Logging

The tool uses log and env_logger.

Default logging level is info. You can override it with RUST_LOG:

RUST_LOG=debug cargo run --release -- --input wav-parquet --output aac-parquet

Notes

  • output audio is stored as ADTS AAC bytes in audio.bytes
  • Parquet schema and Arrow metadata are preserved in the output file
  • audio.sampling_rate is validated against the WAV header; mismatches are logged as warnings and the WAV header is used
  • supported input WAV bit depths:
    • 16
    • 24
    • 32
  • supported AAC sample rates:
    • 7350
    • 8000
    • 11025
    • 12000
    • 16000
    • 22050
    • 24000
    • 32000
    • 44100
    • 48000
    • 64000
    • 88200
    • 96000

Tests

Run checks:

cargo check

Run tests:

cargo test

At the end of each run, the tool logs a summary containing:

  • files discovered
  • files succeeded
  • files failed
  • rows converted
  • rows failed
  • elapsed time
  • files/sec
  • rows/sec
  • effective scheduler mode

About

Convert audio data in parquet files from WAV to ADTS AAC

Resources

Stars

Watchers

Forks

Languages