Synthetic enterprise data generation for ML training, audit analytics, and system testing.
DataSynth generates statistically realistic, fully interconnected enterprise financial data across 20+ process families. Generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford's Law, log-normal mixtures), and maintains referential integrity across 100+ output tables. Generation-time assertions enforce these invariants at scale.
Full Documentation | Commercial SDKs | CHANGELOG
Pre-generated datasets at huggingface.co/VynFi:
| Dataset | Records | Description |
|---|---|---|
| vynfi-aml-100k | 749K | Banking transactions with AML labels, 14 velocity features, 59 columns |
| vynfi-audit-p2p | 234 | P2P document chain (PO/GR/VI/Payment) with fraud labels |
| vynfi-ocel-manufacturing | 344 | OCEL event log for process mining (pm4py, Celonis) |
from datasets import load_dataset
ds = load_dataset("VynFi/vynfi-aml-100k", split="train")
df = ds.to_pandas()All datasets: Apache 2.0, entirely synthetic, no PII.
# Build
git clone https://github.com/mivertowski/SyntheticData.git && cd SyntheticData
cargo build --release
# Demo — generates a complete dataset with defaults
./target/release/datasynth-data generate --demo --output ./output
# Full audit simulation (113+ output files)
./target/release/datasynth-data generate --demo --preset audit-group --output ./audit
# Configure and generate
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output
# AI-powered config generation (set OPENAI_API_KEY, ANTHROPIC_API_KEY, or OPENROUTER_API_KEY)
cargo build --release --features llm
OPENAI_API_KEY=sk-... ./target/release/datasynth-data init \
--from-description "12 months of mid-market retail data with fraud and SOX controls" -o config.yaml
# Counterfactual scenario simulation
./target/release/datasynth-data scenario list --config config.yaml
./target/release/datasynth-data scenario generate --config config.yaml --output ./output
# Auto-tuning: generate → evaluate → AI patch → regenerate
./target/release/datasynth-data generate --config config.yaml --output ./output --auto-tune --max-iterations 3See the CLI Reference for all commands and flags.
Every process chain generates cross-referenced master data, documents, and journal entries:
| Process Family | Scope |
|---|---|
| General Ledger | Journal entries, chart of accounts (small/medium/large), ACDOCA |
| Procure-to-Pay | POs, goods receipts, vendor invoices, payments, three-way match |
| Order-to-Cash | Sales orders, deliveries, customer invoices, receipts, dunning |
| Source-to-Contract | Spend analysis, sourcing, RFx, bids, contracts, scorecards |
| Hire-to-Retire | Payroll, time & attendance, expenses, benefits, pensions, stock comp |
| Manufacturing | Production orders, BOM, WIP costing, quality inspections, cycle counts |
| Financial Reporting | BS/IS/CF, equity changes, KPIs, budgets, segment reporting, notes, XBRL |
| Tax | Multi-jurisdiction, VAT/GST, ASC 740/IAS 12 provisions, deferred tax |
| Treasury | Cash positioning, forecasts, pooling, hedging (ASC 815/IFRS 9), covenants |
| ESG | GHG Scope 1/2/3, energy/water/waste, diversity, GRI/SASB/TCFD |
| Banking / AML | 20 AML typologies, criminal networks, velocity features, KYC |
| Audit | ISA lifecycle, ISA 600 group audit, SOX 302/404, 10 methodology blueprints |
| Intercompany | IC matching, transfer pricing, eliminations, currency translation |
| Period Close | Depreciation, accruals, year-end closing, tax provisions |
| Feature | Description | Feature Flag |
|---|---|---|
| Neural Diffusion | Candle-powered score network, denoising score matching, hybrid blending | neural |
| LLM Config Generation | Natural language → YAML config (OpenAI/Anthropic/OpenRouter) | llm |
| Auto-Tune | Generate → evaluate → AI patch → regenerate closed loop | — |
| Adversarial Testing | ONNX model boundary probing via ort |
adversarial |
| Anomaly Designer | LLM-designed fraud schemes adapted to control environment | — |
| Tabular Transformer | Masked column prediction for conditional generation | neural |
| GNN Graph Generator | Message-passing GNN for entity relationship structure | neural |
See AI Capabilities for details.
Define scenarios with typed interventions, generate paired baseline/counterfactual datasets with causal DAG propagation:
scenarios:
enabled: true
scenarios:
- name: supply_chain_disruption
interventions:
- type: parameter_shift
target: distributions.amounts.components[0].mu
value: "6.5"
timing: { start_month: 7, duration_months: 4, onset: sudden }
constraints:
preserve_accounting_identity: true
output:
paired: true11 pre-built scenarios across fraud, control failures, macro shocks, and operational disruptions. See Scenario Library.
US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB), dual reporting. Revenue recognition (ASC 606/IFRS 15), leases (ASC 842/IFRS 16), fair value (ASC 820/IFRS 13), impairment, deferred tax, ECL, pensions, stock comp, business combinations, segment reporting. ISA (34 standards), PCAOB (19+), SOX 302/404, COSO 2013 (5 components, 17 principles). FEC and GoBD audit file exports.
YAML-driven methodology-agnostic state machine with 10 built-in blueprints (FSA, IA, KPMG, PwC, Deloitte, EY GAM, SOC 2, PCAOB, Regulatory). See Audit FSM.
16 crates in a Rust workspace:
datasynth-cli CLI binary (generate, validate, init, scenario, adversarial, audit)
datasynth-server REST / gRPC / WebSocket server with auth and rate limiting
datasynth-runtime Generation orchestrator (phases, assertions, streaming)
datasynth-generators 50+ generators across all process families
datasynth-banking KYC/AML with 20 typologies and criminal networks
datasynth-eval Evaluation framework, auto-tuning, adversarial testing
datasynth-config YAML configuration, validation, industry presets
datasynth-core 306 domain models, distributions, diffusion, LLM provider
datasynth-graph Graph export (PyG, Neo4j, DGL, hypergraph)
datasynth-standards IFRS, US GAAP, ISA, SOX, PCAOB standards
datasynth-audit-fsm YAML-driven audit FSM (10 blueprints)
datasynth-audit-optimizer Audit optimization, Monte Carlo, group audit simulation
datasynth-ocpm OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint Privacy-preserving fingerprint extraction and synthesis
datasynth-output CSV, JSON, Parquet sinks with streaming
datasynth-test-utils Test fixtures and utilities
See Architecture and Generation Pipeline.
| Metric | Value |
|---|---|
| Generation throughput | ~14,000 JEs/sec |
| XXL dataset (200K+ JEs, 3 companies, 36 months) | 20.6s CSV-only |
| CSV-only speedup | 4x faster (skips JSON serialization) |
| Peak memory at scale | ~4.3 GB for 200K+ JEs |
| Determinism | Fully reproducible via seeded ChaCha8 RNG |
cd python && pip install -e ".[all]"from datasynth_py import DataSynth
from datasynth_py.config import blueprints
config = blueprints.retail_small(companies=4, transactions=10000)
result = DataSynth().generate(config=config, output={"format": "csv", "sink": "temp_dir"})Blueprints: retail_small(), banking_medium(), manufacturing_large(), ml_training(), with_distributions(), with_diffusion(), with_causal()
Integrations: Apache Spark, dbt, Apache Airflow, MLflow. See Python SDK.
cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051 --api-keys "key1,key2"REST, gRPC, and WebSocket APIs with JWT/OIDC authentication, rate limiting, and RBAC. Docker + Kubernetes Helm chart included. See Server & API and Deployment Guide.
| Guide | Content |
|---|---|
| Getting Started | Installation, quick start, demo mode |
| Configuration | YAML reference (40+ sections), presets, NL config |
| CLI Reference | All commands and flags |
| AI Capabilities | Neural diffusion, auto-tune, adversarial, anomaly designer |
| Scenario Engine | Counterfactual simulation, scenario library, .dss format |
| Audit FSM | 10 blueprints, step dispatcher, C2CE lifecycle |
| Banking & AML | 20 typologies, networks, velocity features |
| Fingerprinting | Extract → synthesize pipeline |
| Architecture | 16 crates, pipeline phases, performance |
| Python SDK | Client, blueprints, Spark/dbt/Airflow/MLflow |
| Server & API | REST/gRPC/WebSocket, auth, rate limiting |
| Deployment | Docker, Kubernetes, systemd |
| Contributing | Development setup, PR guidelines |
| Changelog | Full version history |
Build the documentation site locally: cd docs/book && mdbook serve
Copyright 2024-2026 Michael Ivertowski. Licensed under the Apache License, Version 2.0. See LICENSE.
Commercial support, custom development, and enterprise licensing: vynfi.com | GitHub Issues