Skip to content

mivertowski/SyntheticData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

924 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSynth v3.0.0

License Rust CI

Synthetic enterprise data generation for ML training, audit analytics, and system testing.

DataSynth generates statistically realistic, fully interconnected enterprise financial data across 20+ process families. Generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford's Law, log-normal mixtures), and maintains referential integrity across 100+ output tables. Generation-time assertions enforce these invariants at scale.

Full Documentation | Commercial SDKs | CHANGELOG


Example Datasets

Pre-generated datasets at huggingface.co/VynFi:

Dataset Records Description
vynfi-aml-100k 749K Banking transactions with AML labels, 14 velocity features, 59 columns
vynfi-audit-p2p 234 P2P document chain (PO/GR/VI/Payment) with fraud labels
vynfi-ocel-manufacturing 344 OCEL event log for process mining (pm4py, Celonis)
from datasets import load_dataset
ds = load_dataset("VynFi/vynfi-aml-100k", split="train")
df = ds.to_pandas()

All datasets: Apache 2.0, entirely synthetic, no PII.


Quick Start

# Build
git clone https://github.com/mivertowski/SyntheticData.git && cd SyntheticData
cargo build --release

# Demo — generates a complete dataset with defaults
./target/release/datasynth-data generate --demo --output ./output

# Full audit simulation (113+ output files)
./target/release/datasynth-data generate --demo --preset audit-group --output ./audit

# Configure and generate
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

# AI-powered config generation (set OPENAI_API_KEY, ANTHROPIC_API_KEY, or OPENROUTER_API_KEY)
cargo build --release --features llm
OPENAI_API_KEY=sk-... ./target/release/datasynth-data init \
  --from-description "12 months of mid-market retail data with fraud and SOX controls" -o config.yaml

# Counterfactual scenario simulation
./target/release/datasynth-data scenario list --config config.yaml
./target/release/datasynth-data scenario generate --config config.yaml --output ./output

# Auto-tuning: generate → evaluate → AI patch → regenerate
./target/release/datasynth-data generate --config config.yaml --output ./output --auto-tune --max-iterations 3

See the CLI Reference for all commands and flags.


Key Capabilities

Enterprise Process Simulation

Every process chain generates cross-referenced master data, documents, and journal entries:

Process Family Scope
General Ledger Journal entries, chart of accounts (small/medium/large), ACDOCA
Procure-to-Pay POs, goods receipts, vendor invoices, payments, three-way match
Order-to-Cash Sales orders, deliveries, customer invoices, receipts, dunning
Source-to-Contract Spend analysis, sourcing, RFx, bids, contracts, scorecards
Hire-to-Retire Payroll, time & attendance, expenses, benefits, pensions, stock comp
Manufacturing Production orders, BOM, WIP costing, quality inspections, cycle counts
Financial Reporting BS/IS/CF, equity changes, KPIs, budgets, segment reporting, notes, XBRL
Tax Multi-jurisdiction, VAT/GST, ASC 740/IAS 12 provisions, deferred tax
Treasury Cash positioning, forecasts, pooling, hedging (ASC 815/IFRS 9), covenants
ESG GHG Scope 1/2/3, energy/water/waste, diversity, GRI/SASB/TCFD
Banking / AML 20 AML typologies, criminal networks, velocity features, KYC
Audit ISA lifecycle, ISA 600 group audit, SOX 302/404, 10 methodology blueprints
Intercompany IC matching, transfer pricing, eliminations, currency translation
Period Close Depreciation, accruals, year-end closing, tax provisions

AI Capabilities

Feature Description Feature Flag
Neural Diffusion Candle-powered score network, denoising score matching, hybrid blending neural
LLM Config Generation Natural language → YAML config (OpenAI/Anthropic/OpenRouter) llm
Auto-Tune Generate → evaluate → AI patch → regenerate closed loop
Adversarial Testing ONNX model boundary probing via ort adversarial
Anomaly Designer LLM-designed fraud schemes adapted to control environment
Tabular Transformer Masked column prediction for conditional generation neural
GNN Graph Generator Message-passing GNN for entity relationship structure neural

See AI Capabilities for details.

Counterfactual Simulation

Define scenarios with typed interventions, generate paired baseline/counterfactual datasets with causal DAG propagation:

scenarios:
  enabled: true
  scenarios:
    - name: supply_chain_disruption
      interventions:
        - type: parameter_shift
          target: distributions.amounts.components[0].mu
          value: "6.5"
          timing: { start_month: 7, duration_months: 4, onset: sudden }
      constraints:
        preserve_accounting_identity: true
      output:
        paired: true

11 pre-built scenarios across fraud, control failures, macro shocks, and operational disruptions. See Scenario Library.

Accounting & Compliance Standards

US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB), dual reporting. Revenue recognition (ASC 606/IFRS 15), leases (ASC 842/IFRS 16), fair value (ASC 820/IFRS 13), impairment, deferred tax, ECL, pensions, stock comp, business combinations, segment reporting. ISA (34 standards), PCAOB (19+), SOX 302/404, COSO 2013 (5 components, 17 principles). FEC and GoBD audit file exports.

Audit FSM Engine

YAML-driven methodology-agnostic state machine with 10 built-in blueprints (FSA, IA, KPMG, PwC, Deloitte, EY GAM, SOC 2, PCAOB, Regulatory). See Audit FSM.


Architecture

16 crates in a Rust workspace:

datasynth-cli              CLI binary (generate, validate, init, scenario, adversarial, audit)
datasynth-server           REST / gRPC / WebSocket server with auth and rate limiting
datasynth-runtime          Generation orchestrator (phases, assertions, streaming)
datasynth-generators       50+ generators across all process families
datasynth-banking          KYC/AML with 20 typologies and criminal networks
datasynth-eval             Evaluation framework, auto-tuning, adversarial testing
datasynth-config           YAML configuration, validation, industry presets
datasynth-core             306 domain models, distributions, diffusion, LLM provider
datasynth-graph            Graph export (PyG, Neo4j, DGL, hypergraph)
datasynth-standards        IFRS, US GAAP, ISA, SOX, PCAOB standards
datasynth-audit-fsm        YAML-driven audit FSM (10 blueprints)
datasynth-audit-optimizer  Audit optimization, Monte Carlo, group audit simulation
datasynth-ocpm             OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint      Privacy-preserving fingerprint extraction and synthesis
datasynth-output           CSV, JSON, Parquet sinks with streaming
datasynth-test-utils       Test fixtures and utilities

See Architecture and Generation Pipeline.


Performance

Metric Value
Generation throughput ~14,000 JEs/sec
XXL dataset (200K+ JEs, 3 companies, 36 months) 20.6s CSV-only
CSV-only speedup 4x faster (skips JSON serialization)
Peak memory at scale ~4.3 GB for 200K+ JEs
Determinism Fully reproducible via seeded ChaCha8 RNG

See Performance Benchmarks.


Python SDK

cd python && pip install -e ".[all]"
from datasynth_py import DataSynth
from datasynth_py.config import blueprints

config = blueprints.retail_small(companies=4, transactions=10000)
result = DataSynth().generate(config=config, output={"format": "csv", "sink": "temp_dir"})

Blueprints: retail_small(), banking_medium(), manufacturing_large(), ml_training(), with_distributions(), with_diffusion(), with_causal()

Integrations: Apache Spark, dbt, Apache Airflow, MLflow. See Python SDK.


Server & Deployment

cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051 --api-keys "key1,key2"

REST, gRPC, and WebSocket APIs with JWT/OIDC authentication, rate limiting, and RBAC. Docker + Kubernetes Helm chart included. See Server & API and Deployment Guide.


Documentation

Guide Content
Getting Started Installation, quick start, demo mode
Configuration YAML reference (40+ sections), presets, NL config
CLI Reference All commands and flags
AI Capabilities Neural diffusion, auto-tune, adversarial, anomaly designer
Scenario Engine Counterfactual simulation, scenario library, .dss format
Audit FSM 10 blueprints, step dispatcher, C2CE lifecycle
Banking & AML 20 typologies, networks, velocity features
Fingerprinting Extract → synthesize pipeline
Architecture 16 crates, pipeline phases, performance
Python SDK Client, blueprints, Spark/dbt/Airflow/MLflow
Server & API REST/gRPC/WebSocket, auth, rate limiting
Deployment Docker, Kubernetes, systemd
Contributing Development setup, PR guidelines
Changelog Full version history

Build the documentation site locally: cd docs/book && mdbook serve


License

Copyright 2024-2026 Michael Ivertowski. Licensed under the Apache License, Version 2.0. See LICENSE.


Support

Commercial support, custom development, and enterprise licensing: vynfi.com | GitHub Issues

About

High-performance synthetic enterprise data generator. Produces 100+ interconnected financial tables — GL journal entries, document flows, subledgers, banking/KYC/AML, process mining (OCEL 2.0), graph exports (PyTorch Geometric, Neo4j), and 20+ process chains — with Benford's Law compliance, ACFE-aligned fraud labels, and formal privacy guarantees.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Contributors