RSR-core

RSR (Redundant Segment Reduction) for efficient low-bit inference (matrix-vector multiplication).

This repository contains the core kernels, model integrations, and benchmarking code for RSR across CPU and CUDA backends. RSR targets fast matrix-vector multiplication when the matrix is low-bit quantized by grouping repeated column patterns, aggregating the corresponding input values once, and then scattering the result to the affected output rows.

This is especially useful for workloads such as low-bit LLM inference, where decoding repeatedly applies quantized matvec operations. For the original algorithm, see UIC-InDeXLab/RSR and docs/ALGORITHM.md.

Demo 🎬

Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. HF denotes the Hugging Face baseline running bfloat16 on PyTorch.

PROMPT: "Write the numbers from one to sixty in words separated by commas only:"

Installation 🛠️

Prerequisites: Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.

git clone https://github.com/UIC-InDeXLab/RSR-Core.git
cd RSR-Core
pip install -e .

Benchmark Results 📊

Matrix-Vector Multiplication

CPU 🖥️

1-bit	1.58-bit

CUDA ⚡

1-bit	1.58-bit

Ternary (1.58bit) LLMs

Speedup is computed against the HuggingFace bfloat16 baseline for the same model.

CPU 🖥️

Model	HF Tok/s	RSR Tok/s	Speedup
Falcon3-10B-Instruct-1.58bit	0.2	11.3	62.0x
Llama3-8B-1.58-100B-tokens	0.2	13.4	53.8x
bitnet-b1.58-2B-4T-bf16	2.1	28.8	13.9x
bitnet-b1.58-2B-4T	14.2	29.3	2.1x

CUDA ⚡

Model	HF Tok/s	RSR Tok/s	Speedup
Falcon3-10B-Instruct-1.58bit	25.2	47.4	1.9x
Llama3-8B-1.58-100B-tokens	31.9	59.3	1.9x
bitnet-b1.58-2B-4T-bf16	33.1	57.4	1.7x
bitnet-b1.58-2B-4T	41.6	57.1	1.4x

Updates 📝

Project Structure 🗂️

RSR-core/
├── multiplier/             # Python wrappers for kernels
│   ├── bit_1/              # 1-bit (binary) multipliers (CPU/CUDA)
│   └── bit_1_58/           # 1.58-bit (ternary) multipliers (CPU/CUDA)
├── kernels/                # Low-level C/CUDA kernel source
│   ├── bit_1/
│   │   ├── cpu/            #   C kernels
│   │   └── cuda/           #   CUDA kernels (.cu)
│   └── bit_1_58/
│       ├── cpu/            #   C kernels
│       └── cuda/           #   CUDA kernels (.cu)
├── integrations/           # Model integrations
│   └── hf/                 #   HuggingFace integration
├── benchmarking/           # Benchmarking scripts & results
└── tests/                  # Unit and integration tests

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
assets		assets
benchmarking		benchmarking
docs		docs
integrations		integrations
kernels		kernels
multiplier		multiplier
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSR-core

Demo 🎬

Installation 🛠️

Benchmark Results 📊

Matrix-Vector Multiplication

CPU 🖥️

CUDA ⚡

Ternary (1.58bit) LLMs

CPU 🖥️

CUDA ⚡

Updates 📝

Project Structure 🗂️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RSR-core

Demo 🎬

Installation 🛠️

Benchmark Results 📊

Matrix-Vector Multiplication

CPU 🖥️

CUDA ⚡

Ternary (1.58bit) LLMs

CPU 🖥️

CUDA ⚡

Updates 📝

Project Structure 🗂️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages