Benchmark - BitNet

Benchmarking Tools

BitNet includes comprehensive benchmarking utilities to measure inference performance. These tools help you understand the performance characteristics of BitNet models on your hardware.

End-to-End Benchmark

The e2e_benchmark.py script provides end-to-end performance measurements including throughput, latency, and memory usage. For detailed usage instructions, see our Usage Guide.

Running Benchmarks

Run End-to-End Benchmark

python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 4

e2e_benchmark.py Arguments

Option	Short	Description	Default
`--model`	`-m`	Path to the model file (required)	Required
`--n-token`	`-n`	Number of generated tokens	128
`--n-prompt`	`-p`	Number of prompt tokens	512
`--threads`	`-t`	Number of threads to use	2

Example Benchmark Scenarios

Short Generation Benchmark

Short Generation

python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 100 \
  -p 128 \
  -t 4

Long Generation Benchmark

Long Generation

python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 500 \
  -p 1024 \
  -t 4

Multi-threaded CPU Benchmark

CPU Benchmark

python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 8

Benchmarking Custom Models

For model layouts not supported by public models, BitNet provides utilities to generate dummy models for testing:

Generate Dummy Model

python utils/generate-dummy-bitnet-model.py \
  models/bitnet_b1_58-large \
  --outfile models/dummy-bitnet-125m.tl1.gguf \
  --outtype tl1 \
  --model-size 125M

Then run the benchmark with the generated model:

Benchmark Dummy Model

python utils/e2e_benchmark.py \
  -m models/dummy-bitnet-125m.tl1.gguf \
  -p 512 \
  -n 128

Performance Characteristics

Memory Efficiency

BitNet models use 1-bit quantization, reducing memory requirements by up to 16x compared to FP16 models. This makes it possible to run larger models on hardware with limited memory.

Inference Speed

Custom CUDA kernels optimized for binary operations enable faster inference compared to traditional floating-point implementations. Performance varies based on:

Hardware (CPU vs GPU)
Model size
Context window size
Number of threads (for CPU inference)
Batch size

Throughput

Throughput depends on hardware capabilities and model size. GPU acceleration significantly improves throughput compared to CPU-only inference.

Benchmarking Best Practices

Warm-up Runs: Always run a few warm-up iterations before benchmarking to ensure stable results
Multiple Runs: Run benchmarks multiple times and average results for accuracy
Consistent Environment: Ensure consistent hardware and software configuration across runs
Context Size: Test with various context sizes relevant to your use case
Thread Count: Optimize thread count for your CPU configuration
Monitor Resources: Monitor CPU, GPU, and memory usage during benchmarks

Interpreting Results

Benchmark results typically include:

Throughput: Tokens per second (higher is better)
Latency: Time per token or time per request (lower is better)
Memory Usage: Peak memory consumption during inference
Initialization Time: Time to load and initialize the model

Comparing Models

When comparing models:

Use consistent benchmark parameters across models
Test on the same hardware configuration
Consider both performance and quality metrics
Factor in memory requirements for your deployment environment

Related Resources

Usage Guide - Detailed usage instructions
Models Page - Available models to benchmark
Installation Guide - Setup instructions
Documentation - Complete API reference
Features Page - Understanding BitNet features