Benchmarking Tools

BitNet includes comprehensive benchmarking utilities to measure inference performance. These tools help you understand the performance characteristics of BitNet models on your hardware.

End-to-End Benchmark

The e2e_benchmark.py script provides end-to-end performance measurements including throughput, latency, and memory usage. For detailed usage instructions, see our Usage Guide.

Running Benchmarks

Run End-to-End Benchmark
python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 4

e2e_benchmark.py Arguments

Option Short Description Default
--model -m Path to the model file (required) Required
--n-token -n Number of generated tokens 128
--n-prompt -p Number of prompt tokens 512
--threads -t Number of threads to use 2

Example Benchmark Scenarios

Short Generation Benchmark

Short Generation
python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 100 \
  -p 128 \
  -t 4

Long Generation Benchmark

Long Generation
python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 500 \
  -p 1024 \
  -t 4

Multi-threaded CPU Benchmark

CPU Benchmark
python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 8

Benchmarking Custom Models

For model layouts not supported by public models, BitNet provides utilities to generate dummy models for testing:

Generate Dummy Model
python utils/generate-dummy-bitnet-model.py \
  models/bitnet_b1_58-large \
  --outfile models/dummy-bitnet-125m.tl1.gguf \
  --outtype tl1 \
  --model-size 125M

Then run the benchmark with the generated model:

Benchmark Dummy Model
python utils/e2e_benchmark.py \
  -m models/dummy-bitnet-125m.tl1.gguf \
  -p 512 \
  -n 128

Performance Characteristics

Memory Efficiency

BitNet models use 1-bit quantization, reducing memory requirements by up to 16x compared to FP16 models. This makes it possible to run larger models on hardware with limited memory.

Inference Speed

Custom CUDA kernels optimized for binary operations enable faster inference compared to traditional floating-point implementations. Performance varies based on:

  • Hardware (CPU vs GPU)
  • Model size
  • Context window size
  • Number of threads (for CPU inference)
  • Batch size

Throughput

Throughput depends on hardware capabilities and model size. GPU acceleration significantly improves throughput compared to CPU-only inference.

Benchmarking Best Practices

  • Warm-up Runs: Always run a few warm-up iterations before benchmarking to ensure stable results
  • Multiple Runs: Run benchmarks multiple times and average results for accuracy
  • Consistent Environment: Ensure consistent hardware and software configuration across runs
  • Context Size: Test with various context sizes relevant to your use case
  • Thread Count: Optimize thread count for your CPU configuration
  • Monitor Resources: Monitor CPU, GPU, and memory usage during benchmarks

Interpreting Results

Benchmark results typically include:

  • Throughput: Tokens per second (higher is better)
  • Latency: Time per token or time per request (lower is better)
  • Memory Usage: Peak memory consumption during inference
  • Initialization Time: Time to load and initialize the model

Comparing Models

When comparing models:

  • Use consistent benchmark parameters across models
  • Test on the same hardware configuration
  • Consider both performance and quality metrics
  • Factor in memory requirements for your deployment environment

Related Resources