Benchmark
Performance benchmarks and benchmarking tools for BitNet
Benchmarking Tools
BitNet includes comprehensive benchmarking utilities to measure inference performance. These tools help you understand the performance characteristics of BitNet models on your hardware.
End-to-End Benchmark
The e2e_benchmark.py script provides end-to-end performance measurements including
throughput, latency, and memory usage. For detailed usage instructions, see our
Usage Guide.
Running Benchmarks
python utils/e2e_benchmark.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-n 200 \
-p 256 \
-t 4
e2e_benchmark.py Arguments
| Option | Short | Description | Default |
|---|---|---|---|
--model |
-m |
Path to the model file (required) | Required |
--n-token |
-n |
Number of generated tokens | 128 |
--n-prompt |
-p |
Number of prompt tokens | 512 |
--threads |
-t |
Number of threads to use | 2 |
Example Benchmark Scenarios
Short Generation Benchmark
python utils/e2e_benchmark.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-n 100 \
-p 128 \
-t 4
Long Generation Benchmark
python utils/e2e_benchmark.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-n 500 \
-p 1024 \
-t 4
Multi-threaded CPU Benchmark
python utils/e2e_benchmark.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-n 200 \
-p 256 \
-t 8
Benchmarking Custom Models
For model layouts not supported by public models, BitNet provides utilities to generate dummy models for testing:
python utils/generate-dummy-bitnet-model.py \
models/bitnet_b1_58-large \
--outfile models/dummy-bitnet-125m.tl1.gguf \
--outtype tl1 \
--model-size 125M
Then run the benchmark with the generated model:
python utils/e2e_benchmark.py \
-m models/dummy-bitnet-125m.tl1.gguf \
-p 512 \
-n 128
Performance Characteristics
Memory Efficiency
BitNet models use 1-bit quantization, reducing memory requirements by up to 16x compared to FP16 models. This makes it possible to run larger models on hardware with limited memory.
Inference Speed
Custom CUDA kernels optimized for binary operations enable faster inference compared to traditional floating-point implementations. Performance varies based on:
- Hardware (CPU vs GPU)
- Model size
- Context window size
- Number of threads (for CPU inference)
- Batch size
Throughput
Throughput depends on hardware capabilities and model size. GPU acceleration significantly improves throughput compared to CPU-only inference.
Benchmarking Best Practices
- Warm-up Runs: Always run a few warm-up iterations before benchmarking to ensure stable results
- Multiple Runs: Run benchmarks multiple times and average results for accuracy
- Consistent Environment: Ensure consistent hardware and software configuration across runs
- Context Size: Test with various context sizes relevant to your use case
- Thread Count: Optimize thread count for your CPU configuration
- Monitor Resources: Monitor CPU, GPU, and memory usage during benchmarks
Interpreting Results
Benchmark results typically include:
- Throughput: Tokens per second (higher is better)
- Latency: Time per token or time per request (lower is better)
- Memory Usage: Peak memory consumption during inference
- Initialization Time: Time to load and initialize the model
Comparing Models
When comparing models:
- Use consistent benchmark parameters across models
- Test on the same hardware configuration
- Consider both performance and quality metrics
- Factor in memory requirements for your deployment environment
Related Resources
- Usage Guide - Detailed usage instructions
- Models Page - Available models to benchmark
- Installation Guide - Setup instructions
- Documentation - Complete API reference
- Features Page - Understanding BitNet features