Usage - BitNet

Basic Usage

Once you have BitNet installed and a model downloaded, you can start running inference. For installation instructions, see our Installation Guide. For available models, check out our Models Page.

Running Inference

The most common way to use BitNet is through the run_inference.py script:

Basic Inference Command

python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Your prompt here"

run_inference.py Arguments

Option	Short	Description	Default
`--model`	`-m`	Path to model file	Required
`--prompt`	`-p`	Prompt to generate text from	Required
`--n-predict`	`-n`	Number of tokens to predict when generating text	128
`--threads`	`-t`	Number of threads to use	Auto-detect
`--ctx-size`	`-c`	Size of the prompt context	512
`--temperature`	`-temp`	Temperature for text generation (0.0-2.0)	0.8
`--conversation`	`-cnv`	Enable chat mode (uses prompt as system prompt)	False

Example: Simple Text Generation

Simple Text Generation

python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "The future of artificial intelligence is" \
  -n 100 \
  -temp 0.7

Example: Conversational AI

For instruction-tuned models, use conversation mode:

Conversation Mode

python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "You are a helpful assistant" \
  -cnv

When using -cnv flag, the prompt specified by -p will be used as the system prompt, and you'll enter an interactive conversation mode.

Example: Custom Context Size

For longer prompts or conversations, increase the context size:

Custom Context Size

python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Write a long story about..." \
  -c 2048 \
  -n 500

Example: CPU Optimization

Control the number of threads for CPU inference:

CPU Thread Control

python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Your prompt" \
  -t 8

Advanced Usage

Model Setup

Before using a model, you need to set up the environment:

Model Setup

python setup_env.py \
  -md models/BitNet-b1.58-2B-4T \
  -q i2_s

setup_env.py Options

--hf-repo, -hr: Model used for inference (various model names)
--model-dir, -md: Directory to save/load the model
--log-dir, -ld: Directory to save logging info
--quant-type, -q: Quantization type (i2_s or tl1)
--quant-embd: Quantize embeddings to f16
--use-pretuned, -p: Use pretuned kernel parameters

Model Conversion

Convert models from .safetensors format to GGUF:

Model Conversion

# Download .safetensors model
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 \
  --local-dir ./models/bitnet-b1.58-2B-4T-bf16

# Convert to GGUF
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16

Benchmarking

BitNet includes benchmarking utilities to measure inference performance. For detailed benchmark information, see our Benchmark Page.

Run Benchmark

python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 4

e2e_benchmark.py Arguments

-m, --model: Path to the model file (required)
-n, --n-token: Number of generated tokens (default: 128)
-p, --n-prompt: Number of prompt tokens (default: 512)
-t, --threads: Number of threads to use (default: 2)

Best Practices

Use Appropriate Models: Choose models that fit your use case. See our Models Page for recommendations.
Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
Adjust Temperature: Lower temperature (0.0-0.7) for deterministic outputs, higher (0.8-2.0) for creativity.
Use GPU When Available: GPU acceleration significantly improves inference speed.
Monitor Memory: Even with 1-bit quantization, large models still require significant memory.

Troubleshooting

If you encounter issues during usage, check our FAQ Page for solutions. Common issues include:

Model file not found errors
Out of memory errors
CUDA compatibility issues
Model format incompatibilities

Related Resources

Getting Started Guide - Quick introduction
Installation Guide - Setup instructions
Models Page - Available models
Benchmark Guide - Performance testing
Documentation - Complete API reference
FAQ - Common questions and answers