Basic Usage

Once you have BitNet installed and a model downloaded, you can start running inference. For installation instructions, see our Installation Guide. For available models, check out our Models Page.

Running Inference

The most common way to use BitNet is through the run_inference.py script:

Basic Inference Command
python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Your prompt here"

run_inference.py Arguments

Option Short Description Default
--model -m Path to model file Required
--prompt -p Prompt to generate text from Required
--n-predict -n Number of tokens to predict when generating text 128
--threads -t Number of threads to use Auto-detect
--ctx-size -c Size of the prompt context 512
--temperature -temp Temperature for text generation (0.0-2.0) 0.8
--conversation -cnv Enable chat mode (uses prompt as system prompt) False

Example: Simple Text Generation

Simple Text Generation
python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "The future of artificial intelligence is" \
  -n 100 \
  -temp 0.7

Example: Conversational AI

For instruction-tuned models, use conversation mode:

Conversation Mode
python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "You are a helpful assistant" \
  -cnv

When using -cnv flag, the prompt specified by -p will be used as the system prompt, and you'll enter an interactive conversation mode.

Example: Custom Context Size

For longer prompts or conversations, increase the context size:

Custom Context Size
python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Write a long story about..." \
  -c 2048 \
  -n 500

Example: CPU Optimization

Control the number of threads for CPU inference:

CPU Thread Control
python run_inference.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Your prompt" \
  -t 8

Advanced Usage

Model Setup

Before using a model, you need to set up the environment:

Model Setup
python setup_env.py \
  -md models/BitNet-b1.58-2B-4T \
  -q i2_s

setup_env.py Options

  • --hf-repo, -hr: Model used for inference (various model names)
  • --model-dir, -md: Directory to save/load the model
  • --log-dir, -ld: Directory to save logging info
  • --quant-type, -q: Quantization type (i2_s or tl1)
  • --quant-embd: Quantize embeddings to f16
  • --use-pretuned, -p: Use pretuned kernel parameters

Model Conversion

Convert models from .safetensors format to GGUF:

Model Conversion
# Download .safetensors model
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 \
  --local-dir ./models/bitnet-b1.58-2B-4T-bf16

# Convert to GGUF
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16

Benchmarking

BitNet includes benchmarking utilities to measure inference performance. For detailed benchmark information, see our Benchmark Page.

Run Benchmark
python utils/e2e_benchmark.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -n 200 \
  -p 256 \
  -t 4

e2e_benchmark.py Arguments

  • -m, --model: Path to the model file (required)
  • -n, --n-token: Number of generated tokens (default: 128)
  • -p, --n-prompt: Number of prompt tokens (default: 512)
  • -t, --threads: Number of threads to use (default: 2)

Best Practices

  • Use Appropriate Models: Choose models that fit your use case. See our Models Page for recommendations.
  • Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
  • Adjust Temperature: Lower temperature (0.0-0.7) for deterministic outputs, higher (0.8-2.0) for creativity.
  • Use GPU When Available: GPU acceleration significantly improves inference speed.
  • Monitor Memory: Even with 1-bit quantization, large models still require significant memory.

Troubleshooting

If you encounter issues during usage, check our FAQ Page for solutions. Common issues include:

  • Model file not found errors
  • Out of memory errors
  • CUDA compatibility issues
  • Model format incompatibilities

Related Resources