Documentation Overview

Welcome to the BitNet documentation! This comprehensive guide covers all aspects of using BitNet for 1-bit LLM inference. If you're new to BitNet, start with our Getting Started Guide.

Quick Start Guides

Command Line Tools

run_inference.py

Run inference with BitNet models.

Usage

run_inference.py Usage
python run_inference.py [OPTIONS]

Options

Option Short Description Type Default
--model -m Path to model file String Required
--prompt -p Prompt to generate text from String Required
--n-predict -n Number of tokens to predict Integer 128
--threads -t Number of threads to use Integer Auto-detect
--ctx-size -c Size of the prompt context Integer 512
--temperature -temp Temperature for text generation Float 0.8
--conversation -cnv Enable chat mode Flag False

setup_env.py

Setup environment for running inference with BitNet models.

Usage

setup_env.py Usage
python setup_env.py [OPTIONS]

Options

Option Short Description Type
--hf-repo -hr Model used for inference String
--model-dir -md Directory to save/load the model String
--log-dir -ld Directory to save logging info String
--quant-type -q Quantization type Choice: i2_s, tl1
--quant-embd Quantize embeddings to f16 Flag
--use-pretuned -p Use pretuned kernel parameters Flag

e2e_benchmark.py

Run end-to-end inference benchmarks.

Usage

e2e_benchmark.py Usage
python utils/e2e_benchmark.py [OPTIONS]

Options

Option Short Description Type Default
--model -m Path to the model file String Required
--n-token -n Number of generated tokens Integer 128
--n-prompt -p Number of prompt tokens Integer 512
--threads -t Number of threads to use Integer 2

convert-helper-bitnet.py

Convert models from .safetensors format to GGUF format.

Usage

convert-helper-bitnet.py Usage
python ./utils/convert-helper-bitnet.py MODEL_DIR

generate-dummy-bitnet-model.py

Generate dummy models for testing and benchmarking.

Usage

generate-dummy-bitnet-model.py Usage
python utils/generate-dummy-bitnet-model.py MODEL_LAYOUT \
  --outfile OUTPUT_FILE \
  --outtype QUANT_TYPE \
  --model-size SIZE

Quantization Types

i2_s

1-bit signed quantization using -1 and +1 values. This is the recommended quantization type.

tl1

Ternary-like quantization variant.

Model Formats

GGUF Format

GGUF (GPT-Generated Unified Format) is the native format for BitNet models. It's optimized for efficient loading and inference.

.safetensors Format

Some models are available in .safetensors format and can be converted to GGUF using the conversion utilities. See our Usage Guide for conversion instructions.

Configuration

Environment Variables

BitNet uses standard environment variables for configuration:

  • CUDA_VISIBLE_DEVICES - Specify which GPU to use
  • OMP_NUM_THREADS - Number of threads for CPU inference

Best Practices

  • Use Appropriate Models: Choose models that fit your use case. See our Models Page.
  • Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
  • Adjust Temperature: Lower temperature for deterministic outputs, higher for creativity.
  • Use GPU When Available: GPU acceleration significantly improves performance.
  • Monitor Memory: Even with 1-bit quantization, large models require significant memory.

Related Documentation

Additional Resources