Documentation

Documentation Overview

Welcome to the BitNet documentation! This comprehensive guide covers all aspects of using BitNet for 1-bit LLM inference. If you're new to BitNet, start with our Getting Started Guide.

Quick Start Guides

Getting Started - Quick introduction to BitNet
Installation Guide - Setup instructions for all platforms
Usage Guide - Basic and advanced usage examples

Command Line Tools

run_inference.py

Run inference with BitNet models.

Usage

run_inference.py Usage

python run_inference.py [OPTIONS]

Options

Option	Short	Description	Type	Default
`--model`	`-m`	Path to model file	String	Required
`--prompt`	`-p`	Prompt to generate text from	String	Required
`--n-predict`	`-n`	Number of tokens to predict	Integer	128
`--threads`	`-t`	Number of threads to use	Integer	Auto-detect
`--ctx-size`	`-c`	Size of the prompt context	Integer	512
`--temperature`	`-temp`	Temperature for text generation	Float	0.8
`--conversation`	`-cnv`	Enable chat mode	Flag	False

setup_env.py

Setup environment for running inference with BitNet models.

Usage

setup_env.py Usage

python setup_env.py [OPTIONS]

Options

Option	Short	Description	Type
`--hf-repo`	`-hr`	Model used for inference	String
`--model-dir`	`-md`	Directory to save/load the model	String
`--log-dir`	`-ld`	Directory to save logging info	String
`--quant-type`	`-q`	Quantization type	Choice: i2_s, tl1
`--quant-embd`		Quantize embeddings to f16	Flag
`--use-pretuned`	`-p`	Use pretuned kernel parameters	Flag

e2e_benchmark.py

Run end-to-end inference benchmarks.

Usage

e2e_benchmark.py Usage

python utils/e2e_benchmark.py [OPTIONS]

Options

Option	Short	Description	Type	Default
`--model`	`-m`	Path to the model file	String	Required
`--n-token`	`-n`	Number of generated tokens	Integer	128
`--n-prompt`	`-p`	Number of prompt tokens	Integer	512
`--threads`	`-t`	Number of threads to use	Integer	2

convert-helper-bitnet.py

Convert models from .safetensors format to GGUF format.

Usage

convert-helper-bitnet.py Usage

python ./utils/convert-helper-bitnet.py MODEL_DIR

generate-dummy-bitnet-model.py

Generate dummy models for testing and benchmarking.

Usage

generate-dummy-bitnet-model.py Usage

python utils/generate-dummy-bitnet-model.py MODEL_LAYOUT \
  --outfile OUTPUT_FILE \
  --outtype QUANT_TYPE \
  --model-size SIZE

Quantization Types

i2_s

1-bit signed quantization using -1 and +1 values. This is the recommended quantization type.

tl1

Ternary-like quantization variant.

Model Formats

GGUF Format

GGUF (GPT-Generated Unified Format) is the native format for BitNet models. It's optimized for efficient loading and inference.

.safetensors Format

Some models are available in .safetensors format and can be converted to GGUF using the conversion utilities. See our Usage Guide for conversion instructions.

Configuration

Environment Variables

BitNet uses standard environment variables for configuration:

CUDA_VISIBLE_DEVICES - Specify which GPU to use
OMP_NUM_THREADS - Number of threads for CPU inference

Best Practices

Use Appropriate Models: Choose models that fit your use case. See our Models Page.
Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
Adjust Temperature: Lower temperature for deterministic outputs, higher for creativity.
Use GPU When Available: GPU acceleration significantly improves performance.
Monitor Memory: Even with 1-bit quantization, large models require significant memory.

Additional Resources

GitHub Repository - Source code
GitHub Issues - Report bugs
Resources Page - Additional learning materials

Documentation Overview

Quick Start Guides

Command Line Tools

run_inference.py

Usage

Options

setup_env.py

Usage

Options

e2e_benchmark.py

Usage

Options

convert-helper-bitnet.py

Usage

generate-dummy-bitnet-model.py

Usage

Quantization Types

i2_s

tl1

Model Formats

GGUF Format

.safetensors Format

Configuration

Environment Variables

Best Practices

Related Documentation

Additional Resources