Glossary

1-bit LLM

A large language model whose weights use only two values (e.g. -1 and +1), reducing memory and often speeding up inference. See What is a 1-bit LLM?

BitNet

Microsoft's inference framework for 1-bit and 1.58-bit LLMs. It provides tools to run models like BitNet-b1.58 and Falcon3-1.58bit. Home · About

Quantization

Storing model weights with fewer bits (e.g. 1-bit, 4-bit, 8-bit) instead of FP16/FP32 to reduce size and improve inference speed. BitNet focuses on 1-bit.

GGUF

GPT-Generated Unified Format: a file format for storing LLM weights, used by BitNet and the llama.cpp ecosystem for efficient loading.

Inference

Running a trained model to generate text or answers. BitNet optimizes inference for 1-bit weights with custom kernels.

CUDA

NVIDIA's platform for GPU computing. BitNet uses custom CUDA kernels for fast 1-bit matrix operations on NVIDIA GPUs.

i2_s / tl1

Quantization types in BitNet: i2_s is 1-bit signed (-1, +1); tl1 is a ternary-like variant. See Installation and Usage.

Context size

Maximum number of tokens the model can process in one request (prompt + generated output). Set with -c in run_inference.py.