Choose the Right Model

Match model size to your hardware and task. Use smaller models (e.g. 1B–2B) for low memory or CPU; larger (7B–8B) for better quality when you have enough RAM/VRAM. See Models.

Set Context Size Appropriately

Use the smallest context (-c) that fits your prompts to save memory and speed. Increase only when you need long conversations or documents.

Adjust Temperature

Lower temperature (e.g. 0.3–0.5) for more deterministic, factual output; higher (0.7–1.0) for creative or varied text. See Usage.

Use GPU When Available

BitNet's CUDA kernels are much faster than CPU. Use a CUDA-capable GPU for production or heavy use. See Installation and Benchmark.

Warm Up Before Benchmarking

Run a few inference calls before measuring throughput so caches and GPU are warmed up. See Benchmark.

Related