Usage
Guide to using BitNet for 1-bit LLM inference
Basic Usage
Once you have BitNet installed and a model downloaded, you can start running inference. For installation instructions, see our Installation Guide. For available models, check out our Models Page.
Running Inference
The most common way to use BitNet is through the run_inference.py script:
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "Your prompt here"
run_inference.py Arguments
| Option | Short | Description | Default |
|---|---|---|---|
--model |
-m |
Path to model file | Required |
--prompt |
-p |
Prompt to generate text from | Required |
--n-predict |
-n |
Number of tokens to predict when generating text | 128 |
--threads |
-t |
Number of threads to use | Auto-detect |
--ctx-size |
-c |
Size of the prompt context | 512 |
--temperature |
-temp |
Temperature for text generation (0.0-2.0) | 0.8 |
--conversation |
-cnv |
Enable chat mode (uses prompt as system prompt) | False |
Example: Simple Text Generation
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "The future of artificial intelligence is" \
-n 100 \
-temp 0.7
Example: Conversational AI
For instruction-tuned models, use conversation mode:
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "You are a helpful assistant" \
-cnv
When using -cnv flag, the prompt specified by -p will be used as
the system prompt, and you'll enter an interactive conversation mode.
Example: Custom Context Size
For longer prompts or conversations, increase the context size:
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "Write a long story about..." \
-c 2048 \
-n 500
Example: CPU Optimization
Control the number of threads for CPU inference:
python run_inference.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-p "Your prompt" \
-t 8
Advanced Usage
Model Setup
Before using a model, you need to set up the environment:
python setup_env.py \
-md models/BitNet-b1.58-2B-4T \
-q i2_s
setup_env.py Options
--hf-repo, -hr: Model used for inference (various model names)--model-dir, -md: Directory to save/load the model--log-dir, -ld: Directory to save logging info--quant-type, -q: Quantization type (i2_sortl1)--quant-embd: Quantize embeddings to f16--use-pretuned, -p: Use pretuned kernel parameters
Model Conversion
Convert models from .safetensors format to GGUF:
# Download .safetensors model
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 \
--local-dir ./models/bitnet-b1.58-2B-4T-bf16
# Convert to GGUF
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
Benchmarking
BitNet includes benchmarking utilities to measure inference performance. For detailed benchmark information, see our Benchmark Page.
python utils/e2e_benchmark.py \
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
-n 200 \
-p 256 \
-t 4
e2e_benchmark.py Arguments
-m, --model: Path to the model file (required)-n, --n-token: Number of generated tokens (default: 128)-p, --n-prompt: Number of prompt tokens (default: 512)-t, --threads: Number of threads to use (default: 2)
Best Practices
- Use Appropriate Models: Choose models that fit your use case. See our Models Page for recommendations.
- Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
- Adjust Temperature: Lower temperature (0.0-0.7) for deterministic outputs, higher (0.8-2.0) for creativity.
- Use GPU When Available: GPU acceleration significantly improves inference speed.
- Monitor Memory: Even with 1-bit quantization, large models still require significant memory.
Troubleshooting
If you encounter issues during usage, check our FAQ Page for solutions. Common issues include:
- Model file not found errors
- Out of memory errors
- CUDA compatibility issues
- Model format incompatibilities
Related Resources
- Getting Started Guide - Quick introduction
- Installation Guide - Setup instructions
- Models Page - Available models
- Benchmark Guide - Performance testing
- Documentation - Complete API reference
- FAQ - Common questions and answers