Documentation
Complete documentation for BitNet - API reference, guides, and examples
Documentation Overview
Welcome to the BitNet documentation! This comprehensive guide covers all aspects of using BitNet for 1-bit LLM inference. If you're new to BitNet, start with our Getting Started Guide.
Quick Start Guides
- Getting Started - Quick introduction to BitNet
- Installation Guide - Setup instructions for all platforms
- Usage Guide - Basic and advanced usage examples
Command Line Tools
run_inference.py
Run inference with BitNet models.
Usage
python run_inference.py [OPTIONS]
Options
| Option | Short | Description | Type | Default |
|---|---|---|---|---|
--model |
-m |
Path to model file | String | Required |
--prompt |
-p |
Prompt to generate text from | String | Required |
--n-predict |
-n |
Number of tokens to predict | Integer | 128 |
--threads |
-t |
Number of threads to use | Integer | Auto-detect |
--ctx-size |
-c |
Size of the prompt context | Integer | 512 |
--temperature |
-temp |
Temperature for text generation | Float | 0.8 |
--conversation |
-cnv |
Enable chat mode | Flag | False |
setup_env.py
Setup environment for running inference with BitNet models.
Usage
python setup_env.py [OPTIONS]
Options
| Option | Short | Description | Type |
|---|---|---|---|
--hf-repo |
-hr |
Model used for inference | String |
--model-dir |
-md |
Directory to save/load the model | String |
--log-dir |
-ld |
Directory to save logging info | String |
--quant-type |
-q |
Quantization type | Choice: i2_s, tl1 |
--quant-embd |
Quantize embeddings to f16 | Flag | |
--use-pretuned |
-p |
Use pretuned kernel parameters | Flag |
e2e_benchmark.py
Run end-to-end inference benchmarks.
Usage
python utils/e2e_benchmark.py [OPTIONS]
Options
| Option | Short | Description | Type | Default |
|---|---|---|---|---|
--model |
-m |
Path to the model file | String | Required |
--n-token |
-n |
Number of generated tokens | Integer | 128 |
--n-prompt |
-p |
Number of prompt tokens | Integer | 512 |
--threads |
-t |
Number of threads to use | Integer | 2 |
convert-helper-bitnet.py
Convert models from .safetensors format to GGUF format.
Usage
python ./utils/convert-helper-bitnet.py MODEL_DIR
generate-dummy-bitnet-model.py
Generate dummy models for testing and benchmarking.
Usage
python utils/generate-dummy-bitnet-model.py MODEL_LAYOUT \
--outfile OUTPUT_FILE \
--outtype QUANT_TYPE \
--model-size SIZE
Quantization Types
i2_s
1-bit signed quantization using -1 and +1 values. This is the recommended quantization type.
tl1
Ternary-like quantization variant.
Model Formats
GGUF Format
GGUF (GPT-Generated Unified Format) is the native format for BitNet models. It's optimized for efficient loading and inference.
.safetensors Format
Some models are available in .safetensors format and can be converted to GGUF using the conversion utilities. See our Usage Guide for conversion instructions.
Configuration
Environment Variables
BitNet uses standard environment variables for configuration:
CUDA_VISIBLE_DEVICES- Specify which GPU to useOMP_NUM_THREADS- Number of threads for CPU inference
Best Practices
- Use Appropriate Models: Choose models that fit your use case. See our Models Page.
- Optimize Context Size: Use the smallest context size necessary to reduce memory usage.
- Adjust Temperature: Lower temperature for deterministic outputs, higher for creativity.
- Use GPU When Available: GPU acceleration significantly improves performance.
- Monitor Memory: Even with 1-bit quantization, large models require significant memory.
Related Documentation
- Getting Started - Quick introduction
- Installation Guide - Setup instructions
- Usage Guide - Detailed usage examples
- Models Page - Available models
- Benchmark Guide - Performance testing
- FAQ - Common questions and answers
Additional Resources
- GitHub Repository - Source code
- GitHub Issues - Report bugs
- Resources Page - Additional learning materials