The GGUF ( Universal File) file format is a binary format that stores both tensors and metadata in a single file, and is designed for fast saving, and loading of model data. It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures. It superseded previous formats used by the project such as GGML. GGUF files are typically created by converting models developed with a different machine learning library such as
PyTorch.
Design GGUF focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage and increased speed, albeit at the cost of reduced model accuracy. GGUF supports 2-bit to 8-bit quantized integer types, common floating-point data formats such as
float32,
float16, and
bfloat16, and 1.58 bit quantization. GGUF contains information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.
Byte-level structure (little-endian) Metadata block // example metadata general.architecture: 'llama', general.name: 'LLaMA v2', llama.context_length: 4096, ... , general.file_type: 10, // (typically indicates quantization level, here "MOSTLY_Q2_K") tokenizer.ggml.model: 'llama', tokenizer.ggml.tokens: [
, ,
, ,
, ,
, ,
, ,
, , ... ], ...
Tensors info block // n-th tensor name: GGUF string, // ex: "blk.0.ffn_gate.weight" n_dimensions: UINT32, // ex: 2 dimensions: UINT64[], // ex: [ 4096, 32000 ] type: UINT32, // ex: 10 (typically indicates quantization level, here "GGML_TYPE_Q2_K") offset: UINT64 // starting position within the tensor_data block, relative to the start of the block // (n+1)-th tensor ... == Models ==