GGUF e llama.cpp: il formato dei pesi per l'inferenza locale
Cos'è GGUF, perché ha soppiantato il vecchio GGML, come leggere i suffissi come Q4_K_M e perché llama.cpp è diventato il motore di riferimento per far girare i modelli sulla CPU e su GPU miste.
Abstract (EN)
GGUF is the file format that made local inference portable. Born in the llama.cpp project as the successor to GGML, it packs a model's quantized weights together with its metadata and tokenizer into a single file that runs across CPU and mixed GPU setups. This article explains what GGUF stores, how to decode the quantization suffixes such as Q4_K_M, and why llama.cpp became the reference engine that tools like Ollama and LM Studio build on. We aim to let a reader pick the right GGUF variant for their hardware with confidence, reading the file name as a compact spec of size and quality.