Self-hosting LLMs — andreasglashauser.com

No matter which forum or Discord you scroll through, the most common advice you’ll hear for self-hosting LLMs is: Use Ollama.

Lately, I see more and more people thinking that Ollama is always a good choice for self-hosting LLMs. That was already true before, but the number of people thinking that increased again after OpenAI released their gpt-oss models and advertised Ollama as the go-to hosting option. Recommending Ollama to non-technical users makes perfect sense, yet more and more technical people also assume that Ollama is always the right choice.

While Ollama does an awesome job of keeping everything simple and stupid (typing ollama run whatever felt like magic the first time!), I don’t think that Ollama should be anyone’s first choice anymore. There are quite a few reasons for that:

Instead of using Jinja for chat templates (that the rest of the ecosystem agreed on), they rolled their own (see here for the discussion), which led to the creation of RamaLama, maintained by Red Hat.
They refused to credit llamacpp, which is/was the engine of Ollama, and ignored the issue when called out.
Even after creating their own “from-scratch” engine, they rely on the GGUF implementation, and the work that they themselves actually did was bad. See this comment from llamacpp’s creator, Georgi Gerganov, for an example.
Most importantly: Ollama got big by being the first project that made it really easy to self-host LLMs. Other projects acknowledged this, and Ollama lost its leading edge a long time ago. Alternatives make it even easier, have far more features, better UIs, and do better overall work.

Some people love to criticize Ollama for not already supporting Vulkan, but I have to disagree with this take. Vulkan support already exists on the main branch; they are just refusing to ship it until it’s stable because they prioritize quality (at least that’s their explanation). I respect that decision. Maybe they learned from their past mistakes and are getting better now, so I wouldn’t find it fair to criticize this decision.

In short: Ollama is still mostly a wrapper around llamacpp with its own model registry, by-default quantization, and heuristic performance tweaks (which sounds good, unless you know that they did not implement this well)

There are plenty of alternatives, but you most likely only need to know about some:

1. llama.cpp

llamacpp is a C++ inference engine that aims to allow self-hosting models on consumer devices. Initially, they only supported CPUs, but they quickly added support for GPUs and mixed CPU+GPU offloading. Pick it if you:

have no GPU
are the only user
have a GPU, but the model is too fat to fit entirely into VRAM and needs offloading
are fine with a simple web UI as interface or are only interested in the built-in OpenAI-API compatible server
are comfortable compiling from source (While llamacpp is available in some distro repos already, e.g. in Fedora, I still recommend building it yourself!)

2. LM-Studio: the simplicity of Ollama done right

Like Ollama, LM Studio uses llamacpp under the hood, but they:

automatically update llamacpp runtimes (llamacpp uses rolling releases, so getting updates directly is very beneficial)
have a desktop application which exposes three UI tiers: User, Power User, Developer for different UI complexity
automatically recommend models and show what model might or might not fit your setup
let you set guardrails so you don’t accidentally overload your machine
have cool extras: 30MB RAG and a sandboxed JS code runner

The GUI is closed source (which is the only disadvantage I can think of right now), but the bundled CLI (lms), Python SDK and the TypeScript SDK are open-source MIT licensed; the latter SDK is used inside LM Studio.

3. vLLM

vLLM (virtualized Large Language Models) is another common recommendation. It works exceptionally well on GPUs with a very straightforward setup. vLLM also supports CPU offloading, but it doesn’t work quite as well as llamacpp’s implementation. Their popular feature is Paged Attention, which allows it to allocate only the VRAM it actually needs and shares the rest. It shines if you have multiple users/requests at the same time for basic chat use cases. If you want a quick-and-dirty way to try it out: uvx run vllm serve openai/gpt-oss-20b

4. SGLang

While vLLM tries to maximize throughput for single inference requests, SGLang (Structured Generation Language) is optimized for agent use cases. Its key feature is RadixAttention. You can try it out by installing it and running uv run python -m sglang.launch_server --model-path <...>

5. ExLlamaV2/3

ExLlamaV2 and ExLlamaV3 are more of a special type of inference engine. I recommend it if you care about single-user inference performance on GPTQ-quantized models. In my experience, it’s the best choice for one-prompt-at-a-time workloads. Setting it up is unfortunately not as straightforward as alternatives. You have to install ExLlama by cloning the repository (V2 or V3), download a quantized model and then use TabbyAPI as the interface.

6. Others

I most likely didn’t mention others because either I didn’t have a use case to try the inference engine, didn’t like it, or don’t know it yet. A few honorable mentions:

HuggingFace TGI: good option if you already are inside the HuggingFace ecosystem
Nvidia TensorRT: for enterprises, not straightforward to setup
mistral.rs: one for the Rust fans :)
…

Cheat Sheet:

CPU-only or mixed GPU/CPU inference? Use llama.cpp
CPU-only or mixed GPU/CPU inference with an GUI? Use LM Studio
Full model in VRAM with single-user inference? ExLlamaV2/3
Full model in VRAM for simple chat use cases? vLLM
Full model in VRAM for agent use cases? SGLang