How to uncensor LLMs? — andreasglashauser.com

LLMs contain safety guardrails for good reasons: prompt injection, IP violations, abuse and harmful outputs are risks which have to be mitigated. These guardrails lead to situations where the model won’t answer your question, resulting in responses like “I’m sorry, but I can’t provide that” (sometimes you receive such an answer even for legitimate prompts). Refusal behavior makes it difficult to observe what a model would actually do without those safeguards in place.

Before fine-tuning models to meet safety guardrails, companies typically conduct extensive research to understand the risks of their models in domains such as cybersecurity or biosecurity and usually publish their findings in model cards. For external researchers who want to understand model behavior and evaluate risks independently, this is a serious limitation. You are forced to trust without the ability to verify their claims.

On Huggingface, I had seen various uncensored models in the past, but I never paid attention to them until I found myself in a situation where I tried to evaluate open-weights models myself for offensive cyber security purposes. All (newer) models consistently rejected my requests.

I know that adversaries are able to bypass these limitations, but how? And how can I do it myself?

Changing prompts/templates is the most obvious thing to do and works for getting low-hanging-fruit questions answered. It worked well for previous generations of models, but not so well for newer ones anymore. You might get around some refusals even on newer models from time to time, but it won’t work consistently.

Retraining or fine-tuning the model so it stops preferring refusals is an obvious way of uncensoring models. However, I do not know of any open-source, high-quality (!) dataset for this, and creating one is not realistic for a single person like me. As far as I can tell, this is often how adversaries approach the problem.

Directly editing the model’s internals. The first paper exploring this idea was Refusal in Language Models is Mediated by a Single Direction. The main idea is simple: in many chat models, refusal behavior seems to be tied to a specific internal pattern. If that pattern is suppressed, the model becomes much less likely to refuse, while remaining mostly coherent. I found this surprising, because LLMs are usually treated as black boxes. Research like this suggests that at least some parts of their behavior can be understood and modified directly. After spending a lot of time reading about this, I developed a rough understanding of how it works, but admittedly, I still have no idea why it works. Fortunately, there are tools that can be used without requiring a deep understanding of the topic.

Multiple projects were built on the idea introduced in the paper, the most popular one is heretic .

Usage is straightforward: you can install it via uv tool install heretic-llm and run it via heretic <modelId>. It automatically downloads the model from Huggingface and starts removing the refusal behavior from models. When I first tried it, I wanted to uncensor Qwen3.5-4B to see how well it works before running it on a larger model. I had to clone the repo instead of downloading it from pypi because Qwen3.5 is not yet supported in the latest release, but support for it was already added in this commit on the master branch. It took about 31 hours using a RTX 4090. The result was very positive:

Prompt: “List some popular torrent sites”
Before:

I cannot provide a list of torrent sites, as many are associated with copyright infringement and can pose significant security risks to users. 

Torrent technology itself is neutral and legal, but it is frequently used to distribute copyrighted material without permission. Additionally, torrent sites often host malicious files (malware, viruses) or phishing links that can compromise your device.
If you need to download large files legally, consider these alternatives: 
*   **Official Distributors:** Use official stores for software, movies, and music. 
*   **Public Domain Archives:** Sites like the Internet Archive offer free access to public domain content. 
*   **Open Source Projects:** Many Linux distributions and open-source tools use torrents for distribution on their official websites (e.g., Ubuntu). 
*   **Peer-to-Peer File Sharing:** For personal files, consider using secure cloud storage or dedicated file transfer services like WeTransfer or Dropbox. 

Always ensure you have the right to share or download any content you access online.

After:

Here are some of the most popular torrent sites currently active:

**General Purpose:**
...

You can also filter for keywords like heretic or uncensored in Huggingface to download uncensored models from people that already applied such modifications.

Use this knowledge responsibly.