Everyone working on artificial intelligence these days fears the worst-case scenario. The precocious LLM will suddenly glide off the rails and start spouting dangerous thoughts. One minute it’s a genius that’s going to take all our jobs, and the next it’s an odd crank spouting hatred, insurrection, or worse.
Fortunately, there are solutions. Some scientists are building LLMs that can act as guardrails. Yes, adding one LLM to fix the problems of another one seems like doubling the potential for trouble, but there’s an underlying logic to it. These new models are specially trained to recognize when an LLM is potentially going off the rails. If they don’t like how an interaction is going, they have the power to stop it.
Of course, every solution begets new problems. For every project that needs guardrails, there’s another one where the guardrails just get in the way. Some projects demand an LLM that returns the complete, unvarnished truth. For these situations, developers are creating unfettered LLMs that can interact without reservation. Some of these solutions are based on entirely new models while others remove or reduce the guardrails built into popular open source LLMs.
Here’s a quick look at 19 LLMs that represent the state-of-the-art in large language model design and AI safety—whether your goal is finding a model that provides the highest possible guardrails or one that just strips them all away.
Safer: LLMs with guardrails
The models in this category emphasize the many dimensions of AI safety. Whether you are looking for an LLM built for sensitive topics, one with a strong ethical compass, or a model capable of recognizing hidden exploits in seemingly innocent prompts, the heavily guarded models in this list could have you covered.
LlamaGuard
The developers of the various LlamaGuard models from Meta’s PurpleLlama initiative fine-tuned open source Llama models using known examples of abuse. Some versions, like Llama Guard 3 1B, can flag risky text interactions using categories like violence, hate, and self-harm in major languages including English and Spanish. Others, like Llama Guard 3 8B, tackle code interpreter abuse, which can enable denial of service attacks, container escapes, and other exploits. There are close to a dozen LlamaGuard versions that extend the base Llama models already. It also looks like Meta will continue researching ways to improve prompt security in foundation models.
Granite Guardian
IBM built the Granite Guardian model and framework combination as a protective filter for common errors in AI pipelines. First, the model scans for prompts that might contain or lead to answers that include undesirable content (hate, violence, profanity, etc.). Second, it watches for attempts to evade barriers by hoodwinking the LLM. Third, it watches for poor or irrelevant documents that might come from any RAG database that’s part of the pipeline. Finally, if the system is working agentically, it evaluates the risks and benefits of an agent’s function invocations. In general, the model generates risk scores and confidence levels. The tool itself is open source, but it integrates with some of the IBM frameworks for AI governance tasks like auditing.
Claude
As Anthropic built various editions of Claude, it created a guiding list of ethical principles and constraints that it started calling a constitution. The latest version was mainly written by Claude itself, as it reflected upon how to enforce these rules when answering prompts. These include strict prohibitions on dangerous acts like building bioweapons or taking part in cyberattacks as well as more philosophical guidelines like being honest, helpful, and safe. When Claude engages with users, it tries to stay within the boundaries defined by the constitution it helped to create.
WildGuard
The developers of Allen Institute for AI’s WildGuard started with Mistral-7B-v0.3 and used a combination of synthetic and real-world data to fine-tune it for defending against harm. WildGuard is a lightweight moderation tool that scans LLM interactions for potential problems. Its three functions are to identify malicious intent in user prompts; detect safety risks in model responses; and determine the model refusal rate, or how often a model declines to answer. This can be useful for tuning the model to be as helpful as possible while remaining within safe bounds.
ShieldGemma
Google released a series of open weight models called ShieldGemma, which the company uses to block problematic requests. ShieldGemma 1 comes in three sizes (2B, 9B, and 27B) for classifying text input and output. ShieldGemma 2 blocks requests for images that are flagged as sexually explicit, harmful, violent, or for contain excessive blood and gore. The visual classifier tool can also be run in reverse to produce adversarial images, which are used to enhance the model’s ability to detect content that may violate the image safety policy.
NeMo Guardrails
Nvidia’s Nemotron collection of open source models includes a version, Nemotron Safety Guard, that acts as a gatekeeper by scanning for jailbreaks and dangerous topics. It can run on its own or integrate with NeMo Guardrails, a programmable protection system that can be revised and extended with traditional and not-so-traditional techniques. Developers can use Python to add specific “actions” for the model to use, or to provide patterns and structured examples that guide model behavior. Regular guardrails may halt a conversation at the hint of something undesirable. Ideally, the model can steer the conversation back to something productive.
Qwen3Guard
This multilingual model from Qwen comes in a variety of combinations to block unwanted behavior in your dataflows. Qwen3Guard-Gen works in a traditional question-and-answer format with prompts and responses. Qwen3Guard-Stream has a slightly different architecture that’s optimized for token-level filtering in real-time streams. Both come in a few sizes (0.6B, 4B, and 8B) to optimize the tradeoff between performance and protection. Qwen developers also built a special version of the 4B, Qwen3-4B-SafeRL, which was enhanced with reinforcement learning to maximize safety and user experience.
PIGuard
The PIGuard model focuses on defending against prompt injection, the type of malicious attack that can be challenging to prevent without being overly paranoid. It watches for covert suggestions that might be hidden inside the prompt. PIGuard’s developers trained the model by building a special training set called NotInject, which uses examples of false positives that might trick a less capable model.
PIIGuard
Not to be confused with PIGuard, this completely different model is aimed at flagging personally identifiable information (PII) in a data stream. This ensures an LLM won’t mistakenly leak someone’s address, birthday, or other sensitive information when responding to prompts. The PIIGuard model is trained on examples that teach it to detect PII that’s embedded in a conversation or a long text stream. It’s a step up from standard detectors that use regular expressions and other more basic definitions of PII structure.
Alinia
The guardrails from Alinia apply to wider range of potentially troublesome behaviors. The model covers standard issues like illegal or dangerous behaviors but is also trained to avoid the legal tangles that may follow giving medical or tax advice. This LLM guardrail also can detect and refuse irrelevant answers or gibberish that may hurt an organization’s reputation. The Alinia system relies on a RAG-based database of samples so it can be customized to block any kind of sensitive topic.
DuoGuard
Sometimes it’s hard for AI developers to find a large enough training set with all the examples of bad behavior required. The DuoGuard models were built with two parts: Part one generates all the synthetic examples you need, and part two boils them all down to a model. The model is smart, small, and quick, and can detect issues in 12 risk categories, including violent crime, weapons, intellectual property, and jailbreaking. DuoGuard comes in three tiers (0.5b, 1.0b, and 1.5b) to serve all levels of need.
Looser: LLMs with fewer guardrails
LLMs in this category aren’t completely without guardrails, but they’ve been built—or in many cases, retrained—to favor freedom of inquiry or expression over safety. You might need a model like this if you are looking for novel approaches to old problems, or to find the weak points in a system so that you can close them up. Models with lower guardrails are also favored for exploring fictional topics or for romantic roleplay.
Dolphin models
Eric Hartford and a team at Cognitive Computations built the Dolphin models to be “uncensored.” That is, they stripped away all the guardrails they could find in an open source foundation model by removing many restricting questions and answers from the training set. If the training material showed bias or introduced reasons to refuse to help, they deleted it. Then, they retrained the model and produced a version that will answer a question any way it can. They’ve so far applied this technique to a number of open source models from Meta and Mistral.
Nous Hermes
The Hermes models from Nous Research were built to be more “steerable”—meaning they aren’t as resistant as some models are to delivering answers on demand. The Hermes model developers created a set of synthetic examples that emphasize helpfulness and unconstrained reasoning. The training’s effectiveness is measured, in part, with RefuseBench, a set of scenarios that test helpfulness. The results are often more direct and immediately useful. The developers noted, for instance, that “Hermes 4 frequently adopted a first-person, peer-like persona, generating responses with fewer meta-disclaimers and more consistent voice embodiment.”
Flux.1
The Flux.1 model was designed to create images by following as strictly as possible any prompt instructions. Many praise its rectified flow transformer architecture for producing excellent skin tones and lighting in complex scenes. The model can be fine-tuned for applications that require a particular style or content using low-rank adaptation (LoRA). Flux.1 is available under an open source license for non-commercial use. Any commercial deployment requires additional licensing.
Heretic
Heretic lowers the guardrails of existing LLMs by stripping away their defenses. It starts by tracking how the residual vectors behave on two different training sets with harmful and non-harmful examples. It then zeros out the key weights, effectively removing whatever restrictions were built into the original model. The tool is automated, so it’s not hard to apply it to your own model. Or, if you prefer, you can get one that’s been pre-treated. There’s a version of Gemma 3, and another of Qwen 3.5.
Pingu Unchained
Audn.ai built Pingu as a tool for security researchers and red teams who need to ask questions that mainstream LLMs are trained not to answer. To create this model, developers fine-tuned OpenAI’s GPT-OSS-120b with a curated collection of jailbreaks and other commonly refused requests. The resulting model is handy for generating synthetic tests of spear-phishing, reverse engineering, and the like. The tool keeps an audit trail of requests and Audn.ai limits access to verified organizations.
Cydonia
TheDrummer created Cydonia as part of a series of models for immersive roleplay. That means long context windows for character consistency and uncensored interactions for exploring fictional topics. Two versions (22b v1.2 and 24b v4.1) have been built by fine-tuning Mistral Small 3.2 24B. Some call the model “thick” for producing long answers rich with plot details.
Midnight Rose
Midnight Rose is one of several models built by Sophosympatheia for romantic roleplay. The model was developed by merging at least four different foundation models. The idea was to create an LLM capable of building stories with strong plots and emotional resonance, all in an uncensored world of fictional freedom.
Abliterated: LLMs off the rails
A few labs are opening up models by deactivating the guardrail layers directly instead of retraining them for a looser approach. This technique is often called abliteration, a portmanteau combining “ablation” (removal) and “obliterate” (destruction). The developers identify the layers or weights that operate as guardrails by testing the models with a variety of problematic prompts, then deactivate them by zeroing out their contributions in model responses. These models have at times outperformed their foundational versions on various tasks.
Grok
Good examples in this category come from HuiHui AI and David Belton, but the most famous model of this type is Grok. Rather than being concerned about creating a model that behaves badly, the Grok team at X is more concerned with factual errors. Or, as Elon Musk said in an interview: “The best thing I can come up with for AI safety is to make it a maximum truth-seeking AI, maximally curious.” In other words, Grok was designed for factual correctness, not political correctness, whatever your definition of politics might be.
Go to Source
Author: