LLM Guardrails: Types of Guards
This post is co-authored by John Gilhuly
In our previous post on LLM guardrails, we looked into the importance of guardrails for LLMs. We explored what guardrails are, their primary use cases, and demonstrated how to set them up with just a few lines of code. Now, we’re focusing on the various types of guardrails you can implement to ensure your AI applications remain safe, ethical, and effective.
The Balancing Act of LLM Guards
Implementing guardrails for AI systems is a delicate balancing act. While these safety measures are important for responsible AI deployment, finding the right configuration is necessary to maintain both functionality and security.
To effectively manage your guards, we strongly recommend using specialized tools such as Guardrails AI or NemoAI (for an example that leverages Arize-Phoenix with Guardrails AI to set up a guard that blocks an LLM from responding from attempted jailbreaks, check out this notebook). Although it’s technically possible to implement each guard type manually, this approach quickly becomes unwieldy as your system’s complexity grows.
On the other hand, it’s important to resist the temptation to over-index on guards. It may seem prudent to implement every conceivable safety measure, but this approach can be counterproductive. Excessive guardrails risk losing the intent of the user’s initial request or the value of the app’s output. Instead, we advise starting with the most critical guards and expanding judiciously as needed. Tools like Arize’s AI search can be helpful in identifying clusters of problematic inputs to allow for targeted guard additions over time.
Additionally, more sophisticated guards that rely on their own LLM calls introduce additional costs and latency which must also be considered. To mitigate these issues, consider using small language models like GPT 4o mini or Gemma-2b for auxiliary calls.
What Are the Types of LLM Guardrails?
AI guardrails span input validation and sanitization — like syntax and format checks, content filtering, jailbreak attempt detection — and output monitoring and filtering, which can prevent damage, ensure performance, or evolve over time through dynamic guards. Let’s explore the main categories of guards and their applications.
Input Validation and Sanitization
Input validation and sanitization serve as the first line of defense in AI safety. These guards ensure that the data fed into your model is appropriate, safe, and in the correct format.
Syntax and Format Checks
While basic, these checks are important for maintaining system integrity. They verify that the input adheres to the expected format and structure. For instance, if your model expects certain parameters, what happens when they’re missing? Consider a scenario where your RAG retriever fails to return documents, or your structured extractor pulls the wrong data. Is your model prepared to handle this malformed request? Implementing these checks helps prevent errors and ensures smooth operation.
Content Filtering
This guard type focuses on removing sensitive or inappropriate content before it reaches the model. Detecting and removing personally identifiable information can help avoid potential privacy issues, and filtering NSFW or toxic language can ensure more appropriate responses from your LLM. We recommend implementing this guard cautiously – overzealous filtering might inadvertently alter the user’s original intent. Often, these types of guards are better suited filtering the outputs of your application rather than the inputs.
Jailbreak Attempt Detection
These are the guards that prevent massive security breaches and keep your company out of news headlines. Many collections of jailbreak prompts are available, and even advanced models can fail on up to 40% of these publicly-documented attacks. As these attacks constantly evolve, implementing effective guards can be challenging; we recommend using an embedding-based guard like Arize’s, which can adapt to changing strategies. At minimum, use a guard connected to a common library of prompt injection prompts, such as Rebuff.
Output Monitoring and Filtering
Output guards generally fall into two categories: preventing damage, and ensuring performance.
Preventing Damage
Examples of this include:
- System Prompt Protection: Some attacks try to expose the prompt templates your system uses. Adding a guard to detect system prompt language in your outputs can mitigate this risk. Just be sure to avoid exposing this same template within your guard’s code!
- NSFW or Harmful Language Detection: Allowing this type of language in your app’s responses can be extremely harmful to user experience and your brand. Use guards to help identify this language.
- Competitor Mentions: Depending on your use case, mentioning competitors might be undesirable. Guards can be set up to filter out such references.
Ensuring Performance
When it comes to performance, developers face a choice between using guards to improve your app’s output in real-time or running offline evaluations to optimize your pipeline or prompt template. Real-time guards introduce more latency and cost but offer immediate improvements. Offline evaluations allow for pipeline optimization without added latency, though there may be a delay between issue discovery and resolution. We recommend starting with offline evaluations and only adding performance guards if absolutely necessary.
- Hallucination Prevention: Guards can prevent hallucinations by comparing outputs with reference texts or, when unavailable, cross-referencing with reliable sources like Wikipedia.
- Critic Guards: This broad category involves using a separate LLM to critique and improve your pipeline’s output before sending it to the user. These can be instructed to focus on relevancy, conciseness, tone, and other aspects of the response.
Dynamic Guards
The final category of guards involves using your own data to augment existing guards. This allows guards to evolve based on your system’s specific needs and usage patterns. We’ll explore this advanced topic in more detail in our final blog post of this series.
What Strategies and Techniques Best Complement AI Guardrails?
While guardrails are important for AI safety, they’re not the only measures you should consider.
Fence Your App from Other Systems
Isolating your AI application from other systems and networks creates an additional layer of security, limiting potential vulnerabilities and preventing unauthorized access or data leakage. Implement strict access controls and use secure APIs for necessary inter-system communications.
Red Team Pre-Launch
Before deploying your AI system, conduct thorough red team exercises. This involves having a dedicated team attempt to break, manipulate, or exploit your system in ways that malicious actors might. These simulated attacks can reveal vulnerabilities that weren’t apparent during development and allow you to address them before public release.
Monitor Your App Post-Launch
Perhaps the most critical strategy is continuous LLM production monitoring after deployment. No amount of pre-launch testing can anticipate all real-world scenarios. Implement robust logging and monitoring systems to track your app’s performance, user interactions, and potential issues, and regularly analyze this data to identify patterns, anomalies, or emerging problems. This will allow you to refine your guardrails and respond swiftly to any security concerns.
Next Steps
Questions? Feel free to reach out in the Arize community.