Production
Chapter Summary
Focused on safeguarding deployed applications, this section introduces guardrails to mitigate risks like hallucinations, toxic outputs, and security vulnerabilities. The balance between safety and functionality is emphasized, alongside strategies for implementing dynamic and adaptive guards.
Learn how to protect your applications and implement real-time AI guardrails by reading our product documentation.
Guardrails
As LLM applications become more common, so too do jailbreak attempts, exploitations of these apps, and harmful responses. More and more companies are falling prey to damaging news stories driven by their chatbots selling cars for $1, writing poems critical of their owners, or dealing out disturbing replies.
Fortunately, there is a solution to this problem: LLM guardrails. LLM guardrails allow you to protect your application from potentially harmful inputs, and block damaging outputs before they’re seen by a user. As LLM jailbreak attempts become more common and more sophisticated, having a robust guardrails approach is critical.
LLM guardrails work in real-time to either catch dangerous user inputs or screen model outputs. There are many different types of guards that can be employed, each specializing in a different potential type of harmful input or output.
Common input guard use cases include:
- Detecting and blocking jailbreak attempts
- Preventing prompt injection attempts
- Removing user personally identifiable information (PII) before it reaches a model
Common output guard use cases include:
- Removing toxic or hallucinated responses
- Removing mentions of a competitor’s product
- Screening for relevancy in responses
- Removing NSFW text
Balancing Act of Guards
Implementing guardrails for AI systems is a delicate balancing act. While these safety measures are important for responsible AI deployment, finding the right configuration is necessary to maintain both functionality and security.
It’s important to resist the temptation to over-index on guards. It may seem prudent to implement every conceivable safety measure, but this approach can be counterproductive. Excessive guardrails risk losing the intent of the user’s initial request or the value of the app’s output. Instead, we advise starting with the most critical guards and expanding judiciously as needed. Tools like Arize’s AI search can be helpful in identifying clusters of problematic inputs to allow for targeted guard additions over time.
“Excessive guardrails risk losing the intent of the user’s initial request or the value of the app’s output.”
Types of Guards
AI guardrails span input validation and sanitization — like syntax and format checks, content filtering, jailbreak attempt detection — and output monitoring and filtering, which can prevent damage, ensure performance, or evolve over time through dynamic guards. Let’s explore the main categories of guards and their applications.
Input Validation and Sanitization
Input validation and sanitization serve as the first line of defense in AI safety. These guards ensure that the data fed into your model is appropriate, safe, and in the correct format.
“Input validation and sanitization serve as the first line of defense in AI safety.”
Syntax and Format Checks
While basic, these checks are important for maintaining system integrity. They verify that the input adheres to the expected format and structure. For instance, if your model expects certain parameters, what happens when they’re missing? Consider a scenario where your RAG retriever fails to return documents, or your structured extractor pulls the wrong data. Is your model prepared to handle this malformed request? Implementing these checks helps prevent errors and ensures smooth operation.
Content Filtering
This guard type focuses on removing sensitive or inappropriate content before it reaches the model. Detecting and removing personally identifiable information can help avoid potential privacy issues, and filtering NSFW or toxic language can ensure more appropriate responses from your LLM. We recommend implementing this guard cautiously – overzealous filtering might inadvertently alter the user’s original intent. Often, these types of guards are better suited filtering the outputs of your application rather than the inputs.
Jailbreak Attempt Detection
These are the guards that prevent massive security breaches and keep your company out of news headlines. Many collections of jailbreak prompts are available, and even advanced models can fail on up to 40% of these publicly-documented attacks. As these attacks constantly evolve, implementing effective guards can be challenging; we recommend using an embedding-based guard like Arize’s, which can adapt to changing strategies. At minimum, use a guard connected to a common library of prompt injection prompts, such as Rebuff.
Output Monitoring and Filtering
Output guards generally fall into two categories: preventing damage, and ensuring performance.
Preventing Damage
Examples of this include:
-
System Prompt Protection: Some attacks try to expose the prompt templates your system uses. Adding a guard to detect system prompt language in your outputs can mitigate this risk. Just be sure to avoid exposing this same template within your guard’s code!
-
NSFW or Harmful Language Detection: Allowing this type of language in your app’s responses can be extremely harmful to user experience and your brand. Use guards to help identify this language.
-
Competitor Mentions: Depending on your use case, mentioning competitors might be undesirable. Guards can be set up to filter out such references.
Ensuring Performance
When it comes to performance, developers face a choice between using guards to improve your app’s output in real-time or running offline evaluations to optimize your pipeline or prompt template. Real-time guards introduce more latency and cost but offer immediate improvements. Offline evaluations allow for pipeline optimization without added latency, though there may be a delay between issue discovery and resolution. We recommend starting with offline evaluations and only adding performance guards if absolutely necessary.
-
Hallucination Prevention: Guards can prevent hallucinations by comparing outputs with reference texts or, when unavailable, cross-referencing with reliable sources like Wikipedia.
-
Critic Guards: This broad category involves using a separate LLM to critique and improve your pipeline’s output before sending it to the user. These can be instructed to focus on relevancy, conciseness, tone, and other aspects of the response.
Dynamic Guards
Manually updating guards to counter new threats is a near impossible task, quickly becoming unsustainable as attack vectors multiply. Fortunately, two approaches allow us to create adaptive guards that can keep pace with emerging threats: few-shot prompting and embedding-based guards.
While static guards are great at filtering out predefined content like NSFW language, they struggle when faced with sophisticated attacks like jailbreak attempts, prompt injection, and more. These dynamic threats require equally dynamic defenses that can evolve alongside the attackers’ strategies.
Few-Shot Prompting
This technique involves adding examples of recent jailbreak attempts or other attacks directly into your guard’s prompt. By exposing the guard to real-world attack patterns, you improve its ability to recognize and thwart similar threats in the future.
This is a more sophisticated approach to dynamic protection. It involves comparing the embedding of a user’s input against a database of known attack embeddings. By checking the similarity between these representations, the system can identify and block potentially malicious inputs that exceed a predefined similarity threshold. Arize has developed an easy-to-use but powerful implementation of this concept with our ArizeDatasetEmbeddings Guard.
While dynamic guards offer powerful protection, they come with a few considerations:
- Increased computational cost due to larger models or embedding generation.
- Higher latency, potentially impacting response times.
- The need for ongoing maintenance and updates to the attack prompt database
Dataset Embeddings Guard | Few Shot LLM Guard | General LLM Guard |
---|---|---|
Advantages | ||
Customizable: Customize this Guard to your specific Easy to update: Tackle drift by updating the Guard Low latency / cost: Does not rely on a large model to |
Customizable: Same as Dataset Embeddings Easy to update: Same as Dataset Embeddings Fewer false positives: Less likely to produce false |
Customizable: Instantiate with a custom prompt for a Opportunity to optimize performance: Try our prompt |
Disadvantages | ||
---|---|---|
Performance depends on the quality of the source dataset LLM evaluator call introduces cost and latency |
Performance depends on the quality of the source dataset LLM evaluator call introduces cost and latency |
Performance depends on the quality of the prompt LLM evaluator call introduces cost and latency Prompt may need to be continually re-engineered as models and use |
It’s important to weigh these factors against the level of protection required for your specific use case.
Download this article
Join the Arize community and continue your journey into LLM evaluation.