Production

Chapter Summary

Focused on safeguarding deployed applications, this section introduces guardrails to mitigate risks like hallucinations, toxic outputs, and security vulnerabilities. The balance between safety and functionality is emphasized, alongside strategies for implementing dynamic and adaptive guards.

Learn how to protect your applications and implement real-time AI guardrails by reading our product documentation.

Guardrails

As LLM applications become more common, so too do jailbreak attempts, exploitations of these apps, and harmful responses. More and more companies are falling prey to damaging news stories driven by their chatbots selling cars for $1, writing poems critical of their owners, or dealing out disturbing replies.

Fortunately, there is a solution to this problem: LLM guardrails. LLM guardrails allow you to protect your application from potentially harmful inputs, and block damaging outputs before they’re seen by a user. As LLM jailbreak attempts become more common and more sophisticated, having a robust guardrails approach is critical.

LLM guardrails work in real-time to either catch dangerous user inputs or screen model outputs. There are many different types of guards that can be employed, each specializing in a different potential type of harmful input or output.

Common input guard use cases include:

  • Detecting and blocking jailbreak attempts
  • Preventing prompt injection attempts
  • Removing user personally identifiable information (PII) before it reaches a model

Common output guard use cases include:

  • Removing toxic or hallucinated responses
  • Removing mentions of a competitor’s product
  • Screening for relevancy in responses
  • Removing NSFW text
Detections-and-actions

Balancing Act of Guards

Implementing guardrails for AI systems is a delicate balancing act. While these safety measures are important for responsible AI deployment, finding the right configuration is necessary to maintain both functionality and security.

It’s important to resist the temptation to over-index on guards. It may seem prudent to implement every conceivable safety measure, but this approach can be counterproductive. Excessive guardrails risk losing the intent of the user’s initial request or the value of the app’s output. Instead, we advise starting with the most critical guards and expanding judiciously as needed. Tools like Arize’s AI search can be helpful in identifying clusters of problematic inputs to allow for targeted guard additions over time.

“Excessive guardrails risk losing the intent of the user’s initial request or the value of the app’s output.”

Types of Guards

AI guardrails span input validation and sanitization — like syntax and format checks, content filtering, jailbreak attempt detection — and output monitoring and filtering, which can prevent damage, ensure performance, or evolve over time through dynamic guards. Let’s explore the main categories of guards and their applications.

Input Validation and Sanitization

Input validation and sanitization serve as the first line of defense in AI safety. These guards ensure that the data fed into your model is appropriate, safe, and in the correct format.

“Input validation and sanitization serve as the first line of defense in AI safety.”

Syntax and Format Checks

While basic, these checks are important for maintaining system integrity. They verify that the input adheres to the expected format and structure. For instance, if your model expects certain parameters, what happens when they’re missing? Consider a scenario where your RAG retriever fails to return documents, or your structured extractor pulls the wrong data. Is your model prepared to handle this malformed request? Implementing these checks helps prevent errors and ensures smooth operation.

Content Filtering

This guard type focuses on removing sensitive or inappropriate content before it reaches the model. Detecting and removing personally identifiable information can help avoid potential privacy issues, and filtering NSFW or toxic language can ensure more appropriate responses from your LLM. We recommend implementing this guard cautiously – overzealous filtering might inadvertently alter the user’s original intent. Often, these types of guards are better suited filtering the outputs of your application rather than the inputs.

Jailbreak Attempt Detection

These are the guards that prevent massive security breaches and keep your company out of news headlines. Many collections of jailbreak prompts are available, and even advanced models can fail on up to 40% of these publicly-documented attacks. As these attacks constantly evolve, implementing effective guards can be challenging; we recommend using an embedding-based guard like Arize’s, which can adapt to changing strategies. At minimum, use a guard connected to a common library of prompt injection prompts, such as Rebuff.

Output Monitoring and Filtering

Output guards generally fall into two categories: preventing damage, and ensuring performance.

Preventing Damage

Examples of this include:

  • System Prompt Protection: Some attacks try to expose the prompt templates your system uses. Adding a guard to detect system prompt language in your outputs can mitigate this risk. Just be sure to avoid exposing this same template within your guard’s code!

  • NSFW or Harmful Language Detection: Allowing this type of language in your app’s responses can be extremely harmful to user experience and your brand. Use guards to help identify this language.

  • Competitor Mentions: Depending on your use case, mentioning competitors might be undesirable. Guards can be set up to filter out such references.

Ensuring Performance

When it comes to performance, developers face a choice between using guards to improve your app’s output in real-time or running offline evaluations to optimize your pipeline or prompt template. Real-time guards introduce more latency and cost but offer immediate improvements. Offline evaluations allow for pipeline optimization without added latency, though there may be a delay between issue discovery and resolution. We recommend starting with offline evaluations and only adding performance guards if absolutely necessary.

  • Hallucination Prevention: Guards can prevent hallucinations by comparing outputs with reference texts or, when unavailable, cross-referencing with reliable sources like Wikipedia.

  • Critic Guards: This broad category involves using a separate LLM to critique and improve your pipeline’s output before sending it to the user. These can be instructed to focus on relevancy, conciseness, tone, and other aspects of the response.

Dynamic Guards

Manually updating guards to counter new threats is a near impossible task, quickly becoming unsustainable as attack vectors multiply. Fortunately, two approaches allow us to create adaptive guards that can keep pace with emerging threats: few-shot prompting and embedding-based guards.

While static guards are great at filtering out predefined content like NSFW language, they struggle when faced with sophisticated attacks like jailbreak attempts, prompt injection, and more. These dynamic threats require equally dynamic defenses that can evolve alongside the attackers’ strategies.

Few-Shot Prompting

This technique involves adding examples of recent jailbreak attempts or other attacks directly into your guard’s prompt. By exposing the guard to real-world attack patterns, you improve its ability to recognize and thwart similar threats in the future.

This is a more sophisticated approach to dynamic protection. It involves comparing the embedding of a user’s input against a database of known attack embeddings. By checking the similarity between these representations, the system can identify and block potentially malicious inputs that exceed a predefined similarity threshold. Arize has developed an easy-to-use but powerful implementation of this concept with our ArizeDatasetEmbeddings Guard.

While dynamic guards offer powerful protection, they come with a few considerations:

  • Increased computational cost due to larger models or embedding generation.
  • Higher latency, potentially impacting response times.
  • The need for ongoing maintenance and updates to the attack prompt database

 

Dataset Embeddings Guard Few Shot LLM Guard General LLM Guard
Advantages

Customizable: Customize this Guard to your specific
use case by providing few shot examples from real customer chats

Easy to update: Tackle drift by updating the Guard
with new few shot examples as the models and failure modes evolve over
time

Low latency / cost: Does not rely on a large model to
evaluate the input message

Customizable: Same as Dataset Embeddings

Easy to update: Same as Dataset Embeddings

Fewer false positives: Less likely to produce false
positives, e.g. can differentiate between
jailbreak attempts and role-play

Customizable: Instantiate with a custom prompt for a
specific use case

Opportunity to optimize performance: Try our prompt
playground to optimize performance via prompt engineering against a
golden dataset

Disadvantages

Performance depends on the quality of the source dataset

LLM evaluator call introduces cost and latency

Performance depends on the quality of the source dataset

LLM evaluator call introduces cost and latency

Performance depends on the quality of the prompt

LLM evaluator call introduces cost and latency

Prompt may need to be continually re-engineered as models and use
cases evolve

 

It’s important to weigh these factors against the level of protection required for your specific use case.