l

Advanced Guards: Moving To Dynamic Guardrails for LLM Applications

Evan Jolley
Evan Jolley,  Contributor  | Published August 08, 2024

This piece is co-authored by John Gilhuly

Throughout this series, we looked into guardrails for LLMs, covering their purpose, variations, and implementation. In this final blog post, we investigate the most advanced type of protection: dynamic guards.

What Are Dynamic Guards?

While static guards are great at filtering out predefined content like NSFW language, they struggle when faced with sophisticated attacks like jailbreak attempts, prompt injection, and more. These dynamic threats require equally dynamic defenses that can evolve alongside the attackers’ strategies.

Manually updating guards to counter new threats is a near impossible task, quickly becoming unsustainable as attack vectors multiply. Fortunately, two approaches allow us to create adaptive guards that can keep pace with emerging threats: few-shot prompting and embedding-based guards.

llm guards flow advanced

Few-Shot Prompting

This technique involves adding examples of recent jailbreak attempts or other attacks directly into your guard’s prompt. By exposing the guard to real-world attack patterns, you improve its ability to recognize and thwart similar threats in the future.

Embedding-Based Guards

This is a more sophisticated approach to dynamic protection. It involves comparing the embedding of a user’s input against a database of known attack embeddings. By checking the similarity between these representations, the system can identify and block potentially malicious inputs that exceed a predefined similarity threshold. Arize has developed an easy-to-use but powerful implementation of this concept with our ArizeDatasetEmbeddings Guard.

Arize’s ArizeDatasetEmbeddings Guard

The ArizeDatasetEmbeddings guard follows these steps:

  1. Start with a set of example attacks and generate embeddings for each.
  2. When a new prompt arrives, chunk the input and create embeddings for each chunk.
  3. Calculate the cosine distance between the prompt chunk embeddings and the example embeddings.
  4. If any distance falls below a user-defined threshold (default: 0.2), the guard intercepts the call.

By default, this guard uses 10 examples from a public jailbreak prompt dataset. However, you can customize it with your own data using the sources={} parameter, allowing you to fine-tune the guard based on attacks specific to your application.

Both few-shot prompting and embedding-based guards require an up-to-date collection of attack prompts. You can either gather these yourself through your application’s usage or tap into online repositories. While we won’t link directly to them here, platforms like Reddit and Twitter (X) host frequently updated collections of attack prompts that can be valuable resources for training your guards.

Performance

We tested the ArizeDatasetEmbeddings guard against a dataset of jailbreak attempts and regular prompts, and the results speak for themselves. Crucially, the guard keeps false negatives low (shown in the top right quadrant below), preventing missed attacks.

embedding guardrail performance benchmarks confusion matrix

True Positives 86.43% of 656 jailbreak prompts were successfully blocked
False Negatives 13.57% of jailbreak prompts slipped through.
False Positives 13.95% of 2000 regular prompts were incorrectly flagged.
True Negatives 86.05% of regular prompts passed correctly.
Median latency 1.41 seconds for end-to-end LLM call on GPT-3.5.

When To Use Dynamic Guards?

While dynamic guards offer powerful protection, they come with a few considerations:

  • Increased computational cost due to larger models or embedding generation.
  • Higher latency, potentially impacting response times.
  • The need for ongoing maintenance and updates to the attack prompt database.

It’s important to weigh these factors against the level of protection required for your specific use case.

Wrapping Up

We hope this series has provided you with a comprehensive understanding of the guardrail landscape for LLMs. From basic content filtering to sophisticated dynamic defenses, you now have the knowledge to implement robust safety measures for your AI applications.

Remember, the field of AI safety is rapidly evolving. Staying informed and continuously adapting your defenses is the only way to maintain secure and responsible AI systems.

Have questions about guards or LLMOps or AgentSREs in general? Join our community Slack to connect with experts and fellow practitioners. Together, we can build safer, more reliable AI systems for everyone.