LLM Guardrails: Protecting Your AI Application, Including From Itself
This is the first in a series detailing how you can secure your LLM apps
As LLM applications become more common, so too do jailbreak attempts, exploitations of these apps, and harmful responses. More and more companies are falling prey to damaging news stories driven by their chatbots selling cars for $1, writing poems critical of their owners, or dealing out disturbing replies.
Fortunately, there is a solution to this problem: LLM guardrails.
What Are LLM Guardrails?
A key part of any modern large language model (LLM) application, LLM guardrails allow you to protect your application from potentially harmful inputs, and block damaging outputs before they’re seen by a user. As LLM jailbreak attempts become more common and more sophisticated, having a robust guardrails approach is critical.
Let’s dive into how guardrails work, how they can be traced and acted upon, and how you can use them to avoid becoming the next big news story.
How Do Guardrails Work?
LLM guardrails work in real-time to either catch dangerous user inputs or screen model outputs. There are many different types of guards that can be employed, each specializing in a different potential type of harmful input or output.
What Are the Primary Use Cases for Guardrails In AI Development?
Common input guard use cases include:
- Detecting and blocking jailbreak attempts
- Preventing prompt injection attempts
- Removing user personally identifiable information (PII) before it reaches a model
Common output guard use cases include:
- Removing toxic or hallucinated responses
- Removing mentions of a competitor’s product
- Screening for relevancy in responses
- Removing NSFW text
There can be a lot of ground to cover. Fortunately, tools like Guardrails AI offer a hub of different guards that can be added to your application.
If a message in an LLM chat fails a guard, then the guard can take one of a few different corrective actions: providing a default response, prompting the LLM for a new response, or throwing an exception. For a guard that detects responses that may be damaging to your company’s reputation, regenerating a response may work just fine. However for a guard that detects jailbreak attempts, a default response may be more appropriate.
How To Use Guardrails
A variety of packages are available. Arize offers an integration with Guardrails AI to give you best-in-class observability alongside top-of-line security.
Here are the basic steps necessary to use Guardrails with Arize. If you’re looking for a more in-depth example, check out this tutorial notebook walking through how to enable Guards on a RAG pipeline.
Import Packages and Initialize Arize
First, make sure you imported the correct packages and connected your application to your Arize dashboard:
pip install opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-guardrails guardrails-ai arize-otel
from arize_otel import register_otel, Endpoints
from openinference.instrumentation.guardrails import GuardrailsInstrumentor
# Setup OTEL via our convenience function
register_otel(
endpoints = Endpoints.ARIZE,
space_key = getpass("🔑 Enter your Arize space key in the space settings page of the Arize UI: "),
api_key = getpass("🔑 Enter your Arize API key in the space settings page of the Arize UI: "),
model_id = "sales-demo-dataset-embeddings-guard", # name this to whatever you would like
)
# Use the Arize autoinstrumentor to add tracing to your application
GuardrailsInstrumentor().instrument(skip_dep_check=True)
Prepare Your Guards
Next, you need to add whichever guards you’re looking to use to your project. These can be downloaded directly from the Guardrails hub. For this example, we’ll use our ArizeDatasetEmbeddings guard.
guardrails hub install hub://arize-ai/dataset_embeddings_guardrails
Initialize Guardrails and Add Your Guard
Now we are ready to initialize our guard. Here we can specify whether this guard will act on prompts or responses and what it should do if it catches a bad input or output. We also disable Guardrails tracing in this step, as we’re using Arize to view our telemetry.
from guardrails import Guard
guard = Guard().use(ArizeDatasetEmbeddings, on="prompt", on_fail="exception")
guard._disable_tracer = True
Make Calls to your LLM
With that, we’re ready to make protected calls to our models. We can do this by invoking our Guard:
validated_response = guard(
llm_api=openai.chat.completions.create,
prompt=prompt,
model="gpt-3.5-turbo",
max_tokens=1024,
temperature=0.5,
)
Our guard will take care of any retries or default responses necessary based on our earlier setup.
How To View Guard and Trace Data
If you followed the steps so far, you should already be seeing trace data in your project.
Guards will have their own spans with details on whether they were passed or failed, the thresholds they used, and the input or output messages.
Next Steps
Now that you have your first guard connected and sending data to Arize, check out Guardrails AI’s Hub to browse other Guards that could be added. Or explore Arize’s new Datasets feature to see how you can export malicious traces for further training or analysis.