LLM Guardrails: Protecting Your AI Application, Including From Itself

John Gilhuly, Developer Advocate | Published July 07, 2024

This is the first in a series detailing how you can secure your LLM apps

As LLM applications become more common, so too do jailbreak attempts, exploitations of these apps, and harmful responses. More and more companies are falling prey to damaging news stories driven by their chatbots selling cars for $1, writing poems critical of their owners, or dealing out disturbing replies.

Fortunately, there is a solution to this problem: LLM guardrails.

What Are LLM Guardrails?

A key part of any modern large language model (LLM) application, LLM guardrails allow you to protect your application from potentially harmful inputs, and block damaging outputs before they’re seen by a user. As LLM jailbreak attempts become more common and more sophisticated, having a robust guardrails approach is critical.

Let’s dive into how guardrails work, how they can be traced and acted upon, and how you can use them to avoid becoming the next big news story.

How Do Guardrails Work?

LLM guardrails work in real-time to either catch dangerous user inputs or screen model outputs. There are many different types of guards that can be employed, each specializing in a different potential type of harmful input or output.

What Are the Primary Use Cases for Guardrails In AI Development?

Common input guard use cases include:

Detecting and blocking jailbreak attempts
Preventing prompt injection attempts
Removing user personally identifiable information (PII) before it reaches a model

Common output guard use cases include:

Removing toxic or hallucinated responses
Removing mentions of a competitor’s product
Screening for relevancy in responses
Removing NSFW text

There can be a lot of ground to cover. Fortunately, tools like Guardrails AI offer a hub of different guards that can be added to your application.

guardrails ai hub

If a message in an LLM chat fails a guard, then the guard can take one of a few different corrective actions: providing a default response, prompting the LLM for a new response, or throwing an exception. For a guard that detects responses that may be damaging to your company’s reputation, regenerating a response may work just fine. However for a guard that detects jailbreak attempts, a default response may be more appropriate.

How To Use Guardrails

A variety of packages are available. Arize offers an integration with Guardrails AI to give you best-in-class observability alongside top-of-line security.

Here are the basic steps necessary to use Guardrails with Arize. If you’re looking for a more in-depth example, check out this tutorial notebook walking through how to enable Guards on a RAG pipeline.

Import Packages and Initialize Arize

First, make sure you imported the correct packages and connected your application to your Arize dashboard:

pip install opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-guardrails guardrails-ai arize-otel

from arize_otel import register_otel, Endpoints
from openinference.instrumentation.guardrails import GuardrailsInstrumentor

# Setup OTEL via our convenience function
register_otel(
    endpoints = Endpoints.ARIZE,
    space_key = getpass("🔑 Enter your Arize space key in the space settings page of the Arize UI: "),
    api_key = getpass("🔑 Enter your Arize API key in the space settings page of the Arize UI: "),
    model_id = "sales-demo-dataset-embeddings-guard", # name this to whatever you would like
)

# Use the Arize autoinstrumentor to add tracing to your application
GuardrailsInstrumentor().instrument(skip_dep_check=True)

Prepare Your Guards

Next, you need to add whichever guards you’re looking to use to your project. These can be downloaded directly from the Guardrails hub. For this example, we’ll use our ArizeDatasetEmbeddings guard.

guardrails hub install hub://arize-ai/dataset_embeddings_guardrails

Initialize Guardrails and Add Your Guard

Now we are ready to initialize our guard. Here we can specify whether this guard will act on prompts or responses and what it should do if it catches a bad input or output. We also disable Guardrails tracing in this step, as we’re using Arize to view our telemetry.

from guardrails import Guard

guard = Guard().use(ArizeDatasetEmbeddings, on="prompt", on_fail="exception")
guard._disable_tracer = True

Make Calls to your LLM

With that, we’re ready to make protected calls to our models. We can do this by invoking our Guard:

validated_response = guard(
      llm_api=openai.chat.completions.create,
      prompt=prompt,
      model="gpt-3.5-turbo",
      max_tokens=1024,
      temperature=0.5,
)

Our guard will take care of any retries or default responses necessary based on our earlier setup.

How To View Guard and Trace Data

If you followed the steps so far, you should already be seeing trace data in your project.

llm guardrails trace data

Guards will have their own spans with details on whether they were passed or failed, the thresholds they used, and the input or output messages.

Next Steps

Now that you have your first guard connected and sending data to Arize, check out Guardrails AI’s Hub to browse other Guards that could be added. Or explore Arize’s new Datasets feature to see how you can export malicious traces for further training or analysis.

Resources

On this page

Ask any AI question