> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Text2SQL Application for Database Querying

> Let's work through a Text2SQL use case where we are starting from scratch without a nice and clean dataset of questions, SQL queries, or expected responses.

<Frame>
  <Card title="Google Colab" href="https://colab.research.google.com/github/Arize-ai/tutorials/blob/main/python/llm/experiments/text2sql-experiment.ipynb" icon="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/gc.png" horizontal iconType="solid" />
</Frame>

Here are the steps:

<Steps>
  <Step title="Implement text2SQL" />

  <Step title="Setup dataset and evaluators" />

  <Step title="Run experiments" />
</Steps>

## Implement Text2SQL

We are going to use the NBA dataset that information from 2014 - 2018 about every game played in that span. We will use DuckDB as our database.

```python theme={null}
import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

conn.query("SELECT * FROM nba LIMIT 5").to_df().to_dict(orient="records")[0]
```

Here's the example of one row of a game in the dataset

<Accordion title="Example data">
  `{'Unnamed: 0': 1, 'Team': 'ATL', 'Game': 1, 'Date': '10/29/14', 'Home': 'Away', 'Opponent': 'TOR', 'WINorLOSS': 'L', 'TeamPoints': 102, 'OpponentPoints': 109, 'FieldGoals': 40, 'FieldGoalsAttempted': 80, 'FieldGoals.': 0.5, 'X3PointShots': 13, 'X3PointShotsAttempted': 22, 'X3PointShots.': 0.591, 'FreeThrows': 9, 'FreeThrowsAttempted': 17, 'FreeThrows.': 0.529, 'OffRebounds': 10, 'TotalRebounds': 42, 'Assists': 26, 'Steals': 6, 'Blocks': 8, 'Turnovers': 17, 'TotalFouls': 24, 'Opp.FieldGoals': 37, 'Opp.FieldGoalsAttempted': 90, 'Opp.FieldGoals.': 0.411, 'Opp.3PointShots': 8, 'Opp.3PointShotsAttempted': 26, 'Opp.3PointShots.': 0.308, 'Opp.FreeThrows': 27, 'Opp.FreeThrowsAttempted': 33, 'Opp.FreeThrows.': 0.818, 'Opp.OffRebounds': 16, 'Opp.TotalRebounds': 48, 'Opp.Assists': 26, 'Opp.Steals': 13, 'Opp.Blocks': 9, 'Opp.Turnovers': 9, 'Opp.TotalFouls': 22}`
</Accordion>

Let's start by implementing a simple text2sql logic.

```python expandable theme={null}
import os
import openai

client = openai.AsyncClient()

columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}


system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n"
    f"{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n"
    "Write a SQL query corresponding to the user's request. Return just the query text, "
    "with no formatting (backticks, markdown, etc.)."
)


async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content
```

```python theme={null}
query = await generate_query("Who won the most games?")
print(query)
```

> SELECT Team, COUNT(\*) AS Wins&#x20;
>
> FROM nba&#x20;
>
> WHERE WINorLOSS = 'W'&#x20;
>
> GROUP BY Team&#x20;
>
> ORDER BY Wins DESC&#x20;
>
> LIMIT 1;

Let's run the query against our database now!

```python theme={null}
def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")

execute_query(query)
```

> \[\{'Team': 'GSW', 'Wins': 265}]

## Setup dataset and evaluators <a href="#evaluation" id="evaluation" />

To setup an experiment we need a dataset, task and evaluator. Let's setup each.

**Setup dataset**

```python theme={null}
questions = [
    "Which team won the most games?",
    "Which team won the most games in 2015?",
    "Who led the league in 3 point shots?",
    "Which team had the biggest difference in records across two consecutive years?",
    "What is the average number of free throws per year?",
]
```

Let's store the data above as a versioned dataset in Arize AX.

```python theme={null}
# Note: This example uses Python SDK v7
arize_client = ArizeDatasetsClient(
    developer_key=os.environ.get("ARIZE_DEVELOPER_KEY"),
    api_key=os.environ.get("ARIZE_API_KEY"),
)
# Create a dataset from a DataFrame add your own data here
test_df = pd.DataFrame([{"question": question} for question in questions])
dataset_id = arize_client.create_dataset(
    space_id=space_id,
    dataset_name=dataset_name,
    dataset_type=GENERATIVE,
    data=test_df,
)
dataset = arize_client.get_dataset(space_id=space_id, dataset_id=dataset_id)
dataset.head()
```

<Frame>
  ![](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/text2sql-1.png)
</Frame>

**Setup task**

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

```python theme={null}
async def text2sql(question):
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

def task(dataset_row):
    input = dataset_row
    return asyncio.run(text2sql(input["question"]))
```

**Setup evaluator**

Finally, we'll define the evaluator. We'll use the following simple scoring functions to see if the generated SQL queries are correct.

```python theme={null}
# Test if there are no sql execution errors
def no_error(output):
    output = json.loads(output)
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    output = json.loads(output)
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0
```

Now let's run the experiment.

```python theme={null}
experiment = arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=task,
    evaluators=[no_error, has_results],
    experiment_name="text2sql_first_test",
)
```

<Frame>
  ![](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/text2sql-2.png)
</Frame>

### Interpreting the results

Now that we ran the initial evaluation, it looks like three of the results are valid, one produces SQL errors, and one has no results.

The second query for `` `Which team won the most games in 2015` ``  looks for `Date LIKE '2015%'` which is not correct. The fourth query does not have TEAM in the group by clause.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

```python expandable theme={null}
samples = conn.query("SELECT * FROM nba LIMIT 1").to_df().to_dict(orient="records")[0]
sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}"
    for column in columns
)
system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{sample_rows}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


async def generate_query(input):
    response = await client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content


print(await generate_query("Which team won the most games in 2015?"))
```

> SELECT Team, COUNT(\*) AS Wins FROM nba WHERE WINorLOSS = 'W' AND Date LIKE '%/15' GROUP BY Team ORDER BY Wins DESC LIMIT 1;

Looking better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results. And then we can run this as another experiment and compare the results.

```python expandable theme={null}
# Note: This example uses Python SDK v7
from phoenix.evals.models import OpenAIModel
from phoenix.evals.classify import llm_classify
from arize.experimental.datasets.experiments.types import EvaluationResult


IS_SQL_EVAL_TEMPLATE = """You are a SQL expert, is the following a valid SQL query that executes without errors? Return the single workd "valid" if is valid, and "invalid" if it is not.

[BEGIN SQL QUERY]
{query}
[END SQL QUERY]
"""


def check_is_sql(output):
    output = json.loads(output)
    query = output.get("query")
    df_in = pd.DataFrame({"query": query}, index=[0]) if query else None
    eval_df = llm_classify(
        dataframe=df_in,
        template=IS_SQL_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=["valid", "invalid"],
        provide_explanation=True,
    )
    # return score, label, explanation
    return EvaluationResult(
        score=1,
        label=eval_df["label"][0],
        explanation=eval_df["explanation"][0],
    )


experiment = arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=task,
    evaluators=[no_error, has_results, check_is_sql],
    experiment_name="text2sql_test_new_prompt_and_eval-6",
)
```

<Frame>
  ![](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/text2sql-3.png)
</Frame>

You can see that the newer SQL has improved some cases, but there are still some other errors to iron out. As you experiment with different models, prompts, and techniques, you can continuously optimize your applications until they reach the performance thresholds you want.
