Optimizing Coding Agents (Cline) with Prompt Learning

The Task

In this cookbook, we use Prompt Learning to optimize Cline, a popular and powerful open-source coding agent.

Specifically, we will use Prompt Learning to optimize Cline rules - user specified instructions that Cline appends to its system prompt. We chose to optimize the ruleset, and not the actual prompt, because this mimics the workflow of a developer. For most coding agents, you cannot edit its base system prompt, but rather add your own custom instructions.

Cline System Prompt:

<Base Cline Prompt> (remains same everytime)
<Rules> (user-specified instructions: This is what we will optimize!)

More on Cline

More on Prompt Learning

Plan Mode (for now)

This is a primitive stage of optimization. We are just looking at Plan Mode for Cline, which generates a plan for a given query, referencing the files in the codebase. We are then using an LLM-as-Judge evaluator to judge the generated plan. Therefore the accuracies should be taken lightly - they are not a perfect reflection of Cline's performance because Cline is not actually editing the codebase, and we're not actually running the SWE Bench tests to verify whether its edits are correct.

We are still working on running Cline in Act Mode, and allowing it actually edit the codebase. Then we can use the tests in SWE bench to compute a firm accuracy of whether Cline made the right edits. Stay tuned.

Prompt Learning Repository

To view and run the notebook, first clone the Prompt Learning repository.

git clone https://github.com/Arize-ai/prompt-learning.git
cd prompt-learning/cline

Navigate to cline -> optimize_cline.ipynb.

Running the Cookbook

Important Note

Running this notebook can be computationally intensive and expensive as it involves multiple API calls to Claude for each SWE-bench instance. Consider adjusting the training and test set sizes based on your requirements and budget constraints.

Setup

Please visit README.md and complete all the Setup before running this notebook!

It is quite involved, as you need to setup both Cline and SWE Bench.

Configuration

  • OPTIMIZATION_LOOPS: number of Prompt Learning loops. How many times you want to optimize your prompt.

  • TRAIN_SIZE: size of training set.

  • TEST_SIZE: size of test set.

  • MAX_WORKERS: SWE Bench is set up to run in parallel, with however many workers you specify. Set this relative to your machine and your Claude rate limits.

  • RULES: base starting ruleset. I suggest keeping the rule regarding resume_task, as I've noticed using the resume_task tool leads to unstable behavior.

OPTIMIZATION_LOOPS = 5
TRAIN_SIZE = 150 ## Lower this based on your Claude rate limits.
TEST_SIZE = 150 ## Lower this based on your Claude rate limits.
MAX_WORKERS = 50 ## Lower this based on your Claude rate limits + your machine's memory.
RULES = "do NOT use resume_task tool. Do NOT ask for user input/confirmation at any step of the process."

Train/Test Datasets

This code splits SWE-Bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.

dataset = load_swebench_dataset("SWE-bench/SWE-bench_Lite", "test")
random.seed(42)
random.shuffle(dataset)
train_dataset = dataset[:TRAIN_SIZE]
test_dataset = dataset[TEST_SIZE:]
train_pd = pd.DataFrame.from_dict(train_dataset)
test_pd = pd.DataFrame.from_dict(test_dataset)
train_pd.to_csv("train_dataset.csv")
test_pd.to_csv("test_dataset.csv")

Upload Datasets to Arize

We'll be uploading our train/test datasets to Arize so we can eventually track performance when we run Cline.

from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

arize_client = ArizeDatasetsClient(api_key=os.getenv("ARIZE_API_KEY"))

# Prepare expanded dataset with more columns
def prepare_expanded_dataset(df):
    expanded_df = pd.DataFrame({
        'problem_statement': df['problem_statement'],
        'patch': df['patch'],
        'test_patch': df['test_patch'],
        'instance_id': df['instance_id'],
        'repo': df['repo']
    })
    return expanded_df

# Prepare expanded datasets
train_expanded = prepare_expanded_dataset(train_pd)
test_expanded = prepare_expanded_dataset(test_pd)

# Upload expanded datasets to Arize
train_arize_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name="SWE-bench Train",
    dataset_type=GENERATIVE,
    data=train_expanded
)

test_arize_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name="SWE-bench Test",
    dataset_type=GENERATIVE,
    data=test_expanded
)

print("\nCreated train dataset with ID:", train_arize_id)
print("Created test dataset with ID:", test_arize_id)

Helper: Collecting SWE Bench results

This helper function will help us convert our Cline runs on SWE Bench into data we can evaluate.

def collect_swebench_results(all_results):
    dataset_rows = []
    for result in all_results:
        instance_id = result["instance_id"]
    # Find the corresponding problem statement from train_dataset
    problem_statement = next(inst["problem_statement"] for inst in train_dataset if inst["instance_id"] == instance_id)
    patch = next(inst["patch"] for inst in train_dataset if inst["instance_id"] == instance_id)
    test_patch = next(inst["test_patch"] for inst in train_dataset if inst["instance_id"] == instance_id)
    if result["final_plan"]:
        dataset_rows.append({
            "instance_id": instance_id,
            "problem_statement": problem_statement,
            "final_plan": result["final_plan"],
            "test_patch": test_patch,
            "patch": patch,
        })
    train_df = pd.DataFrame(dataset_rows)
    return train_df

Helper: Running Cline on a dataset

This helper function runs Cline on a dataset. It is meant to be used to run Cline on either your train or test split.

It runs Cline in parallel, spinning up MAX_WORKERS # of Cline servers at a time, each server running on a specific row of SWE Bench.

It also then evaluates the plans generated by Cline using our LLM-as-judge eval. We simply provide an LLM with the problem statement, the test patch, the ground truth patch, and Cline's generated plan, and ask it if the generated plan seems right. We use this to compute a rough measure of Cline's accuracy.

def run_cline_dataset(dataset):
    all_results = []
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
        futs = {ex.submit(run_cline, inst, i, RULES): inst["instance_id"] for i, inst in enumerate(dataset)}
        for fut in as_completed(futs):
            r = fut.result()
            all_results.append(r)
            print(f"[{r['instance_id']}] done")
    
    df = collect_swebench_results(all_results)
    evaluated_results = evaluate_results(df)

    accuracy = sum(evaluated_results["correctness"] == "correct") / len(evaluated_results)
    return evaluated_results, accuracy

Helper: Log experiments to Arize

We'll be logging Cline results at every iteration of optimization to Arize, so we can visualize and keep track of our results.

from arize.experimental.datasets.experiments.types import (
    ExperimentTaskResultColumnNames,
    EvaluationResultColumnNames,
)

task_columns = ExperimentTaskResultColumnNames(
    example_id="example_id", result="final_plan"
)
evaluator_columns = EvaluationResultColumnNames(
    label="correctness",
    explanation="explanation",
    score="score"
)

# Get dataset with example_ids and merge with experiment results
def log_experiment_with_ids(client, space_id, experiment_name, experiment_df, dataset_name):
    # 1) fetch dataset and keep only instance_id -> id mapping
    dataset = client.get_dataset(space_id=space_id, dataset_name=dataset_name)
    id_map = dataset[['instance_id', 'id']].drop_duplicates()

    # 2) merge and build a minimal payload with only required columns
    merged = experiment_df.merge(id_map, on='instance_id', how='inner')
    payload = merged.rename(columns={"id": "example_id"})[
        ["example_id", "final_plan", "correctness", "explanation"]
    ].copy()

    return client.log_experiment(
        space_id=space_id,
        experiment_name=experiment_name,
        experiment_df=payload,
        dataset_name=dataset_name,
        task_columns=task_columns,
        evaluator_columns={'correctness': evaluator_columns},
    )

Ruleset Optimization

This code optimizes our ruleset for Cline. Here are the steps:

Repeats OPTIMIZATION_LOOPS # of times:

  1. Run Cline, with the current ruleset, on the training set, and compute training accuracy.

  2. Run Cline, with the current ruleset, on the test set, and compute test accuracy.

  3. Use the results on the training set to optimize the ruleset, using `PromptLearningOptimizer'

  4. Update the current ruleset to be the optimized ruleset, for the next iteration.

ruleset = "do NOT use resume_task tool. Do NOT ask for user input/confirmation at any step of the process."

for idx in range(OPTIMIZATION_LOOPS):
    print(f"Running for idx: {idx}")

    evaluated_train_results, train_accuracy = run_cline_dataset(train_dataset, ruleset)
    evaluated_train_results.to_csv(f"results/train_{idx}.csv")
    log_experiment_with_ids(
        arize_client,
        SPACE_ID,
        f"train_{idx}",
        evaluated_train_results,
        "SWE-bench Train"
    )

    evaluated_test_results, test_accuracy = run_cline_dataset(test_dataset, ruleset)
    evaluated_test_results.to_csv(f"results/test_{idx}.csv")
    log_experiment_with_ids(
        arize_client,
        SPACE_ID,
        f"test_{idx}",
        evaluated_test_results,
        "SWE-bench Test"
    )
    
    pl_optimizer = PromptLearningOptimizer(
        prompt=CLINE_PROMPT,
        model_choice="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    ruleset = pl_optimizer.optimize(
        dataset=evaluated_train_results,
        output_column="final_plan",
        feedback_columns=["correctness", "explanation"],
        ruleset = ruleset,
        context_size_k=100000
    )

    with open(f"rulesets/ruleset_{idx}.txt", "w") as f:
        f.write(f"train_accuracy: {train_accuracy}")
        f.write(f"test_accuracy: {test_accuracy}")
        f.write(f"optimized_ruleset_{idx}: {ruleset}")
        f.write(ruleset)

Results

Running the code will give you your own optimization results. Here are some results we got.

Disclaimer: Again, since this is just Plan Mode, and we are just using an LLM to evaluate the generated plans, these results are to be taken very lightly. But stay tuned for Act Mode results, where we will have real results on Cline's performance before and after optimizing its ruleset using Prompt Learning.

Through 10 loops of optimization, we tracked our results in Arize Experiments.

Index
Train Accuracy
Test Accuracy

1

0.2200

0.3067

2

0.2667

0.3533

3

0.2933

0.3467

4

0.2000

0.3200

5

0.3667

0.3667

6

0.2733

0.2933

7

0.2667

0.3133

8

0.2600

0.3400

9

0.2533

0.3467

10

0.2467

0.3400

11

0.2867

0.4533

According to our LLM-as-Judge, after 10 loops of ruleset optimization, we are seeing about 10-15% better generated plans! These are awesome results that indicate that we may be able to improve Cline in ACT Mode as well.

Last updated

Was this helpful?