A/B testing for LLMs
A/B testing for LLMs compares two or more AI system variants on live traffic.
Access control (agents)
Access control (agents) defines which users, agents, tools, data sources, and actions are allowed in a given context.
Accuracy
Accuracy is the measure of the number of correct predictions made by the model.
Adaptive Knowledge Graph Memory
Adaptive Knowledge Graph Memory is a concept from the DAMCS framework that uses a hierarchical knowledge-graph memory to enable multi-agent cooperation.
AdaptThink
AdaptThink is a novel reinforcement learning framework that trains an LLM when to think deeply and when to respond immediately.
Agent
An Agent is a software system that uses a model together with tools, memory, instructions, and control logic to pursue a goal across multiple steps.
Agent architecture
Agent architecture is the design of the system around the model: orchestration, tools, memory, state, policies, retrieval, tracing, error handling, and evaluation points.
Agent control loop
An Agent control loop is the repeated cycle of observe state, decide next action, execute action, update state, and continue until a stopping condition is met.
Agent drift / model drift
Agent drift / model drift is a measurable change in agent behavior over time.
Agent feedback loop
An Agent feedback loop is the operating cycle that turns production behavior into measurable improvement.
Agent lifecycle (dev → prod → improve)
Agent lifecycle (dev → prod → improve) is the full path from building an agent to running it in production and improving it over time.
Agent orchestration
Agent orchestration is the coordination layer that decides what an agent does next.
Agent policy layer
Agent policy layer defines what an agent is allowed to do and under what conditions.
Agent state management
Agent state management is the handling of information an agent needs across steps, turns, sessions, or workflows.
Agent supervision
Agent supervision is the set of controls used to observe, guide, constrain, and review agent behavior.
Agent workflow
An Agent workflow is the ordered process an agent follows to complete a task.
Agent Workflow Memory (AWM)
Agent Workflow Memory (AWM) is a technique for teaching an LLM-based agent to remember and reuse multi-step task solutions (“workflows”) from its past experience.
Agent-native evaluation
Agent-native evaluation is evaluation infrastructure designed for agents to consume, act on, and improve from, not just dashboards for humans to inspect.
Agent-run evaluation
Agent-run evaluation is the practice of having agents execute evaluation workflows.
Agent-to-agent evaluation
Agent-to-agent evaluation is the use of one agent to evaluate another agent’s behavior.
Agent-to-Agent Protocol (A2A)
Agent-to-Agent Protocol (A2A) is an open communication framework that allows autonomous AI agents to talk to each other in a standardized, predictable way.
AgentBench
AgentBench is a benchmarking suite designed to evaluate the performance of LLMs acting as agents across various interactive environments and tasks.
Agentic Memory (A-MEM)
Agentic Memory (A-MEM) is an LLM agent’s dynamic long-term memory system that can grow and reorganize itself over time.
Agentic RAG
Agentic RAG is a retrieval-augmented generation system where an agent actively manages how retrieval happens during a workflow instead of following a fixed retrieve-then-generate pattern.
Agents that evaluate agents
Agents that evaluate agents are agent systems used to inspect, score, or debug the behavior of other agent systems.
AI evaluation (model evaluation)
AI evaluation (model evaluation) measures the quality, safety, reliability, and performance of an AI system or model.
AI improvement loop
An AI improvement loop is the broader process of using real system behavior to improve an AI application.
AlphaEvolve
AlphaEvolve is an evolutionary coding agent that combines the creative code generation of LLMs with automated program evaluators in a closed-loop system.
AM-Thinking-v1
AM-Thinking-v1 is a 32-billion-parameter open-source LLM that achieves breakthrough reasoning ability by combining supervised fine-tuning and reinforcement learning in its training.
ARC-AGI-2
ARC-AGI-2 is a new benchmark designed to push AI systems to their absolute limits in general problem-solving.
Audit logs
Audit logs are records of important system actions, decisions, and access events.
AutoAgents
AutoAgents is a framework designed to automatically generate and coordinate multiple bespoke AI agents to collaboratively solve complex tasks.
Autoencoder
An Autoencoder is a model used for tasks like sentence completion and sentence or token classification, learning to reconstruct or fill in missing parts of its input.
Autonomous evaluation systems
Autonomous evaluation systems are eval systems that can run, analyze, and act on evaluation workflows with limited human intervention.
Autoregressive Models
Autoregressive Models like GPT-2 use the previous words (context) to predict the next word in a sentence.
Baseline
A Baseline is the reference data or benchmark used to compare model performance against for monitoring purposes.
Baseline Distribution
A Baseline Distribution refers to a model dataset used as a reference or comparison to the model’s current (production) distribution.
Benchmark vs production evaluation
Benchmark vs production evaluation is the distinction between measuring AI performance on standardized tasks or public datasets and measuring it on real, ongoing production behavior.
Bias (AI evaluation)
Bias (AI evaluation) refers to systematic differences in behavior or outcomes across groups, topics, languages, dialects, or contexts.
Binary Classification Model
A Binary Classification Model is a machine learning algorithm that performs classification tasks with only two class labels.
Binning
Binning groups continuous values into smaller cohorts, or “bins,” to reduce the cardinality of data by representing the points in intervals.
Bleu Score
Bleu Score is a precision-focused metric that measures the n-gram overlap between generated text and reference text.
Canary Deployment
Canary Deployment is a method of testing a new model or model version where only a small subset of production data flows through this model to verify response performance before making a complete cutover.
Canary evaluation (AI systems)
Canary evaluation (AI systems) is the practice of routing a small amount of traffic or a limited set of users through a new AI system version and evaluating the result before full rollout.
Cascading failures
Cascading failures occur when one failure triggers additional failures across an agent workflow or multi-agent system.
Checkpointing (agents)
Checkpointing (agents) is the practice of saving an agent’s state at defined points so the workflow can resume, audit, retry, or roll back after interruption.
Chunking strategy
Chunking strategy is the way source documents are split into retrievable units for embedding, indexing, and RAG.
Classification Model
A Classification Model is used to predict categories or assign a class label.
Closed-loop agents
Closed-loop agents are agents connected to a feedback loop: observe what happened, evaluate whether it was good, improve the system, and deploy the change under policy.
Completeness
Completeness measures whether an answer or agent workflow includes all required information or steps.
Compliance (AI systems)
Compliance (AI systems) is the practice of ensuring AI behavior, data handling, access, monitoring, and documentation meet legal, regulatory, contractual, and organizational requirements.
Concept Drift
Concept Drift is the shift in the statistical properties of the target/dependent variable(s), i.e a drift in the actuals.
Confusion Matrix
A Confusion Matrix provides a summary of all prediction results of a classification problem.
Context
Context is the information available to an AI system at the moment it generates a response, makes a decision, or chooses the next action.
Context relevance
Context relevance measures whether the retrieved or supplied context is useful for answering the query.
Continuous evaluation
Continuous evaluation means running evals as an always-on part of development and production.
Continuous improvement for AI systems
Continuous improvement for AI systems is the practice of improving AI quality through ongoing measurement, not one-time launch testing.
Correctness
Correctness measures whether an output is factually or logically right for the task.
Cosine Similarity
Cosine Similarity is a key concept in machine learning that measures the cosine of the angle between two non-zero vectors in a multi-dimensional space, providing a metric for assessing their directional similarity.
Coverage
Coverage measures how much of the expected behavior space an evaluation suite exercises.
Current Distribution
Current Distribution refers to the statistical distribution, or shape, of the dataset being generated by a machine learning model in production.
Data and datasets layer
A Data and datasets layer is the data and datasets layer is the part of the evaluation system that manages examples used for testing, scoring, experimentation, and improvement.
Data Drift
Data Drift is data drift, feature drift, covariate drift and input drift all refer to a shift in the statistical properties of the independent variable(s), i.e. a drift in the feature distributions and the correlations between variables.
Data drift (for LLMs)
Data drift (for LLMs) is data drift for LLM systems occurs when production inputs, retrieved content, user behavior, or external knowledge changes in ways that make existing prompts, models, or evals less reliable.
Data leakage
Data leakage occurs when an AI system exposes information it should not reveal.
Data Quality
Data Quality refers to the integrity and consistency of the data sets used.
Dataset curation
Dataset curation is the process of selecting, cleaning, labeling, organizing, and maintaining examples for evaluation.
Debugging agents
Debugging agents is the practice of tracing and evaluating an agent’s full execution path to find why it failed.
Deep Explainer (Deep Shap)
Deep Explainer (Deep Shap) is an explainability technique that can be used for models with a neural network based architecture.
Deep Learning
Deep Learning is machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications.
Deep Learning Model
A Deep Learning Model normally refers to a neural network, typically with more than two layers.
Deployment gating (for AI)
Deployment gating (for AI) is the practice of requiring AI quality checks before shipping a model, prompt, tool, retrieval, or orchestration change.
Disparate Impact
Disparate Impact is a quantitative measure of the adverse treatment of protected classes that compares the pass rate – or positive outcome – of one group versus another.
Drift
Drift is defined as the change in the data over time.
Durable execution
Durable execution is the ability for an agent workflow to survive time, retries, failures, restarts, and long-running operations without losing state.
EfficientLLM
EfficientLLM is a benchmarking initiative focused on measuring how resource-efficient different LLMs are, beyond just accuracy.
Embeddings
Embeddings are dense, low-dimensional representations of high-dimensional data.
Embeddings / vector search
An Embeddings / vector search is embeddings are numerical representations of text, images, or other data that capture semantic similarity.
Error analysis
Error analysis is the process of grouping failures into categories so teams can understand what to fix.
Eval maturity model
An Eval maturity model is the eval maturity model describes how teams make evaluation more systematic, automated, and connected to production workflows over time.
EvalOps (CI/CD for agents)
EvalOps (CI/CD for agents) is the operational practice of running evaluations continuously across the agent development and deployment lifecycle.
Evals as APIs
Evals as APIs means exposing evaluation results and workflows through programmable interfaces rather than only through dashboards or reports.
Evaluation as API
Evaluation as API is the API-first version of evaluation infrastructure.
Evaluation as infrastructure
Evaluation as infrastructure means treating evaluation as a core system dependency, similar to logging, observability, testing, and CI/CD.
Evaluation dataset
An Evaluation dataset is a collection of examples used to test an AI system.
Evaluation drift
Evaluation drift occurs when an evaluator stops measuring the behavior the team actually cares about.
Evaluation gating
Evaluation gating is the use of eval results to allow, block, or require review before a change moves forward.
Evaluation harness
An Evaluation harness is the operational system that turns evals into repeatable workflows and actions.
Evaluation Metric
Evaluation Metric is the way the performance of a predictive model is quantified and calculated is known as the evaluation metric.
Evaluation metrics
Evaluation metrics are the numerical or categorical measures used to judge AI system behavior.
Evaluation pipeline
An Evaluation pipeline is the repeatable workflow that takes evaluation inputs, runs scoring logic, stores results, and triggers follow-up actions.
Evaluation rubric
An Evaluation rubric is a structured set of criteria used to score AI outputs or agent behavior.
Evaluation Store
An Evaluation Store — also sometimes referred to as an inference store — is a machine learning infrastructure tool used to monitor and improve model performance.
Evaluation Window
Evaluation Window is in machine learning monitoring and observability, the evaluation window is plot of the period or duration of time against the metric being calculated.
Evaluations (evals)
An Evaluations (evals) is evals, or evaluations, are structured tests for measuring the quality of a system, process, or outcome.
Expected Gradients
Expected Gradients are a fast explainability technique useful for differentiable models.
Explainability
An Explainability is the total extent to which the machine learning internal mechanics can be explained in human-understandable terms only.
F-Score
F-Score is a measure of the harmonic mean of precision and recall.
Failure modes (agents)
A Failure modes (agents) is agent failure modes are the recurring ways agents break in production.
Faithfulness (vs hallucination)
Faithfulness (vs hallucination) measures whether an answer accurately reflects the provided context without adding unsupported claims.
False Negative
False Negative is when a model mistakenly predicts a negative class, when the value belongs to the positive class.
False Positive
False Positive is when a model mistakenly predicts a positive class, when the value belongs to the negative class.
False Positive Parity
False Positive Parity is commonly used as a model fairness metric, false positive parity measures whether a model incorrectly predicts something as more likely for a sensitive group than for the base group.
Feature Drift
Feature Drift is feature drift, data drift, covariate drift and input drift all refer to a shift in the statistical properties of the independent variable(s), i.e. a drift in the feature distributions and the correlations between variables.
Feature Importance
Feature Importance is a compilation of a class of explainability techniques that take in all the features related to making a model prediction and assign a certain score to each feature to weigh how much or how little it impacted the outcome.
Feature Importance Heat Map
Feature Importance Heat Map is a feature performance heat map is a visual representation of the performance of each feature in a given model.
Feature Store
Feature Store is a machine learning infrastructure tool that handles offline and online feature transformations.
Grounding
Grounding means an AI output is supported by the context, data, or sources available to the system.
Human evaluation
Human evaluation uses people to judge AI outputs, traces, or sessions.
Human-in-the-loop
Human-in-the-loop means a human actively participates in the AI workflow before a decision or action is completed.
Human-on-the-loop
Human-on-the-loop means a human supervises an automated system without approving every action in advance.
Individual Conditional Expectation (ICE)
Individual Conditional Expectation (ICE) is individual conditional explanation (ICE) plots visualize one line per instance to show how the instance’s prediction changes when a feature changes.
Integrated Gradients
Integrated Gradients are a technique for attributing the predictions of a classification model to input features.
Jailbreaking
Jailbreaking is an attempt to bypass a model or system’s safety rules.
Jensen-Shannon Divergence (JS-Divergence)
Jensen-Shannon Divergence (JS-Divergence) is a symmetric measure of similarity between two probability distributions, computed as half the Kullback-Leibler divergence of each distribution to their mixture.
JS Distance
JS Distance is a symmetric derivation of KL divergence, and it is used to measure drift.
Kernel SHAP
Kernel SHAP is a slow, perturbation-based Shapley approach that theoretically works for all types of models but is rarely used by teams in the wild (at least in production).
KNN Algorithm
KNN Algorithm is the K Nearest Neighbor (KNN) algorithm is an uncomplicated, non-parametric machine learning technique employed for classification and regression tasks.
Kolmogorov Smirnov Test
Kolmogorov Smirnov Test is useful in drift monitoring, the Kolmogorov Smirnov test (KS test) is an efficient and general way to measure if two distributions significantly differ from one another.
Kullback-Leibler (KL) Divergence
Kullback-Leibler (KL) Divergence is the Kullback-Leibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution.
Labeling / human labeling
A Labeling / human labeling is labeling is the process of attaching human or machine-generated judgments to examples.
LGTM@K
LGTM@K is a joke from a Vespa.ai presentation on evaluating information retrieval systems and LangChain.
LLM evaluation
LLM evaluation is the practice of measuring whether a large language model or LLM-powered application behaves as intended.
LLM Jailbreaking
LLM Jailbreaking refers to escaping the guardrails and safeguards of an LLM application or foundation model.
Local Interpretable Model-Agnostic Explanations (LIME)
Local Interpretable Model-Agnostic Explanations (LIME) is an explainability method that attempts to provide local explanations for machine learning model predictions.
Logarithmic Loss
Logarithmic Loss is tracks incorrect labelling of the data class by the model and penalizes the model if deviations in probability occur into classifying the labels.
Long-running agents
Long-running agents are agents that operate across extended timeframes rather than a single request-response turn.
MAPoRL (Multi-Agent Post-Co-Training RL)
MAPoRL (Multi-Agent Post-Co-Training RL) is multi agent post-co training reinforcement learning (MAPoRL) is a training methodology that enhances the collaborative capabilities of multiple AI agents (paper).
Mean Absolute Error
Mean Absolute Error (MAE) is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset.
Mean Absolute Percentage Error
Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE.
Mean Square Error (MSE)
Mean Square Error (MSE) a regressive loss measure.
Memory Injection Attack (MINJA)
Memory Injection Attack (MINJA) is a security vulnerability identified in LLM or AI agents that possess persistent memory capabilities.
METEOR Score
METEOR Score is metric for Evaluation of Translation with Explicit Ordering (METEOR) score is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text.
Misguided Attention Evaluation
Misguided Attention Evaluation is a benchmark designed to test an LLM’s reasoning robustness when faced with misleading or irrelevant context.
Model Context Protocol (MCP)
Model Context Protocol (MCP) is an open standard from Anthropic for connecting AI assistants to external data, content, and tools in a uniform way.
Model Performance
Model Performance is the performance of a machine learning model indicates its usability and ability to provide accurate results.
Model Store
Model Store is a machine learning infrastructure tool that serves as central model registry and tracks experiments.
Monitor Threshold
Monitor Threshold refers to the value set for a model monitor, beyond which the model’s monitoring status will be triggered accordingly.
Monitoring Embeddings
Monitoring Embeddings are embeddings are not static, as new concepts appear in the real world all the time.
MRR (mean reciprocal rank)
MRR (mean reciprocal rank) is mean Reciprocal Rank, or MRR, measures how high the first relevant result appears in the retrieved ranking.
Multi Turn LLM: Conversation Degradation
Multi Turn LLM: Conversation Degradation is it has been observed that many LLMs “get lost” in extended conversations, showing a significant performance drop as the number of dialogue turns increases.
Multi-Agent Reinforcement Fine-Tuning (MARFT)
Multi-Agent Reinforcement Fine-Tuning (MARFT) is a training paradigm that applies reinforcement learning techniques to fine-tune multiple AI agents simultaneously.
Multi-agent systems
Multi-agent systems are AI systems composed of multiple agents that collaborate, delegate, critique, route, or specialize across tasks.
Multi-Turn Semantic Drift
Multi-Turn Semantic Drift is large language models exhibit multi-turn performance degradation, often called semantic drift, during extended dialogues.
Multimodal Model
Multimodal Model is multimodal models process and relate information from different types of inputs, like text and images.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is the field of AI focused on enabling models to read, understand, and generate human language.
nDCG (ranking quality)
nDCG (ranking quality), or normalized Discounted Cumulative Gain, measures ranking quality when results can have graded relevance.
Offline vs online evaluation
An Offline vs online evaluation is offline evaluation runs against a fixed dataset outside the production request path.
One-Shot Reinforcement Learning Using Verifiable Rewards (RLVR)
One-Shot Reinforcement Learning Using Verifiable Rewards (RLVR) is one-Shot RLVR (Reinforcement Learning with Verifiable Reward) is a method to fine-tune a language model using only a single training example, provided the task has an automatic correctness check.
OpenInference
OpenInference is an open standard for tracing and evaluating AI applications.
Pass/fail criteria
Pass/fail criteria define the conditions under which an AI output, eval run, experiment, or deployment is considered acceptable.
Performance Impact Score
Performance Impact Score is a measure of how much worse your metric of interest is on the slice compared to the average.
Performance Slice
A Performance Slice — also known as a cohort or segment — is a subset of model values of interest in performance analysis and troubleshooting.
Planning (agent planning / task decomposition)
Planning (agent planning / task decomposition) is the process by which an agent breaks a goal into smaller steps before or during execution.
Policy adherence
Policy adherence measures whether an AI system follows defined rules.
Policy-driven agents
Policy-driven agents are agents whose behavior is governed by explicit policies rather than prompt instructions alone.
Population Stability Index (PSI)
Population Stability Index (PSI) is population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time.
Precision
Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class precision = true positives / (predicted true positives + predicted false positives).
Precision Recall (PR) Curve
Precision Recall (PR) Curve is the Precision-Recall curve is the correlation between the precision and recall at particular cut-off values, with the cut off values being set according to the particular model.
Precision@K
Precision@K measures how many of the top K retrieved results are relevant.
Prediction Drift Impact
Prediction Drift Impact is the product of feature importance and drift (population stability index — PSI).
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a common way to obtain embeddings that does not rely on neural networks.
Prompt drift
Prompt drift happens when prompt behavior changes over time because the prompt was edited, the surrounding context changed, the model changed, or user inputs shifted.
Prompt evaluation
Prompt evaluation measures how changes to prompts affect output quality.
Prompt injection
Prompt injection is an attack where malicious or untrusted text attempts to override system instructions, change tool behavior, leak data, or manipulate an agent’s decisions.
Prompt Management System
A Prompt Management System is a specialized tool designed to efficiently handle and optimize user inputs for AI and automation applications that facilitates the organization, tracking, and refinement of prompts — or short commands or queries — used to interact with language models or other AI systems.
Quantile
A Quantile is a point dividing the range of a probability distribution into intervals with equal probabilities.
Question Answering Document Retrieval with LLMs
Question Answering Document Retrieval with LLMs is a technique designed to pull specific answers from large documents based on user questions.
RAG evaluation
RAG evaluation measures how well a retrieval-augmented generation system retrieves relevant context and generates grounded answers from it.
RAGEN (LLM Agent Training System)
RAGEN (LLM Agent Training System) is a modular system designed for training and evaluating large language model (LLM) agents using multi-turn reinforcement learning.
Recall
Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives).
Recall Parity
Recall Parity is often used as a model fairness metric, recall parity measures how “sensitive” the model is for one group compared to another — or a model’s ability to predict true positives correctly.
Reference Distribution
Reference Distribution is in the context of ML Observability, the reference distribution can be a number of different options.
Regression
Regression is a fundamental concept in data science and machine learning used to model the relationship between a dependent variable and one or more independent variables.
Relevance
Relevance measures whether an AI output or retrieved context addresses the user’s actual request.
Retrieval Augmented Generation and Dense Passage Retrieval
Retrieval Augmented Generation and Dense Passage Retrieval use the context of a question to retrieve relevant passages from a large corpus of documents and extract answers.
Retrieval failure
Retrieval failure happens when the system does not fetch the information needed to answer or act correctly.
Retrieval quality
Retrieval quality measures whether the system found the right information for a user query.
Retrieval-augmented generation (RAG)
Retrieval-augmented generation (RAG) is an architecture where a system retrieves external context and provides it to a model before generation.
ROC-AUC
ROC-AUC is the area under the Receiver Operating Characteristic (ROC) curve, which is plotted between true positive rate (TPR) and false positive rate (FPR).
Root Mean Square Error (RMSE)
Root Mean Square Error (RMSE) — also known as root mean square deviation, RMSD — is a measure of the average magnitude of error in quantitative data predictions.
Safety evaluation
Safety evaluation measures whether an AI system avoids harmful, unsafe, unauthorized, or policy-violating behavior.
Score Models
Score Models generate a numeric value as its prediction or output.
Scoring function
A Scoring function is the logic that turns an input, output, context, trace, or trajectory into a score or label.
Self-healing agents
Self-healing agents are agents that can detect certain failures and recover without waiting for a developer to intervene.
Self-improving agents
Self-improving agents are agents that use observed behavior and evaluation results to improve their prompts, policies, tool use, retrieval, or workflows over time.
Semantic search
Semantic search retrieves results based on meaning rather than exact keyword matching.
Sensitivity
Sensitivity is a measure of the number of positive cases that turned out to be true for a given model.
Sequence To Sequence Models
Sequence To Sequence Models take an input sequence, encode it into an internal representation, and then decode that into an output sequence.
Shadow Deployment
A Shadow Deployment is a method of testing a candidate model for production where production data runs through the model without the model actually returning predictions to the service or customers.
Shadow testing (LLMs / agents)
Shadow testing (LLMs / agents) runs a new model, prompt, retriever, or agent version alongside production without showing its output to users.
SHAP Values
SHAP Values are on each feature of a ML model, a Shapley value can be computed to explain how this feature contributed to the difference between the model’s prediction for an example compared to the “average” or expected model prediction.
SHAP: Shapley Additive Explanations
SHAP: Shapley Additive Explanations is a concept derived from game theory that is used to explain the output of machine learning models.
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a common way to obtain embeddings that does not rely on neural networks.
Specificity
Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives).
StarPO (Trajectory Optimization for LLM Agents)
StarPO (Trajectory Optimization for LLM Agents) is a training paradigm for LLM-based agents that optimizes entire interaction trajectories rather than stepwise decisions.
Surrogate Model
A Surrogate Model is an explainability technique where a transparent model is built from the predictions of an actual model.
t-SNE
t-SNE (Stochastic Neighbor Embedding) is a dimension reduction technique for data visualization that improves on SNE by addressing its cost-function and crowding problems.
Tabular Data
Tabular Data is data in a table format, with columns and rows.
Tags
Tags are often used alongside model features, tags enable metadata support for slicing and cohorting.
Test set and test cases
Test set and test cases are the collections of inputs and expected behaviors used to evaluate an AI system.
Tokenization
Tokenization is a crucial step in language models as it breaks down text data into smaller units called tokens, such as words or characters.
Tool calling (function calling)
Tool calling (function calling) is the mechanism that lets a model or agent invoke external functions, APIs, databases, retrievers, code execution environments, or other systems.
Tool failure
Tool failure happens when an external function, API, retriever, database, browser, code runner, or other tool call fails or returns bad data.
Tool-N1
Tool-N1 refers to a class of approaches where language models learn to use external tools through trial and error, without explicit step-by-step demonstrations.
Toxicity
Toxicity measures whether an output contains abusive, hateful, harassing, or otherwise harmful language.
Trace
A Trace is the complete execution record of what an AI system actually did while handling a request.
Training data vs evaluation data
Training data vs evaluation data describes the distinction between data used to teach or adapt a model and data used to measure its performance.
TreeSHAP
TreeSHAP is a fast explainer used for analyzing decision tree models in the Shap python library.
True Negative
True Negative is when a model correctly predicts a negative class; when the value belongs to the negative class.
True Positive
True Positive is when a model correctly predicts a positive class, when the value belongs to the positive class.
UMAP (Uniform Manifold Approximation and Projection)
UMAP (Uniform Manifold Approximation and Projection) is a technique for visualizing the embedding representation of a dataset using dimension reduction.
Unstructured Data
Unstructured Data is data without a predefined schema, such as text, images, or audio, and accounts for an estimated 80% of data generated by businesses today.
Vector DB
Vector DB, also called a vector database, is a specialized type of database that stores and processes data in vector form.
Wasserstein Distance
Wasserstein Distance — also known as Earth Mover’s Distance — measures the distance between two probability distributions over a given region.
Weak supervision
Weak supervision uses noisy, indirect, or programmatically generated labels to create training or evaluation signals at scale.