AI Specification Template

A structured template for specifying AI-assisted features, agent behaviours, or LLM-integrated components in a software project. Use this to document what the AI should do, how it should behave, and where the boundaries are.

The Template

# AI Specification: [Feature/Component Name]

## Overview

- **Purpose**: [One sentence — what does this AI component do?]
- **Model**: [e.g. Claude Sonnet 4.6, GPT-4o, local Llama 3]
- **Integration**: [e.g. API call, Claude Code agent, embedded SDK]
- **Owner**: [Team or person responsible]

## Behaviour

### Core Function

[Describe what the AI does in concrete terms. Focus on inputs, outputs,
and the transformation between them.]

- **Input**: [What the AI receives — e.g. user query, code diff, document]
- **Processing**: [What the AI does — e.g. summarise, classify, generate]
- **Output**: [What the AI produces — e.g. JSON response, text, code patch]

### Persona / System Prompt

[Define the AI's role, tone, and constraints at the system level.]

```
You are a [role] that [primary function].

Rules:
- [Constraint 1]
- [Constraint 2]
- [Tone/style guidance]
```

### Examples

#### Example 1: [Scenario Name]

**Input:**
[Sample input]

**Expected Output:**
[Sample output]

#### Example 2: [Edge Case]

**Input:**
[Edge case input]

**Expected Output:**
[How the AI should handle it]

## Boundaries

### Must Do

- [Required behaviour — e.g. "Always cite sources"]
- [Required behaviour — e.g. "Return valid JSON"]

### Must Not Do

- [Prohibited behaviour — e.g. "Never fabricate data"]
- [Prohibited behaviour — e.g. "Never expose internal system details"]

### Fallback Behaviour

[What happens when the AI cannot fulfil a request or encounters
an error. Define graceful degradation.]

- **Uncertain input**: [e.g. "Ask for clarification"]
- **Out of scope request**: [e.g. "Respond with a polite refusal"]
- **Model failure/timeout**: [e.g. "Return cached response or error message"]

## Data

### Input Data

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| [field] | [string/object/array] | [Yes/No] | [What it contains] |

### Output Schema

```json
{
  "result": "[description]",
  "confidence": "[0.0-1.0]",
  "metadata": {
    "model": "[model used]",
    "tokens_used": "[count]"
  }
}
```

### Context / RAG Sources

- [Document collection, database, or knowledge base the AI draws from]
- [How context is retrieved — e.g. vector search, keyword match]
- [Maximum context window budget allocation]

## Quality & Evaluation

### Success Criteria

| Metric | Target | Measurement |
|--------|--------|-------------|
| Accuracy | [e.g. >95%] | [How measured — e.g. human review, test suite] |
| Latency | [e.g. <2s p95] | [Monitoring tool or method] |
| Cost | [e.g. <$0.01/request] | [Token tracking method] |

### Test Cases

1. [Test description] → Expected: [outcome]
2. [Test description] → Expected: [outcome]
3. [Edge case test] → Expected: [outcome]

### Human Review

- **Review frequency**: [e.g. Weekly sample of 50 outputs]
- **Escalation path**: [When and how human review is triggered]
- **Feedback loop**: [How review results improve the system]

## Safety & Ethics

### Content Filtering

- [Pre-processing filters on input — e.g. PII detection]
- [Post-processing filters on output — e.g. toxicity check]

### Bias Considerations

- [Known bias risks for this use case]
- [Mitigation strategies]

### Audit Trail

- [What is logged — e.g. inputs, outputs, model version, timestamps]
- [Retention policy]
- [Access controls on logs]

## Implementation

### Architecture

```
[User/System] → [Pre-processing] → [Model API] → [Post-processing] → [Output]
```

### Configuration

| Parameter | Value | Notes |
|-----------|-------|-------|
| Model | [e.g. claude-sonnet-4-6] | [Why this model] |
| Temperature | [e.g. 0.3] | [Lower = more deterministic] |
| Max tokens | [e.g. 1024] | [Output length limit] |
| Top-p | [e.g. 0.9] | [Nucleus sampling threshold] |

### Dependencies

- [API keys / credentials needed]
- [SDKs or libraries — e.g. anthropic Python SDK]
- [Infrastructure — e.g. Redis for caching, vector DB for RAG]

### Cost Estimate

| Scenario | Requests/day | Avg tokens | Daily cost |
|----------|-------------|------------|------------|
| Low usage | [count] | [tokens] | [$amount] |
| Normal | [count] | [tokens] | [$amount] |
| Peak | [count] | [tokens] | [$amount] |

## Rollout

- [ ] Prototype with hardcoded examples
- [ ] Integration with real data source
- [ ] Internal testing (team review)
- [ ] Staged rollout (% of traffic)
- [ ] Full deployment
- [ ] Monitoring and alerting configured

Section Guide

Overview

Establish the basics upfront. Knowing the model, integration method, and owner prevents ambiguity later. If the model choice is not yet decided, list candidates with trade-offs.

Behaviour

The most important section. Define inputs and outputs precisely — ideally with types and schemas. The system prompt section captures the AI’s "personality" and hard rules. Examples are critical: they serve as both documentation and test cases.

Boundaries

Separate "must do" from "must not do" to make review easier. The fallback behaviour section is often overlooked but essential for production systems. Every AI component will eventually receive unexpected input — define what happens.

Data

Document schemas explicitly. For RAG-based systems, describe the retrieval strategy and context budget. This section is the contract between the AI component and the rest of the system.

Quality & Evaluation

Without measurable criteria, you cannot know if the AI component is working. Define metrics before building. The human review section ensures there is a feedback loop — AI systems degrade without ongoing evaluation.

Safety & Ethics

Scale this section to your risk level. A internal developer tool needs less here than a customer-facing chatbot. At minimum, document what is logged and who can access it.

Implementation

Concrete technical details for whoever builds and maintains this. The cost estimate prevents surprises — LLM API costs can scale unexpectedly.

When to Use This Template

  • New AI feature: Before writing any code, fill in this specification to align the team on behaviour, boundaries, and success criteria.

  • Existing AI audit: Retroactively document an AI component that was built without a specification.

  • Vendor evaluation: Compare models or providers by filling in the Configuration and Cost sections for each candidate.

  • Compliance review: The Safety & Ethics and Audit Trail sections provide a starting point for compliance documentation.