A Practical Guide to Harness Engineering: How to Work with AI Agents Every Day

The Question Everyone Asks

In my mentorship sessions, one question comes up more than any other: "What does my day-to-day actually look like when I work with AI agents?"

It's a fair question. The hype cycle is loud. Every week there's a new demo of an agent building a to-do app from scratch. But nobody talks about the mundane reality. The daily rhythm, the habits, the discipline that makes agent-driven development actually work on real codebases over weeks and months.

A few weeks ago, OpenAI published Harness Engineering: Leveraging Codex in an Agent-First World. Their team shipped a product with zero lines of manually-written code. They described a new role for the software engineer: not someone who writes code, but someone who designs environments, specifies intent, and builds feedback loops.

That post crystallized something I'd been practicing for months. Not the building part. The workflow part. The daily practices that make this sustainable.

This guide walks you through those practices. Whether you use Claude Code, Codex, Cursor, or any other AI coding tool, the principles are the same. I've also built a reference implementation, a full-stack TypeScript app where every line was agent-generated, so you can see these patterns in action. Clone it, explore it, steal what's useful.

But the demo app isn't the point. The workflow is.

The Mental Model Shift

Before anything else, you need to internalize one thing: your job changes.

You stop writing code. You start designing environments. Your value isn't in typing const user = await db.query(...). It's in:

Specifying what the system should do (product specs, task descriptions)
Designing how the system should be structured (architecture, rules, constraints)
Building feedback loops that catch mistakes and compound improvements
Verifying that the output is correct

Think of yourself as a staff engineer working with a very fast, very literal junior developer who has perfect recall but no judgment. Your job is to give that developer the context, guardrails, and verification tools they need to produce correct output consistently.

That's harness engineering. You build the harness. The agent does the work.

Start with a Map, Not a Manual

The first thing you need is an entry point, a file that tells the agent where everything is and what the rules are.

If you're using Claude Code, this is CLAUDE.md. For Codex, it's AGENTS.md. For other tools, it might be a system prompt or a project-level config. The name doesn't matter. What matters is how you write it.

Keep it short. Your entry point should be roughly 100 to 150 lines. It's a table of contents, not an encyclopedia. It points to deeper sources of truth:

CLAUDE.md                    # Map, ~100 lines, always in context
ARCHITECTURE.md              # Domain map and layering rules
docs/
├── design-docs/             # Catalogued design decisions
├── exec-plans/              # Active and completed plans
├── product-specs/           # Product requirements
└── QUALITY_SCORE.md         # Quality grades by domain
.claude/
├── rules/                   # File-scoped rules
└── skills/                  # Step-by-step playbooks

The OpenAI team calls this progressive disclosure. You give the agent a small, stable starting point. It looks deeper only when it needs to. This prevents context pollution because the agent doesn't waste tokens on architecture docs when it's fixing a typo.

Use file-scoped rules. Most AI coding tools support rules that activate only for certain file patterns. When the agent edits a test file, it sees testing rules. When it edits a domain file, it sees domain rules. This keeps each task's context focused and relevant.

Make the repository the system of record. Design decisions, product specs, execution plans, quality scores: put them all in the repo. If the agent can't find it in the repository, it doesn't exist. External wikis and Notion pages are invisible to your agent.

Design Your Architecture for Agents

Agents thrive in rigid, predictable structures. The more consistent your codebase, the more accurately the agent can reason about it.

The single best architectural pattern for agent-driven development is a layered domain model with forward-only dependencies:

types.ts        → (no imports, leaf node)
config.ts       → types only
repo.ts         → types, config
service.ts      → types, config, repo
runtime.ts      → types, config, repo, service
ui/routes.ts    → types, config, service, runtime

Each business domain follows the same structure. The agent learns the pattern once and applies it everywhere. When it needs to add a new feature, it knows exactly where each piece goes.

Cross-cutting concerns like authentication, database connections, logging, and feature flags enter through a providers/ layer. Domain code never imports infrastructure directly. This boundary gives you a clean seam for mocking in tests and swapping implementations.

Parse at boundaries, trust interiors. Validate all external input (HTTP requests, environment variables, database results) at the boundary using a schema library like Zod. Once data crosses the boundary, interior code operates on typed, validated data without re-validation. This gives the agent a reliable contract: if data reaches a service function, it's already clean.

Use the Result pattern over thrown exceptions. Return { ok: true, data } or { ok: false, error } from every domain function. This forces callers to handle errors explicitly. The compiler won't let them forget. Agents are much more reliable when error handling is in the type system rather than implicit try/catch blocks.

// Without Result: caller might forget try/catch
const user = await findUser(id); // throws on not found?

// With Result: compiler enforces handling
const result = await findUser(id);
if (!result.ok) {
  return err({ code: 'NOT_FOUND', message: result.error.message });
}
// TypeScript knows result.data is User here

Your Daily Workflow: The Eight Pillars

This is the section people are really asking about. Here's what a typical day looks like, codified into eight practices.

1. Always plan first

Enter plan mode for every task. Not just big features. Everything. The agent reads the relevant files, designs an approach, and presents it to you before writing a single line of code.

This feels slow at first. It's not. Planning catches misunderstandings before they become 200 lines of wrong code. When something goes sideways mid-task, stop and re-plan immediately. Don't keep pushing.

2. Manage your context window

Your agent has a finite working memory. Treat context like RAM. Files are persistent storage, context is working memory. You wouldn't keep every variable in registers.

The discipline is simple: save state to disk after every meaningful step. Keep a tasks/todo.md that tracks current progress. After a significant step, update it. If you need to compact the conversation or start a new session, you lose nothing because the state is on disk.

After about 20 turns of conversation, compact or start fresh. Long sessions lose coherence.

3. Use subagents strategically

When you need to research something (find all usages of a function, understand how a library works, explore a new area of the codebase), spawn a subagent. Let it explore and bring back the answer.

Never bloat your main context with 10+ file reads for research. One task per subagent. The main context stays clean and focused on execution.

4. Build a self-improvement loop

This is probably the highest-leverage practice. When the agent makes a mistake and you correct it, log the correction as a lesson. Keep a tasks/lessons.md file:

### Always validate env vars with Zod

**Trigger**: Used Number(x) || undefined which treats 0 as undefined
**Pattern**: Using || with numeric coercion loses falsy values
**Rule**: Use Zod schemas for all env var parsing
**Recurrence**: 0 (never happened again)

The agent reads lessons at the start of every session. If a lesson keeps coming up, promote it to a linter rule or structural test. Every correction compounds into a permanent fix.

5. Verify before marking done

"It compiles" is not verification. "Tests pass" is not sufficient either. A CSS regression can pass every test while the page looks completely broken.

Your verification checklist should include:

Type-checking passes
All linters pass
Unit, structural, and E2E tests pass
Visual inspection: actually look at what the agent built, ideally through automated screenshots
Full validation pipeline passes

If you have access to Browser MCP servers (Playwright MCP, Chrome DevTools MCP), use them for automated visual verification. Take screenshots at desktop and mobile viewports. Compare results. This catches layout bugs, missing styles, and visual regressions that tests miss.

6. Demand elegance (selectively)

For non-trivial changes, pause and ask: "Is there a cleaner way to do this?" The agent will often produce working but clunky code on the first pass. A moment of reflection leads to better abstractions.

Skip this for simple, obvious fixes. Don't over-engineer a one-line bug fix.

7. Fix bugs autonomously

When the agent encounters a bug, it should just fix it. The workflow: reproduce, trace the cause, classify (logic error, missing validation, wrong assumption), fix, add a test that would have caught it, prevent recurrence through a rule or lesson.

No back-and-forth. No "what should I do about this error?" Just investigate and resolve.

8. Fresh context for fresh work

When you switch to a different kind of task, save state and start clean. Don't carry debugging context from a CSS fix into a backend API change. Stale context leads to confused output.

Enforce Quality Mechanically

Here's a hard truth: documentation rules decay. You can write the clearest, most detailed instructions in your entry point file, and the agent will eventually ignore them. Not because it's broken, but because documentation is a suggestion. Code is a constraint.

I learned this through direct experience. I wrote a rule: "Always run visual verification before marking a task done." The agent ignored it. I added the rule to three more files. The agent acknowledged it and then forgot. Three times it declared tasks complete without verifying.

So I stopped writing rules and started writing code.

The escalation ladder

When a pattern of mistakes emerges, escalate through these levels:

Lesson: Log it in your lessons file. This catches most things.
Documentation: Add it to your entry point or scoped rules. This catches some more.
Linter: Write a custom linter that detects the violation. Now it fails the build.
Git hook: Add a pre-commit or pre-push hook. Now it blocks the commit.
Session hook: Block the agent from ending a session with the violation unresolved.

Each level catches failures the previous level missed. Invest in mechanical enforcement early. It pays for itself immediately.

Custom linters as agent prompts

Here's an insight from the OpenAI post that changed how I think about linters: linter error messages are prompts. When your linter detects a violation, don't just say "illegal import." Print remediation instructions:

VIOLATION: src/domains/auth/repo.ts imports from service layer
  → repo can only import: types, config
  → Move this logic to the service layer, or extract shared types to types.ts

That remediation instruction goes directly into the agent's context. The linter error becomes a fix instruction. The agent reads it, understands the constraint, and fixes the violation, usually on the first try.

Write every linter message as if it's a prompt for the agent that will read it. This one change makes your linters dramatically more effective.

Git hooks as enforcement gates

Your pre-commit hook should run: formatting, type-checking, all custom linters, and any verification gate you need. Your pre-push hook should run the full validation pipeline including tests.

If verification matters (and it does), add a verification gate. A script that checks a checklist file before allowing commits. If the checklist isn't complete, the commit is blocked. The environment prevents the mistake.

# .husky/pre-commit
npx lint-staged                    # Prettier on staged files
npm run typecheck                  # TypeScript type checking
npm run lint                       # All custom linters
bash scripts/check-verification.sh # Verification gate

Make Your Application Visible to the Agent

One of the biggest unlocks in agent-driven development is giving the agent eyes. When the agent can see what it built, it catches problems that tests miss.

Browser MCP servers

If your AI coding tool supports MCP (Model Context Protocol), configure Browser MCP servers. These give the agent live access to your running application. It can navigate pages, take screenshots, inspect the DOM, and verify visual output.

Two popular options:

Playwright MCP (@playwright/mcp): browser automation through Playwright
Chrome DevTools MCP (chrome-devtools-mcp): direct Chrome DevTools Protocol access

Using both gives you independent verification from two different browser inspection tools. Discrepancies between them flag real issues.

The verification workflow

After implementing a change:

The agent navigates to every affected page using the Browser MCP
It takes screenshots at desktop (1440px) and mobile (375px) viewports
It reads the screenshots (multimodal models can analyze images) and checks for layout issues, missing elements, visual regressions
It runs E2E tests for interactive behavior
It runs the full validation pipeline

Why both visual and E2E? A CSS regression can pass every E2E test while the page looks completely wrong. A backend API change can break page rendering while all unit tests pass. Visual inspection and automated tests catch different classes of bugs. Neither replaces the other.

Making the app bootable

For the agent to see your app, it needs to be running locally. Keep your dev setup simple and fast:

One command to start the backend
One command to start the frontend
Database in Docker with a setup script
Seed data for testing

The easier it is to boot the app, the more likely verification actually happens.

The Self-Improvement Loop in Practice

The self-improvement loop is the mechanism that makes your harness smarter over time. Every session is better than the last, not because the agent improves, but because the environment does.

How corrections compound

Early in my experience with harness engineering, the agent used Number(process.env.PORT) || 3000 for parsing environment variables. Looks reasonable. But Number("0") is falsy, so this silently falls back to 3000. The agent made this mistake twice. After the second correction, I logged a lesson: "Always validate env vars with Zod." That mistake never happened again.

Small correction, permanent fix. Multiply this across dozens of lessons and your codebase becomes remarkably resilient.

When documentation fails, promote to code

The verification rule is the clearest example. I wrote "always verify before marking done" in five different files. The agent ignored it three times. So I built a pre-commit hook that blocks commits without a completed verification checklist, and a session hook that prevents the agent from ending a session with unverified changes.

Now the environment prevents the mistake. The agent can't skip verification because the system won't let it.

This escalation (lesson, then rule, then mechanical hook) is the self-improvement loop's most powerful pattern. Every time documentation fails, you have an opportunity to make the system mechanically correct.

Skills as playbooks

For recurring complex tasks (verifying changes, investigating bugs, scaffolding a new domain), write step-by-step playbooks. Store them as markdown files the agent reads before starting work:

.claude/skills/
├── verify-change/SKILL.md      # 7-step verification checklist
├── drive-app/SKILL.md          # Browser MCP workflow
├── investigate-failure/SKILL.md # Bug reproduction and fix
├── self-improve/SKILL.md       # Learn from corrections
├── add-domain/SKILL.md         # Scaffold new domain
└── manage-context/SKILL.md     # Context window management

Skills encode your team's standards into repeatable checklists. The agent reads the skill, follows the steps, and produces consistent output. Over time, you accumulate a library of skills that represent your team's best practices.

Common Mistakes When Starting Out

These are the patterns I see most often in mentorship sessions. If you're just getting started, watch out for:

Giving the agent too much context upfront

A monolithic 500-line instructions file doesn't work. The agent drowns in context. Start with a concise entry point (~100 lines) and use progressive disclosure. Point to deeper docs, don't dump everything.

Skipping verification because "tests pass"

Tests validate behavior. They don't validate appearance, layout, responsiveness, or the dozen things that can go wrong visually. Build visual verification into your workflow. If you can't use Browser MCPs, at minimum open the page yourself before approving.

Not saving state to disk

When you compact a conversation or start a new session, anything only in context is gone. Save progress to a task file after every meaningful step. This one habit prevents more wasted work than any other.

Over-engineering the scaffolding before you have a real task

Don't spend three days building the perfect harness before writing any application code. Start with the basics: an entry point file, a simple architecture rule, one linter, and iterate. The best harness emerges from real tasks, not theoretical planning.

Treating the agent like autocomplete

Autocomplete suggests the next line. An agent executes a task. If you're prompting line by line ("now add a try/catch here"), you're using the agent wrong. Describe what you want at the task level ("add error handling to the registration flow using the Result pattern") and let the agent figure out the implementation.

Not building a self-improvement loop

If you correct the agent and don't log the lesson, the same mistake will happen again next session. The correction is worth nothing unless it compounds. Log every correction. Review lessons at session start.

Getting Started

You don't need to adopt everything at once. Start small, iterate, and let the harness grow organically.

Week 1: The basics

Pick one codebase you're actively working on
Write an entry point file (~50 to 100 lines): project description, key directories, main rules
Start a lessons file: empty, ready to capture corrections
Do one real task with the agent and capture what you learn

Week 2: Add structure

Add file-scoped rules for your most-edited file types (frontend, tests, backend)
Write your first custom linter for a rule you've corrected more than twice
Add a task tracker file to save state between sessions
Practice context management: compact after 15 to 20 turns, use subagents for research

Week 3: Build verification

Set up dev servers so the agent can see the running app
Configure Browser MCPs if your tool supports them
Add a pre-commit hook with type-checking and linting
Write your first E2E test for a critical user flow

Ongoing: Let it compound

Log every correction as a lesson
Promote recurring lessons into rules, then into linters or hooks
Write skill playbooks for complex recurring tasks
Review and prune: remove outdated lessons, update stale docs

Explore the demo

The entire reference implementation is open source. Clone it and explore how these patterns look in practice:

git clone https://github.com/haripery/harness-engineering.git
cd harness-engineering
npm install

Key files to study:

CLAUDE.md: the agent entry point (~100 lines)
ARCHITECTURE.md: the layered domain model
tasks/lessons.md: the self-improvement loop in action
.claude/skills/: nine agent skill playbooks
linters/: four custom architecture linters
scripts/check-verification.sh: the verification gate

See docs/GETTING_STARTED.md for full setup instructions including database, test credentials, and MCP configuration.

The Shift

The discipline didn't go away. It moved from the code to the scaffolding. You spend your time writing specs, designing constraints, capturing lessons, and verifying outcomes. The agent writes everything else.

OpenAI's post ends with a line worth remembering: "building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code."

Whether you use Claude Code, Codex, or something else, the principles are the same:

Start with a map, not a manual
Design your architecture for agent predictability
Enforce rules with linters and hooks, not documentation
Build a self-improvement loop that compounds every session
Make your application visible to the agent
Verify before declaring done
Save state to disk, always
Treat context like memory and manage it actively

The question isn't whether agents can write your code. They can. The question is whether you've built the harness that lets them do it well.

What's Next: Packaging the Harness as a Plugin

Everything in this guide (the skills, verification hooks, MCP configurations, self-improvement loop) lives inside the repository's .claude/ directory. That works when you're building from scratch or cloning the demo. But what if you want to apply these practices to an existing codebase without copying files around?

That's where Claude Code plugins come in. A plugin is a self-contained package that bundles skills, hooks, MCP configs, and agents into a single installable unit. Instead of manually setting up verification workflows, self-improvement loops, and Browser MCP servers in every project, you'd run /plugin install harness-engineering and get the entire methodology wired up instantly.

The next post will walk through packaging this harness as a distributable plugin, turning a project-specific workflow into a portable methodology that anyone can install with one command.

This guide is based on practices developed using Claude Code and Opus 4.6. The reference implementation demonstrates every pattern described here.