February 26, 2026 · AI Explorations · 18 min read
The daily rhythm, habits, and discipline that make agent-driven development actually work. A practical guide with eight pillars, common mistakes, and a week-by-week getting started plan.
In my mentorship sessions, one question comes up more than any other: "What does my day-to-day actually look like when I work with AI agents?"
It's a fair question. The hype cycle is loud. Every week there's a new demo of an agent building a to-do app from scratch. But nobody talks about the mundane reality. The daily rhythm, the habits, the discipline that makes agent-driven development actually work on real codebases over weeks and months.
A few weeks ago, OpenAI published Harness Engineering: Leveraging Codex in an Agent-First World. Their team shipped a product with zero lines of manually-written code. They described a new role for the software engineer: not someone who writes code, but someone who designs environments, specifies intent, and builds feedback loops.
That post crystallized something I'd been practicing for months. Not the building part. The workflow part. The daily practices that make this sustainable.
This guide walks you through those practices. Whether you use Claude Code, Codex, Cursor, or any other AI coding tool, the principles are the same. I've also built a reference implementation, a full-stack TypeScript app where every line was agent-generated, so you can see these patterns in action. Clone it, explore it, steal what's useful.
But the demo app isn't the point. The workflow is.
Before anything else, you need to internalize one thing: your job changes.
You stop writing code. You start designing environments. Your value isn't in typing const user = await db.query(...). It's in:
Think of yourself as a staff engineer working with a very fast, very literal junior developer who has perfect recall but no judgment. Your job is to give that developer the context, guardrails, and verification tools they need to produce correct output consistently.
That's harness engineering. You build the harness. The agent does the work.
The first thing you need is an entry point, a file that tells the agent where everything is and what the rules are.
If you're using Claude Code, this is CLAUDE.md. For Codex, it's AGENTS.md. For other tools, it might be a system prompt or a project-level config. The name doesn't matter. What matters is how you write it.
Keep it short. Your entry point should be roughly 100 to 150 lines. It's a table of contents, not an encyclopedia. It points to deeper sources of truth:
CLAUDE.md # Map, ~100 lines, always in context
ARCHITECTURE.md # Domain map and layering rules
docs/
├── design-docs/ # Catalogued design decisions
├── exec-plans/ # Active and completed plans
├── product-specs/ # Product requirements
└── QUALITY_SCORE.md # Quality grades by domain
.claude/
├── rules/ # File-scoped rules
└── skills/ # Step-by-step playbooks
The OpenAI team calls this progressive disclosure. You give the agent a small, stable starting point. It looks deeper only when it needs to. This prevents context pollution because the agent doesn't waste tokens on architecture docs when it's fixing a typo.
Use file-scoped rules. Most AI coding tools support rules that activate only for certain file patterns. When the agent edits a test file, it sees testing rules. When it edits a domain file, it sees domain rules. This keeps each task's context focused and relevant.
Make the repository the system of record. Design decisions, product specs, execution plans, quality scores: put them all in the repo. If the agent can't find it in the repository, it doesn't exist. External wikis and Notion pages are invisible to your agent.
Agents thrive in rigid, predictable structures. The more consistent your codebase, the more accurately the agent can reason about it.
The single best architectural pattern for agent-driven development is a layered domain model with forward-only dependencies:
types.ts → (no imports, leaf node)
config.ts → types only
repo.ts → types, config
service.ts → types, config, repo
runtime.ts → types, config, repo, service
ui/routes.ts → types, config, service, runtime
Each business domain follows the same structure. The agent learns the pattern once and applies it everywhere. When it needs to add a new feature, it knows exactly where each piece goes.
Cross-cutting concerns like authentication, database connections, logging, and feature flags enter through a providers/ layer. Domain code never imports infrastructure directly. This boundary gives you a clean seam for mocking in tests and swapping implementations.
Parse at boundaries, trust interiors. Validate all external input (HTTP requests, environment variables, database results) at the boundary using a schema library like Zod. Once data crosses the boundary, interior code operates on typed, validated data without re-validation. This gives the agent a reliable contract: if data reaches a service function, it's already clean.
Use the Result pattern over thrown exceptions. Return { ok: true, data } or { ok: false, error } from every domain function. This forces callers to handle errors explicitly. The compiler won't let them forget. Agents are much more reliable when error handling is in the type system rather than implicit try/catch blocks.
// Without Result: caller might forget try/catch
const user = await findUser(id); // throws on not found?
// With Result: compiler enforces handling
const result = await findUser(id);
if (!result.ok) {
return err({ code: 'NOT_FOUND', message: result.error.message });
}
// TypeScript knows result.data is User here
This is the section people are really asking about. Here's what a typical day looks like, codified into eight practices.
Enter plan mode for every task. Not just big features. Everything. The agent reads the relevant files, designs an approach, and presents it to you before writing a single line of code.
This feels slow at first. It's not. Planning catches misunderstandings before they become 200 lines of wrong code. When something goes sideways mid-task, stop and re-plan immediately. Don't keep pushing.
Your agent has a finite working memory. Treat context like RAM. Files are persistent storage, context is working memory. You wouldn't keep every variable in registers.
The discipline is simple: save state to disk after every meaningful step. Keep a tasks/todo.md that tracks current progress. After a significant step, update it. If you need to compact the conversation or start a new session, you lose nothing because the state is on disk.
After about 20 turns of conversation, compact or start fresh. Long sessions lose coherence.
When you need to research something (find all usages of a function, understand how a library works, explore a new area of the codebase), spawn a subagent. Let it explore and bring back the answer.
Never bloat your main context with 10+ file reads for research. One task per subagent. The main context stays clean and focused on execution.
This is probably the highest-leverage practice. When the agent makes a mistake and you correct it, log the correction as a lesson. Keep a tasks/lessons.md file:
### Always validate env vars with Zod
**Trigger**: Used Number(x) || undefined which treats 0 as undefined
**Pattern**: Using || with numeric coercion loses falsy values
**Rule**: Use Zod schemas for all env var parsing
**Recurrence**: 0 (never happened again)
The agent reads lessons at the start of every session. If a lesson keeps coming up, promote it to a linter rule or structural test. Every correction compounds into a permanent fix.
"It compiles" is not verification. "Tests pass" is not sufficient either. A CSS regression can pass every test while the page looks completely broken.
Your verification checklist should include:
If you have access to Browser MCP servers (Playwright MCP, Chrome DevTools MCP), use them for automated visual verification. Take screenshots at desktop and mobile viewports. Compare results. This catches layout bugs, missing styles, and visual regressions that tests miss.
For non-trivial changes, pause and ask: "Is there a cleaner way to do this?" The agent will often produce working but clunky code on the first pass. A moment of reflection leads to better abstractions.
Skip this for simple, obvious fixes. Don't over-engineer a one-line bug fix.
When the agent encounters a bug, it should just fix it. The workflow: reproduce, trace the cause, classify (logic error, missing validation, wrong assumption), fix, add a test that would have caught it, prevent recurrence through a rule or lesson.
No back-and-forth. No "what should I do about this error?" Just investigate and resolve.
When you switch to a different kind of task, save state and start clean. Don't carry debugging context from a CSS fix into a backend API change. Stale context leads to confused output.
Here's a hard truth: documentation rules decay. You can write the clearest, most detailed instructions in your entry point file, and the agent will eventually ignore them. Not because it's broken, but because documentation is a suggestion. Code is a constraint.
I learned this through direct experience. I wrote a rule: "Always run visual verification before marking a task done." The agent ignored it. I added the rule to three more files. The agent acknowledged it and then forgot. Three times it declared tasks complete without verifying.
So I stopped writing rules and started writing code.
When a pattern of mistakes emerges, escalate through these levels:
Each level catches failures the previous level missed. Invest in mechanical enforcement early. It pays for itself immediately.
Here's an insight from the OpenAI post that changed how I think about linters: linter error messages are prompts. When your linter detects a violation, don't just say "illegal import." Print remediation instructions:
VIOLATION: src/domains/auth/repo.ts imports from service layer
→ repo can only import: types, config
→ Move this logic to the service layer, or extract shared types to types.ts
That remediation instruction goes directly into the agent's context. The linter error becomes a fix instruction. The agent reads it, understands the constraint, and fixes the violation, usually on the first try.
Write every linter message as if it's a prompt for the agent that will read it. This one change makes your linters dramatically more effective.
Your pre-commit hook should run: formatting, type-checking, all custom linters, and any verification gate you need. Your pre-push hook should run the full validation pipeline including tests.
If verification matters (and it does), add a verification gate. A script that checks a checklist file before allowing commits. If the checklist isn't complete, the commit is blocked. The environment prevents the mistake.
# .husky/pre-commit
npx lint-staged # Prettier on staged files
npm run typecheck # TypeScript type checking
npm run lint # All custom linters
bash scripts/check-verification.sh # Verification gate
One of the biggest unlocks in agent-driven development is giving the agent eyes. When the agent can see what it built, it catches problems that tests miss.
If your AI coding tool supports MCP (Model Context Protocol), configure Browser MCP servers. These give the agent live access to your running application. It can navigate pages, take screenshots, inspect the DOM, and verify visual output.
Two popular options:
@playwright/mcp): browser automation through Playwrightchrome-devtools-mcp): direct Chrome DevTools Protocol accessUsing both gives you independent verification from two different browser inspection tools. Discrepancies between them flag real issues.
After implementing a change:
Why both visual and E2E? A CSS regression can pass every E2E test while the page looks completely wrong. A backend API change can break page rendering while all unit tests pass. Visual inspection and automated tests catch different classes of bugs. Neither replaces the other.
For the agent to see your app, it needs to be running locally. Keep your dev setup simple and fast:
The easier it is to boot the app, the more likely verification actually happens.
The self-improvement loop is the mechanism that makes your harness smarter over time. Every session is better than the last, not because the agent improves, but because the environment does.
Early in my experience with harness engineering, the agent used Number(process.env.PORT) || 3000 for parsing environment variables. Looks reasonable. But Number("0") is falsy, so this silently falls back to 3000. The agent made this mistake twice. After the second correction, I logged a lesson: "Always validate env vars with Zod." That mistake never happened again.
Small correction, permanent fix. Multiply this across dozens of lessons and your codebase becomes remarkably resilient.
The verification rule is the clearest example. I wrote "always verify before marking done" in five different files. The agent ignored it three times. So I built a pre-commit hook that blocks commits without a completed verification checklist, and a session hook that prevents the agent from ending a session with unverified changes.
Now the environment prevents the mistake. The agent can't skip verification because the system won't let it.
This escalation (lesson, then rule, then mechanical hook) is the self-improvement loop's most powerful pattern. Every time documentation fails, you have an opportunity to make the system mechanically correct.
For recurring complex tasks (verifying changes, investigating bugs, scaffolding a new domain), write step-by-step playbooks. Store them as markdown files the agent reads before starting work:
.claude/skills/
├── verify-change/SKILL.md # 7-step verification checklist
├── drive-app/SKILL.md # Browser MCP workflow
├── investigate-failure/SKILL.md # Bug reproduction and fix
├── self-improve/SKILL.md # Learn from corrections
├── add-domain/SKILL.md # Scaffold new domain
└── manage-context/SKILL.md # Context window management
Skills encode your team's standards into repeatable checklists. The agent reads the skill, follows the steps, and produces consistent output. Over time, you accumulate a library of skills that represent your team's best practices.
These are the patterns I see most often in mentorship sessions. If you're just getting started, watch out for:
A monolithic 500-line instructions file doesn't work. The agent drowns in context. Start with a concise entry point (~100 lines) and use progressive disclosure. Point to deeper docs, don't dump everything.
Tests validate behavior. They don't validate appearance, layout, responsiveness, or the dozen things that can go wrong visually. Build visual verification into your workflow. If you can't use Browser MCPs, at minimum open the page yourself before approving.
When you compact a conversation or start a new session, anything only in context is gone. Save progress to a task file after every meaningful step. This one habit prevents more wasted work than any other.
Don't spend three days building the perfect harness before writing any application code. Start with the basics: an entry point file, a simple architecture rule, one linter, and iterate. The best harness emerges from real tasks, not theoretical planning.
Autocomplete suggests the next line. An agent executes a task. If you're prompting line by line ("now add a try/catch here"), you're using the agent wrong. Describe what you want at the task level ("add error handling to the registration flow using the Result pattern") and let the agent figure out the implementation.
If you correct the agent and don't log the lesson, the same mistake will happen again next session. The correction is worth nothing unless it compounds. Log every correction. Review lessons at session start.
You don't need to adopt everything at once. Start small, iterate, and let the harness grow organically.
The entire reference implementation is open source. Clone it and explore how these patterns look in practice:
git clone https://github.com/haripery/harness-engineering.git
cd harness-engineering
npm install
Key files to study:
CLAUDE.md: the agent entry point (~100 lines)ARCHITECTURE.md: the layered domain modeltasks/lessons.md: the self-improvement loop in action.claude/skills/: nine agent skill playbookslinters/: four custom architecture lintersscripts/check-verification.sh: the verification gateSee docs/GETTING_STARTED.md for full setup instructions including database, test credentials, and MCP configuration.
The discipline didn't go away. It moved from the code to the scaffolding. You spend your time writing specs, designing constraints, capturing lessons, and verifying outcomes. The agent writes everything else.
OpenAI's post ends with a line worth remembering: "building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code."
Whether you use Claude Code, Codex, or something else, the principles are the same:
The question isn't whether agents can write your code. They can. The question is whether you've built the harness that lets them do it well.
Everything in this guide (the skills, verification hooks, MCP configurations, self-improvement loop) lives inside the repository's .claude/ directory. That works when you're building from scratch or cloning the demo. But what if you want to apply these practices to an existing codebase without copying files around?
That's where Claude Code plugins come in. A plugin is a self-contained package that bundles skills, hooks, MCP configs, and agents into a single installable unit. Instead of manually setting up verification workflows, self-improvement loops, and Browser MCP servers in every project, you'd run /plugin install harness-engineering and get the entire methodology wired up instantly.
The next post will walk through packaging this harness as a distributable plugin, turning a project-specific workflow into a portable methodology that anyone can install with one command.
This guide is based on practices developed using Claude Code and Opus 4.6. The reference implementation demonstrates every pattern described here.
If this post was useful, consider supporting my open source work and independent writing.
No comments yet. Be the first to share your thoughts.