Skip to content
Blog

The Keepstone Managed Software Ops Framework

The Keepstone framework, in plain terms — an AI agent system for operating custom software, augmented by experienced humans where judgment is required.

The Keepstone framework is, in plain terms, an AI agent system for operating custom software, augmented by experienced humans where judgment is required. It runs on top of whatever your system is built on, normalizes the parts that are missing or fragile, and then handles most of the day-to-day operating work automatically.

Here's how it actually works.

What the framework is

A standardized set of layers, wrapped around every system we run. Each layer combines four things: tools (the instrumentation, control planes, and integrations into the system), agents (software workers with specific responsibilities and the right access to do their jobs), standards (conventions every account is normalized to during hardening), and operators (humans who handle the decisions that need a name behind them).

The agents are the core. Most of the standing operational work — monitoring, triaging, classifying, documenting, patching, fixing low-risk bugs — is handled by software agents continuously. Operators step in for incidents with business stakes, architecture decisions, fit calls, and the handful of things that genuinely need a person.

The framework runs on top of whatever stack your system happens to be on. Doesn't matter if it's built in Lovable, in Bolt, in Replit, in Claude Code, by a former employee, by a contractor five years ago, on AWS, on Azure, on a small VPS — the framework normalizes the operational seams. The underlying application stays whatever it is.

The eight layers

1. Engineering Workflow

Standardizes how changes get made. Source control, branching, automated testing, AI-led coding with strong guardrails. No change ships without being reviewable and revertible — whether the author is an agent or a human.

Mostly automated. AI agents author the majority of code changes. Code review, test generation, and change descriptions are agent-driven. A human reviews anything non-trivial.

2. Infrastructure and Deployment

The production environment itself, and how changes reach it safely. Configuration of servers and services managed in version-controlled files (so nothing important lives only in someone's head), separate environments for testing and production, secrets stored in a vault rather than in the code, deploys that can be rolled back, releases that go out in stages.

Mostly automated. Deploys are agent-orchestrated against the configuration files. Drift between what's running and what's configured is detected continuously. A human approves anything above a defined risk threshold before it reaches production.

3. Observability

Makes the system visible and diagnosable. Uptime checks, automated tests that pretend to be users, structured logs, error tracking, response-time monitoring, queue and backlog tracking, cost monitoring.

Agent-watched. Detection agents monitor continuously across all instrumented signals. Anomalies get characterized, correlated across layers, and either remediated automatically or paged with the diagnosis already attached.

4. Triage and Support

Turns noise into tickets and tickets into decisions. Issue intake, defect-vs-enhancement classification, durable history of every issue, clean handoff between agent and human.

Mostly automated. Triage agents classify and route automatically. Front-line support agents resolve recurring questions using the live documentation. Operators see only the tickets that actually require human judgment, which is typically a small fraction of incoming volume.

5. Documentation and Training

Reduces key-person dependency and feeds every other layer. Architecture diagrams, data maps, onboarding guides, user docs.

Agent-maintained. Documentation agents keep everything continuously aligned with the actual code. The same documentation set grounds the support agents, so when a user asks "how do I…", the answer comes from the same source of truth the engineers use.

6. Security

Keeps the system defensible. Identity and access management, secrets hygiene, dependency scanning, patch management, audit trails, anomaly detection.

Agent-watched. Security agents monitor for credential leaks, software-component vulnerabilities, access drift, and exposed endpoints. Routine patching is automated. Anything ambiguous escalates to a named operator.

7. Business Continuity and Disaster Recovery

Plans for the days when nothing is going right. Backup strategy, tested restores, recovery targets matched to the business (how much data you can afford to lose; how fast you need to be back up), documented failover procedures, vendor-outage playbooks.

Agent-verified. Restore drills run on a documented cadence. Backup integrity is checked continuously, not assumed. Failover playbooks are kept current and exercised. Backups you haven't restored from aren't backups.

8. Governance

Decides fit, enforces boundaries, defines required hardening, keeps systems inside operable conditions.

Operator-led. This is the one layer that stays firmly in human hands. Fit, scope, and risk boundaries are judgment calls that need a named human behind them. We don't let agents decide what we'll take on.

What the framework actually does day-to-day

The layers describe the structure. The interesting part is what the system does, in practice, with that structure in place.

Automatic issue recognition

The detection agents in the observability layer aren't just watching for thresholds. They correlate signals across layers — a backlog spike, an error rate climbing, a slow database query, a recent code change — and characterize what's happening before paging a human. By the time an operator sees the ticket, the diagnosis is already attached.

Automatic bug resolution

This is the part that surprises clients.

When a user reports a bug — through the support intake, through email, through a ticket — the triage agent classifies it, examines the code, and makes a determination: is this minor, is the fix obvious, can the agent ship it. In a meaningful share of cases the answer is yes, and the bug is fixed and deployed before a human ever sees the ticket. The user gets a "fixed in 23 minutes" reply. The operator gets a note in the daily summary.

For anything ambiguous, anything material, anything that touches money or identity or external systems, the agent escalates. We don't let agents make architecture calls, and we don't let them ship anything they're not confident about. The threshold for autonomous action is conservative on purpose.

Automated front-line support

Front-line support agents handle the questions that have been answered before — password resets, "how do I export this report," "where's the field for X" — grounded in the same documentation the engineers use. The moment a question requires judgment, or the user explicitly asks for a person, the agent escalates to a named operator.

Continuous documentation

Every code change that ships triggers a documentation pass. If a workflow changed, the runbook for that workflow updates. If a new setting was added, the deployment guide updates. If something about how data is stored changed, the data map updates. The documentation set stays current automatically — which is what makes the support agents work, because their answers are only as good as the documentation backing them.

Augmented by humans, not replaced by them

None of this is "AI replacing operators." It's AI handling the standing volume so operators can focus on the work that actually requires a person — incidents with business consequences, architecture decisions, fit and scope calls, the things you'd want a senior human in the room for. Volume is handled by software. Judgment is handled by people. That's the entire arrangement.

What you own

The framework belongs to us. It travels with us to every account we run. The work product belongs to you — the hardened application, source code, documentation, credentials, dashboards, deployment configuration. All yours, forever, in your accounts. If we ever walk away, the system continues to run. You just won't have us running it.

We operate inside your accounts, never ours. Thirty-day exit, any time, everything handed over. That isn't generosity. It's how the business model works. We have to be confident enough in the value we deliver that we don't need to lock anyone in.

Why we built it this way

The original question was whether the kind of operating discipline a serious technology company has could be delivered economically to a small business running real custom software. A decade ago, the honest answer was no — the labor cost of doing the work right exceeded what the business could pay. With the right agent harness, the answer is yes.

The framework is how we make it true. Same level of discipline a much larger company would have, sized and priced for the business that actually needs it.

← Back to all posts