- Alberduris
- Posts
- What I’ve Learned from Claude Code SDK Without (Yet) Using the SDK
What I’ve Learned from Claude Code SDK Without (Yet) Using the SDK
Lessons from pushing Claude Code CLI with an SDK mindset — state, prompting, specificity, and the economics of digital employees.
I haven’t deployed the Claude Code SDK in production. But by pushing the Claude Code CLI far outside its “intended” use case, I’ve already learned most of what matters when you think like an SDK user.
The key is mindset:
Programmer Mode (normie): you use Claude Code CLI interactively, inside your terminal, to write or refactor code.
Universal Computation Agent (CLI, big brain): you use the same CLI for non-programming tasks — email triage, Notion knowledge management, even personal workflows. You get a kind of conversational copilot.
Universal Computation Agent (SDK, galactic brain): you imagine this not as an interactive tool but as a deployed service — triggered periodically or by events, orchestrated like micro-services. What you get is no longer a copilot but an asynchronous digital micro-employee.
Everything I describe here comes from using Claude Code CLI in local, non-interactive setups, but always with that SDK mindset: “How would I design this if I were deploying it with the SDK?”
Potential vs. Practical Limits
With MCP and browser automation, Claude Code can in principle do virtually anything. The boundaries are not conceptual but practical. The two real obstacles that appear today are context overload and bloated interfaces. Drafting a thread directly on Twitter’s official site is a good example: the page is so heavy and full of noise that the agent cannot handle it. By contrast, using Typefully for the same task works because the surface is lighter and more manageable.
But I’m not worried because these are not fundamental/permanent limitations, just reflect the current state of LLMs and its ecosystem. As models grow and structured browser automation improves, the gap between “possible in theory” and “achievable in practice” will shrink. For now, though, these constraints still define what you can actually run.
Prompting as Process Design
One of the first lessons is that complex or long browser actions only work if treated as a process to be engineered. Asking Claude to “log into this site, do that, and draft X” almost always fails. Success comes from breaking the task into smaller, explicit steps: navigate to the URL, locate a given element, click it, enter text. The more concrete the instruction —naming specific buttons, divs, or references— the sharper the execution.
Sometimes the best strategy is to divide work across separate invocations (if possible to preserve state between them). Drafting a Twitter thread in one call and then adding an image or a link in a later call is far more reliable than trying to compress everything into a single prompt. The principle is straightforward: success comes from making the instructions concrete and structured, even if that means they are longer. A single broad command fails; a sequence of explicit steps succeeds.
Claude Code as Micro-Employee
The most useful mental model is not “tool” but “employee.” When thinking in SDK terms, you don’t have a single assistant but a micro task-force: a set of micro-employees, each with a narrowly defined role, collaborating to complete a larger task. The analogy is the same as real task-forces: every member is highly specialized, and the team succeeds only because each one contributes their specific part.
State and Orchestration
What makes this model functional is not just the execution of steps, but the act of keeping records. Each invocation corresponds to a worker on duty, but the record it leaves is essential. Sometimes it is for others in the task-force, so the next employee knows what has already been done. Other times it is for itself: in recurrent workflows, the same employee may be invoked again at t=0, t=1, t=2…, and without a “worker notebook” it would remain stateless, unable to resume. Recording is therefore the foundation of memory —both intra-employee (a worker remembering its own previous cycle) and shared (a team coordinating across roles and time).
This dual layer of memory is what turns a collection of stateless invocations into something more. Locally, the mechanism can be primitive: a JSON schema in some /db folder where each micro-employee writes and reads “notes”, from its own status to the global task progress. Even in this simple form it enforces continuity: one micro-worker can recover its own past work, and another can pick up where the previous left off.
At larger scale the same principle extends naturally: JSON becomes a Mongo collection or a SQL table, and the "worker notebook" becomes visible through dashboards. The design may vary, but the requirement does not. State is what allows Claude Code to operate as a task-force —coordinating multiple employees in a given moment— and also to maintain continuity across different cycles in time. Without it, each run is only a snapshot, disconnected from both its peers and its own past.
This is precisely what distinguishes SDK-style deployment: once the workflow is recurrent or event-driven, persistence and coordination stop being optional and become the core of the design.
A Note on SDK Sessions
The SDK itself does provide a notion of sessions: mechanisms for conversation state, persistence, and resumption. At first glance this looks like the holy grail — why reinvent the “worker notebook” if the SDK already tracks context?
The reason is that session persistence still depends on the model’s context window. With today’s LLMs, context length is finite and performance degrades as history grows. A session can preserve continuity of conversation, but it cannot substitute for an external state layer when workflows must survive across cycles, coordinate multiple employees, or remain stable over time.
Sessions are therefore complementary, not a replacement. They keep the dialogue coherent, but the “employee notebook” —JSON, NoSQL, SQL, Graph— is still essential for structured memory and orchestration.
Shit happens: Monitoring and Logging
Observability is necessary but it comes with struggles. The Langsmith integration works, but only for metrics. You can see token counts, model usage, and tool invocations, yet the actual content of Claude’s messages never appears. This is by design, to prevent leaks of sensitive data, but it makes Langsmith insufficient for debugging.
The workaround in local development is straightforward: dump the full conversation to JSON and store the last message in a text file. This is usually enough to diagnose why a run failed. In a deployed setup, the same idea extends easily: logs can be pushed to an external logging service, or written to a database so they can be inspected through dashboards. What matters is not the mechanism but that you keep a trace beyond metrics, so that when a failure happens you can see exactly what went wrong.
But wait, what's a failure?
When thinking in “employee” terms, it’s also important to redefine what counts as a failure. A worker failing to complete a task does not always mean a technical error (e.g., some process giving some 400 or 500 status). Often it simply means the correct outcome was “no result” (for whatever reason).
If the employee’s assignment is to scan the ten hottest r/eli5 posts and find a new video-worthy idea, sometimes the right answer is that none exist. From the outside it looks like a "failed" run, but in reality it’s a successful execution: the employee checked and concluded there was nothing to do.
The same applies when external services return errors. If Reddit is down and the MCP call fails, or if Summiz returns a transient error fetching a transcript, that should not be treated as Claude’s failure. It’s equivalent to an employee being blocked by a system outage: the task wasn’t completed, but not because the worker did something wrong.
Logging should capture these distinctions. What matters is not only technical errors, but also these “negative but valid” outcomes and transient external failures, so downstream employees and dashboards reflect the true state of work.
Beyond SaaS: Extreme Specificity
One of the clearest distinctions between Claude Code and traditional SaaS is specificity. A SaaS must justify itself economically: the feature must serve a broad enough audience to sustain the business. That economic logic naturally pushes every SaaS toward generality (even if many modern SaaS are already micro-niche).
Claude Code, by contrast, can be pathologically specific. You can afford to design a micro-employee for a workflow so narrow that no company would ever productize it —and still have it be worthwhile.
Take the case of Summiz. A worker can:
Scan r/eli5 for posts that introduce concepts in a learning-friendly way.
Check if a YouTube video exists that matches the same topic one-to-one.
Hand that video off to the summarization pipeline.
That entire workflow is only useful to me, under my constraints, at this moment. No SaaS could ever build it. But with Claude Code, such “hyper-niche” automations are natural. This is where the micro-employee model departs from product logic and aligns instead with personal leverage.
SDK and Deployment (Theoretical Learnings)
Up to this point, everything I’ve described comes from local CLI use. The workflows, prompting strategies, and state mechanisms already work there. But the moment you ask, “How would this run continuously, outside my terminal?” you’ve crossed into SDK territory.
The SDK assumes —or at least works best with— a classic server model:
a persistent filesystem it can read and write freely,
the ability to launch processes or scripts as needed,
and continuity across runs so “employee notebooks” don’t vanish.
That setup is natural on a personal machine, a VPS, or an EC2 instance. It clashes with serverless platforms like Vercel or Railway, where the filesystem is ephemeral, execution is time-limited, and shells are absent. If you want to make multi-turn conversations there, you’d need to attach persistent volumes and typically treat each task-force as a containerized service so it can persist session state.
The SDK doesn’t change the workflows themselves. What it adds is the ability to package those patterns as services: to take a workflow that already works locally and deploy it so it runs asynchronously, continuously, and reliably, independent of your terminal.
Pricing Reality Check
One final lesson is economic.
Running Claude Code locally under Anthropic’s subsidized subscription is viable: the costs are fixed and predictable (and very cheap!).
Running it under SDK token pricing is another story.
Token usage can escalate quickly, and often the effective hourly cost is higher than hiring a human for the same task. That makes it difficult to justify as a user-facing feature: the economics collapse before scale.
For now, I feel that the sweet spot lies in internal automations: workflows where the cost is acceptable because the leverage is high and the audience is you (or your team). Claude Code in SDK mode can still be invaluable there, but probably not yet for broad deployment to end-users.
That said, there will always be exceptions. Some edge cases will justify the economics — but finding them will require creative framing and clever leverage.
Thank god Claude Code supports interoperability. For lower costs, keep an eye on Z.ai’s GLM Coding Plan (from $3/month), or run it with Kimi K2 via its API — still much cheaper than Sonnet.
Why the SDK Mindset Matters
All these learnings come from pushing the CLI with an SDK mindset. The workflows, prompting strategies, and state mechanisms don’t need to change between CLI and SDK. What changes is their operational form: from something that runs in your terminal to something that lives as a service.
The lesson is that you don’t need the SDK in hand to start thinking like an SDK user. Treat the CLI as if you were already deploying micro-employees, and most of the hard questions reveal themselves early: state, specificity, economics, and environment fit.
By the time you actually deploy with the SDK, the design patterns will already be there. The only difference is that they will no longer be experiments in a terminal session, but persistent digital micro-employees running on their own.
P.S. A question for the reader: how long until these micro-employees level up into mini-employees? 🙂