Smart and Witty Blog Title

Current LLM Workflow Setup

2026-03-23T00:00:00+00:00

An updated snapshot of my LLM-assisted development setup, as of March 2026. The previous snapshot was from late February. A lot has changed — new tools, more skills, a proper planning pipeline, and significantly expanded permissions. Same deal as before: Claude examined its own configuration and I peppered in my commentary.

Environment


Terminal	Ghostty
IDE	IntelliJ (Rust development), Zed (blogging)
LLM Tool	Claude Code CLI (Claude Opus, 1M context)
Task Management	Filament (replaced beads_rust)
Code Generation	Jujo

What changed: Upgraded from beads_rust to filament for task tracking. Filament adds a knowledge graph, inter-agent messaging, and a TUI — all in one Rust binary. Added jujo for deterministic code generation from templates. The 1M context window is a game changer — no more context exhaustion mid-session.

Claude Code Configuration

Plugins

Plugin	Purpose
rust-skills	Rust-specific guidance — ownership, concurrency, error handling, domain patterns, crate research, daily news
rust-analyzer-lsp	LSP integration for go-to-definition, find references, symbol analysis
code-review	Structured PR code review

What changed: Added the code-review plugin since February.

Skills (20 installed)

Custom skills loaded from ~/.claude/skills/:

Skill	Source	What it does
project-context	local	Reads CLAUDE.md files for onboarding; updates them after changes
skill-creator	local	Guide for writing effective Claude Code skills
filament	local	Task lifecycle, knowledge graph, lesson capture via `fl` CLI
jujo	local	Code generation from Tera templates via `jujo` CLI
pattern-analyzer	local	Analyze codebase patterns → generate jujo templates
research	local	GitHub repo exploration and web article fetching via Go CLI
handoff	local	Structured session handoff summaries
cleanup	local	Scan and remove stale files across /tmp, ~/.claude, project dirs
datastar	local	Datastar hypermedia framework patterns
spec-driven-dev	local	Three-phase workflow: Research → Plan → Implement with human checkpoints
grill-me	matt	Interview user relentlessly about a plan until shared understanding
write-a-prd	matt	Create PRD through user interview, submit as GitHub issue
prd-to-plan	matt	Turn PRD into multi-phase implementation plan with tracer bullets
prd-to-issues	matt	Break PRD into independently-grabbable GitHub issues
triage-issue	matt	Triage bugs: search filament lessons → investigate → fix plan with TDD
review	gstack	Pre-landing PR review against project checklist
code-eng-review	gstack	Eng manager-mode review of implemented code
plan-eng-review	gstack	Eng manager-mode plan review with architecture focus
retro	gstack	Engineering retrospective with trend tracking
library	fork	Private skill distribution via YAML catalog + git sync

Sources: local = original, matt = mattpocock/skills, gstack = garrytan/gstack, fork = disler/the-library

What changed: Went from 4 skills to 20. The biggest additions are the planning pipeline (write-a-prd → prd-to-plan → prd-to-issues), the spec-driven-dev meta-workflow, and the library catalog for managing skills across devices. All skills now have filament integration for task tracking and lesson capture. Removed br and bd-to-br-migration skills.

Hooks

Event	Hook	What it does
UserPromptSubmit	`log-prompt.sh`	Captures every user prompt to a daily session log file. Now works across multiple projects
PostToolUse (Bash)	cargo check after build	Runs `cargo check` after any cargo/make command to catch compile errors immediately

What changed: The prompt logger now works across multiple projects (previously Koupang-only). Added the PostToolUse hook for Koupang that runs cargo check after build commands — catches compilation errors before I even look at the output.

Permissions

The permissions list has grown significantly. The philosophy is: anything that only reads or only modifies local project files should be auto-allowed.

Explicitly allowed (no confirmation needed):

Read-only shell: ls, cat, find, grep, rg, tree, stat, which, file, wc, sort, uniq, diff, basename, dirname, realpath, jq, cut, tr, awk, sed, xargs, tee
Git read: status, log, diff, branch, show, tag, remote, stash, blame, rev-parse
Git write: add, commit, checkout, switch, merge, rebase, fetch, pull, push, cherry-pick
Cargo: check, build, test, clippy, fmt, run, add, tree, doc, metadata, install, update, clean, bench, fix
Docker: compose up/down/ps/logs/build/exec, images, logs, build, ps
Make: all make targets
Custom CLIs: fl (filament), jujo, sqlx, rustup, research (Go CLI)
File ops: mkdir, touch, cp, mv
Shell: printf, read, echo, whoami, id, env, date, uname

Explicitly denied:

rm, sudo, curl, wget, chmod, chown, kill, killall, pkill, dd, mkfs
git push --force, git reset --hard, git clean -f
docker rm, docker rmi
WebSearch, WebFetch
Output redirection (> *)

What changed: Massively expanded from ~25 allowed commands to ~80+. Added all the text processing tools (jq, awk, sed, etc.), file operations (mkdir, cp, mv), more git commands (cherry-pick, blame), the full cargo suite, and custom CLIs (fl, jujo, research). The goal was to reduce the number of “approve this?” prompts to near-zero for normal development work. It mostly worked — I rarely see permission prompts now unless Claude is doing something genuinely unusual.

CLAUDE.md Files

Same hierarchical structure as before, but now with more content:

Root (koupang/CLAUDE.md) — workspace structure, tech stack, ADR summary, key imports, scripts
STYLE.md — coding style guide adopted from TigerBeetle’s TIGER_STYLE, customized for this project. Covers: data-oriented programming, value objects, assertions, error handling, naming, function size
Per-service (identity/CLAUDE.md, catalog/CLAUDE.md, shared/CLAUDE.md, order/CLAUDE.md, payment/CLAUDE.md, cart/CLAUDE.md) — detailed architecture, endpoints, domain models, test structure
Reference docs (.plan/) — detailed implementation plans, test standards

What changed: Added STYLE.md which is now the single source of truth for coding conventions. Added CLAUDE.md files for order, payment, and cart services. The STYLE.md adoption was a turning point — it gives Claude a concrete reference for what “good code” looks like rather than relying on vibes.

Development Cycle

For new features (spec-driven-dev workflow)

Research — /spec-driven-dev triggers filament lesson search for prior knowledge, then codebase exploration
Plan — /grill-me to stress-test the design, then /plan-eng-review for architecture review
Create tasks — break plan into filament tasks with dependency chains
Implement — work through tasks with fl task ready to find unblocked work
Review — /code-eng-review for structured code review against STYLE.md
Capture lessons — fl lesson add for gotchas and patterns discovered

For boilerplate/scaffolding

Analyze patterns — /pattern-analyzer to find repeated structures in the codebase (done once or when patterns need updates)
Generate templates — export to jujo generator with jujo init + template files
Stamp out code — jujo generate for deterministic scaffolding
Customize — Claude fills in AI customization markers for business logic

For bug fixes

Triage — /triage-issue searches filament lessons first, then investigates
Fix — TDD approach with the fix
Capture — lesson recorded in filament for future reference

What makes this work

STYLE.md gives Claude a concrete reference for code quality, not vibes
Filament provides persistent context across sessions — lessons, tasks, and knowledge graph survive session boundaries
Jujo eliminates token waste on boilerplate — deterministic code gen for repetitive patterns, Claude only handles the unique parts
The planning pipeline (PRD → plan → grill → review → implement) prevents wasted work on under-specified features
Expanded permissions make the flow nearly frictionless — I rarely see “approve?” prompts
1M context window means I can do planning + implementation + review in a single session without context exhaustion
The prompt logger captures everything for blog posts and retrospectives

My take: The February setup was functional but ad-hoc. The March setup feels like a proper workflow. The biggest wins were: (1) filament replacing beads with a knowledge graph that accumulates project wisdom across sessions, (2) STYLE.md giving Claude a codified standard to follow, and (3) the planning pipeline preventing the “just start coding” impulse that led to problems in the Filament sprint. I’m still not at the “fully autonomous multi-agent” level but I’m getting more comfortable delegating larger chunks of work to single sessions with good context. I also found that STYLE.md helped a lot in getting the agent to produce better code.

Closing Thoughts

Since this is all setup locally and I will have to use work computers, I will need a handy way to package all this and export it to other computers. I’ll get around to doing that.

Getting Gud at LLMs Pt6

2026-03-23T00:00:00+00:00

In Part 5, I went fast and broke things building Filament. This time, I bounced between 5 projects over 18 days — shipped Filament v1.0, built and shipped Jujo v1.0 from scratch, pushed Koupang’s order saga to completion, attempted a C++ to Rust port, and started learning Haskell. I also overhauled my entire skill and workflow setup. Here’s a snapshot of my current setup which is now quite evolved: Current LLM Workflow Setup (March 2026).

The Numbers

~35 sessions, ~300+ prompts over ~18 days (Mar 4 – Mar 22)
5 projects touched: Filament, Koupang, Jujo, RLedger, Haskell learning
87 git commits across all projects (36 Koupang, 31 Filament, 20 Jujo)
Filament: v1.0.0 released, 31 commits, extensive QA
Koupang: full order saga implemented, STYLE.md adopted, 36 commits
Jujo: built from scratch to v1.0.0 in ~3 days
Haskell: levels 1–9 completed (exercises generate by Claude) and also did additional Exercism exercises

Filament

Finishing this took longer than I thought. When I last left off, the TUI was nearly done but not quite and then I got distracted by my perfectionism. I did some extensive QA work which I thought was good to experience and managed to release v1.0.0. By the end of this 5 day sprint, I was absolutely exhausted because of how much focus I put in. It was a very different kind of tiredness which I am finding hard to describe with words. As I was in the frenzy of prompting, it felt extremely exhilarating to just Get Shit Done but my need for full knowledge and perfect verification also led me to do lots of reading of the code and constant re-examining of priorities. I was just “On” for so many hours of the day and obsessed with getting it done. I don’t particularly wanna get back into this state again.

Git History (Mar 4–22)

The filament sessions weren’t captured by the prompt logger since it only ran in the Koupang directory at the time. But the git history tells the story — 31 commits from Mar 4 to Mar 22:

TUI enhancements: message detail pane, keyset pagination, reply-to-message, 95 TUI tests
Pre-v1 code review: 18 fixes across all 4 crates, typed entity DTOs, Clearable enum
QA: 7 QA sessions (22/22 tests passing, 0 bugs), roleplay simulations for multi-agent coordination
Bug fixes: flaky daemon test race condition, error exit codes, circular dependency prevention
v1.0.0: CI/CD pipeline, curl install script, distributable skills

My take: The filament git log is dense. 31 commits in ~18 days but the actual work was front-loaded into maybe 5-6 days of focused sessions. The roleplay simulations (where Claude pretended to be multiple agents using filament concurrently) were genuinely quite useful and I think using LLMs for QA has a lot of potential.

Koupang

I went back to Koupang after some rest where I did nothing. Instead of going more headfirst super-fast like Filament, I decided to take things a bit slower and more deliberate. I went over the generated code and revised plans multiple times. I eventually iterated enough on the plans and read enough of the code that I got a good understanding of what is going on and what to do and began implementing. At this point, I also started doing LLM-assisted code reviews. I was still kinda recovering from the mad-sprint as I was doing this.

I think the code that was output was pretty decent and well tested. I learned a fair bit (mostly big picture) about outbox and kafka setup from this which was good. The very detailed implementation details elude me to be honest but considering these are kinda one-off things and not repetitive like basic CRUD work, I think it would take me far longer to internalize. The full order flow is now complete and I just need to fully verify it works as expected.

Kafka Consumer/DLQ Planning & Outbox Review (Mar 5)

Reviewed and critiqued the Kafka consumer/DLQ plan for edge cases and production quality. Then turned the same rigor on the existing outbox implementation. Created improvement plans for both. Also got book recommendations for practical Kafka knowledge.

Prompts

Getting Gud at LLMs Pt5

2026-03-04T00:00:00+00:00

In Part 4, I finished planning for the big order orchestration saga and then started implementing. I am in the middle of implementing it but I decided to go on a little side quest. Things were getting complacent and I wanted to shake things up.

Problem and Solution

I was already falling into established patterns because I was working like how I would work at a real serious full time job. Using LLMs to enhance my speed by outsourcing typing but not going all in. I found a pace I was comfortable in. But, I was definitely not satisfied. The fact that human intervention is needed or LLMs speed you up is not revolutionary. Plenty of other people are saying it too. Also, I wanted to challenge myself more with the single agent setup. So I changed 2 variables:

I sped up the pace of work. I was going reasonably slow-ish pace compared to other devs who embraced this fully so I have room to improve there.
The amount of human review and intervention needed to be reduced. I still meticulously read the code generated in Koupang and directed refactoring efforts.

I didn’t want to apply this to Koupang, so I decided on a new project. Tangent: Jevons Paradox “is said to occur when technological improvements that increase the efficiency of a resource’s use lead to a rise, rather than a fall, in total consumption of that resource”. This applies because now if I have an idea, I can just execute. I started the second project because now, the cost of starting is so low that I can afford to take a quick jaunt at something completely unrelated and come back to the original project without losing so much time.

As I developed Koupang, I felt that task management was solved with beads_rust but other stuff like knowledge management was not so convenient or agent first. I looked into Flywheel but I didn’t really want to install 10+ tools and learn all of them. So instead, I decided to build my own: Filament. I wanted a single rust binary with all the tools I needed. I directly got Claude to research the tools and codebases of my inspiration and with some of my input, we began building.

Process and Outcomes

I took an evening or so to plan and research. Then the next day, I started executing. I decided to give myself a day to get as much done as possible since I wanted to go back to Koupang ASAP. Thus, I went fast and reduced human intervention (the 2 variables I talked about above). Cracks immediately started to show. The cracks were:

programming shortcuts taken (eg: N+1 queries, LLM didn’t just write a new query to get things in batch)
code quality issues (eg: god functions, circular dependencies)
bugs (too numerous to count)

There is quite a lot to cover so I got Claude to summarize the interactions I had below. But in the end, I mostly managed to finish all the features I wanted. I will need to do some extensive QA but after, I plan to implement this into my workflow afterwards. I am currently planning and executing an aggressive QA part of the development where I will try to break what I built using Claude as I write this blog post.

The Numbers

40 sessions, 227 prompts over ~1.5 days (Mar 2 evening → Mar 4 morning)
42 commits, ~15,000 lines of Rust across 4 crates
235 tests (120 core + 58 CLI + 39 daemon + 10 MCP + 8 TUI)
20 ADRs documenting architecture decisions
5 phases completed (Core, CLI, Daemon+MCP, Agent Dispatching, TUI)
~6 code review sessions, 2+ manual QA rounds
Multiple context window exhaustions (sessions continued from summaries)

Prompt & Progress summaries by AI with my takes in between as usual

Sessions 1–2: Planning & Architecture Decisions (Mar 2 evening)

Copied the Makefile and util-scripts from Koupang as a starting point. Wrote 6+ Architecture Decision Records. Key decisions made: messages are NOT graph nodes (separate inbox/outbox pattern), file reservations and agent runs also not graph nodes, single-binary architecture (installable via curl not just cargo), and per-project .filament/ directory with local SQLite + Unix socket.

Prompts

Current LLM Workflow Setup

2026-02-26T00:00:00+00:00

A snapshot of my current setup for LLM-assisted development, as of February 2026. This is what I use to build Koupang, the Rust e-commerce project documented in the Getting Gud at LLMs series.

I got Claude to look at its own configuration, write the summary of configuration and format the post nicely. Like pt3, I peppered in my commentary in block quotes. I thought this could be a separate post on its own as this is quite meta and mostly informational.

Environment


Terminal	Ghostty
IDE	IntelliJ (Rust development), Zed (blogging)
LLM Tool	Claude Code CLI (Claude Opus)
Task Management	beads_rust (br)

Claude Code Configuration

Plugins

Plugin	Purpose
rust-skills	Rust-specific guidance — ownership, concurrency, error handling, domain patterns, crate research, daily news
rust-analyzer-lsp	LSP integration for go-to-definition, find references, symbol analysis

Skills

Custom skills loaded from ~/.claude/skills/:

Skill	Trigger	What it does
br	`/br`, mentions of tasks/issues/backlog	Task lifecycle management via beads_rust CLI — create, query, update, close, dependency tracking
project-context	`/project-context`, session start	Reads CLAUDE.md files for onboarding; updates them after significant changes
skill-creator	Creating new skills	Guide for writing effective Claude Code skills
bd-to-br-migration	Migrating from beads to beads_rust	Command mapping and migration patterns from `bd` to `br`

Hooks

One hook configured on UserPromptSubmit:

Prompt logger (log-prompt.sh) — automatically captures every user prompt to a daily session log file (session-log-YYYY-MM-DD.md). Only activates within the Koupang project directory. Filters out system/command messages and empty prompts. These logs feed directly into the blog posts — it’s how I track session counts and reproduce exact prompts.

My take: I am quite surprised I need so few skills, plugins and hooks to be this productive. I may try to branch out further with more stuff to see if I’m missing out. Also very satisfied with having Claude generate its own summaries bceause this saves a lot of time when writing posts where the content is mostly factual and self-documenting. Making Claude write stuff that it will later use kinda reminds me of metaprogramming. If LLMs were a deterministic programming language, I think it would be a combination of Ada (the “English”-like syntax) and Lisp (language capable of a lot of metaprogramming).

Permissions

Explicitly allowed (no confirmation needed):

Read-only shell: ls, cat, find, grep, tree, git status/log/diff/show/tag, etc.
Git write: git add, commit, checkout, merge, rebase, push
Cargo: check, build, test, clippy, fmt, run, add, doc
Docker: docker compose up/down/ps/logs
Make: all make targets

Explicitly denied:

rm, sudo, curl, wget, chmod, chown, kill
git push --force, git reset --hard, git clean -f
docker rm, docker rmi
WebSearch, WebFetch

Everything else prompts for confirmation.

My take: “Supposedly” allowed but I still have to manually allow permissions for the “safe” commands. Gotta figure out how to properly configure allowed stuff.

CLAUDE.md Files

The project uses hierarchical CLAUDE.md files:

Root (koupang/CLAUDE.md) — workspace structure, tech stack, ADR summary, key imports, scripts
Per-service (identity/CLAUDE.md, catalog/CLAUDE.md, shared/CLAUDE.md) — detailed architecture, endpoints, domain models, test structure
Reference docs (.plan/) — bootstrap recipe, code patterns, test standards (loaded on-demand, not auto-loaded)

These are the primary onboarding mechanism — a new session reads them first and skips redundant exploration.

My take: Sometimes Claude just reads all of the CLAUDE.md files and all the plan files when it loads up. Seems inconsistent but doing my best to prevent context bloat.

Development Cycle

This is the general loop I follow for each feature:

Plan — enter plan mode, let Claude explore the codebase, design the approach
Iterate on the plan — review, ask questions, refine until the plan is solid
Put the plan into beads — create br tasks with priorities and dependencies so Claude has a structured work queue
Generate code — Claude works through tasks, launching subagents for parallel work when possible
Review code — read through what was generated, check for scope creep and pattern violations
Commit code — stage and commit with meaningful messages
Clean up — second pass to catch anything missed: redundant code, missing tests, stale docs

What makes this work

CLAUDE.md files keep context cheap. A new session doesn’t waste 10 prompts re-discovering the architecture.
beads_rust gives structure to multi-step work. Instead of one giant prompt, break it into tasks with dependencies and let Claude work through them.
The prompt logger means I never lose track of what I asked. Blog posts practically write themselves from the logs.
Strict permissions prevent accidents. No force-pushes, no rm, no silent curl calls. Everything destructive requires confirmation.
Plan mode first prevents wasted work. Getting alignment on the approach before writing code is always worth the extra 5 minutes.

My take: I am currently very happy with this workflow and can see myself using this workflow in the future and in jobs.

Getting Gud at LLMs Pt4

2026-02-26T00:00:00+00:00

In Part 3, I finished the catalog service, compacted CLAUDE.md files, and kicked off planning for the order/payment phase. Today was less about writing code and more about optimizing what exists and planning what comes next. I also wrote a separate post about my current LLM workflow setup if you’re curious about the tooling side. I also write about the direction of blog posts since I got to thinking about what was the end goal.

Direction of blog posts

The blog posts so far contain 3 topics

using LLMs
the Koupang project
my existential dread + musings about software industry

Git Gud

My LLM usage stabilized once I figured out planning, beads and context management. According to Steve Yegge’s Gas Town, I am somewhere around level 5. Personally, I haven’t reached the stage in the codebase where multiple agents are justifiably needed. But when I have multiple microservices developed and have many features to develop at once (I started putting more tasks as backlogs besides this order saga to fill things up), I will start experimenting with multiple agent stuff. I am still a big proponent of human in the loop and can still see areas where the LLM messes up so I need to challenge that part as well. I want to see if I can create nearly comprehensive guidelines that many LLMs can follow so that I can grow to “trust” their output. Trust is a funny word considering LLMs are just fancy matrix multiplication calculators.

When I reach maybe level 7/8 (or whatever the highest possible level I can reach) from that Gas Town post, I think that will be a nice place to end this series.

Koupang

On the topic of the project itself, I decided that the MVP will be when the full order cycle is completed and is in a deployable state. Once that is done, the functional and non-functional challenges will become more difficult.

On the functional side, I will try to leverage multi-agents to develop many features at once.
On the non-functional side, I will also try to leverage multi-agents to make this service ready for “production” grade traffic and up-time.

These updates will be a bit less frequent and will mostly focus on non-functional stuff. The “feature” work can be outsourced to LLMs since a lot of it is mostly CRUD, but the “tech” side I want to take a bit more time to make it right. I think the posts will be titled “Overengineering Koupang for Fun and Profit pt{n}”.

Trauma dumping on main

Honestly have no clue about this. May suddenly decide to randomly drop an 10k word essay or whatever.

AI summary of work

Same deal as pt3 — I got Claude to summarize the work from its own session logs and git history. My commentary is in block quotes.

By the numbers

5 sessions, ~34 prompts across Feb 25 evening and Feb 26:

4 git commits since v0.3-catalog-complete
19 files changed, +879 lines added, -2,523 lines removed (net -1,644 lines)
Tests: 207 → 235 (deduped many redundant tests, added shared module tests)
4 plan files revised with inline comments
35 beads tasks created with full dependency DAG
1 new doc: .plan/test-standards.md

Git tag: v0.4-test-optimization-and-planning

Most of the work fell into two categories: test optimization and order/payment planning.

Test optimization (2 sessions, ~13 prompts)

Analyzed the test suites across identity and catalog and found significant redundancy. The same CRUD operations were being tested at every layer (router, service, repository) with full end-to-end Postgres containers each time.

Key changes:

Shared container infrastructure — instead of spinning up a separate Postgres/Redis container per test, containers are now shared per test binary. This cut test setup overhead significantly.
Test deduplication — removed 107 redundant tests across identity and catalog (identity 115 → 82, catalog 209 → 135). Coverage maintained — the removed tests were asserting the same behavior at multiple layers.
Extracted test helpers to shared — auth fixtures (seller_user(), admin_user()), HTTP request builders (authed_json_request, authed_get), and pagination unit tests all moved to the shared module so every future service gets them for free.
Test standards doc — created .plan/test-standards.md defining what each test layer should cover, preventing redundancy from creeping back in.

Net result: -2,523 lines of test code removed, +879 added (mostly shared infrastructure), and the test suite runs faster.

My take: This was necessary to do and quite quickly achieved. This would have taken me ages to type otherwise so I’m glad I got the LLM to do it. Emphasizes the importance of actually reading what the LLM generates. I made the AI create a test-standards.md file so that it can refer back to these standards repeatedly to prevent further code like this.

Order/payment mega-planning (3 sessions, ~21 prompts)

This was the big one. The order/payment phase touches multiple services and needs careful coordination. The planning was split across multiple sessions:

Session 1 — Initial 4-plan structure (from pt3) Claude explored the entire codebase and created 4 detailed plan files:

Shared infrastructure — Kafka KRaft, transactional outbox (outbox-core), event system with rdkafka, distributed tracing with Jaeger
Cart service — Redis-only, 6 endpoints, 30-day TTL
Order + Payment — choreography saga, state machines, mock payment gateway, inventory reservation, compensation flows
Workflow docs — ADRs 010-013, CLAUDE.md files, saga flow documentation

Session 2 — Plan review with my comments I left inline comments on plans 1-3 (titled “Comment on [relevant part]”), then walked through each one with Claude to revise:

Plan 1: Added ServiceBuilder pattern, typed event enums, DLQ topics, programmatic Kafka topic creation
Plan 2: Changed cart to display-only totals, added /validate endpoint, seller order endpoint
Plan 3: Added PaymentTimedOut handling, sku_availability view

Session 3 — Double-entry accounting discussion I pushed for a double-entry accounting ledger in the payment service, inspired by Alvaro Duran’s article about big tech companies begrudgingly building their own double-entry payment ledgers. Claude revised plan 3 to adopt this approach and added a note about platform commission (out of scope for now but will affect the ledger design).

After the plans were finalized, Claude created 35 beads tasks with a full dependency DAG across all 4 plans, plus 3 MVP milestone tasks (docker-compose deployment, seed script, API walkthrough) and a standalone non-functional requirements task for high traffic/uptime planning.

My take: Still very planning heavy and I could have decided to research way more before going through amendments but I think I need to execute and learn through pain and suffering. The accounting double entry stuff was my call because I know from experience that money records needs a special kind of care/domain knowledge. I predict the Kafka stuff will bite me in the ass but oh well.

What’s next

The 4 plans are reviewed and waiting for implementation. Next time, I’ll try to push my current setup to its limits by handling this huge set of requirements with me managing 1 agent.

Here’s the full beads dependency tree — this is what Claude will work through: (I got Claude to use the beads_rust skill to read from beads_rust and format this. This is so very convenient lmao)

PLAN 1: SHARED INFRASTRUCTURE (foundation for everything)
──────────────────────────────────────────────────────────
bd-na8  Event types + typed enums (EventType, AggregateType, EventEnvelope)
├── bd-1j2  KafkaEventPublisher (rdkafka) ← also needs bd-3ga
│   ├── bd-3sv  KafkaEventConsumer with DLQ support
│   │   ├──→ [Plan 3: Order schema, Payment schema, Catalog inventory]
│   │   └── bd-2v4  ADR-010 Event-driven architecture ← also needs bd-1ee, bd-5y4
│   └── bd-27z  Kafka health check
├── bd-5y4  ServiceBuilder composable bootstrap
│   ├──→ [Plan 2: Cart value objects]
│   └──→ bd-2v4 (see above)
└── bd-1fo  MockEventPublisher (test infra)

bd-3ga  Docker compose additions (Kafka KRaft, Kafka UI, Jaeger)
├──→ bd-1j2 (see above)
└── bd-2g4  Programmatic topic creation (AdminClient)

bd-3ej  Research outbox-core crate API compatibility
└── bd-1ee  Outbox integration via outbox-core + migration templates
    ├──→ [Plan 3: Order schema, Payment schema]
    └──→ bd-2v4 (see above)

bd-8fc  Distributed tracing OTLP exporter (independent)


PLAN 2: CART SERVICE (Redis-only)
─────────────────────────────────
bd-337  Cart value objects (Quantity, PriceSnapshot, Currency) ← blocked by bd-5y4
└── bd-2aq  Cart domain model (CartItem, Cart) + Redis data model
    ├── bd-1tw  Cart DTOs (request, validated, response) ──┐
    └── bd-3sf  Cart repository (Redis ops) + tests ───────┤
                                                           ▼
                                              bd-1ra  Cart service + tests
                                              └── bd-aqs  Cart routes + router tests
                                                  └── bd-1p7  Cart bootstrap (lib.rs, main.rs)
                                                      └── bd-305  cart/CLAUDE.md


PLAN 3: ORDER + PAYMENT + INVENTORY
────────────────────────────────────
Order chain:                              Payment chain:
bd-32f  Order schema ← bd-1ee, bd-3sv    bd-1tp  Payment double-entry schema ← bd-1ee, bd-3sv
└── bd-lwv  Order value objects           ├── bd-2cj  Payment gateway trait + mock
    └── bd-186  Order repository          └── bd-1te  Payment repository
        └── bd-2ac  Order service             └── bd-1rk  Payment service ← needs both
            └── bd-3pu  Order routes              └── bd-3of  Payment routes
                └── bd-a1p  Order outbox
                    └── bd-1a3  Order Kafka consumers

Inventory chain:
bd-1yx  Catalog inventory migration ← bd-3sv
└── bd-xsn  Inventory service + repository
    └── bd-2h3  Inventory Kafka consumer

              ┌─── bd-1a3 (order) ──────┐
All three ──► │    bd-3of (payment) ────►├──► bd-b02  Wire Kafka consumers in all main.rs
              └─── bd-2h3 (inventory) ──┘
                                         └── bd-1zd  Saga integration tests
                                             ├── bd-jp3  order/payment CLAUDE.md
                                             └──► MVP track (below)


PLAN 4: DOCS
─────────────
bd-jp3  order/CLAUDE.md + payment/CLAUDE.md ──┐
bd-305  cart/CLAUDE.md ────────────────────────┤
                                              ▼
                              bd-1jp  ADRs 010-014
                              bd-m7m  Saga flow docs ← bd-jp3
                              └── bd-2ne  Progress summary pt3


MVP MILESTONES
──────────────
bd-1zd  Saga integration tests
└── bd-32d  MVP: Docker Compose (all services + infra)
    └── bd-o7r  MVP: Seed data script
        └── bd-9mm  MVP: API walkthrough / Postman collection


BACKLOG (independent, no blockers)
──────────────────────────────────
P2: bd-2yx  Redis caching for product reads → bd-2sh Search engine planning
P3: bd-1yk  Bulk product/SKU CSV processing
P3: bd-3jn  Brands list keyset pagination
P3: bd-1dh  Image upload for products/SKUs
P3: bd-v0a  Plan for high traffic and uptime (NFRs)
P3: bd-dsh  Resilient auth (gRPC + Redis cache + circuit breaker)
P3: bd-2kq  Evaluate repository trait pattern for mockable tests
P4: bd-x38  Advertisements service planning
P4: bd-7m8  Discounts/coupons planning
P4: bd-dj9  Evolve domain FK refs to embedded domain objects

Getting Gud at LLMs Pt3

2026-02-25T00:00:00+00:00

In Part 2, I finished the catalog service CRUD, reflected on backend development with LLMs, and predicted that complex cross-service features would be where things get hard. Since then, I’ve been building out the catalog further and planning the next major phase.

I realized that the summary of what I did can be automated with Claude Code, so that’s what I did. I’ve read through the summary and can verify it’s accurate. I’ll pepper in my own commentary in between (those sections are in blockquotes). When the AI summary says “I,” it’s Claude writing as me.

By the numbers

About 12 sessions and ~50 user prompts across Feb 24 afternoon and Feb 25:

11 git commits since pt2
Tests: 209 → 207 (net -2 from deduplicating redundant tests, coverage maintained)
2 new features: brands + categories with ltree hierarchy, keyset pagination with filters
4 major plans created for the next phase
CLAUDE.md files compacted significantly (catalog 250→88, identity 125→68, shared 187→82 lines)
ADR count: 8 → 9 (added ADR-009 for ltree categories)

The work fell into a few categories: building new catalog features, managing LLM context, testing and code quality, and planning for the order/payment phase.

Building out the catalog

Category & brand planning (3 prompts)

Added two new br tasks for categories and brands. Entered plan mode specifically for categories because:

Tree data structures in a relational DB are tricky (chose Postgres ltree)
Brand-category validation was needed (e.g. a car brand can’t appear on food products)

Brand & category implementation (7 prompts across 2 sessions)

Executed the plan. Implemented brands CRUD, categories with ltree hierarchy, and a brand-category association table. Asked Claude to work on tasks in parallel. Reminded it to reuse value objects (e.g. HttpUrl for brand logo URL). Ended the session when context started filling up.

My take: I decided to experiment with two coupled features that have interdependencies with each other and are dependent on the existing product code, to see how Claude handles additions to existing code and manages a set of parallel intertwined tasks. I’m quite happy with how it handled itself once the plan was in place and shoved into beads. I got it to launch subagents to do the work as well when it could, to speed up task completion.

Domain model clarification + scope control (5 prompts)

Clarified to Claude what I meant by “domain objects” — rich domain models where business logic lives, not just FK validation helpers. Claude suggested building FK traversal (following foreign key references to load related domain objects automatically), which I correctly flagged as ORM scope creep. Pushed it to a P4 backlog “nice to have.”

to clarify, this is going into the realm of reimplementing orm features which
dramatically increases the scope of the task/project to unmanageable levels

My take: The concept of “correct” code appeals to me greatly due to my work experience. When I worked in not-ideal legacy code environments or with fast-paced schedules, I would often skip writing tests and code as fast as possible. That obviously leads to poor and buggy code. Besides writing tests — which is an entire ordeal itself if the code is sloppy or very legacy — I wanted techniques to write code that can’t be wrong (assuming I have the correct understanding of the requirements).

This led me to shop around for techniques, which landed me on several topics:

Functional Core, Imperative Shell

Making illegal states unrepresentable.

The first concept taught me that code can be split into Logic (CPU-bound tasks or Calculation) and Side Effects (I/O or Data production/consumption). So whenever I look at or think about code, I decompose it into these two broad categories, which lets me reason about how to organize/debug code and choose technology.

The second concept helped me understand that compilers can encode business logic. I learned that certain language features can be used as guards and as a representation for business logic — that the compiler can be made to encode and “understand” business logic. This was the specific “technique” I was looking for.

Since LLMs can always make mistakes, I wanted another pre-emptive “layer” of validation besides what the compiler provides. If the LLM accidentally hallucinates something that isn’t just technical and wrong in a business-logic sense, the compiler will be able to catch that as well. This specific case hasn’t happened yet but I’m putting it in there just in case.

Keyset pagination (3 prompts)

Planned and implemented keyset cursor pagination using UUID v7 ordering for product listing endpoints. Added a thumbnail image (sort_order=1) to the paginated product response. Skipped pagination for images and SKUs since they’re low cardinality and only appear inside product detail views.

Implemented ProductFilterQuery/ProductFilter for GET /api/v1/products. Filters by category, brand, price range, search (ILIKE), and status. Fixed 4 SQL linter bugs that Claude introduced (double WHERE / AND WHERE issues). 5 new router filter tests. All 207 tests passing.

Testing and code quality

Integration tests (5 prompts)

Implemented 33 brand & category repository tests (17 brand, 16 category). All 144 tests passing at this point.

Test refactoring (6 prompts)

Extracted test helpers to the shared module: auth fixtures (test_auth_config, test_token, seller_user/buyer_user/admin_user), HTTP request builders (json_request, authed_json_request, authed_get, authed_delete). Removed redundant pagination integration tests and duplicated constructor functions from catalog router tests. Net result: -520 lines removed, +337 added. Compacted shared/CLAUDE.md from 187 to 82 lines.

My take: LLMs seem to always default to not preemptively making abstractions and are prone to repeating code. So between the Product Filters and Test Refactoring sections, I spent some time looking over the code Claude generated and identified points for improvement. I’m sure I could spend more time on this, but I’d like to get to the complicated parts and see how my architecture decisions and LLM management skills hold up.

Managing LLM context

CLAUDE.md optimization & skill creation (7 prompts)

Asked Claude to suggest ways to improve the CLAUDE.md files as “onboarding docs” for new LLM sessions. Created a project-context skill that handles session onboarding (gathering context efficiently) and documentation maintenance (updating CLAUDE.md files after significant work). Pruned redundant info from root CLAUDE.md. I think this is the most useful skill I’ve created so far.

Context optimization (5 prompts)

Explored how to minimize the context that gets loaded at the start of every session. Moved code patterns and bootstrap recipe to on-demand .plan/ files instead of always-loaded CLAUDE.md. Cut always-loaded context by roughly 50%. Discussed knowledge graph tools vs markdown for managing project knowledge — decided to stay with markdown. Added deduplication rules so MEMORY.md and CLAUDE.md don’t drift apart with redundant info.

CLAUDE.md compaction + ADR-009 (5 prompts)

Compacted catalog CLAUDE.md from 250 to 88 lines, identity CLAUDE.md from 125 to 68 lines. Created ADR-009 for the ltree categories decision.

Housekeeping (8 prompts)

Created v0.2-catalog-crud git tag and pushed to remote. Wrote a make run SERVICE= command with a run.sh script, debugged it for both services. Small things, but the kind that save time every day.

My take: I frequently checked the root .claude folder and kept seeing things pile up, which concerned me. Context management has become a first-class part of the workflow — every session now follows: /project-context → /br → work → flush to memory → end session.

Planning the next phase

Order/payment mega-planning session (4 prompts)

The big one. Created 4 detailed implementation plans:

Shared infrastructure — Kafka (KRaft mode, no Zookeeper), transactional outbox pattern, event system with rdkafka, Jaeger for distributed tracing, Redis-only service bootstrap function
Cart service — Redis-only microservice, 6 endpoints, 30-day TTL, max 50 SKUs per cart
Order + Payment services — choreography saga with state machines, mock payment gateway (Stripe/PayPal), inventory reservation, compensation flows, ~190 tests planned
Workflow documentation — ADRs 010-013, CLAUDE.md updates, saga flow documentation

My take: I’m going to spend some time reviewing, researching independently, and iterating on the plans with Claude. Infrastructure (specifically Kafka) and saga orchestration are areas where I’m weakest in terms of experience and knowledge, so I’ll have to tread carefully. I personally hate microservices, but this is what I specifically planned for — to push both myself and the LLM.

Observations

My take: These are actually Claude’s observations. They’re pretty accurate.

Scope creep needs a human check

Claude suggested building automatic FK traversal for domain objects. Sounds nice in theory but it’s basically reimplementing an ORM — massive scope increase for questionable benefit. Pushing back on LLM suggestions is a skill that matters.

LLMs don’t refactor proactively

Test infrastructure and shared helpers only got cleaned up because I reviewed the code and pointed out duplication. As the project grows, I don’t want test times ballooning from duplicated setup code and tests that verify the same generic pagination behavior repeatedly.

The mega-planning session validates my pt2 predictions

I predicted that orders, payments, and cross-service coordination would be where complexity explodes. The fact that I needed 4 separate plans just to approach this phase — before writing a single line of code — confirms that. The planning covered saga patterns, state machines, transactional outboxes, compensation flows, and distributed tracing. This is a fundamentally different challenge from “implement CRUD endpoints.”

Why I started programming

I started programming because at my first job, I spent hours on Google Sheets copy-pasting meaningless bullshit. After an agonizing amount of time doing that, developing unwanted muscle memory and getting tired of the farce*, I decided to look for solutions to this Sisyphean task. The answer was programming with — ugh — JavaScript in an online scripting environment integrated into Google Workspace. It was a terrible dev experience. It had none of the convenience features that modern IDEs provide. All I had were my unending fury, hatred of manual work, googling skills, and bottomless willpower. This was around 2020-2021, for context.

When I got my custom macros to work and automated the entire boring task, it was an amazing feeling. The act of programming itself was addicting, and I continued to do it until it became my full-time job.

That’s why when I got my jobs at startups, I enthusiastically threw myself into coding-heavy repetitive tasks (mostly refactoring and test writing without much of the fun “intellectual” domain object stuff). This helped me develop taste and opinions, but it also tired me out physically and mentally. I realized that programming, like any job, requires manual, boring, repetitive work.

Now that LLMs are here to automate the act of writing code itself, I don’t know if I’ll enjoy programming like I used to. Right now, it’s so comfortable to get the LLM to do things for me, and to be honest, I’m not sure sometimes what the point of writing code faster even is. Right now, the shiny new toy is very interesting and I want this blog to be a showcase of my skills so that I don’t need to do silly leetcode-style tests. That also makes me think: is being a good employee my true end goal? My thoughts are complicated and I need to think about it more. Nevertheless, I am still excited about the project.

* Looking back now I understand why work was done that way(making interns do spreadsheet manual labour) but I still hate it with a great passion

Git Tag

v0.3-catalog-complete

What’s next

The 4 plans are created and waiting for review. Next is implementing them in order — shared infrastructure first (Kafka, outbox, tracing), then cart, then order+payment. This is where the real test begins.

Getting Gud at LLMs Pt2

2026-02-24T00:00:00+00:00

In Part 1, I built the identity service for Koupang and was surprised at how well Claude handled Rust and niche crates. Now I’m tackling the catalog service and experimenting with a task management workflow.

Using a task manager

I figured out how to get Claude to use beads_rust. The key insight was splitting planning and execution into separate sessions so that the execution session starts with a clean context and only has the task list to work from.

The workflow:

Load in the beads_rust skill at the start of the session
Start plan mode
Give requirements
Iterate plan
When Claude asks to go ahead with the plan, I “reject” the execution and get it to put the tasks with dependencies into beads_rust
Close the planning Claude session
Start a new Claude session and load in beads_rust skill
Tell Claude to use beads_rust to look at what it needs to do and to execute it

The reason I close the planning session and start fresh is to prevent context bloat. The planning conversation can get long, and I don’t want all of that history polluting the execution phase. By writing the plan into beads_rust, the new session can pick up exactly what it needs to do without carrying the baggage of the planning discussion.

Results and Thoughts

First impressions

It followed the beads it created well
- To be fair, it also kept notes in its internal MEMORY.md file about the next task (catalog service) and which bead to use
- I should consider clearing the memory and then trying a new task that it put in beads to see how well it performs without that crutch
It does simple CRUD very well so I was not too surprised it did the simple CRUD stuff perfectly
It unfortunately just queries the entire db for list endpoints instead of using pagination despite there being common pagination support modules in the context (maybe it got lost?)
It properly used value objects (I added stuff about value objects to the identity service) without me having to mention it
It appropriately suggested claims based authentication. When I pushed back on using the gRPC server and a circuit breaker pattern, it suggested that those can be enhancements for later
It didn’t do router tests and only implemented them when I told it to
It just implemented dynamic updating based on optional parameters in the product update request dto without me telling it to
Catalog service is currently very CRUD-y which makes sense considering there isn’t much “business logic” yet so that’s fine for now
This saved me a TON of time typing
- I would say 2-3 days of repetitive typing and debugging got reduced to 2-ish hours of planning and waiting for the LLM to generate code

Deeper reflections (after stepping away for lunch)

It properly planned out + implemented slugs for products, sku code, non-zero skus and other such domain specific requirements correctly without me explicitly saying so
Something I forgot was dealing with locks in databases. Something I used to do a lot was doing very conservative pessimistic locking for records before updating for strong consistency requirements. It didn’t really suggest doing something like that but I should have at least considered it somehow.
The get product detail was done with 3 separate queries (1 for product, 1 for skus, 1 for images). I personally would have used a join and some json aggregation to do one query but this is acceptable. I found that many people have many different opinions about the best way to interface with a database from the application code side.
Not using an ORM (although I think they are a fantastic tool if used well with intention) I think was a good choice. I shied away from raw SQL before LLMs due to the finicky nature of raw string manipulation, having to map between code and sql result sets by casting Any types to concrete types, and syncing between code and sql being annoying. But LLMs make these kinds of things very trivial to do and they are quite decent at SQL and they do not have to learn some potentially niche ORM library without many examples.
The LLM still handles the current size of the codebase well.

Predictions

I am building simple foundations for more complex features so things are going well so far
I expect dealing with orders, shipping, payment, refunds (things surrounding products) to massively increase complexity. These features require cross-service coordination, state machines (order lifecycle, payment states), and careful handling of failure scenarios (partial refunds, failed shipments). I would like to see how I can use LLMs to handle them well but I expect the LLM to struggle here.
Certain features more directly related to the catalogs like dynamic pricing based on algorithms and discount features, searching for products, handling high traffic while keeping track of stock could also challenge me and LLMs. These involve algorithmic thinking and concurrency concerns that go beyond CRUD patterns, so I expect the LLM to struggle in this regard as well.
My Claude Code install is fresh so there isn’t much cruft in the various config and memory files on my computer (I checked). As this grows, I expect Claude to be a bit more confused even if I keep the context fresh each session. Stale or accumulated memories from past sessions could mislead future ones, so I think I would have to clear those regularly.

Thoughts about backend development

A lot of backend development is just plumbing data in my humble opinion. You receive some kinda data through the network, you shove it into some kinda persistence layer, you retrieve something from the persistence layer, you throw it out into the network, repeat ad infinitum. There are lots of established patterns that I just need to re-implement again and again.

“Difficulty” in backend development came from:

Not knowing programming well
Not understanding how to use the libraries/packages
Writing bad code and suffering from it
Working with code that others have wrote
Crazy deadlines and fluctuating requirements

But as time passed:

Programming by hand solved the not knowing programming well as I developed intuition and muscle memory about the language
Reading documentation and looking up guides solved the library/package issue
Writing bad code got solved when I was forced to refactor my own code to make it testable and I developed intuition on how to write simpler more testable code
Working with code written by other people is still difficult
I can’t do anything about deadlines and fluctuating requirements — but faster coding helps with both

How LLMs change this

LLMs solve issue 1 and 2 (syntax and library/package issues). They have tons of training data and can look things up online.
LLMs don’t really solve issue 3 (bad code) definitively. It really depends on what you feed it.
LLMs could solve issue 4 (working with code written by other people) but I haven’t used LLMs in a context where I’m completely new to the codebase and there is no one to onboard me.
LLMs don’t solve issues about deadlines and requirements directly, but they make working with fluctuating requirements easier because writing and rewriting code is much faster.

Skill atrophy and the next generation

I have experience writing bad code, improving and unraveling my cocoon of ignorance on various programming topics which arguably lets me be effective in structuring code, providing samples, fixing code produced by AI. But I wonder if these skills would atrophy as I use LLMs more. I also wonder if the newer generation of software engineers will develop different kinds of intuition.

Problems I haven’t faced yet

There are also problems I haven’t faced a lot yet such as dealing with extremely high traffic, working with low hardware resource constraints, maintaining very high uptime, dealing with distributed systems and such. I never got to develop experience and intuition doing this “the old way” from working at smaller companies so I wonder how I will develop as an engineer when I will hopefully get to tackle these kinds of issues with an LLM in the future.

Prompt Log: Planning the Catalog Service

One thing I wanted to do with these posts was show the actual interaction flow, not just the results. Here’s the exact sequence of prompts I used to plan the catalog microservice with Claude Code. This section was done with help from Claude parsing the jsonl files and I have to say it did it VERY well. I expected mild prompt injection as it was reading about previous prompts but it didn’t?? I also automated this step with Claude suggesting I put reminders in the Claude.md file and added a hook with a script Claude created. I am quite impressed.

1. Check task board state

/br

Checks current beads_rust task board. Confirms starting from a clean slate.

2. Clean up stale memory

remove plan 5 from your memory wherever that is because it has been partially implemented

Housekeeping — removes outdated progress entries from auto-memory before starting new work.

3. Enter plan mode with detailed requirements

/plan
I want to start working on the catalog microservice now
The catalog microservice will allow buyers and admins to upload and manage products
The 2 main important tables will be Product and Sku. Sku is a child table of Product.
1 Product can have many Skus. A Sku is a variation of a Product
Eg: A product can be a shoe and a sku can be the shoe sizes
The granularity of the sku can be something we discuss
The catalog service will also need to keep track of the inventory levels as well
For the initial iteration of the catalog service, we will assume that image files are
handled somehow and we receive links to images. Detailed image/media handling can be
implemented later
We also need to keep track of prices here. When dealing with money, we need to make sure
to use the correct types because floating point numbers do not reflect money behaviour
accurately.
Do the plan first and then I will take a look

Enters plan mode. Provides high-level requirements with open questions (SKU granularity). Explicitly says “do the plan first” to let Claude explore and design before execution.

4. Answer design questions

Claude asked 3 targeted questions:

SKU variant attributes model? → Selected: “Flexible JSON attributes (Recommended)”
Price and inventory on SKU level? → Selected: “Yes, both on SKU (Recommended)”
Who can create/manage products? → Selected: “Sellers and Admins (Recommended)”

5. Challenge a design decision

can you explain the claims based auth decision?

Rejected the initial plan approval to ask about a specific architectural choice. Claude explains the claims-based vs gRPC auth trade-off. This is a key technique — rejecting plan approval doesn’t lose work, it just lets you dig deeper.

6. Propose alternative with nuance

I prefer the gRPC to identity but the point about coupling is correct. Instead, I think
adding a periodic health check to the identity grpc to determine whether to call the gRPC
service would be better and then gracefully fail to claims based. Also, caching on the
catalog service side for the gRPC identity service could work. What do you think about
these 2 suggestions?

Pushes back with a hybrid approach. Claude analyzes both suggestions and recommends phasing the work.

7. Agree on phasing

Claude asked whether to phase the work or include everything now. I selected “Phase it (Recommended)”.

8. Log tasks in beads_rust

I want you to use beads_rust to log the plan for future use

Before approving execution, asks Claude to create br tasks with dependencies so the plan is tracked in the task management system. This is the step that enables the “close session and start fresh” workflow described above.

9. Document the workflow

do not execute yet, I want to restart the session. I also want a workflow of user input
prompts into claude cli for blogging purposes to demonstrate claude cli tool usage. Can you
log all of my user prompts and then create a repeatable workflow for future use in all
other sessions?

This is where I stopped the planning session and asked Claude to document everything before restarting for execution.

Key techniques demonstrated

/plan mode separates research from execution
Rejecting plan approval to ask questions (doesn’t lose work)
Using br (beads_rust) for persistent task tracking across sessions
Phased delivery: start simple, enhance later
Pushing back on design decisions with your own suggestions

Prompt Log: All Sessions So Far

I extracted the user prompts from all 26 Claude Code sessions across both the identity and catalog services. Below is the condensed version — system noise stripped out, plans summarized instead of pasted in full. This covers about 2 days of work.

Identity Service: Integration Tests (Sessions 1-5)

/plan
your task now is to plan implementing integration tests for the identity microservice.
The integration tests should use the #[sqlx::test] macro for actual database usage for
all levels of tests which are to be implemented (repository_test, service_test,
router_test). Plan test cases for me to review as well

Implement the following plan: [full plan with 39 test cases across 3 layers,
plus 2 bug fixes found during planning]

I want you to run integration tests for the identity service using the Makefile command.
There will be a test failure for get_current_user_returns_correct_user this test. I want
you to identify the cause, explain it then fix it

can you fix the other issues found by the other test failures?

your task is to refactor the GetCurrentUser trait to make it async and fix the
implementation on the identity service side. Use the identity service integration test
to verify the fix works

to make myself clear, I meant make it async fn get_by_id, the current implementation
"technically" is but I want to use the async syntax for the trait

Claude Code Configuration (Session 6)

I want help configuring claude code cli. I want to allow safe commands like ls, grep,
cat, etc bash commands for reading to be allowed while I want commands like rm, curl
(potentially accessing malicious links), etc to require permission from me

Shared Module Extraction (Sessions 7-8)

/plan
based on the high level description and current implementation of the identity service,
I want to identify some common code/utility things that can be put in the shared module.
Do some planning for me to review and then after planning, use beads (br skill) to put
those as tasks with proper dependencies

Implement the following plan: [extract 6 modules to shared: observability, server
bootstrap, API responses, auth guards, health check, DTO helpers]

cool, now update CLAUDE.md file in the root folder to record for future use what can
be reused from the shared module in a compact manner

there are other pieces of code in the shared module that aren't mentioned in the
CLAUDE.md file, put those in as well

Auth Flows (Sessions 9-14)

/plan
read the .plans/critical-user-flows.md 8 Auth Flows and do some planning 1 at a time.
I will review each plan 1 by 1 and then you will add them to beads

Implement the following plan: [Plan #1: Email Interface — trait + mock]

let's move to phase 2 and remember when running tests, refer to the makefile for the
test running commands

Implement the following plan: [Plan #2: Email Verification on Registration]

when making migrations, use the make migration command (refer to the makefile for
details) to create the migration file, continue with phase 2

let's do phase 3 now

Implement the following plan: [Plan #3: Password Reset Flow]

go ahead with the plan for phase 3

now let's do phase 4

/br
create br issues for remaining plan#5 AFTER you make the plan and let me approve

Implement the following plan: [Plan #5: gRPC + Redis Caching]

the grpc_service currently has a build error, the generated type and the implementation
seems to be not matching despite it using the same type

now I need you to abstract away the redis connecting and grpc server bootstrapping to
the shared module as I can see it being used commonly in many services

Implement the following plan: [Abstract Redis + gRPC bootstrapping into shared]

Testing Infrastructure (Sessions 15-18)

I need you to implement integration tests for the grpc service in the identity service
user package

is there a way to actually run the grpc server and then call from a grpc client for
the test?

/plan
the current GetCurrentUser implementation and test gracefully handles the non-presence
of an actual redis client, however I want to use an actual redis client for the test
for more comprehensive integration testing. Explore options on how to do this so that
this setup can be abstracted away and reused in many cases

Implement the following plan: [Real Redis Integration Tests via Testcontainers]

there are several test utils like starting a grpc server and starting a redis
testcontainers instance that can be put in shared for common use across all
microservices, do so

Implement the following plan: [Extract Reusable Test Utilities to Shared]

now, while the sqlx-test macro is convenient to use it requires a running db instance
to run tests. I don't want a dependency on a running db managed outside of the testcode
in case these tests runs in CI. Now that we established that redis testcontainers work,
refactor the tests to use postgres testcontainers and remove the use of sqlx-test macro
and put the postgres testcontainers setup in the test utils and make it be used in the
integration tests for identity

Implement the following plan: [Replace #[sqlx::test] with Postgres Testcontainers — 82 tests]

while we are in the early stages of the project, I want to set stricter roles using
enums. The roles should be Seller, Buyer and Admin. Make changes to the identity and
shared modules referring to role which is a String right now

Implement the following plan: [Refactor Role from String to Enum]

the claude.md file is getting quite large, I can see that the information about the
shared module can be split up and put into a new claude.md file. Put the compact overview
of the shared module in the shared crate and update the shared module Claude.md so that
it reflects the most current code in the shared module. Point towards the shared module
Claude.md in the Claude.md in the root folder.

create a CLAUDE.md for the identity service module and make the root folder CLAUDE.md
reference it. Make sure the identity service module CLAUDE.md is compact

I want you to look at the code so far (mainly identity, shared modules) and the various
scripts and other stuff created to create a summary. I wrote a blog post about using AI
and I want to add a summary about what was built so far.

let's put the summary of what is there so far in a .md file, I will copy paste to the
blog at a later point. Now let's do the ADR and git tags as you have suggested.

Value Objects & Validation (Sessions 23-24)

/plan
in the identity microservice, there needs to be validation for username, password,
email and phone strings
email -> research well known email regexes
password -> research "strong" password regexes
phone -> assume that we store country code with phone numbers, allow - characters
username -> no empty strings and minimum 3 characters, no profanities (nice to have)
make value objects, parse the strings into valid value objects where the new function
in the struct impl validates with regex
write unit tests for these value objects
make sure when writing into the db that the strings are valid
create a new validated struct ValidUserReq that uses the value objects and replace the
use of UserCreateReq, UserUpdateReq
also update the password flows to use password value objects

Implement the following plan: [Value Objects & Input Validation — 4 value objects, ~35 unit tests]

Catalog Service Execution (Sessions 25-26)

/br
start work on bd-jx9 and close issues as you go

make a task for router tests for products in br, implement the product router tests,
close the br issue

add following tasks to br
implementing pagination for listing product endpoints (there are several which just
return the entire list)
implementing caching for read endpoints
planning search engine implementation

to specify, I meant keyset pagination. update the bead about pagination

Claude’s Observations from the logs

76 total user prompts across 26 sessions over roughly 2 days
Most prompts are short (1-3 sentences). The longest ones are the plan pastes and initial requirements.
The pattern is very consistent: /plan → review → paste plan into new session → implement → follow-up corrections
Corrections tend to be brief and specific (“I meant keyset pagination”, “make it async fn”)
I spent more time on testing infrastructure than I expected — sessions 15-18 are entirely about making tests self-contained with testcontainers

Git Tag

v0.2-catalog-crud
This is the new tag that has been create for the latest amount of code that got generated.

What’s next

In Part 3, I want to push into the more complex services — starting with orders — and see if my predictions about LLM struggles hold up.

Getting Gud at LLMs Pt1

2026-02-23T00:00:00+00:00

Background

Around this time last year, I was quite skeptical of LLM usage for programming. I do remember having a false sense of superiority that I programmed “the old way”. However, I kept my eye on the progress of LLMs and used them more at work while I was employed.

I originally used the web interfaces (ChatGPT, Gemini, Claude, Qwen, Deepseek) to one-shot bash scripts, Golang scripts or SQL that I couldn’t be bothered to write. Then I used the Canvas feature in Gemini to actually code/refactor certain features. My main programming use was providing samples of test code that I was happy with and getting the LLM to copy that format for other tests. Then eventually I caved and got the AI subscription from Jetbrains to use the LLMs within the IDE because copy-pasting to the browser was getting annoying.

After a while of deliberation, I decided to give this LLM thing an actual shot. Far too many people around me as well as people I’ve seen on Youtube have said that the industry is changing. From the 2 possible futures (software engineering industry changes fundamentally OR it doesn’t), I just decided to give in to what seems like changing tides. I set out to challenge my own assumptions — mainly that LLMs are bad at dealing with large complex codebases and are only good at basic programming tasks.

Another sentiment I kept hearing was from staff-level and above engineers (via Reddit and YouTube) that LLMs boosted their productivity by a lot and they almost no longer code by hand. I imagine that very experienced engineers deal with far more difficult, larger and nebulous problems that require a lot of thought, planning and task subdivision — which are exactly what you need to do to use LLMs effectively from what I’ve heard. Considering that I am quite early career and don’t deal with such difficult issues, I may not have such a large productivity boost with LLMs. So I figured the best way to understand why all these more experienced engineers are lauding LLMs was to try something larger and more difficult myself.

The experiment

I wanted to build something simple first to get in the hang of things and I also wanted to use Rust for the heck of it. That became Workout Util — I already knew that LLMs were good at SQL and basic forms/pagination stuff due to the likely abundance of its training data so this project was a breeze. My main thoughts/reflections of using LLMs are in the readme of that project.

Now that I was done with a simple example, I decided to do something more complicated: Koupang, an ecommerce backend. I am reasonably familiar with the ecommerce domain, it’s a well established problem and I have worked on ecommerce backends so I decided to tackle this with Rust as well.

To facilitate a much larger use of LLMs, I decided to just pay money and use the Claude Code Max plan and work through the CLI. I originally scaffolded the entire project with Qwen (web interface), shoved that in a markdown file and then got Claude and started working on it.

What surprised me

And so far… Claude seems to be doing really well. The first thing I worked on is the identity microservice. Arguably, the project is still small and identity/auth is a well established topic as well so the LLM would be good at it. But I am incredibly optimistic about this project because the use of Claude Code allowed me to make ridiculously fast progress despite only using Claude Code in a basic way. I am excited to see where this project goes and how the LLM can handle larger codebases.

Specific points where I was surprised

I assumed LLMs would be poor at using Rust due to there not being a large amount of Rust training data… I was wrong?
- To be fair, I am using quite basic Rust and nothing too crazy
The code produced by LLMs is not crazy spaghetti
- It’s not perfect code and I have to suggest better abstractions and pulling code out into the common shared package often but that’s fine
It’s surprisingly good at handling niche crates like testcontainers and tonic/prost, which I’m pretty sure don’t have a huge amount of usage data. My assumption about lack of training data is being slowly chipped away.
I did barely any coding by hand.

Claude Code setup/usage notes

Nothing fancy yet and no multiple sub-agents:

Just basic prompting through the CLI after using /plan mode
Often restarting sessions to clear context
Trying to get the LLM to use beads rust but it doesn’t really seem to be listening
Being conservative with Bash script permissions just in case
Being quite strict about human in the loop and reviewing the code it generates
Scoping tasks quite tightly so context doesn’t bloat

Plugins:

rust-skills
- Kinda freaky how easy it was from the Claude Code CLI to add this considering prompt injection risks
- Made sure to read through the skills to ensure they’re not malicious but the more I use Claude Code, it’s likely the more lax I will become

Skills:

Create Skills meta skill from Anthropic
Beads Rust skill that I made Claude create

Why write about it

Even as I heard good things about using LLMs for projects, I was never really able to see the outcomes. Most people work on closed source codebases at their jobs, so I couldn’t see the actual code being produced or how they had their tools set up — the prompting strategies, the configurations, the actual workflow. So this blog post is my attempt at showing the process for others who may be skeptical or curious. If there are transparent examples (actual complex codebases and well documented workflows), please let me know. I am eager to learn.

Progress Log

I’ve also started keeping ADRs (Architecture Decision Records) in the repo to capture the why behind technical choices, and tagging milestones like v0.1-identity-auth so I can easily diff what changed between blog posts.

What’s next

In Part 2, I want to push Koupang further, I plan to work on the catalog service next, and see if the LLM can keep up as the codebase grows. Stay tuned.

Is this working?

2024-11-19T00:00:00+00:00

Very original and interesting take here.

Subheading

Bullet points
More points

Numbered lists
Work too

Smart and Witty Blog Title

Current LLM Workflow Setup

Environment

Claude Code Configuration

Plugins

Skills (20 installed)

Hooks

Permissions

CLAUDE.md Files

Development Cycle

For new features (spec-driven-dev workflow)

For boilerplate/scaffolding

For bug fixes

What makes this work

Closing Thoughts

Getting Gud at LLMs Pt6

The Numbers

Filament

Git History (Mar 4–22)

Koupang

Kafka Consumer/DLQ Planning & Outbox Review (Mar 5)

Getting Gud at LLMs Pt5

Problem and Solution

Process and Outcomes

The Numbers

Prompt & Progress summaries by AI with my takes in between as usual

Sessions 1–2: Planning & Architecture Decisions (Mar 2 evening)

Current LLM Workflow Setup

Environment

Claude Code Configuration

Plugins

Skills

Hooks

Permissions

CLAUDE.md Files

Development Cycle

What makes this work

Getting Gud at LLMs Pt4

Direction of blog posts

Git Gud

Koupang

Trauma dumping on main

AI summary of work

By the numbers

Test optimization (2 sessions, ~13 prompts)

Order/payment mega-planning (3 sessions, ~21 prompts)

What’s next

Getting Gud at LLMs Pt3

By the numbers

Building out the catalog

Category & brand planning (3 prompts)

Brand & category implementation (7 prompts across 2 sessions)

Domain model clarification + scope control (5 prompts)

Keyset pagination (3 prompts)

Product filters (4 prompts)

Testing and code quality

Integration tests (5 prompts)

Test refactoring (6 prompts)

Managing LLM context

CLAUDE.md optimization & skill creation (7 prompts)

Context optimization (5 prompts)

CLAUDE.md compaction + ADR-009 (5 prompts)

Housekeeping (8 prompts)

Planning the next phase

Order/payment mega-planning session (4 prompts)

Observations

Scope creep needs a human check

LLMs don’t refactor proactively

The mega-planning session validates my pt2 predictions

Why I started programming

Git Tag

What’s next

Getting Gud at LLMs Pt2

Using a task manager

Results and Thoughts

First impressions

Deeper reflections (after stepping away for lunch)

Predictions

Thoughts about backend development

How LLMs change this