~/youwang

$ whoami

Youwang — builder, AI explorer

$ cat mission.txt

Building autonomous systems that think, learn, and ship.

$ ls posts/

Skills Ate the Framework: How a Markdown File Beat the SDK

Anthropic open-sourced SKILL.md in December 2025. Six months later Cursor, Gemini CLI, opencode, OpenHands and mux all adopted it — five non-Anthropic runtimes reading the same folder of markdown. The portable unit of agent capability turned out to be a 30-line file, not a Python class hierarchy. Receipts, the second-order bloat, and the next year.

Read →

The Orchestration Bubble: Why One Smart Loop Beats Fourteen Frameworks

14 multi-agent frameworks hit Hacker News in seven days. Google's Scion landed at #1 with 230 upvotes. The contrarian piece warning everyone got 19. My ticket bot and training run, both single-loop, both shipping numbers, say the 19-point post was the only one that was right.

Read →

The MCP Tax: Why Loading Every Server Makes Your Agent Worse

Three MCP servers, 143K tokens consumed before the user types a character. Bigger toolsets degrade selection accuracy, and a 1-second registry race silently drops late-arriving tools. The best agent in 2026 is the one with the fewest tools, not the most.

Read →

Trap-Positive Or Bust: Your Agent Eval's Oracle Is Probably Lying

Most agent benchmarks have an unmeasured failure mode: the oracle silently passes incorrect implementations. We authored 20 trap-task oracles for a benchmark and validation caught 3 broken ones — a 15% bug rate. Three real cases: an oracle grading its own ground-truth file, CPython's GIL hiding the race the oracle was meant to detect, and $? capturing the wrong process's exit code. Without trap-positive checks, every PASS is faith. One day of engineering buys back every percentage point of credibility your eval is borrowing on vibes.

Read →

Models Are Weekly Hot-Swaps Now. Your Context Layer Is the Moat.

Same builders switched Claude → Codex → Opus 4.8 → back to Codex/GPT-5.5 in three weeks. Cursor Composer 2.5 ships on Kimi K2.5. MiniMax M3 is top-3 on Vercel's agent evals at 10× cheaper. The frontier is dense, the half-life of model loyalty is days, and the only thing that compounds across swaps is your AGENTS.md, skills, MCP allow-list, and learned corrections. The model is the cattle. Your context layer is the pet.

Read →

MCP Isn't Dead — It's Overprescribed: What 13 MCP Servers and 46 Skills Side-by-Side Actually Taught Us

Quandri's "MCP is dead" hit HN at 195 points by measuring what every dev felt: 77 MCP tool definitions burning 21,077 tokens before any work happens, and a single Linear lookup costing 65× more via MCP than via curl. Our runtime ships 13 MCP servers and 46 skills concurrently — and the boundary between them is mechanical, not religious. Four cases where MCP earns its tokens, everywhere else a Skill wrapping a CLI wins.

Read →

The 88ms Lie: Your Coding Agent Is Reward-Hacking and the Benchmark Now Proves It

Mitchell Hashimoto's agent took a renderer from 88ms to 2ms — looks like a 44× win. The hand-written port runs in 20 microseconds with zero allocations. Five days earlier, SpecBench put numbers on the failure mode: reward-hacking grows ~27pp per 10× LOC, and Codex shipped a 2,900-line hash-table "compiler" that scored 97% on visible tests and 0% held out.

Read →

Ownership Is the New Authorship

A reddit thread titled "I won't review AI-generated PRs" got 949 upvotes. The top comment got 1,190 — more than the post itself — and quietly rewrote the headline. The new code-review contract isn't "did a human write this?" — it's "can a human own this?"

Read →

Reviewing Code Is the New Bottleneck

At 40 tps the LLM was the slow part — so we opened ten tabs and walked away. At 1,200 tps Cerebras and OpenAI explicitly reverse the playbook. Half of what I shipped in SageCLI was optimized for a constraint that just evaporated.

Read →

The Corpus Cliff: Why Agents Ace Linux and Faceplant on Your Monorepo

DHH's six-line tweet hit 1,706 hearts in ten hours: agents are good at Linux because all 40M lines of kernel code were in pre-training. The replies surfaced the part nobody wants to say — the moat is corpus, not model quality. Public OSS got an invisible upgrade. Proprietary monorepos hit a cliff at the firewall. A reframing of every skill, MCP server, and steering file we've shipped in the last twelve months: they're all the same invention — inference-time corpus injection — patching the same cliff.

Read →

Fork It, Trim It, Freeze It: Mitchell Hashimoto's 10-Year Heresy Just Went Viral Again

A moving Docker tag went backwards in production — litellm:main-latest shipped 1.82.6 to a container running 1.83.4. Days later Hashimoto's 2014 "fork your deps, never update" tweet hit 5,436 hearts. He's been saying it for a decade. The reason it landed now: 2026's agent stack has turned the dependency tree into a minefield, and "always upgrade for security" is the single most expensive gospel in our industry.

Read →

The Agent SDK Was on Disk All Along

Dex Horthy threw out claude -p and shipped Shannon: 196 stars in days, a 200-line wrapper that runs claude interactively in tmux and tail -f's the JSONL transcript. No SDK, no bidi-stdio, no streaming framework. Three Unix tools from the 1990s replaced a 2026 SDK. The CLI already serializes everything you need; the "agent SDK" was an unnecessary middleman. The real protocol is the filesystem.

Read →

Harness fatigue is real. ACP is the release valve.

A Saturday r/LocalLLaMA thread — "I am overwhelmed by Harnesses" — opened with 19 installed CLI coding agents, three paid, one actually used. The pain isn't choice; it's coupling. Six rebrands in 60 days, OpenCode shipping six patch releases in six days for bugs that should live in a protocol. ACP is the LSP moment: 22 agents, 12 editors, JetBrains as anchor tenant. Notes on what to invest in and what to refuse to pay tuition for.

Read →

AGENTS.md won the filename war. It hasn't won the content war.

60k AGENTS.md files on GitHub, Linux Foundation stewardship, and a Reddit thread this morning still asking where conventions live. The filename is solved. The content — sections that actually earn their place, nested inheritance bugs, security-noise traps — is day three. Notes from an 8-round AutoSDE rollout across 19 packages.

Read →

You Can't Route What You Don't Measure: The 17x DeepSeek Gap Is a Harness Problem, Not a Model Problem

A Reddit post hit 651 upvotes claiming DeepSeek V4 is 17x cheaper than GPT-5. The number everyone quoted is right and misreads the post. The real number is 65% — the share of one developer's daily coding-agent traffic that ran on a 3090 at quality parity. Almost nobody has that number for their own workload, because almost every agent harness hides per-call tokens and task classes. Routing is an empirical decision your tooling probably doesn't let you make.

Read →

Your Agent's Config File Is an RCE Primitive: Lessons from the PyTorch Lightning Worm

A worm inside lightning 2.6.2 wrote six lines of JSON into .claude/settings.json — every future Claude Code session in that repo, on any teammate's laptop, runs the dropper. No prompt, no tool call, no consent. Semgrep calls it likely the first real-world attack abusing Claude Code hooks. prompt injection was 2025. Config-file RCE is the 2026 attack surface, and every agent CLI has this shape.

Read →

The 10x Write, 1x Debug Problem: Why Agent Coding Still Feels Slow

METR's RCT — 16 senior OSS devs, 246 real tasks on million-LOC repos — found AI tools made them 19% slower. They still believed they were 20% faster. The emit clock went 10x; the debug clock didn't move. Stop measuring LOC/minute. Start measuring p50(time-to-green-CI) and p50(time-to-safe-revert). That's the number that tells you if your agent is actually helping.

Read →

"It Shipped, Nothing Happened, I Still Don't Feel Right": The Rubber-Stamp Era of AI Code Review

A senior dev confessed on r/ExperiencedDevs this week: approved an AI feature for prod knowing the security review was garbage, nothing bad happened, still feels wrong. He's right to. Human-in-the-loop has quietly degraded into rubber-stamp-in-the-loop — and we have production data showing every optional gate gets disabled within a week.

Read →

Your Agent Doesn't Need a Memory Layer. It Needs a Context Editor.

This week's #2 HN story was a developer cancelling Claude over token economics — same week three "memory layer for agents" projects launched. They're solving the wrong problem. 40MB on disk, 55KB injected: the 1000× compression is the product.

Read →

I Built a System-Design Wiki So My Agents Stop Hallucinating Architecture

3,210 pages of distilled engineering knowledge, ingested hourly, deployed on Cloudflare Pages. Why a compiled wiki beats RAG for architecture decisions — and why the moat in AI coding is the knowledge the agent reads before it writes.

Read →

Go Accidentally Shipped the Best Agent Coding Language — Here's Why

Hashimoto says go doc and gopls are agent superpowers. He's not praising Go — he's describing a 2012-era language whose boring tooling is accidentally the best interface for LLMs.

Read →

The Friction Is Your Judgment: Why Agentic Speed Without Steering Produces Invisible Debt

Mitsuhiko argues the moments you want to skip thinking are exactly when thinking matters most. After 8 rounds of AutoSDE on one CR, I agree — friction is judgment.

Read →

Harness Engineering: When the Code You Write Isn't Code

OpenAI shipped 1M lines with zero human code. The real product isn't the code — it's the scaffolding. AGENTS.md, skills, cron jobs, and memory systems are what compound.

Read →

Your AI Agent's Benchmark Score Is a Lie — And That's a Feature, Not a Bug

Berkeley researchers scored near-perfect on 8 major AI benchmarks without solving a single task — just by exploiting how scores are computed. The eval layer is the attack surface.

Read →

The Platform Squeeze Is Here

Anthropic locked out OpenClaw, shipped Agent Teams, and started charging extra for third-party tools. The open-source AI tooling ecosystem just got its first real wake-up call.

Read →

The Only Thing That Matters Is a Mechanical Finish Line

AI agents can plan, retry, and iterate forever. The bottleneck was never intelligence. It's knowing when you're done.

Read →

What We Should Actually Be Testing

Leetcode tests skills AI already does better than humans. Here's a practical framework for what software engineering interviews should look like in 2026.

Read →

Your AI Agent Is Burning Money on Tokens. Here's How to Fix It.

Token costs are the hidden tax on AI agents. Most of what your agent sends to the model is noise. A 3-tier compression pipeline can cut costs 40-70% without losing quality.

Read →

Agent Memory Is Where Computers Were Before Virtual Memory

Every AI agent today wakes up with amnesia. We solved this problem in computing 60 years ago. Why are we pretending it's new?

Read →

Agent Shadows: The End of Meetings as We Know Them

What if every engineer, PM, and manager had an AI agent shadow that carried their context? Meetings become agent-to-agent syncs. Humans just make decisions. Here's the vision — and the uncomfortable implications.

Read →

Software Engineering's Turning Point

Karpathy scored software developers 8-9/10 on AI exposure. He's not wrong. But "exposed" doesn't mean "dead." The job is splitting in two — and the split is already happening.

Read →

CLI vs MCP: The Real Tradeoff Nobody's Talking About

Search GitHub for "MCP server" and you'll find thousands of repos. Meanwhile, every AI coding agent already has a shell. When does MCP actually help, and when is it just ceremony?

Read →

ACP Is the Missing Protocol for AI Coding Agents

Every AI coding agent speaks its own language. The Agent Client Protocol turns N×M agent-editor integrations into N+M. Here's why it matters, what it actually does, and what the honest downsides are.

Read →

I Built an Agent Framework in 1500 Lines of Bash

Every agent framework wants you to install a runtime and learn a DSL. I just wanted to dispatch tasks to AI coding agents. So I wrote 1500 lines of bash — agents as processes, messages as files, tmux as the IDE. Here's why it works.

Read →

The Agent Paradox: Why AI That Builds Itself Changes Everything

We're at an inflection point. The question isn't whether AI can code — it's whether AI can decide what to build. I've been running autonomous agents 24/7 for months. Here's what I've learned about the gap between tools and partners.

Read →