Giving an AI Agent a Memory: A Self-Curating Knowledge Base for Game Development

Why I stopped letting my AI assistant re-learn the whole codebase every morning — and built it a memory that curates itself.

The problem I kept hitting

On a fast-moving commercial Unity mobile-game project, the hardest part of plugging in an AI assistant wasn't the model — it was the context. Every session, the assistant started from zero: re-reading the codebase, re-deriving how a feature worked, often landing on a slightly different answer than yesterday. Meanwhile the actual knowledge was scattered across three places that quietly contradicted each other — daily commits, code-review notes, and ticket specs — and the game designers, who couldn't read C#, still needed to know "what does this feature touch, and is this request even feasible?"

I didn't want a smarter prompt. I wanted the system to have a memory that stayed current on its own.

The bet: three systems, one knowledge flywheel

Instead of one monolith, I split the work across three cooperating systems:

  • Game Source — the game client. It only produces material: a daily code review that emits pure "code does X + file:line" facts, with no opinions.
  • Game Agent — the brain. It curates that raw material into a knowledge base, answers feasibility questions, and is deliberately kept outside the game repo so the whole mechanism is portable to the next project.
  • Task — a self-built ticketing system. Requests the agent works out flow into it as tickets; completed tickets flow back into curation.

That last loop matters: specs feed in, completed work feeds back, and the knowledge base gets stronger the longer the project runs. A flywheel, not a snapshot.

The non-obvious trade-offs

A few decisions went against my first instinct, and those are the ones worth writing down.

I didn't use RAG. The reflexive move is to chunk everything, embed it, and retrieve by similarity. I tried that on a predecessor system and it was a poor fit for game-dev data: structured specs got flattened into soup, superseded discussion got confused with final decisions, and "who decided this" became guesswork. So I moved the hard work from query-time to write-time. When the agent curates, it preserves structure and writes clean, human-readable knowledge pages — each with machine-readable frontmatter, a plain-prose reading layer, and footnotes citing exact file.cs:line. One file, three audiences, no forking.

The LLM never infers authority. Knowledge from a real code review (Tier 1) always outranks a ticket spec (Tier 2), regardless of date — a two-month-old ticket should never overwrite a fact I just read from live code. But the key discipline is that humans and rules decide the tiers and the "final decision" status; the model only executes and writes structured content correctly. It never guesses which source wins. That makes the whole pipeline predictable and auditable — when something's wrong, it's a rule that's wrong, not the model's mood that day.

I built my own ticketing instead of reaching for Jira or Trello. Game-content tickets are highly custom and need to plug into the agent flow. I kept the ticketing system intentionally minimal and fully decoupled — it holds no LLM logic at all. All the intelligence stays on the agent side; the ticketing layer just persists and tracks work.

There's also a routing index the agent uses for retrieval. I made it impossible to hand-edit: it's regenerated from each page's frontmatter, and CI fails if it drifts. A routing table that can rot silently is worse than no routing table.

Making it runnable

Design notes are easy to wave at; running code is harder to fake. So alongside the write-up, I packaged a de-identified starter kit — three folders mirroring the three systems, with the portable pieces actually runnable: regenerate the routing index from sample knowledge pages, dry-run the config deploy, run the daily code-review collector. The content is entirely synthetic; the mechanism is real.

What I'd take to the next project

The lesson that generalized best wasn't a clever prompt or a framework. It was this: the scarce resource in an agent system is the signal-to-noise ratio of what enters the knowledge base. I once planned to auto-convert design spreadsheets into AI-readable docs, measured a ~30% useful conversion rate, and cut the whole feature. Knowing when not to add a source turned out to matter as much as any algorithm.

Give your agent a memory — but be ruthless about what you let it remember.


The design showcase lives at GamePlusAIAgent (de-identified). A runnable starter kit mirrors the three systems in a separate repo.


之前就有注意到這篇文章,大概就留個心思等使用 codex 時必須關注下這個議題

OpenAI Codex 遭爆正殺死你的 SSD:21 天寫入 37 TB,不到一年燒盡硬碟壽命

這個應該是因爲使用了 agent 處理日常任務造成的,看新聞說的是 websocket 的日志是 trace 都記錄了,如果是 coding 使用場景,應該不至於,查了下 gpt 説的是這樣的用法

watch -> trace -> backgroun task
類似 codex agent --loop

- 閱讀剩餘部分 -

本章節要將情緒「分類」升級到「結構化輸出」

原本Stage 1 只回一個詞(喜悅)。Stage 2 要回一個 JSON 物件:

{
  "instruction": "分析以下文本的情緒,輸出 JSON:emotion(類別)、intensity(1-5)、trigger(觸發詞)、reasoning(分析過程)",
  "input": "他竟然在背後說我壞話,我真的快氣炸了",
  "output": {
    "emotion": "憤怒",
    "intensity": 4,
    "trigger": "背後說壞話",
    "reasoning": "「竟然」表示出乎意料的憤慨,「快氣炸了」是強烈憤怒的慣用表達"
  }
}

在這過程中要加入 COT,爲什麽呢?

- 閱讀剩餘部分 -

之前使用 kimi 生成約200筆數據,這次將數據量提高到 500~800 筆,再次進行訓練、測試,結果如下

Macro F1 = 0.74(三輪:0.62 → 0.57 → 0.74),Accuracy 0.76。

- precision recall f1-score support
喜悅 0.90 0.90 0.90 21
悲傷 1.00 0.80 0.89 20
憤怒 0.93 0.59 0.72 22 ← 從 recall 0 救回
恐懼 0.74 0.95 0.83 21 ← 從 recall 0.11 救回
驚訝 0.67 0.29 0.40 14 ← 新的弱點
厭惡 0.54 0.90 0.68 21
macro avg 0.80 0.74 0.74 119

- 閱讀剩餘部分 -