使用 Qwen2.5B + QLoRA 情緒分類模型改善實驗:結構化輸出、CoT 推理與 JSON 評估實作
本章節要將情緒「分類」升級到「結構化輸出」
原本Stage 1 只回一個詞(喜悅)。Stage 2 要回一個 JSON 物件:
{
"instruction": "分析以下文本的情緒,輸出 JSON:emotion(類別)、intensity(1-5)、trigger(觸發詞)、reasoning(分析過程)",
"input": "他竟然在背後說我壞話,我真的快氣炸了",
"output": {
"emotion": "憤怒",
"intensity": 4,
"trigger": "背後說壞話",
"reasoning": "「竟然」表示出乎意料的憤慨,「快氣炸了」是強烈憤怒的慣用表達"
}
}
在這過程中要加入 COT,爲什麽呢?
# 不加 CoT
input: "還不錯吧,就這樣" → "中性" ❌(可能是掩飾的悲傷)
# 加 CoT(先推理再下判斷)
reasoning: "語氣平淡但含「吧」,暗示不確定或壓抑,結合「就這樣」的放棄感"
output: {"emotion": "悲傷", "intensity": 2, "trigger": "就這樣(放棄感)"} ✓
首先第一個,將 Stage 1 資料升級成結構化資料
執行如下
python finegrained.py build \
--in ../Week1-DataEngineering/data/synth.jsonl \
--out data/train_struct.jsonl
def cmd_build(args):
"""把 {text, emotion, intensity, trigger[, reasoning]} 轉成 Stage 2 指令格式。"""
out = []
for s in _read_jsonl(args.infile):
target = {
"emotion": s["emotion"],
"intensity": int(s.get("intensity", 3)),
"trigger": s.get("trigger", ""),
"reasoning": s.get("reasoning", ""), # 沒有就留空,建議由合成階段補上
}
out.append({
"instruction": STRUCT_INSTRUCTION,
"input": s["text"],
"output": json.dumps(target, ensure_ascii=False),
})
_write_jsonl(args.out, out)
print(f"✓ 轉換 {len(out)} 條 → {args.out}")
no_reason = sum(1 for r in out if not json.loads(r["output"])["reasoning"])
if no_reason:
print(f"⚠ {no_reason} 條缺 reasoning,建議用大模型補 CoT 後再訓練")
接下來則是微調結構化輸出模型
訓練流程與 QLoRA 相同,差別只在資料的 output 是 JSON 字串
python finegrained.py train \
--data data/train_struct.jsonl \
--model Qwen/Qwen2.5-1.5B-Instruct \
--out ./emotion-struct-lora
def cmd_train(args):
"""結構化輸出訓練——核心 QLoRA 流程與 Week 2 共用。"""
import torch
from datasets import Dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from trl import SFTConfig, SFTTrainer
tokenizer = AutoTokenizer.from_pretrained(args.model)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
args.model, quantization_config=bnb_config, device_map="auto"
)
rows = _read_jsonl(args.data)
texts = []
for r in rows:
messages = [
{"role": "user", "content": f"{r['instruction']}\n\n文本:{r['input']}"},
{"role": "assistant", "content": r["output"]},
]
texts.append(tokenizer.apply_chat_template(messages, tokenize=False))
dataset = Dataset.from_dict({"text": texts})
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir=args.out,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
gradient_checkpointing=True,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
max_seq_length=args.max_seq_len,
dataset_text_field="text",
),
train_dataset=dataset,
peft_config=LoraConfig(
r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, task_type="CAUSAL_LM",
),
processing_class=tokenizer,
)
trainer.train()
trainer.save_model(args.out)
print(f"✓ adapter 已存到 {args.out}")
接著是細粒度評估(MAE + Token F1 + JSON 解析率)
在這裏要用三個面向衡量結構化輸出品質
python finegrained.py eval \ --adapter ./emotion-struct-lora \ --model Qwen/Qwen2.5-1.5B-Instruct \ --data data/test_struct.jsonl
def cmd_eval(args):
import torch
from peft import PeftModel
from sklearn.metrics import mean_absolute_error
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
args.model, device_map="auto", torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(model, args.adapter)
model.eval()
rows = _read_jsonl(args.data)
parsed_ok = 0
emo_correct = 0
true_int, pred_int = [], []
trig_f1s = []
for r in rows:
gold = json.loads(r["output"])
messages = [{"role": "user", "content": f"{r['instruction']}\n\n文本:{r['input']}"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
gen = tokenizer.decode(
out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
pred = safe_json(gen)
if pred is None:
continue
parsed_ok += 1
if pred.get("emotion") == gold["emotion"]:
emo_correct += 1
try:
true_int.append(int(gold["intensity"]))
pred_int.append(int(pred.get("intensity", 3)))
except (TypeError, ValueError):
pass
trig_f1s.append(token_f1(pred.get("trigger", ""), gold.get("trigger", "")))
n = len(rows)
print(f"JSON 解析成功率:{parsed_ok / n:.2f}({parsed_ok}/{n})")
if parsed_ok:
print(f"情緒分類 Accuracy:{emo_correct / parsed_ok:.2f}")
if true_int:
print(f"強度 MAE:{mean_absolute_error(true_int, pred_int):.2f}")
if trig_f1s:
print(f"觸發詞 Token F1:{sum(trig_f1s) / len(trig_f1s):.2f}")
#### 評估三件事:
- 強度回歸 MAE:mean_absolute_error(true_intensity, pred_intensity)
- 觸發詞 Token-level Overlap F1:預測觸發詞與標準觸發詞的 token 交集
- JSON 解析成功率:能被 json.loads 成功解析的比例
**目標指標**:
| 指標 | 目標值 |
|---|---|
| 強度 MAE | ≤ 0.8 |
| 觸發詞 Token F1 | ≥ 0.60 |
| JSON 解析成功率 | ≥ 0.90 |
**預期輸出範例**:
JSON 解析成功率:0.94(47/50)
情緒分類 Accuracy:0.82
強度 MAE:0.71
觸發詞 Token F1:0.63
✓ 三項指標皆達標
---
### 然後在這裏驗收標準
- JSON 解析成功率 ≥ 0.90
- 強度 MAE ≤ 0.8
- 觸發詞 Token F1 ≥ 0.60