๐Ÿง

(210826) Review: Mitigating Political Bias in Language Models Through Reinforced Calibration

210826 Diary Controlled Text Generation
์–ผ๋งˆ ์ „์— GPT๋ฅผ ํ™œ์šฉํ•œ Text Generation์‹œ์— ๋ชจ๋ธ์˜ Parameter Update์—†์ด Political Bias๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋‚ด์šฉ์˜ ํฅ๋ฏธ๋กœ์šฐ๋ฉด์„œ๋„ ์œ ์šฉํ•œ, ๊ทธ๋ ‡์ง€๋งŒ ๊ฐ€๊นŒ์šด ๋ฏธ๋ž˜์—๋Š” ์‚ฌ์šฉํ•  ์ผ ์—†์„ ๋“ฏํ•œ ๋…ผ๋ฌธ(AAAI 2021)์„ ์ฝ์€ ์ ์ด ์žˆ๋‹ค. ๊ธฐ๋กํ•˜์ง€ ์•Š์œผ๋ฉด ์žŠ์„ ๊ฒƒ ๊ฐ™์•„ ๋‚ด์šฉ๋งŒ ๊ฐ„๋žตํžˆ ์ ์–ด๋ณธ๋‹ค..
GPT์™€ ๊ฐ™์€ ๊ฑฐ๋Œ€ํ•œ ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ Pre-Training ์ค‘์— ์ •์น˜์ ์œผ๋กœ ํŽธํ–ฅ๋˜๋„๋ก (Politically Biased) ํ•™์Šต๋˜์—ˆ์„ ์ˆ˜ ์žˆ๋‹ค. ์ •์น˜์  ํŽธํ–ฅ์ด๋ž€ ๊ฐ€๋ น "๋‚จ์ž๋Š” ~์ •๋‹น์„ ์ง€์ง€ํ•œ๋‹ค" ํ˜น์€ "..์ง€์—ญ ์‚ฌ๋žŒ๋“ค์€ ~์„ฑํ–ฅ์ด๋‹ค" ๋“ฑ ์ถœ์‹ ์ด๋‚˜ ๋ฐฐ๊ฒฝ ๋“ฑ์œผ๋กœ๋ถ€ํ„ฐ ํŠน์ • ์ •์น˜ ์„ฑํ–ฅ์„ ๋„๋Š” Text๋ฅผ ์ƒ์„ฑํ•˜๋Š”(์–ธ์–ด ๋ชจ๋ธ ๊ด€์ ์—์„œ) ๊ฒƒ์„ ์ผ์ปซ๋Š”๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด GPT-2์˜ Political Bias๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.
๋…ผ๋ฌธ์€ Political Bias๋ฅผ ์•ผ๊ธฐํ•˜๋Š” Attribute(์†์„ฑ)๋กœ Gender, Location, Topic 3๊ฐ€์ง€๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋˜ํ•œ, Bias์˜ ์ข…๋ฅ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ 2๊ฐ€์ง€๋กœ ์ •์˜ํ•œ๋‹ค.
โ€ข
Indirect Bias: ์•ž์„œ ์ •์˜ํ•œ ์†์„ฑ์˜ Keyword(e.g. ๊น€์ฒ ์ˆ˜: Gender-Male)๊ฐ€ ํฌํ•จ๋œ Prompts๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” Text์˜ Bias
โ€ข
Direct Bias: Keyword+์ง์ ‘์ ์ธ Trigger(์ง„๋ณด, ๋ณด์ˆ˜)๋ฅผ ํฌํ•จํ•˜๋Š” Prompts๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” Text์˜ Bias
Indirect Bias๋Š” ์œ„์™€ ๊ฐ™์ด ์ •์˜๋œ๋‹ค. Option์€ ์†์„ฑ์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ผ์ข…์˜ Categorical Value์ด๋‹ค. (e.g. Gender: Male or Female). ํ•ด์„ํ•˜๋ฉด "๋‚จ์ž" Keyword๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  Prompts๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ Texts์™€ "๋‚จ์ž" ํ˜น์€ "์—ฌ์ž" Keyword๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  Prompts์—์„œ ์ƒ์„ฑ๋œ Texts๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ Bias์˜ ๊ฐ’์œผ๋กœ ์ •์˜ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋งŒ์•ฝ, ๋‚จ์ž๋“  ์—ฌ์ž๋“  ๋™์ผํ•œ Text๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค๋ฉด ๋ชจ๋ธ์€ ์ •์น˜์ ์œผ๋กœ ํŽธํ–ฅ๋˜์ง€ ์•Š์•˜๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
Direct Bias๋Š” ์œ„์™€ ๊ฐ™์ด ์ •์˜๋œ๋‹ค. (L: Liberal, C: Conservative). Indirect Bias์˜ Prompts์— ์ง„๋ณด ํ˜น์€ ๋ณด์ˆ˜์ ์ธ Trigger๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์„ ๋•Œ ์ƒ์„ฑ๋˜๋Š” Texts๊ฐ„์˜ Bias ์ฐจ์ด์ด๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ ์ ˆ๋Œ€๊ฐ’์„ ์”Œ์šด ๊ฒƒ์— ์ฃผ๋ชฉํ–ˆ๋Š”๋ฐ, ์ด๋Š” ์ง„๋ณด ํ˜น์€ ๋ณด์ˆ˜์ ์ธ Bias๊ฐ€ ๋ฌด์กฐ๊ฑด ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ์ง€ํ–ฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‘ ๊ฐ’์ด ๋น„์Šทํ•ด์ง€๋„๋ก, ์ฆ‰ ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ๋น„์Šทํ•˜๊ฒŒ ์น˜์šฐ์น˜๊ธฐ๋ฅผ ์›ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์€ GPT-2์˜ ๋ชจ๋ธ์„ Updateํ•˜๋Š” ๋Œ€์‹ , Text Generation(Inference)์—์„œ Softmax๊ฐ’์„ ๊ฑด๋“œ๋ฆฌ๋Š” ๋ฐฉ์‹์œผ๋กœ Debiasing์„ ์ˆ˜ํ–‰ํ•œ๋‹ค! ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ์ •์น˜์ ์œผ๋กœ ํŽธํ–ฅ๋œ ๋‹จ์–ด์˜ ์ƒ์„ฑ ํ™•๋ฅ ์„ ์กฐ์ •ํ•˜๋Š”๋ฐ, Word Embedding ํ˜น์€ (Trained) Classifier๋ฅผ ํ™œ์šฉํ•˜์—ฌ Reward๋ฅผ ์ •์˜ํ•œ๋‹ค.
โ€ข
Word Embedding Debias Gain
์‚ฌ์ „์— ์ •์˜ํ•œ ์ง„๋ณด ํ˜น์€ ๋ณด์ˆ˜์ ์ธ ์„ฑํ–ฅ์˜ ๋‹จ์–ด๋“ค๊ณผ ๊ฑฐ๋ฆฌ๊ฐ€ ๋จผ+์–‘์ธก์œผ๋กœ๋ถ€ํ„ฐ ๋น„์Šทํ•˜๊ฒŒ ๋–จ์–ด์ ธ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ Gain์„ ํฌ๊ฒŒ ๋ถ€์—ฌํ•œ๋‹ค.
โ€ข
Classifier Debias Gain
Political Bias๊ฐ€ Word Level๋กœ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด Word Embedding์„ ํ†ตํ•œ Debiasing์€ ํฐ ์˜๋ฏธ๊ฐ€ ์—†์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ Generation ๊ฐ Step๋งˆ๋‹ค ์‚ฌ์ „์— ํ•™์Šต๋œ Classifier(์ง„๋ณด ํ˜น์€ ๋ณด์ˆ˜)๋กœ Gain์„ ๊ณ„์‚ฐ(+Accumulate)ํ•œ๋‹ค. (8)์‹์€ Cross-Entropy์™€ ๋น„์Šทํ•œ๋ฐ Pr(y=1)=Pr(y=0)=0.5๊ฐ€ ๋˜๋„๋ก ํ•™์Šตํ•œ๋‹ค.
๋นจ๊ฐ„์ƒ‰์œผ๋กœ ํ‘œ์‹œํ•œ KL Divergence๋Š” ๊ธฐ์กด์˜ ๋ถ„ํฌ(Vanilla GPT-2 Distribution)์™€ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๋„๋ก ํ•˜๋Š” Penalty์ด๋‹ค.
์‹คํ—˜ ๊ฒฐ๊ณผ, ํฐ ํญ์˜ Debiasing ํšจ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, Perplexity ์ธก๋ฉด์—์„œ๋„ ๋งŽ์€ Trade-Off๊ฐ€ ์žˆ๋‹ค๊ณ ๋Š” ๋ณด์—ฌ์ง€์ง€ ์•Š๋Š”๋‹ค. ๋ฌผ๋ก , Debiasing์ด ๋“ค์–ด๊ฐ€๋Š” ์ˆœ๊ฐ„ PPL์ด 2๋ฐฐ ์ •๋„ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ๊ทธ ์ด์ƒ์œผ๋กœ๋Š” ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š”๋‹ค.