๐Ÿง

(210920) Review: Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

210920 Review Domain Adaptation
ACL 2021์—์„œ ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ์œผ๋กœ, n-grams ์ •๋ณด๋ฅผ Data์™€ ํ•จ๊ป˜ Feed ํ•ด์คŒ์œผ๋กœ์จ LM์ด Domain-Specific Data์˜ Representations๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ง๊ด€์ ์ด๊ณ  ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹๊ณผ Low-Resource๋กœ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ์ ์ด ๋…ผ๋ฌธ์˜ ๊ฐ•๋ ฅํ•œ ์žฅ์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

Problems: Domain Adaptation

Generic Data๋กœ ํ•™์Šต๋œ LM์€ Domain-Specificํ•œ Downstream Task์ผ์ˆ˜๋ก, ํ•ด๋‹น Data Set์—์„œ ์„ฑ๋Šฅ์ด Degradation๋˜๋Š” ์„ฑํ–ฅ์„ ๋ณด์ธ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ Domain-Specific Data๋ฅผ ํ™œ์šฉํ•œ LM์˜ Further Pre-Training(DAPT)์ด ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๋„์›€์ด ๋จ์„ ์ฆ๋ช…ํ•˜๊ณ  ์žˆ์œผ๋‚˜, ์œ„ ๋ฐฉ๋ฒ• ์—ญ์‹œ ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ์ ์„ ๊ฐ–๋Š”๋‹ค. ์ €์ž๋Š” DAPT์˜ ๋ฌธ์ œ๋กœ ํฌ๊ฒŒ 2๊ฐ€์ง€์— ์ฃผ๋ชฉํ•˜๋Š”๋ฐ,
โ€ข
๋งŽ์€ ์–‘์˜ Domain-Specific Data๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์ 
โ€ข
Generic, Domain-Specific Data์˜ Gap์„ Word Level์—์„œ๋งŒ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค๋Š” ์ 
์ด๋‹ค. 2๋ฒˆ์งธ ๋ฌธ์ œ๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ํ•ต์‹ฌ์ ์œผ๋กœ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ์œผ๋กœ, ๊ธฐ์กด์˜ ์—ฐ๊ตฌ๋“ค์ด Domain-Specificํ•œ ๋‹จ์–ด๋“ค์„ Generic Tokenizer๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Subwords๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๋ฌธ์ œ์—๋งŒ ์ง‘์ค‘ํ•จ์„ ์ง€์ ํ•œ๋‹ค. ์ €์ž๋Š” Domain Gap์ด ๊ฐœ๋ณ„ ๋‹จ์–ด๊ฐ€ ์•„๋‹Œ, ๋‹จ์–ด๋“ค(n-grams) ํ˜น์€ ๊ตฌ(Phrases) ์ฐจ์›์—์„œ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜๋ฉฐ, ์ด๋ฅผ ๋’ท๋ฐ›์นจํ•˜๋Š” 2๊ฐœ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
์ฒซ ๋ฒˆ์งธ๋Š” Fine-Tuned Generic RoBERTa๋กœ IMDB Data Set์„ ๋ถ„๋ฅ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•˜๋Š” ์‹คํ—˜์ด๋‹ค. LM์ด ์ •ํ™•ํžˆ(Correct)/์ž˜๋ชป(False) ๋ถ„๋ฅ˜(Prediction)ํ•œ Data๋“ค์ด Domain-Specific n-grams๋ฅผ ํฌํ•จํ•˜๋Š” ๋น„์œจ์„ ๊ณ„์‚ฐ ๋ฐ ๋น„๊ตํ•˜๋Š” ๋‚ด์šฉ์ด๋‹ค. Domain-Specific n-grams๋ž€ Wikipedia Page์— ์ž์ฃผ ์ถœํ˜„ํ•˜๋Š” Top 10K๋ฅผ ์ œ์™ธํ•œ ๊ฒƒ๋“ค์ด๋ฉฐ, Correct/False Predictions์—์„œ ๊ฐ๊ฐ ์ถ”์ถœํ•œ ๋นˆ๋„์ˆ˜ Top 1K n-grams์—์„œ ์ด๋“ค์ด ํฌํ•จ๋˜๋Š” ๋น„์œจ์„ ์ƒ๋‹จ ํ‘œ์— ์ •๋ฆฌํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ Data๋“ค์—์„œ ๋” ๋งŽ์€ Domain-Specific n-grams๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‘ ๋ฒˆ์งธ๋Š” Generic RoBERTa์™€ (์ถ”ํ›„์— ์†Œ๊ฐœํ• ) ์ œ์•ˆ ๊ธฐ๋ฒ•์œผ๋กœ ํ•™์Šตํ•œ RoBERTa+T-DNA๊ฐ€ ํŠน์ • Domain Data๋ฅผ Embeddingํ•œ ๊ฒƒ์„ Attention Map๊ณผ Salience Map(์ƒ๋‹จ ํ‘œ)์„ ํ†ตํ•ด ๊ด€์ฐฐ+๋น„๊ตํ•˜๋Š” ์‹คํ—˜์ด๋‹ค. Generic RoBERTa๋Š” RoBERTa+T-DNA์™€ ๋‹ฌ๋ฆฌ creepy animated, scary as hell๊ณผ ๊ฐ™์ด ์˜๋ฏธ์ ์œผ๋กœ ์ค‘์š”ํ•œ n-grams ํ‘œํ˜„๋“ค์„ ํŒŒ์•…ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ (False Prediction)์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
์ €์ž๋Š” ์œ„์˜ ์‹คํ—˜๋“ค๋กœ ์ž์‹ ์˜ ์ฃผ์žฅ์„ ์ฆ๋ช…ํ•˜๋ฉฐ, Low Domain-Specific Resources๋กœ Domain Data์˜ Representations๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Proposed Method(Model): T-DNA

์ œ์•ˆ ๊ธฐ๋ฒ•์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” Fine-Tuning ํ˜น์€ Task-Adaptive Pre-Training(TAPT) ์‹œ์— Domain-Specific n-grams ์ •๋ณด๋ฅผ Data์™€ ํ•จ๊ป˜ ๋ชจ๋ธ์— ๋ช…์‹œ์ ์œผ๋กœ Feed ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. ์ œ์•ˆ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€๋ฐ, ์ผ๋ฐ˜์ ์ธ BERT์ฒ˜๋Ÿผ Data๋ฅผ Embeddingํ•˜๋Š” ์šฐ์ธก์˜ Generic Pre-Trained Encoder, Data์— ํฌํ•จ๋œ Domain-Specific n-grams๋ฅผ Embeddingํ•˜๋Š” ์ค‘๋‹จ์˜ Adaptor Network, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ 2๊ฐœ์˜ Representations๋ฅผ ํ•ฉ์น˜๋Š” ์ƒ๋‹จ์˜ Module๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. Adaptor Network ์—ญ์‹œ Transformer Encoder ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ ์€ Data๋ฅผ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ, ์ดˆ๊ธฐ n-gram Embeddings๋กœ๋Š” FastText๊ฐ€ ํ™œ์šฉ๋œ๋‹ค.
Data์—์„œ Domain-Specific n-grams๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •์€ ํ•™์Šต์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๋‹ค ํ•ด๋„ ๊ณผ์–ธ์ด ์•„๋‹ˆ๋ฉฐ, ๊ตฌ์ฒด์ ์ธ Process๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
โ€ข
์ฃผ์–ด์ง„ ๋ฌธ์žฅ์—์„œ ์ธ์ ‘ํ•˜๋Š” ๋‘ ๋‹จ์–ด๋“ค์˜ PMI Score๋ฅผ ๊ณ„์‚ฐ
โ€ข
์ธ์ ‘ํ•œ ํŠน์ • ๋‘ ๋‹จ์–ด์˜ Score๊ฐ€ Threshold ์ดํ•˜์ผ ๋•Œ, ๋‹จ์–ด๋“ค ์‚ฌ์ด์— Delimiter(๊ตฌ๋ถ„์ž)๋ฅผ ํ‘œ์‹œ
โ€ข
Delimiter๊ฐ€ ์—†๋Š” ์—ฐ์†๋œ ๋‹จ์–ด๋“ค์˜ Sequence๋ฅผ ๋ชจ๋‘ Candidate๋กœ ์ฒ˜๋ฆฌ
Data๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ํŠน์ • Token(i)์˜ ์ตœ์ข… Representations๋Š” Generic Encoder Embedding(h_i)+Adaptor Network Embeddings(ฮฃ\Sigmag_i,k)๊ฐ€ ๋˜๋ฉฐ, Generic Encoder์˜ ๊ฐ Layer๋งˆ๋‹ค ํ•ด๋‹น ์—ฐ์‚ฐ์ด ๋ฐ˜๋ณต๋œ๋‹ค.

Experiments & Results

์‹คํ—˜์— ์‚ฌ์šฉํ•œ Data Sets๋Š” 4๊ฐœ Domain์—์„œ 8๊ฐœ์˜ Classification Tasks์ด๋‹ค.
โ€ข
Biomedical Sciences: ChemProt, RCT
โ€ข
Computer Science: CitationIntenet, SciERC
โ€ข
News: HyperPartisan, AGNews
โ€ข
Reviews: Amazon, IMDB
Low-Resource Setting์„ ์œ„ํ•ด, RCT, AGNews, Amazon, IMDB์˜ ๊ฒฝ์šฐ Data ์ˆ˜๋ฅผ Downsamplingํ•œ๋‹ค.
์‹คํ—˜์€ Generic RoBERTa(-base)๋ฅผ Fine-Tuning ํ˜น์€ TAPT ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ œ์•ˆ ๊ธฐ๋ฒ•(T-DNA)์„ ์ ์šฉํ•  ๋•Œ์™€ ํ•˜์ง€ ์•Š์„ ๋•Œ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ˆ˜ํ–‰๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์ƒ๋‹จ์˜ ํ‘œ์™€ ๊ฐ™์œผ๋ฉฐ, Fine-Tuning๊ณผ TAPT์—์„œ ๋ชจ๋‘ ์ œ์•ˆ ๊ธฐ๋ฒ•์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋ƒ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
์ถ”๊ฐ€์ ์ธ ์‹คํ—˜์„ ํ†ตํ•ด Domain-Specific n-grams์˜ Granularity(์ตœ๋Œ€ n๊ฐ’)๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ํญ๋„ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€ํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, Data์˜ ์ˆ˜๊ฐ€ ์ถฉ๋ถ„ํ•œ ์ƒํ™ฉ์—์„œ๋„ ์ œ์•ˆ ๊ธฐ๋ฒ•์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋ƒ„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค(์ƒ๋‹จ์˜ ํ‘œ ์ฐธ์กฐ). ๋‹ค๋งŒ, Data์˜ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด ์ œ์•ˆ ๊ธฐ๋ฒ•์˜ ํšจ๊ณผ๊ฐ€ ๋ฏธ๋ฏธํ•ด์ง€๋Š”๋ฐ, ์ด๋Š” Data๊ฐ€ ์ถฉ๋ถ„ํ•˜๋‹ค๋ฉด ๊ธฐ์กด์˜ ํ•™์Šต์œผ๋กœ๋„ ๋ชจ๋ธ์ด ์ข‹์€ Representations๋ฅผ ํ•™์Šตํ•  ํ™•๋ฅ ์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.