๐Ÿง

(210821) Review: Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

210821 Review Sentence Embedding Domain Adaptation
Sentence BERT(SBERT)์˜ ํ›„์†ํŽธ ๋А๋‚Œ์œผ๋กœ NAACL 2021์—์„œ ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ์ด๋‹ค. SBERT์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉด์„œ๋„ ์‹ค์ œ๋กœ ์œ ์šฉํ•˜๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ๊ตฌ๋กœ ์ƒ๊ฐ๋œ๋‹ค.

Problem: Bi-Encoders

์ฃผ์–ด์ง„ ๋ฌธ์žฅ ์Œ์˜ ์œ ์‚ฌ๋„ ๋“ฑ์„ ๊ณ„์‚ฐํ•˜๋Š” Pairwise Sentence Scoring Tasks์—์„œ Cross-Encoder ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ BERT ๋ชจ๋ธ์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, Computational Cost๊ฐ€ num(๋ฌธ์žฅ)^2์— ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ํ•˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. Sentence BERT(SBERT)๋Š” Bi-Encoder ๋ฐฉ์‹์˜ Fine-Tuning์œผ๋กœ ์œ„์˜ ๋ฌธ์ œ๋ฅผ ์™„ํ™”์‹œ์ผฐ์ง€๋งŒ, Cross-Encoder ๋ชจ๋ธ์— ์ค€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์–‘์˜ Training Data๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๊ฐ–๋Š”๋‹ค. ์•„๋ž˜ ํ‘œ๋ฅผ ๋ณด๋ฉด ๋™์ผํ•œ ์ˆ˜์˜ Training Data๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ–ˆ์„ ๋•Œ Bi-Encoder๊ฐ€ Cross-Encoder์— ๋น„ํ•ด ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์€ Cross-Encoder ๋ชจ๋ธ๋กœ Training Data๋ฅผ ์ฆ๊ฐ•ํ•˜์—ฌ Bi-Encoder ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ์•ˆ(or ๋ชจ๋ธ)์„ ์ œ์‹œํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด 4๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ Task์—์„œ ์ œ์•ˆ ๋ฐฉ์‹์˜ ํšจ๊ณผ๋ฅผ ๋ณด์ด๊ณ , ๋” ๋‚˜์•„๊ฐ€ Domain Adaptation์—๋„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ์ง€๋ฅผ ๋‚จ๊ธด๋‹ค.

Proposed Model: Augmented SBERT

์ œ์•ˆ ๋ฐฉ์‹์˜ Process๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค. Gold Dataset์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์˜ Cross-Encoder ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ , Unlabeled ๋ฌธ์žฅ ์Œ๋“ค์„ ํ•ด๋‹น ๋ชจ๋ธ๋กœ Soft-Labelingํ•˜์—ฌ Silver Dataset์„ ์–ป๋Š”๋‹ค. ์ดํ›„, Silver Dataset์„ Gold Dataset์— ์ถ”๊ฐ€ํ•˜์—ฌ Bi-Encoder ๋ชจ๋ธ(Augmented SBERT)์„ Fine-Tuningํ•˜๋Š” ์ˆœ์„œ์ด๋‹ค. ์ด ๋•Œ, Unlabeled ๋ฌธ์žฅ ์Œ๋“ค์€ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ or Gold Dataset์˜ ๋ฌธ์žฅ๋“ค์„ ์žฌ์กฐํ•ฉํ•œ ๊ฒƒ๋“ค์ธ๋ฐ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ›„์ž๋ฅผ ์„ ํƒํ•œ๋‹ค. ๋˜ํ•œ, Labeling๋œ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ชจ๋‘ Fine-Tuning์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์–‘์ ์œผ๋กœ ์œ ๋ฆฌํ•ด ๋ณด์ด์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, Computational Overhead๋„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ ๊ฐ•์กฐํ•˜๋Š” ์ ๋“ค ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ "Unlabeled ๋ฌธ์žฅ ์Œ๋“ค๋กœ๋ถ€ํ„ฐ Silver Dataset์„ Samplingํ•˜๋Š” ๋ฐฉ๋ฒ•"์ธ๋ฐ, ์ €์ž๋Š” ๊ทธ ๋ฐฉ์•ˆ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ Sampling ๊ธฐ๋ฒ•๋“ค์„ ์ œ์•ˆํ•œ๋‹ค.
โ€ข
Random Sampling (RS)
๋ง ๊ทธ๋Œ€๋กœ Gold Dataset์˜ ๋ฌธ์žฅ๋“ค์„ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•˜์—ฌ ๋ฌธ์žฅ ์Œ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•.
๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ์Œ์ด Negativeํ•˜์—ฌ ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์™€ ๋‹ค๋ฅธ(Skewed) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ.
โ€ข
Kernel Density Estimation (KDE)
KDE๋ฅผ ํ™œ์šฉํ•˜์—ฌ Continuousํ•œ ๋ฐ์ดํ„ฐ(Score) ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐ: F_gold(s), F_silver(s)
F_gold(s), F_silver(s)๊ฐ„ KL Divergence(ํ™•๋ฅ  ๋ถ„ํฌ ์ฐจ์ด)๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฌธ์žฅ ์ถ”์ถœ: Q(s)
(์˜ˆ์‹œ) ๋ฌธ์žฅ ์Œ์„ ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•˜์˜€๋Š”๋ฐ Score๊ฐ€ 2์ ์ด๋‹ค. Gold Dataset์—์„œ 2์ ์งœ๋ฆฌ ๋ฌธ์žฅ ์Œ์˜ ๋“ฑ์žฅ ํ™•๋ฅ ๊ณผ ํ˜„์žฌ๊นŒ์ง€ Samplingํ•œ Silver Dataset์—์„œ 2์ ์งœ๋ฆฌ ๋ฌธ์žฅ ์Œ์˜ ๋“ฑ์žฅ ํ™•๋ฅ ์„ ๋น„๊ตํ•œ๋‹ค. โ†’ Silver Dataset์—์„œ์˜ ํ™•๋ฅ ์ด ๋‚ฎ์œผ๋ฉด ์ถ”์ถœํ•œ ๋ฌธ์žฅ ์Œ์„ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ์ผ์ • ํ™•๋ฅ (F_gold(s) / F_silver(s))๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
์ด์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ ์‹ค์ œ Score ๋ถ„ํฌ์™€ ๋น„์Šทํ•œ Dataset์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋น„ํšจ์œจ์„ฑ์ด ์กด์žฌํ•œ๋‹ค.
โ€ข
BM25 Sampling (BM25)
ElasticSearch๋ฅผ ํ™œ์šฉํ•˜์—ฌ BM25 Score๊ธฐ๋ฐ˜ ์œ ์‚ฌํ•œ Top k๊ฐœ์˜ ๋ฌธ์žฅ ์Œ๋“ค์„ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ Cross-Encoder ๋ชจ๋ธ๋กœ Labeling.
โ€ข
Semantic Search Sampling (SS)
BM25 Score๋Š” ๋น„์Šทํ•œ ๋‹จ์–ด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฌธ์žฅ๋“ค์„ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•จ(NOT Semantic).
์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Gold Dataset์œผ๋กœ Bi-Encoder ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ , ํ•ด๋‹น ๋ชจ๋ธ์„ ํ†ตํ•ด ์œ ์‚ฌํ•œ ๋ฌธ์žฅ ์Œ๋“ค์„ ์ถ”์ถœ.
โ€ข
BM25+Semantic Search Sampling (BM25-SS)
BM25์™€ Semantic Search Sampling์„ ๋™์‹œ์— ์ˆ˜ํ–‰.
Sampling ๊ธฐ๋ฒ• ์ด์™ธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Seed Optimization์„ ์ œ์•ˆํ•œ๋‹ค. Bi-Encoder ๋ชจ๋ธ์€ Cross-Encoder ๋ชจ๋ธ์— ๋น„ํ•ด Random Seed์— ๋ฏผ๊ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์„œ๋กœ ๋‹ค๋ฅธ 5๊ฐœ์˜ Random Seed๋กœ ๋ชจ๋ธ๋“ค์„ ์ผ์ • ๊ธฐ๊ฐ„(Early Stopping at 20% of training steps) ํ•™์Šต์‹œํ‚ค๊ณ , ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์˜ ํ•™์Šต๋งŒ์„ ๋๋งˆ์น˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

Domain Adaptation with Augmented SBERT

์ผ๋ฐ˜์ ์ธ Bi-Encoder ๋ชจ๋ธ์€ Test Data์˜ Domain์ด Training Data์™€ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” Out-of-Domain ํ™˜๊ฒฝ์—์„œ ์ข‹์ง€ ์•Š์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ Silver Dataset์„ Target Domain์˜ Unlabeled ๋ฌธ์žฅ ์Œ๋“ค๋กœ๋ถ€ํ„ฐ ๊ตฌ์ถ•ํ•˜๊ณ , Silver Dataset๋งŒ์œผ๋กœ Augmented SBERT์„ Fine-Tuningํ•˜๋Š” Domain Adaptation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Experiments & Results

โ€ข
In-Domain Experiments
๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด 4๊ฐœ์˜ Downstream Task+Dataset์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค.
โ—ฆ
Spanish-STS (STS)
โ—ฆ
BWS (Argument Similarity)
๋ฌธ์žฅ ์Œ์ด ๋‹ค๋ฃจ๋Š” ์Ÿ์ (์ด 8๊ฐ€์ง€)์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” Task.
โ–ช
In-Topic: 8๊ฐ€์ง€ ์Ÿ์ ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰.
โ–ช
Cross-Topic: 6๊ฐ€์ง€ ์Ÿ์ ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต ์ˆ˜ํ–‰ ๋ฐ ๋‹ค๋ฅธ 2๊ฐ€์ง€ ์Ÿ์ ์˜ ๋ฐ์ดํ„ฐ๋กœ Test.
โ—ฆ
Quora-QP (Duplicate Question Detection)
Training Data์˜ ์ˆ˜๊ฐ€ ์ ์€ ๊ฒฝ์šฐ๋ฅผ ๊ฐ€์ •ํ•˜๊ธฐ ์œ„ํ•ด Down-Sampling ์ˆ˜ํ–‰.
โ—ฆ
MRPC (News Paraphrase Identification)
์œ„ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๊ธฐ๋ณธ์ ์œผ๋กœ Cross-Encoder ๋ฐฉ์‹์˜ BERT๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
Seed Optimization์€ BERT์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ๋ชปํ•˜์ง€๋งŒ, Bi-Encoder ๋ฐฉ์‹์˜ SBERT์—๋Š” ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋‚ธ๋‹ค. ํŠนํžˆ, Training Data์˜ ์ˆ˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ Spanish-STS์—์„œ ์ด๋Ÿฌํ•œ ํ˜„์ƒ์ด ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.
Random Sampling์„ ์ œ์™ธํ•˜๋ฉด Augmented SBERT๊ฐ€ SBERT์— ๋น„ํ•ด ์ „๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ KDE์™€ BM25 Sampling์ด ํŠนํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”๋ฐ ์ผ๋ถ€ Dataset์—์„œ๋Š” BERT๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ์ €์ž๋Š” Computation ์ธก๋ฉด์—์„œ ๋” ํšจ์œจ์ ์ธ BM25 Sampling์„ ์ถ”์ฒœํ•œ๋‹ค.
Sampling ๊ธฐ๋ฒ•๋“ค ๊ฐ„์— ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š” ์ด์œ ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์„ ํ†ตํ•ด ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋‹ค. Random Sampling์€ ์‹ค์ œ Score ๋ถ„ํฌ์— ๋น„ํ•ด ํŠนํžˆ ์ž‘์€ ๊ฐ’์˜ ๋ฌธ์žฅ ์Œ๋“ค์„ ์ถ”์ถœํ•œ๋‹ค. ๋ฐ˜๋ฉด, ์„ฑ๋Šฅ์ด ์ข‹์€ BM25 Sampling์˜ ๊ฒฝ์šฐ ์‹ค์ œ์™€ ๋น„์Šทํ•œ Score ๋ถ„ํฌ์˜ ๋ฌธ์žฅ ์Œ๋“ค์„ ์ถ”์ถœํ•˜๋ฉฐ, ๋ฌด์—‡๋ณด๋‹ค ์œ ์‚ฌ ๋ฌธ์žฅ ์Œ์˜ ๋น„์œจ์ด ๋†’๋‹ค.
โ€ข
Domain Adaptation Experiments
Duplicate Question Detection Task์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ 4๊ฐ€์ง€ Domain์˜ Dataset์„ ์‚ฌ์šฉํ•œ๋‹ค.
โ—ฆ
AskUbuntu
โ—ฆ
Quora
โ—ฆ
Sprint
โ—ฆ
SuperUser
์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ฑฐ์˜ ๋ชจ๋“  ์กฐํ•ฉ์—์„œ Domain Adaptation์„ ์ˆ˜ํ–‰ํ•œ Augmented SBERT๊ฐ€ SBERT์— ๋น„ํ•ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ผ๋ถ€ ์กฐํ•ฉ์—์„œ๋Š” In-Domain Training Data๋กœ ํ•™์Šต๋œ SBERT๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ €์ž์— ์˜ํ•˜๋ฉด (๋‹น์—ฐํ•˜๊ฒŒ๋„) Source Data๊ฐ€ Generalํ•  ์ˆ˜๋ก, Target Data๊ฐ€ Specificํ•  ์ˆ˜๋ก Domain Adaptation์˜ ํšจ๊ณผ๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.