๐Ÿญ

(210918) Diary: GitHub Copilot & OpenAI Codex

210918 Diary GPT
๋‚˜๋Š” ํ•™๋ถ€์—์„œ CS๋ฅผ ์ „๊ณตํ•˜๊ณ , Data Mining ์—ฐ๊ตฌ์‹ค์—์„œ DL-based NLP๋ฅผ ์—ฐ๊ตฌํ•œ ์ฃผ์ œ์— ์ฝ”๋”ฉ์„ ์ฉ ์ข‹์•„ํ•˜๋Š” ํŽธ์ด ์•„๋‹ˆ๋‹ค. ์ตœ๊ทผ, OpenAI์—์„œ GPT๋ฅผ GitHub Code๋“ค๋กœ Fine-Tuning ํ•˜์—ฌ, Code๋ฅผ ์ž๋™ ์™„์„ฑํ•ด์ฃผ๋Š” Copilot์„ ๊ณต๊ฐœํ•˜์˜€๋‹ค. Demo ์˜์ƒ์„ ๋ณด๋ฉด ๋‚˜๋ฆ„ ๋ณต์žกํ•œ Logic์˜ Code๋“ค๋„ ๊ณง์ž˜ ๊ตฌํ˜„ํ•˜๋Š” ๋ชจ์Šต์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์กฐ๋งŒ๊ฐ„ ๋‚˜์˜ Code๋„ ๋Œ€์‹  ์ž‘์„ฑํ•ด์ฃผ๋Š” ๋ชจ๋ธ์ด ๋‚˜์˜ฌ๊นŒ ์ •๋ง ์กฐ๊ธˆ ๊ธฐ๋Œ€ํ•˜๋Š” ์ค‘์ด๋‹ค. ๋‚˜์™€ ๋น„์Šทํ•œ ๊ธฐ๋Œ€๋ฅผ ํ•˜๋Š” ์ ˆ์นœํ•œ ์นœ๊ตฌ์™€ ํ•จ๊ป˜ Copilot์— ์‚ฌ์šฉ๋œ ๋ชจ๋ธ์— ๊ด€ํ•œ Paper๋ฅผ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ณด์•˜๋‹ค.

(OpenAI, 2021) Evaluating Large Language Models Trained on Code

๋…ผ๋ฌธ์˜ Contributions๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํฌ๊ฒŒ 2๊ฐ€์ง€๋กœ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
โ€ข
GPT๋ฅผ GitHub Code๋“ค๋กœ Fine-Tuningํ•œ, Docstring์œผ๋กœ๋ถ€ํ„ฐ Python Code๋ฅผ ์ž๋™ ์™„์„ฑํ•˜๋Š” Codex๋ฅผ ์ œ์•ˆํ•˜๊ณ ,
โ€ข
๋ชจ๋ธ์ด ์ž‘์„ฑํ•œ Code๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” Unit Tests(Programming Problem)๋ฅผ ํฌํ•จํ•œ HumanEval, ์ƒˆ๋กœ์šด Evaluation Set์„ ๊ณต๊ฐœํ•œ๋‹ค.
HumanEval์—์„œ ๊ธฐ์กด์˜ GPT์™€ GPT-J๊ฐ€ ๊ฐ๊ฐ 0%, 11.4%์˜ ๋ฌธ์ œ๋“ค๋งŒ์„ ํ•ด๊ฒฐํ•œ ๋ฐ˜๋ฉด, Codex-S๋Š” 37.7%์˜ ์ •๋‹ต๋ฅ ์„ ๋ณด์ด๊ณ , ๋ฌธ์ œ๋งˆ๋‹ค 100๊ฐœ์˜ Sample Code๋ฅผ ์ƒ์„ฑํ•˜๋ฉด 77.5%์˜ ๋ฌธ์ œ๋“ค์—์„œ ์ ์–ด๋„ 1๊ฐœ ์ด์ƒ์˜ ์ •๋‹ต Code๋ฅผ ํฌํ•จํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.
Evaluation: Functional Correctness
Docstring์—์„œ Python Code๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์€ ์ผ์ข…์˜ Translation Task๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, BLEU Score๋ฅผ Evaluation Metric์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์‹ค์ œ ์ž‘๋™์ด ์ค‘์š”ํ•œ Code์˜ ํŠน์„ฑ์ƒ ์˜ฌ๋ฐ”๋ฅธ ์ ‘๊ทผ๋ฒ•์ด ์•„๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ž‘์„ฑ๋œ Code๊ฐ€ ๊ธฐ๋Šฅ์ ์œผ๋กœ ์ž˜ ๋™์ž‘ํ•˜๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” HumanEval, Hand-Written Evaluation Set๊ณผ pass@k Metric์„ ์ œ์•ˆํ•œ๋‹ค. pass@k๋Š” ํ•˜๋‚˜์˜ ๋ฌธ์ œ์— k๊ฐœ์˜ Sample Code๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์ •๋‹ต์ด ์กด์žฌํ•˜๋ฉด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ์ˆ˜์น˜๊ฐ€ Unbiased๋œ ๊ฐ’์„ ๊ฐ–๋„๋ก, ์ €์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณ„์‚ฐ ๋ฐฉ์‹์„ ์ œ์‹œํ•œ๋‹ค.
n์€ k๋ณด๋‹ค ํฐ ์ˆ˜์ด๋ฉฐ, c๋Š” ์ •๋‹ต Sample์˜ ๊ฐœ์ˆ˜์ด๋‹ค. ํ•ด์„ํ•˜๋ฉด, k๊ฐ€ 5์ผ ๋•Œ(pass@5), 30๊ฐœ(n=30)์˜ Sample์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋“ค ์ค‘์— 10๊ฐœ(c=10)์˜ ์ •๋‹ต์ด ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ด ๊ฒฝ์šฐ, 30๊ฐœ Sample์—์„œ 5๊ฐœ๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒฝ์šฐ ์ค‘ ์ •๋‹ต์ด ์•„๋‹Œ 20๊ฐœ์˜ Sample์—์„œ๋งŒ 5๊ฐœ๋ฅผ ๊ณ ๋ฅด๋Š”(์˜ค๋‹ต๋งŒ์„ ์„ ํƒํ•˜๋Š”) ํ™•๋ฅ ์„ 1์—์„œ ๋นผ์ค€ ๊ฐ’์ด pass@5๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.
Proposed Model: Codex
๋ชจ๋ธ ์ž์ฒด์—๋Š” ํŠน๋ณ„ํ•œ ์ ๋“ค์ด ์žˆ๋Š” ๊ฒƒ ๊ฐ™์ง€ ์•Š๋‹ค. Pre-Trained GPT๋กœ๋ถ€ํ„ฐ Fine-Tuning์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด From Scratch๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ์— ๋น„ํ•ด ์„ฑ๋Šฅ์ƒ ์ด์ ์ด ์—†๋Š”๋ฐ, ์ด๋Š” Fine-Tuning Dataset์˜ ํฌ๊ธฐ๊ฐ€ ๋ฐฉ๋Œ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ „์ž์˜ ๊ฒฝ์šฐ๊ฐ€ ํ•™์Šต์˜ ์ˆ˜๋ ด์ด ๋” ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , Code ํŠน์„ฑ์ƒ ์žฆ์€ ๊ณต๋ฐฑ์˜ ์ถœํ˜„์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€ Token๋“ค์„ ํ™œ์šฉํ•˜์˜€์Œ์„ ์ €์ž๋Š” ์–ธ๊ธ‰ํ•œ๋‹ค.
Experiments & Results
์‹คํ—˜์€ Sample Code์˜ ์ˆ˜(k), Generation์‹œ Sampling Temperature, Model Size ๋“ฑ์„ ๋ณ€๊ฒฝํ•˜๋ฉฐ ์ˆ˜ํ–‰๋œ๋‹ค. ๋™์ผํ•œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ๋” ๋งŽ์€ ์ˆ˜์˜ Sample Code๋ฅผ ์ƒ์„ฑํ• ์ˆ˜๋ก ์ •๋‹ต๋ฅ ์ด ๋†’์•„์ง€๋Š”๋ฐ, k๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ƒ์„ฑ๋˜๋Š” Sample์˜ Higher Diversity๋ฅผ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š” ํฐ ๊ฐ’์˜ Temperature๋ฅผ ์„ค์ •ํ•˜๋Š” ํŽธ์ด ์œ ๋ฆฌํ•˜๋‹ค(์•„๋ž˜ ํ‘œ ์ฐธ์กฐ).
Code ์ƒ์„ฑ์˜ ๊ฒฝ์šฐ์—๋„ Model Size๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๊ทธ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋ฉฐ, ์‹ค์ œ ์„œ๋น„์Šค์—์„œ์™€ ๊ฐ™์ด ์ƒ์„ฑ๋œ Sample๋“ค ์ค‘ Top K๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ, Mean Token Log Probability๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋‹ค์Œ ํ‘œ๋Š” BLEU Score์— ๋”ฐ๋ฅธ ์ •๋‹ต+์˜ค๋‹ต Sample์˜ ํ™•๋ฅ  ๋ฐ€๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, ํŠน๋ณ„ํ•œ ์ƒ๊ด€์„ฑ์ด ์—†๋Š” ๊ฒƒ์œผ๋กœ ๋ฏธ๋ฃจ์–ด ๋ณผ ๋•Œ, (์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ) Code ์ƒ์„ฑ์˜ Evaluation Metric์œผ๋กœ BLEU Score๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค.