
























Abstract:We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.
From: Dezhi Yu [view email]
[v1]
Thu, 11 Jun 2026 04:15:54 UTC (170 KB)
[v2]
Fri, 12 Jun 2026 06:20:00 UTC (170 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。