惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
Stack Overflow Blog
Stack Overflow Blog
MongoDB | Blog
MongoDB | Blog
小众软件
小众软件
U
Unit 42
S
SegmentFault 最新的问题
A
About on SuperTechFans
T
Tailwind CSS Blog
Hugging Face - Blog
Hugging Face - Blog
H
Help Net Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
V
Visual Studio Blog
G
Google Developers Blog
The GitHub Blog
The GitHub Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
I
InfoQ
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Y
Y Combinator Blog
博客园 - 司徒正美
量子位
美团技术团队
云风的 BLOG
云风的 BLOG
B
Blog RSS Feed
酷 壳 – CoolShell
酷 壳 – CoolShell
D
Docker
J
Java Code Geeks
B
Blog
L
LangChain Blog
博客园 - 叶小钗
雷峰网
雷峰网
博客园_首页
F
Fortinet All Blogs
Recent Announcements
Recent Announcements
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
有赞技术团队
有赞技术团队
H
Hackread – Cybersecurity News, Data Breaches, AI and More
GbyAI
GbyAI
Blog — PlanetScale
Blog — PlanetScale
Microsoft Azure Blog
Microsoft Azure Blog
阮一峰的网络日志
阮一峰的网络日志
P
Proofpoint News Feed
博客园 - 聂微东
腾讯CDC
T
The Blog of Author Tim Ferriss
罗磊的独立博客
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园 - 三生石上(FineUI控件)

cs.CL updates on arXiv.org

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations Self-Calibrating Language Models via Test-Time Discriminative Distillation HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning Weird Generalization is Weirdly Brittle Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text Relational Probing: LM-to-Graph Adaptation for Financial Prediction CodeComp: Structural KV Cache Compression for Agentic Coding FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness Comparative Analysis of Large Language Models in Healthcare Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation A Structured Clustering Approach for Inducing Media Narratives NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection Turing or Cantor: That is the Question NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning Instruction Data Selection via Answer Divergence EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval BlasBench: An Open Benchmark for Irish Speech Recognition OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts Evaluating Memory Capability in Continuous Lifelog Scenario Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts Evaluating Cooperation in LLM Social Groups through Elected Leadership A Triadic Suffix Tokenization Scheme for Numerical Reasoning Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models Anthropogenic Regional Adaptation in Multimodal Vision-Language Model METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities Linear Representations of Hierarchical Concepts in Language Models TInR: Exploring Tool-Internalized Reasoning in Large Language Models Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models LLMs Should Incorporate Explicit Mechanisms for Human Empathy ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning The Amazing Agent Race: Strong Tool Users, Weak Navigators Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities CircuitSynth: Reliable Synthetic Data Generation ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks Cross-Cultural Value Awareness in Large Vision-Language Models Should We be Pedantic About Reasoning Errors in Machine Translation? Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards COMPOSITE-Stem GIANTS: Generative Insight Anticipation from Scientific Literature Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering Generating High Quality Synthetic Data for Dutch Medical Conversations Generative UI: LLMs are Effective UI Generators LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment Seven simple steps for log analysis in AI systems H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration LABBench2: An Improved Benchmark for AI Systems Performing Biology Research MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval Reasoning Models Will Sometimes Lie About Their Reasoning StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models Multi-Model Synthetic Training for Mission-Critical Small Language Models KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling Echoes of Automation: The Increasing Use of LLMs in Newsmaking What Factors Affect LLMs and RLLMs in Financial Question Answering? Aligning What LLMs Do and Say: Towards Self-Consistent Explanations Tuning Language Models for Robust Prediction of Diverse User Behaviors If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs Preference Learning Unlocks LLMs' Psycho-Counseling Skills MegaFake: A Theory-Driven Dataset of Fake News Generated by Large Language Models Language Reconstruction with Brain Predictive Coding from fMRI Data Template-assisted Contrastive Learning of Task-oriented Dialogue Sentence Embeddings
RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets
Jan Cegin, Branislav Pecher, Ivan Srba, Jakub Simko · 2025-10-08 · via cs.CL updates on arXiv.org

LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.