惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

cs.CL updates on arXiv.org

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions World-State Transformations for Neuro-symbolic Interactive Storytelling ROC Analysis for Evaluating Translation Quality Estimation Systems Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations Repeated Sequences Reveal Gaps between Large Language Models and Natural Language JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval End-to-End Intracortical Speech Decoding from Neural Activity Measuring the Depth of LLM Unlearning via Activation Patching Generating Legal Commentaries from Case Databases via Retrieval, Clustering, and Generation AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing Large Language Model Selection with Limited Annotations Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models Multi-Persona Debate System for Automated Scientific Hypothesis Generation An Interactive Paradigm for Deep Research STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty Lngram: N-gram Conditional Memory in Latent Space SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora Learning to Route Languages for Multilingual Policy Optimization Raon-Speech Technical Report TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis Re-defining Humor Data Objects for AI Humor Research Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth They Are Not the Same: Direct Causes Are Not Grounded Emotion Explanations CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach READER: Reasoning-Enhanced AI-Generated Text Detection Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs Phonetic Modeling of Dialectal Variation in Vietnamese Speech DRInQ: Evaluating Conversational Implicature with Controlled Context Variation Inference Time Optimization with Confidence Dynamics EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation Towards a Universal Causal Reasoner StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring P1SCO: Social Dimensions from a Perspectivist Lens WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems NITP: Next Implicit Token Prediction for LLM Pre-training SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation HiMed: Incentivizing Hindi Reasoning in Medical LLMs When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers Beyond the Target: From Imitation to Collaboration in Speculative Decoding Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol Exploring Profiles of Cognitive Distortions Associated with Mental Health Disorders Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation Mimir: Large-scale Multilingual Concept Modeling Knowing but Not Showing: LLMs Recognize Ambiguity but Rarely Ask Clarifying Questions Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities A general tensor-structured compression scheme for efficient large language models Extracting Training Data from Diffusion Language Models via Infilling Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval SEAL: Synergistic Co-Evolution of Agents and Learning Environments Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language Side-by-side Comparison Amplifies Dialect Bias in Language Models Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience Locality Matters for Training-Free Audio Token Compression in Audio-Language Models Better, Faster: Harnessing Self-Improvement in Large Reasoning Models
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
Yoonwon Jung · 2026-05-26 · via cs.CL updates on arXiv.org

View PDF HTML (experimental)

Abstract:Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.
Comments: CoNLL 2026
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2605.24310 [cs.CL]
  (or arXiv:2605.24310v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2605.24310

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yoonwon Jung [view email]
[v1] Sat, 23 May 2026 00:36:53 UTC (358 KB)