

















Abstract:We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at this https URL.
| Comments: | ACL 2026. 19 pages, 7 figures, 11 tables |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2503.08600 [cs.CL] |
| (or arXiv:2503.08600v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2503.08600 arXiv-issued DOI via DataCite |
From: Weiqiu You [view email]
[v1]
Tue, 11 Mar 2025 16:35:08 UTC (4,023 KB)
[v2]
Sat, 15 Mar 2025 21:25:43 UTC (4,023 KB)
[v3]
Mon, 25 May 2026 18:03:22 UTC (4,058 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。