
























Abstract:Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose
maintenance cost is established in prior work. Existing detection techniques require
running the tests (Binamungu et al., 2018-2023) or are confined to a single
organisation (Irshad et al., 2020-2022), leaving a gap: a purely static,
paraphrase-robust, step-level detector usable on any repository. We fill the gap
with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein
ratio, and sentence-transformer embeddings in a layered pipeline, released alongside
an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature
files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2
%; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top
hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs
manually labelled by the three authors under a released rubric (inter-annotator
Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with
bootstrap 95 % CIs under two protocols: the primary rubric and a score-free
second-pass relabelling. The strongest honest pair-level number is near-exact at F1
= 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by
a stratification artefact that pins recall at 1.000. Lexical baselines
(SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also
presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations);
eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus,
labelled pairs, rubric, and pipeline are released under permissive licences.
| Comments: | 39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at this https URL under Apache-2.0 |
| Subjects: | Software Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR) |
| ACM classes: | D.2.5; D.2.7; I.2.7 |
| Cite as: | arXiv:2604.20462 [cs.SE] |
| (or arXiv:2604.20462v1 [cs.SE] for this version) | |
| https://doi.org/10.48550/arXiv.2604.20462 arXiv-issued DOI via DataCite (pending registration) |
From: Ali Hassaan Mughal [view email]
[v1]
Wed, 22 Apr 2026 11:44:05 UTC (240 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。