I would say since your focus is on structural or programmatic detection, and not LLM heuristics, the problem depends on language a lot.
In Rust or Go there’s super clear test markers or filenames.
In Javascript it would have to detect the framework in use then detect test files and tests embedded in program files.
And so forth.
Are you doing any call sequencing heuristics? Like if the same 5 calls (with different args) appear in the same order in multiple places (even in test files) that might be a strong signal for deduplication. Or even if the same 5 calls are in the same order with a couple different interleaved calls - the fuzziness of the heuristic might be something tunable to a language, or particular codebase, or framework, etc.
I’ve dealt with a question that rhymes with this.
Sonarqube or CodeQL reports might tell me what percentage of a repo is duplicated code, and a large percentage of that is in src/test/java
I find that a lot of the time this is not just some flippant observation but a clue that I should be using a mechanism like @ParameterizedTest instead of @Test, so I rewrite those tests in a way that makes them easier to set-up, define parameters/constraints, inputs, and outputs. Sometimes it does get a little convoluted as you either use a lot of naked Arguments.of() or define test-class-scoped nested records to encapsulate test parameters, inputs, expected outputs, etc.
Detect tests somehow (eg. in rust you could check for #[test]) and just skip the analysis for that function?
Yeah, that is pretty much what it does already: it tries to recognize test files and skip them. Dupehound is available for 12 languages Today.
Some languages like RUst you mentioned, have a clear tag that says "this is a test," but others do not, so the tool has to guess from file names and ends up missing some and skipping too much.
Also as I mentioned on the answer below, sometimes you actually do want to see the repeats inside tests, or normal code repeats on purpose too. So I am leaning toward letting users wave off one specific case by hand instead of skipping everything blindly.
Maybe I don’t quite understand the question but can you not just define a function that sets up the shared test state and use that in every test?
What is a "structural detector"?
clicking through to the repo linked at the end it appears to be rolling-hash-style ast structural pattern matching that ignores things like what names identifiers concretely have