hackernews client

Ask HN: How do you separate intentional test boilerplate from real duplication?

8 pointsposted 2 days ago

Item id: 48574082

7 Comments

bilbo-b-baggins

22 minutes ago

I would say since your focus is on structural or programmatic detection, and not LLM heuristics, the problem depends on language a lot.

In Rust or Go there’s super clear test markers or filenames.

In Javascript it would have to detect the framework in use then detect test files and tests embedded in program files.

And so forth.

Are you doing any call sequencing heuristics? Like if the same 5 calls (with different args) appear in the same order in multiple places (even in test files) that might be a strong signal for deduplication. Or even if the same 5 calls are in the same order with a couple different interleaved calls - the fuzziness of the heuristic might be something tunable to a language, or particular codebase, or framework, etc.

peterabbitcook

an hour ago

I’ve dealt with a question that rhymes with this.

Sonarqube or CodeQL reports might tell me what percentage of a repo is duplicated code, and a large percentage of that is in src/test/java

I find that a lot of the time this is not just some flippant observation but a clue that I should be using a mechanism like @ParameterizedTest instead of @Test, so I rewrite those tests in a way that makes them easier to set-up, define parameters/constraints, inputs, and outputs. Sometimes it does get a little convoluted as you either use a lot of naked Arguments.of() or define test-class-scoped nested records to encapsulate test parameters, inputs, expected outputs, etc.

dezgeg

2 hours ago

Detect tests somehow (eg. in rust you could check for #[test]) and just skip the analysis for that function?

rafaepta

an hour ago

Yeah, that is pretty much what it does already: it tries to recognize test files and skip them. Dupehound is available for 12 languages Today.

Some languages like RUst you mentioned, have a clear tag that says "this is a test," but others do not, so the tool has to guess from file names and ends up missing some and skipping too much.

Also as I mentioned on the answer below, sometimes you actually do want to see the repeats inside tests, or normal code repeats on purpose too. So I am leaning toward letting users wave off one specific case by hand instead of skipping everything blindly.

Ask HN: How do you separate intentional test boilerplate from real duplication?

7 Comments

bilbo-b-baggins

peterabbitcook

dezgeg

rafaepta

echoangle

ambicapter

nagaiaida