Questionable Practices in Machine Learning

6 pointsposted 17 hours ago
by beckhamc

1 Comments

beckhamc

16 hours ago

I'm not sure if this is directly mentioned in the paper, but I didn't see any mention specifically about the conflation between a validation set and test set. When people actually make a distinction between the two (which is seemingly not all that common nowadays), you're meant to perform model selection on the validation set, i.e. find the best HPs such that you minimise `loss(model,valid_set)`. Once you've found your most performant model according to that, you then evaluate it on the test set once, and that's your unbiased measure of generalisation error. Since the ML community (and reviewers) are obsessed with "SOTA", "novelty", and bold numbers, a table of results purely composed of test set numbers is not easily controllable (when you're trying to be ethical) from the point of view of actually "passing" the peer review process. Conversely, what's easily controllable is a table full of validation set numbers: just perform extremely aggressive model validation on your model until your model gets higher numbers than everything else. Even simpler solution, why not just ditch the distinction between the valid and test set to begin with? (I'm joking, btw.) Now you see the problem.