The Many SHAs of a GitHub Pull Request

52 pointsposted 6 days ago
by yamrzou

8 Comments

timvdalen

2 days ago

Ran into one of these when debugging an action workflow a while ago. Locally, all the tests would pass, but the tests kept failing on GitHub.

Diving into the logs, the SHA that the tests were run against was not the commit we were testing locally, but the "merge candidate" that had apparently auto merged some lines of code that resulted in a syntax error.

Looking back, it makes perfect sense that _that_ is what tests are run against, but I had never realized that before (and the platform definitely hides it from you). Before we moved to GitHub actions, we were just running tests against the head of a feature branch.

OJFord

2 days ago

> the platform definitely hides it from you

I'm not saying it's obvious, but pull_request workflows don't run if there's a merge conflict (because it can't generate this merge candidate) which is a pretty decent pointer once you hit that.

timvdalen

2 days ago

For sure! In this case there wasn't an actual conflict, just a bad merge, and I just had not considered this at all before.

OptionOfT

2 days ago

I struggled with this when trying to build a system where the output of a PR is labeled in such a way that once it is merged, it can be relabeled without rebuilding.

I ended up taking the commit SHAs of parents of the virtual merge commit and label my containers with that.

Then, when the PR is merged in, I have the same parent SHAs, and I can now move the label from a PR candidate to what is now on main.

For provenance reasons I only want to build the code once. If the code doesn't change the checksum of the binary shouldn't.

fragmede

2 days ago

except that usually timestamps get embedded into something somewhere, so the binary's checksum does change with each build.

klodolph

2 days ago

Practically speaking that’s often true, but there are also projects where someone has figured out deterministic builds. There’s usually not a technical reason why you can’t have a deterministic build, but rather, it’s a question of whether you care to spend the resources making it happen (usual answer is “no”).

crabbone

2 days ago

GitHub Actions still do it very wrong, especially in the only useful case of rebases.

Here's my story: for a particular repository I was asked to add a GitHub Action that would prune all the output from Python notebooks, purge the history of ever having any outputs and push the new history into the target branch. The premise here was that eventually the branches with commits containing extra notebook content would be deleted and GC'd, so they didn't need special processing.

The problem here is that GitHub doesn't set base correctly when a PR is from the branch to itself. It just sets it to all zeros hash... someone didn't test the edge case over there...

Anyways. This isn't really a problem. This is just me trying to cope with a much larger problem with how GitHub CI works. Ideally, if I were to design my own system (and I've done this twice so far), any tests, any PR-associated work need to happen in the candidate commit that is about to be pushed into the target branch. This candidate is produced by rebasing the source of the PR onto the target of the PR (i.e. it's still on the source of the PR branch). Once the tests pass, the generated commit is pushed onto the target branch and the target's branch head is updated.

Instead, GitHub rebases, pushes the generated commit from the source branch into the target, and runs the PR-associated CI code afterwards. This is bad, because especially if the PR-associated code wants to process PR history, it needs to generate a new PR... so, anything like enforcing code conventions, removing potentially sensitive information etc. is... impossible, because the action exposes this information too late.