Today I learned that some people make mistakes, but I already knew that ;) ServerAliveInterval doesn't do this properly and consistently.
I've used my own autossh type script for two decades now. It's mostly used to give access to machines behind shitty NAT, and/or that have addresses that constantly change, and/or for systems on CGNAT, like Starlink.
If ServerAliveInterval works so well and negates the need for something like autossh to exist, then why have sessions created by my script, which has ServerAliveInterval (and ServerAliveIntevalMaxCount) gotten hung up where the script needs to kill the old and create a new ssh connection now and then? My script logs each timeout, each session hang, and each new connection, and depending on the network, it can happen often.
Please read the bit where it's explained how autossh sends test data back and forth. Do you think you just magically and cleverly discovered ServerAliveIntevalMaxCount and that the autossh people have no idea that it exists?
Or perhaps they know it exists, they know it's not perfect, and they used another mechanism to make up for the shortcomings of what ssh offers out of the box?
The README has this text:
> For example, if you are using a recent version of OpenSSH, you
may wish to explore using the ServerAliveInterval and
ServerAliveCountMax options to have the SSH client exit if it
finds itself no longer connected to the server. In many ways
this may be a better solution than the monitoring port.
Just to clarify that we're talking about the same thing in case I misunderstood something: autossh (style) scripts do these things:
1. fake data to keep a connection "fresh" for shitty middleware
2. detect connection which are stuck (state = open, but no data can actually round trip) and kill them
3. restart ssh when that happens
Is that what we're talking about here? I think people are saying that points 1 and 2, but not 3, are covered by SSH's ServerAlive* options. And that's also how OpenSSH advertises and documents those options, and apparently even how autossh talks about it in their own readme.
You're saying that those options don't actually solve points 1 and 2, while (your/their/etc) autossh does properly detect it.
Correct so far?
If so that seems like a bug in OpenSSH (or whatever implementation) which should get appropriate attention upstream. Has anyone reported this upstream? Is there a ticket to follow?
PS: I think we're all in agreement that option 3 is out of scope for stock OpenSSH (regardless of what other tools do)
I haven’t revisited this issue in years but on a project for thousands of similar devices we found autossh much more reliable.
I believe the issue is that the connections often fail or get wedged in other network layers; the only way to be sure that your ssh tunnel isn’t: a) lossy enough to “keep alive” but too lossy to send data, or b) isn’t just always waiting on TCP retry backoff, or c) etc, is to use the tunnel to transmit actual data at the application level.