Yeah, Zscaler Breaks SSH Too
I tried to SSH to the OpenBSD box that hosts this blog.
Nothing exotic. Just:
ssh sns@<host-ip>
It timed out.
That was odd. The machine was up. The site was serving HTTP. The address was right. Port 22 looked open. Another machine on my LAN could reach it. This was not obviously a dead host, a firewall rule, or a bad key.
My first thought was a simple one: can I reach the port at all, or can I reach it but not complete a session?
nc and ping both succeeded — TCP reached the box, but SSH still died.
Verbose SSH made the failure more interesting. It reached key exchange, then stalled. Small packets were getting through. The larger packets needed for key exchange were not. Eventually it ended with this:
ssh_dispatch_run_fatal: Connection to <host-ip> port 22: Operation timed out
I'd seen this sort of thing before. MTU is the largest packet a network link will carry. If something on the path silently caps it and sends back no error, the oversized packets just disappear and the connection stalls.
The way to test that is to send pings of steadily increasing size with the don't-fragment flag set, then watch for the size at which replies stop coming back. That size is the effective ceiling of the path. In this case the usable path topped out at around 1200 bytes, with larger packets silently dropped.
The route to the host went through a tunnel interface:
gateway: 100.64.0.1 interface: utun4
The interface was utun4, and its MTU was 1325.
So, to sum up where I'd got to:
TCP connects. Small SSH packets pass. Larger key-exchange packets vanish. Nobody sends back a useful ICMP message. Path MTU discovery fails. SSH reports a timeout.
Your machine is supposed to discover the largest packet a path allows by listening for "packet too big" ICMP messages that routers send back. If those messages never arrive, it never learns to shrink its packets, so it keeps sending oversized ones that vanish.
At this point I thought I had it. I was on a corporate Cisco VPN, which meant a tunnel, and that tunnel had an odd MTU, so the obvious conclusion was that the VPN was mangling my outbound SSH. It was a reasonable guess. It was also wrong.
I tried the obvious-looking client-side mitigation first: lowering utun4's MTU, to 1200 and then 1100. It did not fix the session.
Then I tried the less obvious thing, which I did not even know was a lever until this incident: clamping MSS locally with pf on the Mac. That did not fix the path either.
Wait, what is MSS? It is the Maximum Segment Size: the largest chunk of actual data TCP puts in a single packet, roughly the MTU minus headers. Clamp it down and both ends agree to send smaller packets.
Wait, pf on macOS? pf is OpenBSD's packet filter, and macOS inherited it from the BSDs. There is a pf sitting on every Mac, which almost nobody realises.
In hindsight that was another useful lesson: if the black hole is inside a tunnel I do not control, client-side packet nudging only gets me as far as the tunnel entrance. The thing eating the packets is still somewhere else, and I cannot fix that from my laptop.
Then the failure changed.
Instead of only timing out, the server started refusing me:
kex_exchange_identification: Connection closed Not allowed at this time
That looked like a new problem, and it was.
The box is OpenBSD 7.6. OpenSSH 9.8 introduced a feature called PerSourcePenalties, on by default, that penalises source addresses which repeatedly fail to complete authentication. My repeated broken handshakes looked exactly like that, so sshd started temporarily refusing me.
So now there were two problems: the original MTU black hole, and the debugging process had triggered a defensive mechanism on the server.
The workaround was simple enough. I had a Linux box on the LAN that could reach the OpenBSD host cleanly. So I used it as a jump host, in the usual shape:
ssh -J sns@lan-box sns@<host-ip>
From there, SSH worked.
But the explanation still bothered me.
Later I turned the Cisco VPN off.
The problem remained.
That was the important moment. If the Cisco VPN was down and the route still went through utun4, then utun4 was not the Cisco VPN. Or at least not the thing I had been blaming.
So I asked the question I should probably have asked much earlier:
Whose network am I actually leaving through?
There were Zscaler processes running, including ZscalerTunnel. The traffic to my OpenBSD box was still routed through a tunnel interface. The egress IP belonged to Zscaler's London infrastructure.
So the answer was: Zscaler.
Not the VPN I had toggled. Not the thing I was mentally debugging. The always-on security tunnel underneath it.
Zscaler does not merely break web things, certificate chains, Docker pulls, package managers, and all the usual corporate-proxy irritations. It can also break plain outbound SSH in a way that looks, at first, like a network reachability problem, then like an SSH problem, then like a VPN problem, then like a server-side refusal.
"Operation timed out" is nearly contentless: the useful question is not whether it timed out, but where the protocol got to before it did. And a switched-off VPN is not the same thing as no tunnel — endpoint security software may still be intercepting and tunnelling your traffic. If you want to know whose infrastructure your packets are really using, the egress IP is the best witness: ask where they come out.
So yes: Zscaler breaks SSH too.