tl;dr: all the individual links have a high bandwidth on their own, but the end-to-end connection has a low bandwidth.

I’m stuck on debugging it, what should I measure and what should I try?

Details below.

I have two routers (both are Ubiquiti UDR), and a NAS (Synology). The NAS is connected directly to one of the UDRs. Between the two routers there’s a site-to-site wireguard VPN using Ubiquiti’s “Magic Group” feature.

I measured all the links with iperf3, and on their own they all have a high bandwith (see pic below):

  1. The link between the two routers over the VPN are fast enough (200 Mb/s+),
  2. the link between the NAS and its neighbor router is also fast (500 Mb/s+).

However, the end-to-end connection between the NAS and the remote router is extremely low bandwith, 4-10% of what should be possible: 10-23 Mb/s. When I mount the NAS on a client on the remote network, the bandwith is even slower, copying a file is max 6Mb/s, with drops to 0.

https://preview.redd.it/t4vjshkyzl2c1.png?width=2222&format=png&auto=webp&s=3adefc4b059a77df31671bbe0f464f6cb2def838

I believe I ruled out issues with the VPN: I removed the VPN connection on the routers, forwarded the port used by iperf, and measured the bandwidth between the remote router and the NAS directly via its public IP. This generated about the same result as the measurement over the VPN.

Other potentially useful info:

  • There’s a large distance between the two routers. Traceroute measured 12 hops, with a whopping 179ms latency.
  • The connection from the NAS is weirdly asymmetric.
    • When the remote router downloads from the NAS, the speed settles on a stable 21.0 Mbit/s, with an occasional bump to 31.5 Mbit/s. There’s a ~5% overhead (packet loss?) during the test.
    • On the opposite direction, NAS downloading from remote router, the speed starts from 1.71 Mbit/s, gradually increasing until 36.5, then drops to 0, then starts hiking again. The packet loss is under 1%.
    • The connection between the NAS and its directly connected router is also inexplicably asymmetric, but it’s stable without any changes in bandwith during the test and no packet loss visible on the iperf test.
  • bbluffOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Found it. The TCP window size was too small.

    Raising the maximum TCP window size on the NAS and changing the TCP window size on iperf fixed the problem. The bandwidth changed from 23 Mbit/s to 190 Mbit/s over VPN.

    I believe the root cause is a combination of high latency between the endpoints and low maximum TCP window size. The latency between the two routers was 179ms, and the maximum TCP window size on the NAS was set to 128KB. Because of the high latency, the TCP ACK takes a long time to arrive, and because of the small window size the sender is waiting for the ACK instead of sending more data.

    • Between the two routers the bandwidth was fine because the maximum window size was large (16MB) and the size scaled up
    • Between the NAS and its local router the latency was small so the TCP window size didn’t matter.
    • When sending data from the NAS to the remote router, the NAS kept waiting for TCP ACKs to arrive, and underused the available bandwidth.

    I changed the max TCP window size on the NAS to match the routers’ 16MB:

    sudo sysctl -w net.core.rmem_max=16777216  
    sudo sysctl -w net.core.wmem_max=16777216
    

    And changed iperf command to use the -w 8M flag to send data with a 8MB TCP window size. I found the 8M value with trial-and-error, I’m sure it could be optimized further but 190Mbit/s is good enough for this connection.