Issue: Dropouts every couple of hours, not consistent but always fell on the 1/4 hour. Traced it back to DHCP timeout issues.
ENV: Aussie Broadband Fiber to the Premises (FTTP) NBN. Basically an Ethernet WAN service. pfSense 2.4-5 (continued into pfSense 2.5.0-RELEASE) Full VM on proxmox. WAN connection into proxmox via Linux Bridge. E1000 virtual adapter in pfSense.
Issue originally cropped up 12 hours after initial stable connection. pfSense still believed the connection to be up. Restarted the vm and connection returned. 8.5hrs later same issue. Stumbled upon this Whirlpool thread describing DHCP timing issue with various commercial routers and custom setups including pfSense installs.
Resolution: Seemingly an issue with the script that interacts between DHCPclient and the pfSense GUI. Script was updated in the 2.4.5-p1 update and though the change log does not mention any fix’s the issue hasn’t cropped up since. Either that or an ISP error, noted in logs and resolved manually. (Updated in last section of this post)
History: Check the DHCP timing for the WAN interface under Interfaces > WAN > DHCP Client Configuration > Tick Advanced Configuration.
The default seems to be Presets : Saved Config (blank/unknown values).
Updated the Preset to FreeBSD defaults, which are listed below for ref:
Select Timeout: 0
Backoff cutoff: 120
Initial Interval: 10
has been was stable for the last 3 5 days without an further dropout until today. 1800 second leases with 900 seconds until pfSense will attempt to renew (as per DHCP standard).
Found some new log entries in the dhclient log that seem to suggest the DHCP server that assigns the WAN IPs is either being ignored, timing out or is blocking packets to itself. (Partial Log Bellow)
Jun 1 12:33:55 dhclient RENEW Jun 1 12:33:55 dhclient Creating resolv.conf Jun 1 12:48:55 dhclient RENEW Jun 1 12:48:55 dhclient Creating resolv.conf Jun 1 13:12:31 dhclient 9839 send_packet: Host is down Jun 1 13:18:57 dhclient EXPIRE Jun 1 13:18:57 dhclient Deleting old routes Jun 1 13:18:57 dhclient PREINIT Jun 1 13:19:58 dhclient FAIL Jun 1 13:25:59 dhclient FAIL Jun 1 13:32:00 dhclient FAIL Jun 1 13:38:01 dhclient FAIL Jun 1 13:44:02 dhclient FAIL Jun 1 13:50:03 dhclient FAIL Jun 1 13:56:04 dhclient FAIL Jun 1 14:02:05 dhclient FAIL Jun 1 14:08:06 dhclient FAIL Jun 1 14:14:07 dhclient FAIL Jun 1 14:20:08 dhclient FAIL Jun 1 14:26:09 dhclient FAIL <WAN Interface Manually Shutdown> Jun 1 14:31:06 dhclient 9839 connection closed Jun 1 14:31:06 dhclient 9839 exiting. <WAN Interface Manually Started> Jun 1 14:31:52 dhclient PREINIT Jun 1 14:31:52 dhclient EXPIRE Jun 1 14:31:52 dhclient Deleting old routes Jun 1 14:31:52 dhclient PREINIT Jun 1 14:31:52 dhclient 86271 short write: wanted 20 got 0 bytes Jun 1 14:31:52 dhclient 86271 exiting. Jun 1 14:31:53 dhclient PREINIT Jun 1 14:31:53 dhclient EXPIRE Jun 1 14:31:53 dhclient Deleting old routes Jun 1 14:31:53 dhclient PREINIT Jun 1 14:31:53 dhclient 5335 DHCPDISCOVER on em0 to 255.255.255.255 port 67 interval 23 Jun 1 14:31:54 dhclient 5335 DHCPOFFER from <GW-IP> Jun 1 14:31:54 dhclient ARPSEND Jun 1 14:31:56 dhclient ARPCHECK Jun 1 14:31:56 dhclient 5335 DHCPREQUEST on em0 to 255.255.255.255 port 67 Jun 1 14:31:56 dhclient 5335 DHCPACK from <GW-IP> Jun 1 14:31:56 dhclient BOUND Jun 1 14:31:56 dhclient Starting add_new_address() Jun 1 14:31:56 dhclient ifconfig em0 inet <IP> netmask </22 netmask> broadcast <bcast IP> Jun 1 14:31:56 dhclient New IP Address (em0): <IP> Jun 1 14:31:56 dhclient New Subnet Mask (em0): </22 netmask> Jun 1 14:31:56 dhclient New Broadcast Address (em0): <bcast IP> Jun 1 14:31:56 dhclient New Routers (em0): <GW-IP> Jun 1 14:31:56 dhclient Adding new routes to interface: em0 Jun 1 14:31:56 dhclient /sbin/route add default <GW-IP> Jun 1 14:31:56 dhclient Creating resolv.conf Jun 1 14:31:56 dhclient 5335 bound to <IP> -- renewal in 900 seconds. Jun 1 14:46:56 dhclient 33966 DHCPREQUEST on em0 to <GW-IP> port 67 Jun 1 14:46:56 dhclient 33966 DHCPACK from <GW-IP> Jun 1 14:46:56 dhclient RENEW Jun 1 14:46:56 dhclient Creating resolv.conf Jun 1 14:46:56 dhclient 33966 bound to <IP> -- renewal in 900 seconds.
I recorded the “static IP”, /22 netmask and gw. Changed the interface across to static IP mode and set those values. 25 minutes later the wan interface stopped receiving traffic. Which was dead on 1800 seconds after the last completed renew. I’d say Aussie BB starts blocking all traffic (except DHCP traffic) from an IP 30 minutes after the last successful renewal.
It does not look to be MSS or MTU related as under normal conditions no packets are being dropped. Large size pings tests confirmed this.
Will continue to look into DHCP timing issues.
UPDATE: Some packet captures completed.
Further review of the logs aswell.
Just before the “crash” renewal log entries looked like this.
Jun 1 10:48:54 dhclient RENEW Jun 1 10:48:54 dhclient Creating resolv.conf Jun 1 11:03:54 dhclient RENEW Jun 1 11:03:54 dhclient Creating resolv.conf <Every 900 seconds>
Just after the manual stop/start the renewals log like this:
Jun 1 17:06:10 dhclient 71403 DHCPREQUEST on em0 to <GW-IP> port 67 Jun 1 17:06:10 dhclient 71403 DHCPACK from <GW-IP> Jun 1 17:06:10 dhclient RENEW Jun 1 17:06:10 dhclient Creating resolv.conf Jun 1 17:06:10 dhclient 71403 bound to <IP> -- renewal in 900 seconds.
Investigating the program to completes renewals on pfSense. This script looks to be the lead. /usr/local/sbin/pfSense-dhclient-script (see resolution at top, script was updated in 2.4.5-p1 release and issue hasnt appeared since)
No further testing completed. Didn’t check the differences in the script to see what changed.
Update 2021-04-11: The interface drops may have something to do with the GW monitor “dpinger” that when it is unable to ping the monitor gateway IP (which defaults to the first upstream hop of the WAN interface) it will mark the gateway as down and possibly kills dhclient from running on the WAN interface, dropping the link at the new renew.
Error Messages had been appearing in the system log > gateways for the last couple of days are as follows:
<datetime> dpinger <pid> WAN_DHCP <first-hop>: sendto error: 50 .... multiple time .... and .... <datetime> dpinger <pid> WAN_DHCP <first-hop>: sendto error: 65
Pinging this first upstream gateway is not preferable as most sane ISPs will be prioritizing customer traffic over echo pings and some may be dropped. Did some forum checks looking for stable and reliable hosts to bounce echo’s off. In Australia the majority of the recommendation are Cloudflare’s DNS (18.104.22.168), dns.google (22.214.171.124), Level 3’s DNS (126.96.36.199) or ns1.telstra.net (188.8.131.52). Purely based on best latency and hop count I selected 184.108.40.206 and configured it as the Monitor IP for the WAN default GW.
No more dpinger errors since and most importantly no drops either. Will report back here if either return.