Cliche time – You learn something new every day…
… and today I learned about how the seconds elapsed field in DHCP packets can affect the DHCP DORA (Discover/Offer/Request/Acknowledgement) process – particularly when using a load-balance failover config on Microsoft DHCP servers.
- 2 Microsoft DHCP servers with DHCP scopes setup for load-balance (50/50)
- Meraki APs
- AP Management VLAN gateway configured on core Nexus switch (and branch routers) with one IP helper address pointed towards DHCP server 1 (This will be important)
- Greenfield wireless deployment
Some (but not all) new Meraki APs were not getting DHCP IP addresses when they got plugged into the network.
I love troubleshooting DHCP because it is a straightforward, structured process.
- Client sends Discover packet (broadcast) – “Hey I’m looking for an IP”
- Server responds with an Offer – “I have an IP if you want it”
- Client sends a Request – “I would like that IP good sir”
- Server sends an Acknowledgement – “OK – that IP is marked as yours”
My first step in troubleshooting a client not getting an address is to verify that the server is getting a Discover packet from the client. If your client is on the same broadcast domain (Layer 2) as your DHCP server, the server will see the Discover packet as a broadcast with the source MAC address of the client. Typically though, the DHCP server is located across a Layer3 boundary and so the normal broadcast method won’t work.
Enter IP helpers.
IP helpers are located on the gateway interface for a VLAN (SVI, sub-interface, etc). They listen for DHCP broadcasts and if they hear one, they jump in to help get that packet to the proper server. They do this by taking a broadcast packet and changing it to a unicast packet to be sent directly to the configured IP helper address. The unicast packet then puts the VLAN interface IP as the source IP and the IP Helper address as the destination. Inside the packet the original client MAC is preserved as the client requesting an address.
So with IP helper configured you should see the Discover packet arrive on your server. In this case I was successfully seeing the packet arrive at the DHCP 1 server. Next you would expect the server to craft and send a DHCP Offer back to the VLAN interface with an IP the client can use.
In this particular case I was not seeing either of the DHCP servers responding with an Offer. So now I know why the AP wasn’t getting an IP – the DORA process wasn’t finishing as intended. Next I had to figure out why the servers weren’t responding to the Discover packets.
According to the DHCP server logs the server was dropping my discover packets.
Packet dropped because of client ID hash mismatch or standby server
The Google machine told me it meant that my server (DHCP 1) wasn’t responding because DHCP 2 should be the responding server based on the Microsoft hashing algorithm that determines what DHCP server should serve the client. The problem was, DHCP 2 wasn’t responding with an Offer either and I didn’t have anything in the logs indicating why.
As a test we disabled load-balancing and then APs were able to get IPs with no issues. So by process of elimination it appeared that something in the load balance process was breaking the normal DORA flow. At that point, I turned to Google again because I don’t have a ton of experience administering and troubleshooting MS DHCP.
As I pored over Google search results, one caught my eye.
The preview text sounded like my problem. Clicking the link I found a wealth of information that definitely sounded like my problem. Peruse the info in the image below:
Other than being on Meraki APs and not Extreme, our situation matched up almost exactly with the one outlined above. The biggest things that jumped out to me were the fact that we only had one IP helper configured AND other non-meraki devices weren’t having any issues on similarly configured scopes.
Armed with this information I reconfigured load balancing and fired up Wireshark on the two DHCP servers again. In my capture I started looking at the Seconds Elapsed field in each discover packet from my Meraki AP. Like the example packet below, I only saw a value of 0 for each packet.
If a client chooses to use the secs field it should update that value for the time elapsed between the first time it tried to send that packet and the current time.
If Meraki updated the secs field as designed, then the DHCP 1 server would eventually see a value greater than 6 in our particular scenario where the client device was not getting an Offer. At that point it would forward that request to the DHCP 2 server so that it could respond to the Discover. However, since that field never increments, the DHCP 1 server never forwards that request. This becomes an issue when you only have one IP helper address configured like this particular customer did. Since they were only forwarding requests to DHCP 1 and since DHCP 1 determined that the client requesting should be serviced by DHCP 2, DHCP 1 would ignore the Discover assuming DHCP 2 was going to answer. DHCP 2 couldn’t answer though because it was never getting the request either via the IP helper or the internal load-balance mechanism.
Since we couldn’t make Meraki change the way it uses (or rather doesn’t use the Seconds Elapsed field) the only other way to fix the issue was to add the DHCP 2 server as another IP helper address on the relevant interfaces. So we made the change, rebooted the APs that weren’t getting IP addresses and BOOM – they were able to successfully complete the DHCP process and get an IP.
- To be clear – both IP helpers should have been configured from the start. That being said I would not have had the above learning experience if both had been configured.
- I’ll take the blame on the misconfiguration. I used an existing VLAN interface that served PC clients and used it as a template for the AP management interface. Normally I would put both helpers in but I made an assumption (bad Jamie) that copying the existing config should be fine. The existing one only had the one IP helper. There had never been an issue before since most normal client devices will utilize the seconds elapsed field. So clients on the Data VLAN would get their DHCP requests sent to DHCP 2 approximately 6 seconds after their initial attempt if DHCP 2 was supposed to be the server that handled that particular client. That being the case, the customer had never seen issues before because that process worked as designed.
- The issue only came to light after introducing a device (the Meraki AP) that didn’t utilize the same behavior regarding the Seconds Elapsed field.
- The customer would’ve eventually had issues if DHCP 1 had ever gone and stayed down unexpectedly. They had not run into that situation though as of this writing.
- Ultimately I recommended that they roll out the second IP helper address to all interfaces that need to forward DHCP requests to prevent future issues if/when DHCP 1 were to become unavailable.