I have a UDP based server/client application where on initial communication, the client sends a message to the server (on specific IP/port), then the server replies with a new port to talk to. The client then sends another message to the new port (same server IP as before), and communications normally continue from there. I've had a couple complaints about this communication failing, but I haven't been able to reproduce it. The logs I have don't contain as much as I'd like, but I can see that when the issue occurs, the 2nd thread on the server that receives the 2nd message drops it because the client IP/port is unknown (the first thread logs it for the 2nd thread prior to replying to the client, mutexes and everything look correct for ensuring the data is there before the 2nd thread gets the next message).
This works fine every time I've tested it, and I know the assumption is that the first message from the client comes from the same IP/port as the 2nd message, which I unfortunately do not have logs to verify whether or not that was the case and I'm starting to wonder if that assumption is incorrect based on the client code. Here's the client code (windows):
SOCKET skt = socket(AF_INET, SOCK_DGRAM, 0);
std::string serverAddr("127.0.0.1");
int addrInt = inet_addr(serverAddr.c_str());
sockaddr_in toAddr;
toAddr.sin_family = AF_INET;
toAddr.sin_addr.s_addr = addrInt;
toAddr.sin_port = htons(12345);
uint8_t txBuf[] msg = "test_msg";
// this send works correctly
sendto(skt, (char*)txBuf, sizeof(txBuf), 0, (sockaddr*)&toAddr, sizeof(toAddr));
sockaddr_in rxAddr;
int fromAddrLen = sizeof(rxAddr);
uint8_t rxBuf[1000];
// this receive also works
recvfrom(skt, (char*)rxBuf, sizeof(rxBuf), 0, (sockaddr*)&rxAddr, fromAddrLen);
uint16_t nextPort = *((uint16_t*)rxBuf 0);
toAddr.sin_port = htons(nextPort);
// this send triggers the server log that the client IP/port is unknown
sendto(skt, (char*)txBuf, sizeof(txBuf), 0, (sockaddr*)&toAddr, sizeof(toAddr));
My question is, in the code above, is the 2nd send guaranteed to come from the same IP/port combination as the first send? And if not, how can I change the above code to gaurantee it without binding (I don't know the client IP to bind to and don't know what ports are available)?
EDIT: I learned the hard way previously about how NATs prevent servers from making first contact (which is why the client sends to thread1, gets response, then sends to thread2 instead of thread1 directly telling thread2 to respond). Remembering this fun time, I'm starting to wonder if the behavior I'm seeing is due to NAT behavior that's different in some cases from all of the NATs my test assets have. I know NATs create a linkage between the client's IP and server's IP and that the server sees the NAT generated client IP/port instead of the IP/port that the client sees. I was thinking that the NAT used the same linkage when the client started sending to the new server port (same server IP). Was that wrong? Does the fact that the client starts sending to a different port potentially cause a different NAT IP/port that the server ultimately sees?
CodePudding user response:
I believe your error is that you are using UDP and this protocol does not have error correction. There is no guarantee that the packet will be received, or that packets will be received in order, or that packets will be received only once.
You appear to be doing some kind of transaction that depends on the order packets are received. Testing locally is likely to always work, but as soon as you are on a wider area network you fill find dropped and out of order packets.
You can solve this by either using TCP, or by designing your own solution in UDP. Generally, unless you have specific performance reasons not to, you should use TCP.
CodePudding user response:
The answer turns out to be that the client IP/port is not guaranteed to stay the same. I added debug to the server and saw it happen again, so was able to verify that when the client changed it's send-to port, it caused the server to see a different client send-from port.
I'm not entirely sure if this is due to something in the OS being different for some clients or if it was because of different NAT behavior. For my purposes, it doesn't matter -- I had to change code to work with it either way. If anybody can definitively say (and hopefully point to documentation) that it was the OS and/or NAT, please do and I'll change the accepted answer to it