Winsock sendto returns error 10049 (WSAEADDRNOTAVAIL) for broadcast address after network adapter is-CodePudding

This bounty has ended. Answers to this question are eligible for a 300 reputation bounty. Bounty grace period ends in 18 hours. pulp_user wants to draw more attention to this question:

An explanation as to why this happens (if this is a windows bug or expected behavior and if so why) and - if possible - a solution to detect and handle network interfaces disconnecting without ambiguity (not just a heuristic based on observer behavior that "should work most of the time" - I can do that myself).

I am working on a p2p application and to make testing simple, I am currently using udp broadcast for the peer discovery in my local network. Each peer binds one udp socket to port 29292 of the ip address of each local network interface (discovered via GetAdaptersInfo) and each socket periodically sends a packet to the broadcast address of its network interface/local address. The sockets are set to allow port reuse (via setsockopt SO_REUSEADDR), which enables me to run multiple peers on the same local machine without any conflicts. In this case there is only a single peer on the entire network though.

This all works perfectly fine (tested with 2 peers on 1 machine and 2 peers on 2 machines) UNTIL a network interface is disconnected. When deactivacting the network adapter of either my wifi or an USB-to-LAN adapter in the windows dialog, or just plugging the usb cable of the adapter, the next call to sendto will fail with return code 10049. It doesn't matter if the other adapter is still connected, or was at the beginning, it will fail. The only thing that doesn't make it fail is deactivating wifi through the fancy win10 dialog through the taskbar, but that isn't really a surprise because that doesn't deactivate or remove the adapter itself.

I initially thought that this makes sense because when the nic is gone, how should the system route the packet. But: The fact that the packet can't reach its target has absolutely nothing to do with the address itsself being invalid (which is what the error means), so I suspect I am missing something here. I was looking for any information I could use to detect this case and distinguish it from simply trying to sendto INADDR_ANY, but I couldn't find anything. I started to log every bit of information which I suspected could have changed, but its all the same on a successfull sendto and the one that crashes (retrieved via getsockopt):

250   16.24746[886] [debug|debug] local address: 192.168.178.35
251   16.24812[886] [debug|debug] no remote address
252   16.25333[886] [debug|debug] type: SOCK_DGRAM
253   16.25457[886] [debug|debug] protocol: IPPROTO_UDP
254   16.25673[886] [debug|debug] broadcast: 1, dontroute: 0, max_msg_size: 65507, rcv_buffer: 65536, rcv_timeout: 0, reuse_addr: 1, snd_buffer: 65536, sdn_timeout: 0
255   16.25806[886] [debug|debug] Last WSA error on socket was WSA Error Code 0: The operation completed successfully.

256   16.25916[886] [debug|debug] target address windows formatted: 192.168.178.255
257   16.25976[886] [debug|debug] target address 192.168.178.255:29292
258   16.26138[886] [debug|assert] ASSERT FAILED at D:\Workspaces\spaced\source\platform\win32_platform.cpp:4141: sendto failed with (unhandled) WSA Error Code 10049: The requested address is not valid in its context.

The nic that got removed is this one:

   1.07254[0] [platform|info] Discovered Network Interface "Realtek USB GbE Family Controller" with IP 192.168.178.35 and Subnet 255.255.255.0

And this is the code that does the sending (dlog_socket_information_and_last_wsaerror generates all the output that is gathered using getsockopt):

void send_slice_over_udp_socket(Socket_Handle handle, Slice<d_byte> buffer, u32 remote_ip, u16 remote_port){
    PROFILE_FUNCTION();

    auto socket = (UDP_Socket*) sockets[handle.handle];
    ASSERT_VALID_UDP_SOCKET(socket);
    dlog_socket_information_and_last_wsaerror(socket);

    if(socket->is_dummy)
        return;

    if(buffer.size == 0)
        return;

    DASSERT(socket->state == Socket_State::created);

    u64 bytes_left = buffer.size;

    sockaddr_in target_socket_address = create_socket_address(remote_ip, remote_port);

    #pragma warning(push)
    #pragma warning(disable: 4996)
    dlog("target address windows formatted: %s", inet_ntoa(target_socket_address.sin_addr));
    #pragma warning(pop)
    unsigned char* parts = (unsigned char*)&remote_ip;
    dlog("target address %hhu.%hhu.%hhu.%hhu:%hu", parts[3], parts[2], parts[1], parts[0], remote_port);

    int sent_bytes = sendto(socket->handle, (char*) buffer.data, bytes_left > (u64) INT32_MAX ? INT32_MAX : (int) bytes_left, 0, (sockaddr*)&target_socket_address, sizeof(target_socket_address));

    if(sent_bytes == SOCKET_ERROR){
        #define LOG_WARNING(message) log_nonreproducible(message, Category::platform_network, Severity::warning, socket->handle); return;
        switch(WSAGetLastError()){
            //@TODO handle all (more? I guess many should just be asserted since they should never happen) cases
            case WSAEHOSTUNREACH: LOG_WARNING("socket %lld, send failed: The remote host can't be reached at this time.");
            case WSAECONNRESET: LOG_WARNING("socket %lld, send failed: Multiple UDP packet deliveries failed. According to documentation we should close the socket. Not sure if this makes sense, this is a UDP port after all. Closing the socket wont change anything, right?");
            case WSAENETUNREACH: LOG_WARNING("socket %lld, send failed: the network cannot be reached from this host at this time.");
            case WSAETIMEDOUT: LOG_WARNING("socket %lld, send failed: The connection has been dropped, because of a network failure or because the system on the other end went down without notice.");

            case WSAEADDRNOTAVAIL:

            case WSAENETRESET:
            case WSAEACCES:
            case WSAEWOULDBLOCK: //can this even happen on a udp port? I expect this to be fire-and-forget-style.
            case WSAEMSGSIZE:
            case WSANOTINITIALISED:
            case WSAENETDOWN:
            case WSAEINVAL:
            case WSAEINTR:
            case WSAEINPROGRESS:
            case WSAEFAULT:
            case WSAENOBUFS:
            case WSAENOTCONN:
            case WSAENOTSOCK:
            case WSAEOPNOTSUPP:
            case WSAESHUTDOWN:
            case WSAECONNABORTED:
            case WSAEAFNOSUPPORT:
            case WSAEDESTADDRREQ:
                ASSERT(false, tprint_last_wsa_error_as_formatted_message("sendto failed with (unhandled) ")); break;
            default: ASSERT(false, tprint_last_wsa_error_as_formatted_message("sendto failed with (undocumented) ")); //The switch case above should have been exhaustive. This is a bug. We either forgot a case, or maybe the docs were lying? (That happened to me on android. Fun times. Well. Not really.)
        }
        #undef LOG_WARNING
    }

    DASSERT(sent_bytes >= 0);
    total_bytes_sent  = (u64) sent_bytes;
    bytes_left -= (u64) sent_bytes;
    DASSERT(bytes_left == 0);
}

The code that generates the address from ip and port looks like this:

sockaddr_in create_socket_address(u32 ip, u16 port){
    sockaddr_in address_info;
    address_info.sin_family = AF_INET;
    address_info.sin_port = htons(port);
    address_info.sin_addr.s_addr = htonl(ip);
    memset(address_info.sin_zero, 0, 8);
    return address_info;
}

The error seems to be a little flaky. It reproduces 100% of the time until it decides not to anymore. After a restart its usually back.

I am looking for a solution to handle this case correctly. I could of course just re-do the network interface discovery when the error occurs, because I "know" that I don't give any broken IPs to sendto, but that would just be a heuristic. I want to solve the actual problem.

I also don't quite understand when error 10049 is supposed to fire exactly anyway. Is it just if I pass an ipv6 address to a ipv4 socket, or send to 0.0.0.0? There is no flat out "illegal" ipv4 address after all, just ones that don't make sense from context.

If you know what I am missing here, please let me know!

CodePudding user response：

This is a issue people have been facing up for a while , and people suggested to read the documentation provided by Microsoft on the following issue . "Btw , I don't know whether they are the same issues or not but the error thrown back the code are same, that's why I have attached a link for the same!!"

https://docs.microsoft.com/en-us/answers/questions/537493/binding-winsock-shortly-after-boot-results-in-erro.html

CodePudding user response：

I found a solution (workaround?)

I used NotifyAddrChange to receive changes to the NICs and thought it for some reason didn't trigger when I disabled the NIC. Turns out it does, I'm just stupid and stopped debugging too early: There was a bug in the code that diffs the results from GetAdaptersInfo to the last known state to figure out the differences, so the application missed the NIC disconnecting. Now that it observes the disconnect, it can kill the sockets before they try to send on the disabled NIC, thus preventing the error from happening. This is not really a solution though, since there is a race condition here (NIC gets disabled before send and after check for changes), so I'll still have to handle error 10049.

The bug was this:

My expectation was that, when I disable a NIC, iterating over all existing NICs would show the disabled NIC as disabled. That is not what happens. What happens is that the NIC is just not in the list of existing NICs anymore, even though the windows dialog will still show it (as disabled). That is somewhat suprising to me but not all that unreasonable I guess.

Before I had these checks to detect changes in the NICs:

Did the NIC exist before, was enabled and is now disabled -> disable notification
Did the NIC exist before, was disabled and is now enabled -> enable notification
Did the NIC not exist before, is not enabled -> enable notification

And the fix was adding a fourth one:

Is there an existing NIC that was not in the list of NICs anymore -> disable notification

I'm still not 100% happy that there is the possibility of getting a somewhat ambiguous error on a race condition, but I might call it a day here.