Most of our services are on Google cloud, but we connect to Azure Cognitive search for full text searches.
Since around a month ago, random delays started appearing between app engine and cognitive search, with requests consistently taking 127.2-127.3 seconds, i.e. exactly 127 seconds more than normal requests. This is specifically on the request to cognitive search, and not any other part of the infra or code.
Strangely, these delays do not appear from local testing, vms, k8s or anywhere else. Still, after taking 3 weeks to answer, google insists this is 'an azure problem' caused by 'packet loss'. They do indicate that some requests connect to an ipv4 and others to an ipv6 address, but I have found no correlation here either.
What kind of problem could cause such an extremely specific delay?
CodePudding user response:
It appears this is due to IPv6 problems, and the requests library (or underlying python infrastructure) switches to IPv4 after a number of exponentially backed-off retries, leading to ~2^7-1 seconds delay.
Forcing an IPv4 connection canbe done using this ugly hack:
import socket
import requests.packages.urllib3.util.connection as urllib3_cn
# https://pythonadventures.wordpress.com/2019/06/28/force-requests-to-use-ipv4/
urllib3_cn.allowed_gai_family = lambda: socket.AF_INET # force IPv4