On Windows Python 3.7 i5 laptop, it takes 200ms to receive 100MB of data via a socket
, that's obviously very low compared to the RAM speed.
How to improve this socket speed on Windows?
# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
t0 = time.time()
while True:
data = conn.recv(8192) # 8192 instead of 1024 improves from 0.5s to 0.2s
if data == b'':
break
print(time.time() - t0) # ~ 0.200s
# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
a = b"a" * 100_000_000 # 100 MB of data
t0 = time.time()
s.send(a)
print(time.time() - t0) # ~ 0.020s
One can see that the server is not immediately awaken when the client fill the TCP buffer which is a missed-optimization of the Windows scheduler. In fact, the scheduler could wake up the client before the server starvation so to reduce latency issues. Note that a non-negligible part of the time is spent in a kernel process and the time slice are matching with the client activity.
Overall, 55% of the time is spend in the recv
function of ws2_32.dll, 10% in the send
function of the same DLL, 25% in synchronization functions, and 10% in other functions including ones of the CPython interpreter. Thus, the modified benchmark is not slowed down by CPython. Additionally, synchronizations are not the main source of slowdown.
When processes are scheduled, the memory throughput goes from 16 GiB/s up to 34 GiB/s with an average of ~20 GiB/s which is pretty big (especially considering the time taken by synchronizations). This means Windows performs a lot of big temporary buffer copies, especially during the recv
calls.
Note that the reason why the Xeon-based platform is slower is certainly because the processor only succeed to reach 14 GiB/s in sequential while the i5-9600KF processor reach 24 GiB/s in sequential. The Xeon processor also operate at a lower frequency. Such things are common for server-based processors that mainly focus on scalability.
A deeper analysis of ws2_32.dll show that nearly all the time of recv
is spent in the obscure instruction call qword ptr [rip 0x3440f]
which I guess is a kernel call to copy data from a kernel buffer to the user one. The same thing applies for send
. This means that the copies are not done in user-land but in the Windows kernel itself...
If you want to share data between two processes on Windows, I strongly advise you to use shared memory instead of sockets. Some message passing libraries provide an abstraction on top of this (like ZeroMQ for example).
CodePudding user response:
Don't make two calls to send
if you have a bunch of data you want to send at the same time. When the implementation sees the first send
, it has no reason to think there is going to be a second one, so sends the data immediately. But when it sees the second send
, it has no reason to think there won't be a third one, and so delays sending the data to try to aggregate a full packet of data.
This would actually be fine if they were two distinct application-level messages and the party on the other side acknowledged the first message. But that's not the case here.
If you are designing an application-level protocol to work with TCP, you have to work with TCP if you care about performance.
If you don't have application-level messages, ensure you aggregate as much data as possible in each send
call -- at least 4KB.
If you do have application-level messages that are acknowledged by the other side, try to include a full message in each send
call.
But what you do in your code violates all these principles and gives the implementation no way to perform well.