I've got these two scripts right here, send.py and receive.py. Send.py is a host, it opens a connection and waits for receive.py to connect. Once the connection is successfull, in theory, I could send any file from one device (with the send.py script) to another (with the receive.py script). Little problem... I was trying to read from a random music file I found on my computer to make sure it works with any type of file and encoutered the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 45: invalid start byte
What causes this error?
send.py:
from socket import *
port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.bind(('0.0.0.0', port))
s.listen(1)
c, addr = s.accept()
buffersize = 128
fname = '✵ТГК -Гелик 2022✵ Gelik✵-160 (mp3cut.net).mp3' #input('File Path: ')
with open(fname, 'rb') as file:
readfc = file.read()
c.send(fname.encode())
if len(readfc) > buffersize:
for packet in range(len(readfc) % buffersize):
c.send(readfc[0:buffersize])
and receive.py:
from socket import *
port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.connect(('192.168.0.171', port))
index = 0
while True:
data = s.recv(1024)
if not data:
pass
else:
index = 1
if index == 1:
filename = data.decode()
else:
with open(filename, 'ab') as file:
file.write(data.decode())
And here are the first lines from the msuic file:
ID3 #TSSE Lavf59.16.100 яыа Info #R ђ.3
!$&) .0369:=@CEGJMORUVY\_acfiknqsux{}Ђ‚…‡ЉЌЏ‘”—љњћЎЈ¦©«°і¶ёєЅАВЕЗКМПТФЦЩЬЮбгжилортхшъэ Lavc59.18 $@ ђ.3ЮЬмf яыаD р i ```
CodePudding user response:
This code is assuming that a single send
in the sender matches a single recv
in the recipient. This assumption is wrong for TCP: TCP is only an unstructured byte stream and not a structured message transport which would preserve message boundaries over send/recv.
This means that the initial data = s.recv(1024)
in the recipient might not only include the filename, but might also already include parts of the music file. Thus it is a mix of the utf-8 encoded filename (multi-byte characters) followed by the binary music data (bytes). Trying to filename = data.decode()
on this will successfully decode the initial filename. But it will continue to decode the data after the end of the filename and thus treat the binary music data also as multi-byte characters encoded in utf-8. This will lead to the observed decoding error.
The fix should be to clearly mark where the filename ends and the binary data start and then only decode the filename as text and treat the rest as bytes. A common approach is to prefix the filename with the length so that it is clear where it ends. Another approaches might to add a \0
at the end of the filename (since it is not part of valid utf-8 encoded character except NUL - which itself is invalid in filenames) and split the incoming data on this delimiter.
Apart from that the later data.decode()
when reading the music data is plain wrong since there is no matching encode()
on the sender side. And there should not be one since these are binary data, i.e. already bytes.
CodePudding user response:
In addition to what @StefanUllrich said:
You receive binary data in line 9.
You open your file in binary mode in line 17.
All of this is correct.
Why do you think you need to decode the binary data to a string in line 18??? That's what's causing the exception you're seeing. Just don't call .decode()
, write that data as it is!