Home > database >  Learning Python - len() returns 2n 2
Learning Python - len() returns 2n 2

Time:11-16

I'm sorry if this is a duplicate post but search seemed to yield no useful results...or maybe I'm such a noob that I'm not understanding what is being said in the answers.

I wrote this small code for practice (following "learning Python the hard way"). I tried to make a shorter version of a code which was already given to me.

from sys import argv

script, from_file, to_file = argv

# here is the part where I tried to simplify the commands and see if I still get the same result,
# Turns out it's the same 2n 2
trial = open(from_file)
trial_data = trial.read()
print(len(trial_data))
trial.close()

# actual code after defining the argumentative variables
in_file = open(from_file).read()

input(f"Transfering {len(in_file)} characters from {from_file} to {to_file}, hit RETURN to continue, CRTL-C to abort.")
#'in_data = in_file.read()

out_file = open(to_file, 'w').write(in_file)

When using len() it always seems to return 2n 2 value instead of n, where n is the actual number of characters in the text file. I also made sure there are no extra lines in the text file.

Can someone kindly explain?

TIA

I was expecting the exact number of characters found in the txt file to be returned. Turns out it's too much to ask.

CodePudding user response:

Possibly the extra characters are the new line character or some other invisible to-your-text-editor character?

Try to make a simple test file with only one character. eg run

echo "a" > test_file

Also there is a dedicated bash command to count such stuff

wc -m

CodePudding user response:

The observed behaviour is consistent with opening the file in binary mode and the file being encoded in utf-16 with a BOM.

If you then call len on the contents of that file it will count the bytes in that file. The amount of bytes will depend on the specific encoding.

That would explain both the 2n cause every utf-16 char has 2 bytes as well as the 2 the BOM newline.

  • Related