I have a bunch of text files that have the byte order mark (BOM) in them and also they have the CRLF (\r\n) endings to mark end of line. For example, here is the octal dump snippet:
$ od -bc P21_T_3-28-2022.txt
0000000 357 273 277 163 164 141 147 145 040 061 015 012 120 154 141 171
357 273 277 s t a g e 1 \r \n P l a y
0000020 151 156 147 040 164 150 145 163 145 040 164 167 157 040 147 141
i n g t h e s e t w o g a
0000040 155 145 163 054 040 162 145 155 151 156 144 145 144 040 155 145
m e s , r e m i n d e d m e
0000060 040 157 146 040 164 151 155 145 163 040 164 150 141 164 040 111
o f t i m e s t h a t I
<snip>
I am using this code to read the file:
lines = open(file, "r", encoding='utf-8').read().splitlines()
print(lines[0])
The first line prints like this, without the CRLF:
'\ufeffstage 1'
How to get rid off the BOM characters while reading?
CodePudding user response:
maybe you need to specify encoding
as utf-8-sig
:
lines = open(file, "r", encoding='utf-8-sig').read().splitlines()
print(lines[0])