I've been trying to use regular expressions to clean some .txt files contained in a local folder, but the script performs no modifications on my string variables. The original content I need to modify is something like "words words wor-\r\nds words words wo-\r\n\r\n\r\nrds"; I need to remove any line-final hyphens and all returns new line.
The function works fine up until line 7
(it seems to correctly accesses a sample file and print its contents as a string), but when I apply re.sub
to it (line 8
) and inspect my modified variable (line 9
), the script still returns the unmodified string. However, if I define txt_contents
as a separate variable and use re.sub
on it, it actually does perform the modifications that I expect. What am I doing wrong? Should I even need to define txt_clean
within the function? I have a list of substitutions to perform on these same files and would prefer not to re-define my variable for each one. Thanks in advance!
1 def clean_files(dir):
2 for root, dirs, files in os.walk(dir):
3 for file in files:
4 with open(file, "r", encoding="utf-8") as txt_file:
5 txt_contents = txt_file.read()
6 print(txt_contents) # OK
7 print(type(txt_contents)) # correctly returns "str"
8 txt_clean = re.sub('-(\r\n) ', '', txt_contents)
9 print(txt_clean) # still returns the same text
CodePudding user response:
Python will by default change your newlines, when you read a file in text mode. Look at the doc and how you can change that with parameter newline
. If you want newlines unchanged, I think you have to read the file in binary mode.