Home > Software engineering >  Python re.sub() does not perform any substitution if used within function
Python re.sub() does not perform any substitution if used within function

Time:09-06

I've been trying to use regular expressions to clean some .txt files contained in a local folder, but the script performs no modifications on my string variables. The original content I need to modify is something like "words words wor-\r\nds words words wo-\r\n\r\n\r\nrds"; I need to remove any line-final hyphens and all returns new line.

The function works fine up until line 7 (it seems to correctly accesses a sample file and print its contents as a string), but when I apply re.sub to it (line 8) and inspect my modified variable (line 9), the script still returns the unmodified string. However, if I define txt_contents as a separate variable and use re.sub on it, it actually does perform the modifications that I expect. What am I doing wrong? Should I even need to define txt_clean within the function? I have a list of substitutions to perform on these same files and would prefer not to re-define my variable for each one. Thanks in advance!

1 def clean_files(dir):
2    for root, dirs, files in os.walk(dir):
3        for file in files:
4            with open(file, "r", encoding="utf-8") as txt_file:
5                txt_contents = txt_file.read()
6                print(txt_contents) # OK
7                print(type(txt_contents)) # correctly returns "str"
8                txt_clean = re.sub('-(\r\n) ', '', txt_contents)
9                print(txt_clean) # still returns the same text

CodePudding user response:

Python will by default change your newlines, when you read a file in text mode. Look at the doc and how you can change that with parameter newline. If you want newlines unchanged, I think you have to read the file in binary mode.

  • Related