Replacing \n while keeping \r\n intact-CodePudding

I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact. I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex

with open(filetoread, "r") as inf:
    with open(filetowrite, "w") as fixed:
        for line in inf:
            line = re.sub("(?<!\r)\n", " ", line)
            fixed.write(line)

but it is not keeping \r\n it is removing everything. I can't do it in Notepad it is crashing on this file.

CodePudding user response：

You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.

You can use

with open(filetoread, "rb") as inf:
    with open(filetowrite, "wb") as fixed:
        fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))

Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.

Pay attention to

"rb" when reading file in
"wb" to write file out
re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.

CodePudding user response：

When you open a file with a naive open() call, it will load a view of the file with a variety of newlines to be simply \n via TextIOWrapper

Explicitly setting newline="\r\n" should allow you to read and write the newlines the way you expect

with open(path_src, newline="\r\n") as fh_src:
    with open(path_dest, "w", newline="\r\n") as fh_dest:
        for line in fh_src:  # file-likes are iterable by-lines
            fh_dest.write(line[:-2].replace("\n", " "))
            fh_dest.write("\r\n")

content example

>>> with open("test.data", "wb") as fh:
...     fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
...     for line in fh:
...         print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'