I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact.
I've tried line.replace
but seems like it is not recognizing \r\n
so next I tried regex
with open(filetoread, "r") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
line = re.sub("(?<!\r)\n", " ", line)
fixed.write(line)
but it is not keeping \r\n
it is removing everything. I can't do it in Notepad it is crashing on this file.
CodePudding user response:
You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open
with r
mode, and to keep them all in the input, you can read the file in in the binary mode using b
. Then, you need to remember to also use the b
prefix with the regex pattern and replacement.
You can use
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
Now, the whole file will be read into a single string (with inf.read()
) and the line breaks will be matched, and eventually replaced.
Pay attention to
"rb"
when reading file in"wb"
to write file outre.sub(b"(?<!\r)\n", b" ", inf.read())
containsb
prefixes with string literals, andinf.read()
reads in the file contents into single variable.
CodePudding user response:
When you open a file with a naive open()
call, it will load a view of the file with a variety of newlines to be simply \n
via TextIOWrapper
Explicitly setting newline="\r\n"
should allow you to read and write the newlines the way you expect
with open(path_src, newline="\r\n") as fh_src:
with open(path_dest, "w", newline="\r\n") as fh_dest:
for line in fh_src: # file-likes are iterable by-lines
fh_dest.write(line[:-2].replace("\n", " "))
fh_dest.write("\r\n")
content example
>>> with open("test.data", "wb") as fh:
... fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
... for line in fh:
... print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'