I'm trying to write a regex that will find any CRLF in python.
I am able to successfully open the file and use newlines to determine what newlines its using CRLF or LF. My numerous regex attempts have failed
with open('test.txt', 'rU') as f:
text = f.read()
print repr(f.newlines)
regex = re.compile(r"[^\r\n] ", re.MULTILINE)
print(regex.match(text))
I've done numerous iterations on the regex and in every case it till either detect \n as \r\n or not work at all.
CodePudding user response:
You could try using the re
library to search for the \r
& \n
patterns.
import re
with open("test.txt", "rU") as f:
for line in f:
if re.search(r"\r\n", line):
print("Found CRLF")
regex = re.compile(r"\r\n")
line = regex.sub("\n", line)
if re.search(r"\r", line):
print("Found CR")
regex = re.compile(r"\r")
line = regex.sub("\n", line)
if re.search(r"\n", line):
print("Found LF")
regex = re.compile(r"\n")
line = regex.sub("\n", line)
print(line)
Assuming your test.txt file looks something like this:
This is a test file
with a line break
at the end of the file.
CodePudding user response:
As I mentioned in a comment, you're opening the file with universal newlines, which means that Python will automatically perform newline conversion when reading from or writing to the file. Your program therefore will not see CR-LF sequences; they will be converted to just LF.
Generally, if you want to portably observe all bytes from a file unchanged, then you must open the file in binary mode:
In Python 2:
from __future__ import print_function
import re
with open('test.txt', 'rb') as f:
text = f.read()
regex = re.compile(r"[^\r\n] ", re.MULTILINE)
print(regex.match(text))
In Python 3:
import re
with open('test.txt', 'rb') as f:
text = f.read()
regex = re.compile(rb"[^\r\n] ", re.MULTILINE)
print(regex.match(text))