When converting HTML from sites to plain text we get a lot of extra line breaks. We want a maximum of 1 adjacent line break. This is the function we are using, but it seems ugly, and doesn't hit all use cases. Is there a more Pythonic way to achieve this results with less ugly code?
def clean_up_lines(message_text):
text_str = str(message_text)
text_data = text_str.replace(chr(13), "[EOL]")
text_data = text_data.replace(chr(10), "[EOL]")
text_data = text_data.replace("\n", "[EOL]")
text_data = text_data.replace("\r", "[EOL]")
for x in range(0, 10):
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
text_data = text_data.replace("[EOL][EOL]", "[EOL]")
for x in range(0, 8):
text_data = text_data.replace("[EOL][EOL]", "[EOL]")
text_data = text_data.replace("[EOL]", "\n")
return text_data
CodePudding user response:
Just use re.sub()
to substitute every chain of \n
by \n\n
if you want one extra line-break. Use \n
if you just want one line-break.
import re
s = 'Line1\n\n\nLine4'
print(re.sub(r'\n ', '\n\n', s))
#print(re.sub(r'\n ', '\n', s))
Output:
Line1
Line4
CodePudding user response:
you could use a regex replace to replace several adjacent instances of newlines by a single one:
document.replace(r"\n ",r"\n")