Removing Extra Line Endings with Python-CodePudding

When converting HTML from sites to plain text we get a lot of extra line breaks. We want a maximum of 1 adjacent line break. This is the function we are using, but it seems ugly, and doesn't hit all use cases. Is there a more Pythonic way to achieve this results with less ugly code?

    def clean_up_lines(message_text):
        text_str = str(message_text)
        text_data = text_str.replace(chr(13), "[EOL]")
        text_data = text_data.replace(chr(10), "[EOL]")
        text_data = text_data.replace("\n", "[EOL]")
        text_data = text_data.replace("\r", "[EOL]")
        for x in range(0, 10):
            text_data = text_data.replace("[EOL]       [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL]      [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL]     [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL]    [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL]   [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL]  [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL] [EOL]", "[EOL]")
            text_data = text_data.replace("[EOL][EOL]", "[EOL]")
        for x in range(0, 8):
            text_data = text_data.replace("[EOL][EOL]", "[EOL]")
        text_data = text_data.replace("[EOL]", "\n")
        return text_data

CodePudding user response：

Just use re.sub() to substitute every chain of \n by \n\n if you want one extra line-break. Use \n if you just want one line-break.

import re

s = 'Line1\n\n\nLine4'

print(re.sub(r'\n ', '\n\n', s))
#print(re.sub(r'\n ', '\n', s))

Output:

Line1

Line4

CodePudding user response：

you could use a regex replace to replace several adjacent instances of newlines by a single one:

document.replace(r"\n ",r"\n")