I'm developing a simple tool that allows to extract relevant data from HTML files and write them in TXT files. So far, I've achieved most of what I had in mind, but the final result is still unusable because there are (lots of) lines consisting of only white spaces that keep getting transcribed into the final TXT files. I'll attach a picture of how one of the TXTs is looking like as of right now:
Ideally, I'd want all lines containing text to be consecutive. How do I ignore all the lines containing ONLY spaces (.i.e. containing no alphanumeric character) when reading the HTML file once I got rid of the etiquettes? (the spaces are the remainder after deleting everything in between "<" and ">" for the TXTs)
CodePudding user response:
You should post some code, for instance of how you write your TXT file. Anyway, if you use lines, you can simply have a condition:
if len(line.strip()) > 0:
f.write(line)
CodePudding user response:
Use str.strip
to get rid of the spaces, then you can use filter
to remove the (then empty) lines:
example = """
AAA
f
ffifljlsehfshogfse
hello
"""
def remove_blank_lines(s):
lines = s.split("\n")
lines = filter(None, lines)
return "\n".join(lines)
# Or as a one-liner:
# remove_blank_lines = lambda s: "\n".join(filter(None, s.split("\n")))
print(remove_blank_lines(example))
CodePudding user response:
with open("<file_name>.txt") as f:
data = list(filter(lambda x: x.strip(), f.readlines()))