Home > OS >  how to parse text file, find needed string and delete X symbols after a key word?
how to parse text file, find needed string and delete X symbols after a key word?

Time:06-29

A have a source .html file. It has some code. Inside this code I have blocks with links to files like this:

</div><a href="some-adress" ><img src="data/62b063932513a59a02f25f37_name1.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/123bdsfh5235235352366345_name2.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/34f522352342342352352323_name3.jpg" loading="lazy" width="70" >

Number of symbols here 34f522352342342352352323_name3.jpg is static, but symbols are all different.

How can I open sourсe file, read it and then delete for example 24 symbols after "/data" in all strings of a document then save it?

CodePudding user response:

According to you you only want to keep something like: data/_name1.jpg, it doesn't matter exactly which bit, it can all be done using regular expressions like this.

import re

html = """
</div><a href="some-adress" ><img src="data/62b063932513a59a02f25f37_name1.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/123bdsfh5235235352366345_name2.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/34f522352342342352352323_name3.jpg" loading="lazy" width="70" >
"""

res = re.findall('src="data/(.*)_name.*"', html)
for item in res:
    html = html.replace(item, "")
print(html)

If you want to replace by length, you need to modify the regular expression, for example, like this (only 20 arbitrary characters):

res = re.findall('src="data/(.{20}).*_name.*"', html)

CodePudding user response:

Your .html file have some syntax error, so it may casuse error with beautifulsoup, therefore I suggest you read/write the file as plain text and do modification with re.

import re

input_f  = 'input.html'
output_f = 'output.html'
with open(output_f, 'w') as out_f:
    with open(input_f, 'r') as in_f:

        old_contents = in_f.read()
        new_contents = re.sub(r'(?<=data/)(.{24}\_)', '', old_contents)
        out_f.write(new_contents)

p.s. If you want to overwrite the original file (not recommended), just use your input_f as the output_f.

  • Related