A have a source .html file. It has some code. Inside this code I have blocks with links to files like this:
</div><a href="some-adress" ><img src="data/62b063932513a59a02f25f37_name1.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/123bdsfh5235235352366345_name2.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/34f522352342342352352323_name3.jpg" loading="lazy" width="70" >
Number of symbols here 34f522352342342352352323_name3.jpg is static, but symbols are all different.
How can I open sourсe file, read it and then delete for example 24 symbols after "/data" in all strings of a document then save it?
CodePudding user response:
According to you you only want to keep something like: data/_name1.jpg, it doesn't matter exactly which bit, it can all be done using regular expressions like this.
import re
html = """
</div><a href="some-adress" ><img src="data/62b063932513a59a02f25f37_name1.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/123bdsfh5235235352366345_name2.jpg" loading="lazy" width="70" >
</div><a href="some-adress" ><img src="data/34f522352342342352352323_name3.jpg" loading="lazy" width="70" >
"""
res = re.findall('src="data/(.*)_name.*"', html)
for item in res:
html = html.replace(item, "")
print(html)
If you want to replace by length, you need to modify the regular expression, for example, like this (only 20 arbitrary characters):
res = re.findall('src="data/(.{20}).*_name.*"', html)
CodePudding user response:
Your .html
file have some syntax error, so it may casuse error with beautifulsoup
, therefore I suggest you read/write the file as plain text and do modification with re
.
import re
input_f = 'input.html'
output_f = 'output.html'
with open(output_f, 'w') as out_f:
with open(input_f, 'r') as in_f:
old_contents = in_f.read()
new_contents = re.sub(r'(?<=data/)(.{24}\_)', '', old_contents)
out_f.write(new_contents)
p.s. If you want to overwrite the original file (not recommended), just use your input_f
as the output_f
.