<elem1><elem2>20,000 Leagues Under the Sea1050251</elem2></elem1>
<elem1><elem2>1002321Robinson Crusoe1050251</elem2></elem1>
I'm working with an XML file and had to insert elements above extracted from it into another XML file. The problem is, I have no idea how to remove the id (7-digit substrings) used to track the position from the string. Removing characters between ">" and "<" isn't feasible, because text sometimes starts with id and sometimes with title that begins with numbers. What I'd need is something that could remove only and any 7-digit substrings from a string, but I've only found code that can do it for specified substrings
CodePudding user response:
You can try with regex:
import re
string = """<elem1><elem2>20,000 Leagues Under the Sea1050251</elem2></elem1>
<elem1><elem2>1002321Robinson Crusoe1050251</elem2></elem1>"""
pattern = re.compile(r"\d{7}") # pattern that matches exactly 7 consecutive ascii digits
result = pattern.sub("", string) # returns a string where the matched pattern is replaced by the given string
print(result)
Output:
<elem1><elem2>20,000 Leagues Under the Sea</elem2></elem1>
<elem1><elem2>Robinson Crusoe</elem2></elem1>
Useful: