How can I change specific recurring text on a very large HTML file?-CodePudding

I have a very big HTML file (talking about 20MB) and I need to remove from the file a large amount of nodes of the form:

<tr><td>SPECIFIC-STRING</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>

The file I need to work on is basically made of thousands of these strings, and I only need to remove those that have a specific first string, for instance, all those with the first string being "banana":

<tr><td>banana</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>

I tried achieving this opening the file in Geany and using the replace feature with this regex:

<tr><td>banana<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr><tr><td(.*)<\/td><\/tr>

but the console output was that it removed X amount of occurrences, when I know there are way more occurrences than that in the file. Firefox, Chrome and Brackets fail even to view the html code of the file due to it's size. I can't think of another way to do this due to my large unexperience with HTML.

CodePudding user response：

You could be using a stream editor which as the name suggest streams the file content, thus never loads the whole file into the main memory.

A popular editor is sed. It does support RegEx.

Your command would have the following structure.

sed -i -E 's/SEARCH_REGEX/REPLACEMENT/g' INPUTFILE

-E for support of extended RegEx
-i for in-place editing mode
s denotes that you want to replace values
g is for global. By default sed would only replace the first occurrence so to replace all occrrences you must provide g
SEARCH_REGEX is the RegEx you need to find the substrings you want to replace
REPLACEMENT is the value you want to replace all matches with
INPUTFILE is the file sed is gonna read line-by line and do the replacement for you.

CodePudding user response：

While regex may not be the best tool to do this kinda job, try this adjustment to your pattern:

<tr><td>banana<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr><tr><td(.*?)<\/td><\/tr>

That's making your .* matches lazy. I am wondering if those patterns are consuming too much.