Home > OS >  How can I change specific recurring text on a very large HTML file?
How can I change specific recurring text on a very large HTML file?

Time:03-23

I have a very big HTML file (talking about 20MB) and I need to remove from the file a large amount of nodes of the form:

<tr><td>SPECIFIC-STRING</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>

The file I need to work on is basically made of thousands of these strings, and I only need to remove those that have a specific first string, for instance, all those with the first string being "banana":

<tr><td>banana</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>

I tried achieving this opening the file in Geany and using the replace feature with this regex:

<tr><td>banana<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr><tr><td(.*)<\/td><\/tr>

but the console output was that it removed X amount of occurrences, when I know there are way more occurrences than that in the file. Firefox, Chrome and Brackets fail even to view the html code of the file due to it's size. I can't think of another way to do this due to my large unexperience with HTML.

CodePudding user response:

You could be using a stream editor which as the name suggest streams the file content, thus never loads the whole file into the main memory.

A popular editor is sed. It does support RegEx.

Your command would have the following structure.

sed -i -E 's/SEARCH_REGEX/REPLACEMENT/g' INPUTFILE
  • -E for support of extended RegEx
  • -i for in-place editing mode
  • s denotes that you want to replace values
  • g is for global. By default sed would only replace the first occurrence so to replace all occrrences you must provide g
  • SEARCH_REGEX is the RegEx you need to find the substrings you want to replace
  • REPLACEMENT is the value you want to replace all matches with
  • INPUTFILE is the file sed is gonna read line-by line and do the replacement for you.

CodePudding user response:

While regex may not be the best tool to do this kinda job, try this adjustment to your pattern:

<tr><td>banana<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr><tr><td(.*?)<\/td><\/tr>

That's making your .* matches lazy. I am wondering if those patterns are consuming too much.

  • Related