Home > Back-end >  How to remove multi-line blocks of text of varying sizes from a file given the first and last lines
How to remove multi-line blocks of text of varying sizes from a file given the first and last lines

Time:11-21

I have an xml file listing several games and their metadata, like so:

<?xml version="1.0"?>
<gameList>
    <game>
        <path>./Besiege.desktop</path>
        <name>Besiege</name>
        <desc>Long description of game</desc>
        <releasedate>20150128T000000</releasedate>
        <developer>Spiderling Studios</developer>
        <publisher>Spiderling Studios</publisher>
        <genre>Strategy</genre>
        <players>1</players>
    </game>
<A bunch of other entries>
    <game>
        <path>./67000.The Polynomial.txt</path>
        <name>The Polynomial - Space of the music</name>
        <desc>Long description of game</desc>
        <releasedate>20101015T000000</releasedate>
        <developer>Dmytry Lavrov</developer>
        <publisher>Dmitriy Uvarov</publisher>
        <genre>Shooter, Music</genre>
        <players>1</players>
        <favorite>true</favorite>
    </game>
<Another bunch of entries>
</gameList>

I want to remove every entry that contains the substring ".desktop" and leave all the rest. But just removing the line which contains this string isn't enough, I want to remove the whole block from <game> to </game>.

I know that in Linux, with bash, there are several ways to remove a fixed number of lines before or after a given string. But by comparing the two entries above, you can see that they don't always have the same number of fields. The descriptions inside the "<desc>" tags also vary from one to four paragraphs separated by empty lines. I have not found any solutions that deal with a variable number of lines around a target substring.

I thought there would be an easy way to split the text into blocks from the opening <game> tag to the closing </game> tag so that I could operate on them in a similar way to how one normally does with lines, in which case a simple while loop that tested for the presence of the substring and deleted the block if true, or something similar, would solve my problem. Well, I've been banging my head against grep, sed and awk and I've tried to set a convenient value for IFS so that it would only end lines at "</game>" and I am growing increasingly frustrated because I'm almost at the point where it would have been faster to do this manually. But then I'd remain ignorant.

I'm only just beginning to learn Bash so there is so much that I don't know, and I feel like this is the sort of thing that someone more knowledgeable could do with a single-liner but I'm completely stumped. So thank you for your time and please point me in the right direction.

CodePudding user response:

Do not use line tools to edit XML files. Do not use Bash to edit XML files. Use XML tools to edit XML files. Write a program in python or Perl or other capable programming language with an XML library to edit XML.

The following with xmlstarlet is quite simple:

$ xmlstarlet ed -d '/gameList/game[ contains(path, ".desktop") ]' input.xml
<?xml version="1.0"?>
<gameList>
  <game>
    <path>./67000.The Polynomial.txt</path>
    <name>The Polynomial - Space of the music</name>
    <desc>Long description of game</desc>
    <releasedate>20101015T000000</releasedate>
    <developer>Dmytry Lavrov</developer>
    <publisher>Dmitriy Uvarov</publisher>
    <genre>Shooter, Music</genre>
    <players>1</players>
    <favorite>true</favorite>
  </game>
</gameList>
  • Related