I have a directory with 80,000 .xml files. I’d like to delete the rest of the content from each file except for 3 specific lines. In each file, the line # remains the same (lines 41, 65, 120). Alternatively, they are the lines with specific strings (“InvestorIndentifier” and “PoolID”).
Is there a way to delete the rest of the content from the files but keep only those lines in the file? Since there is such a big volume of files, I need it to be something that does it on the full batch.
CodePudding user response:
You haven't made it clear whether you expect the content that remains to be well-formed XML. It seems unlikely that retaining lines 41, 65, and 120, while discarding the rest, will produce well-formed XML, because you'll lose the outermost start and end tag.
In general, processing XML files using non-XML tools is strongly discouraged, because it often results in content that is not well-formed XML. We get a vast number of questions here from people trying to process ill-formed XML, which usually arises precisely because someone has tried to take this short cut. However, there are cases where the XML is so regular and predictable that you may be able to get away with it, and this might be such a case.
My own choice, however, would be to process the content with XSLT. In XSLT 2.0 you can use the collection() or uri-collection() functions to process a whole directory of input files, and the xsl:result-document instruction to generate an output file. So with Saxon you would do something like this:
<xsl:transform version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template name="xsl:initial-template">
<xsl:for-each select="uri-collection('file:///my-input-directory/')">
<xsl:result-document href="replace(., 'my-input-directory', 'my-output-directory')">
<doc>
<xsl:copy-of select="doc(.)//(InvestorIdentifier|PoolID)"/>
</doc>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:transform>
CodePudding user response:
Use sed
.
sed -i -r '!/InvestorIndentifier|PoolID/d' *.xml
This deletes lines that don't match the regexp that recognizes the specific strings you want to keep.