Home > other >  Remove attributes from XML using sed
Remove attributes from XML using sed

Time:09-23

First of all, there might be other (better) options, but I'm bound to sed of awk in this case. I have an XML file with the following contents.

<Field name="field1" type="String">AAAA</Field>
<Field name="field2" type="Integer">0</Field>
<Field name="field4" type="String">BBBB</Field>

Here I would like to change the contents using sed, to get the following result:

<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>

So remove the "*Field name="*", the last quote from the name and the rest of the attributes up till the *>* and also I would like to change the last </Field> with the actual field name. How to approach with awk or sed?

Removing from the first tag works with

sed 's/ type=".*"//' 
and
sed 's/Field name="//' 

I'm not sure how to proceed with the replacing of the last one.

CodePudding user response:

Using sed

$ sed -E 's~[A-Z][^"]*"([^"]*)[^>]*([^/]*/)[^>]*~\1\2\1~' input_file
<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>

CodePudding user response:

Another sed:

sed -E 's/^[^"] "([^"] )("[^"] ){2}">([^<] ).*$/<\1>\3<\/\1>/' file.xml

CodePudding user response:

1st solution: With your shown samples please try following sed code. Using -E option to enable ERE(extended regular expression). Using sed's capability to create capturing groups(through regex) and values captured in those capturing groups are being used later in substitution.

sed -E 's/^<Field name="([^"]*)"[^>]*>([^<]*)<.*$/<\1>\2<\/\1>/' Input_file

Here is the Online Demo for used regex for understanding purposes only.



2nd solution: With awk please try following awk code. Written and tested with shown samples. Making field separator as <Field name=, ", > and < for all the lines. In main block printing 3rd and 7th fields along with tags s per required output.

awk -F'^<Field name=|"|>|<' '{print "<"$3">"$7"</"$3">"}'  Input_file


3rd solution: With GNU awk using its match function where using regex and its creating capturing groups out of it to store values into array named arr which are being printed later to achieve goal here.

awk '
match($0,/<Field name="([^"]*)"[^"]*"[^>]*>([^<]*)</,arr){
  print "<"arr[1]">" arr[2] "</"arr[1]">"
}
'  Input_file

CodePudding user response:

as simple as elegant

awk -F "[\"><]" '{print "<"$3">" $7 "<"$3">"}' input_file

explanation
use '','<','>' as delimiter separate each line into several column fields
then print what you need

  • Related