Home > database >  Linux sed regex replace with capture groups
Linux sed regex replace with capture groups

Time:07-13

I have a file containing directory entries in the following format:

<item><ln></ln><fn>Some person</fn><ct>07123456789</ct><sd>37</sd><rt>1</rt><bw>1</bw></item>

I would like to use sed to search for any where <ct> is an 11 digit number and where <bw>1</bw>. I would like to change the line above like so:

<item><ln></ln><fn>Some person</fn><ct>07123456789</ct><sd>37</sd><rt>1</rt><bw>0</bw></item>

(if it isn't obvious I have changed <bw>= 0)

I have tried the following in sed but it does not match:

sed -E 's/(. <ct>\d{11}. <bw>)1(<\/bw><\/item>)/\10\2/g' test-directory.xml

What am I doing wrong?

CodePudding user response:

You may use this sed with 2 capture groups:

sed -E 's~(.*<ct>[0-9]{11}</ct>.*<bw>)1(</bw>.*)~\10\2~' file

<item><ln></ln><fn>Some person</fn><ct>07123456789</ct><sd>37</sd><rt>1</rt><bw>0</bw></item>

More Info:

  • (.*<ct>[0-9]{11}</ct>.*<bw>): Match and capture any text followed by <ct>11-digits</ct> followed by any text followed by <bw> in capture group #1
  • 1:
  • (</bw>.*): Match </bw> followed by anything in capture group #2

PS: This assumes <ct> tag appears before <bw> tag in same line. For more refined control over XML better to use a XML parser instead of shell utilities.


If <bw> tag position is not fixed then you may use this sed solution:

sed -E '\~<ct>[0-9]{11}</ct>~ s~(.*<bw>)1(</bw>.*)~\10\2~' file

CodePudding user response:

With awk(in case you are ok with it) you could try following GNU awk solution, written and tested in GNU awk with shown samples. Simple explanation would be, using match function of awk program where using regex (.*<ct>[0-9]{11}<\/ct>.*<bw>)([0-9] )(<\/bw>.*) which creates 3 capturing group in it(to be used later on) and stores values of those as per capturing group number it will create index of items in array named arr. Once its done then printing only required part(changing any digits with 0 which is coming before </bw>).

awk '
match($0,/(.*<ct>[0-9]{11}<\/ct>.*<bw>)([0-9] )(<\/bw>.*)/,arr){
  print arr[1]"0"arr[3]
}
' Input_file

Here is the Online demo for above shown regex.

  • Related