How to extract multiple parts of a string using awk/sed/perl?-CodePudding

I search for log files having errors using egrep and it outputs a bunch of files. What I want to do is manipulate those strings and present in a different way.

/abcd/efgh/ijkl/logs/fac_unet_abp99507.log.20220708111219.26476752.0
/abcd/efgh/ijkl/logs/fac_oxf_abp3506.log.20220708111219.26476752.0

The output should look like:

ABP99507,UNET
ABP3506,OXF

I tried awk and sed and couldn't figure out a way to do this. I want to be able to make it dynamic and do it via regular expressions.

What I have tried so far is:

egrep -li "^error" /abcd/efgh/ijkl/logs/*202207* | awk '/unet|cirrus|oxf|csp|cmcd|cmcr|nice/ {print}'
egrep -li "^error" /abcd/efgh/ijkl/logs/*202207* | sed -n "s/.*\(cirrus|unet|cmcr|csp|cmcd|oxf|nice\)\(abp[0-9]*[A-ZA-Za-za-z]*\).*/\1,\2/p"

Sed doesn't work as the "|" operator doesn't work because I am not using GNU Awk. Even escaping it doesn't work. Also I can't seem to make use of capture groups.

CodePudding user response：

1st solution: Simplest option would be, using awk's field separator option. With your shown samples please try following awk code.

awk -F'/|\\.|_' '{print toupper($8","$7)}' Input_file

2nd solution: In case you want to try with regular expression in awk then try. Written and tested in GNU awk.

awk 'match($0,/logs\/[^_]*_([^_]*)_([^.]*)\.log/,arr){print toupper(arr[2]","arr[1])}'  Input_file

3rd solution: With GNU sed's enabling ERE with -E option please try following code.

sed -E 's/.*logs\/[^_]*_([^_]*)_([^.]*)\.log\..*/\U\2,\U\1/' Input_file

4th solution: Adding a NON-GNU awk solution using match function.

awk '
match($0,/logs\/[^_]*_([^_]*)_([^.]*)\.log/){
  val=substr($0,RSTART 5,RLENGTH-5)
  sub(/\.log/,"",val)
  split(val,arr,"_")
  print toupper(arr[3]","arr[2])
}
'  Input_file

CodePudding user response：

Also I can't seem to make use of capture groups.

You did not escape | so they are meaning literal |, you need to escape it to mean alternative, as is case with ( and ) (literal vs group delimiter). After doing that and repairing minor issues I get it working: let file.txt content be

/abcd/efgh/ijkl/logs/fac_unet_abp99507.log.20220708111219.26476752.0
/abcd/efgh/ijkl/logs/fac_oxf_abp3506.log.20220708111219.26476752.0

then

sed -e 's/.*\(cirrus\|unet\|cmcr\|csp\|cmcd\|oxf\|nice\)_\(abp[0-9]*[A-ZA-Za-za-z]*\).*/\2,\1/' -e 's/[a-z]/\U&/g' file.txt

gives output

ABP99507,UNET
ABP3506,OXF

Explanation: I introduced following changes: escaped |, added _ between groups, change order of replacement (2nd group is first), dropped /p as it caused doubling output. After doing this I added second action: uppercasing using standard GNU sed way of doing so. As there are now 2 actions, I use -e to register them.

(tested in GNU sed 4.2.2)