Home > front end >  Only get alphanumeric characters in capture group using sed
Only get alphanumeric characters in capture group using sed

Time:02-05

Input:

x.y={aaa b .c}

Note that the the content within {} are only an example, in reality it could be any value.

Problem: I would like to keep only the alphanumeric characters within the {}.

So it would be come:

x.y={aaabbc}

Trial 0

$ echo 'x.y={aaa b .c}' | sed 's/[^[:alnum:]]\ //g'
xyaaabc

This is great, but I'd like to only modify the part within {}. So I thought this may need capture groups, hence I went ahead and tried these:

Trial 1

$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{(.*)\}/x.y={\1}/'
x.y={aaa b .c}

Here I have captured the content I want to modify (aaa b .c) correctly, but I need a way to somehow do s/[^[:alnum:]]\ //g only on \1.

Instead, I tried capturing all alphanumeric characters only (to \1) like this:

Trial 2

$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{([[:alnum:]] )\}/x.y={\1}/'
x.y={aaa b .c}

Of course, it doesn't work because I'm only expecting alnum's and then immediately a } literal. I didn't tell it to ignore the non-alnum's. I.e, this part:

s/x.y=\{([[:alnum:]] )\}/x.y={\1}/
      ^^^^^^^^^^^^^^^^^^   

It literally matches: an open brace, some alnum's, and a closing brace -- which is not what I want. I'd like it to match everything, but only capture the alnum's.


Example of input/output:

x.y={aaa b .c} blah
blah
x.y={1 2 3 def} blah
blah

to

x.y={aaabc} blah
blah
x.y={123def} blah
blah

I searched the web before finally giving up and posting the question but I didn't find anything helpful as I didn't see anyone with a similar problem as mine. Would appreciate some help this as I'd love to have a better understanding of variables in regex/sed, thanks!

CodePudding user response:

With your shown samples, please try following in awk. Written and tested in GNU awk.

awk '
match($0,/\{[^}]*}/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/[^{}a-zA-Z]/,"",val)
  $0=substr($0,1,RSTART-1) val substr($0,RSTART RLENGTH)
}
1
' Input_file

Explanation: Adding detailed explanation for above.

awk '                                      ##Starting awk program from here.
match($0,/\{[^}]*}/){                      ##using match function of awk to match from { to first occurrence of }
  val=substr($0,RSTART,RLENGTH)            ##Creating val which has sub string of matched regex in it.
  gsub(/[^{}a-zA-Z]/,"",val)               ##Globally substituting everything apart from { } and alphabets in val.
  $0=substr($0,1,RSTART-1) val substr($0,RSTART RLENGTH) ##saving everything before match val and everything after match here.
}
1                                          ##Printing line if it doesn't meet `match` condition mentioned above.
'  Input_file                              ##Mentioning Input_file name here. 


Generic solution: In case you have multiple occurrences of { and } then try following awk code.

awk '
{
  line=""
  while(match($0,/\{[^}]*}/)){
    val=substr($0,RSTART,RLENGTH)
    gsub(/[^{}a-zA-Z]/,"",val)
    line=(line?line:"") (substr($0,1,RSTART-1) val)
    $0=substr($0,RSTART RLENGTH)
  }
  if(RSTART RLENGTH!=length($0)){
    $0=line $0
  }
  else{
    $0=line
  }
}
1
'  Input_file

CodePudding user response:

With sed (tested on GNU sed, syntax may vary for other implementations):

$ sed -E ':a s/(\{[[:alnum:]]*)[^[:alnum:]] ([^}]*})/\1\2/; ta' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
  • :a marks that location as label a (used to jump using ta as long as the substitution succeeds)
  • (\{[[:alnum:]]*) matches { followed by zero or more alnum characaters
  • [^[:alnum:]] matches one or more non-alnum characters
  • ([^}]*}) matches till the next } character


If perl is okay:

$ perl -pe 's/\{\K[^}] (?=\})/$&=~s|[^a-z\d] ||gir/e' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
  • \{\K[^}] (?=\}) match sequence of { to } (assuming } cannot occur in between)
    • \{\K and (?=\}) are used to avoid the braces from being part of the matched portion
  • e flag allows you to use Perl code in replacement portion, in this case another substitute command
  • $&=~s|[^a-z\d] ||gir here, $& refers to entire matched portion, gi flags are used for global/case-insensitive and r flag is used to return the value of this substitution instead of modifying $&
    • [^a-z\d] matches non-alphanumeric characters (assuming ASCII, you can also use [^[:alnum:]] )
    • use \W if you want to preserve underscores as well

For both solutions, you can add x\.y= prefix if needed to narrow the scope of matching.

CodePudding user response:

Here is another gnu-awk solution using FPAT:

s='x.y={aaa b .c}'
awk -v OFS= -v FPAT='{[^}] }|[^{}] ' '
{
   for (i=1; i<=NF;   i)
      if ($i ~ /^{/) $i = "{" gensub(/[^[:alnum:]] /, "", "g", $i) "}"
} 1' <<< "$s"

x.y={aaabc}
  •  Tags:  
  • Related