Home > Mobile >  Regular expression to capture alphanumeric string only in shell
Regular expression to capture alphanumeric string only in shell

Time:04-30

Trying to write the regex to capture the given alphanumeric values but its also capturing other numeric values. What should be the correct way to get the desire output?

code

grep -Eo '(\[[[:alnum:]]\)\w ' file > output
$ cat file
2022-04-29 08:45:11,754 [14] [Y23467] [546] This is a single line
2022-04-29 08:45:11,764 [15] [fpes] [547] This is a single line
2022-04-29 08:46:12,454 [143] [mwalkc] [548] This is a single line
2022-04-29 08:49:12,554 [143] [skhat2] [549] This is a single line
2022-04-29 09:40:13,852 [5] [narl12] [550] This is a single line
2022-04-29 09:45:14,754 [1426] [Y23467] [550] This is a single line

current output -

[14
[Y23467
[546
[15
[fpes
[547
[143
[mwalkc
[548
[143
[skhat2
[549
[5
[narl12
[550
[1426
[Y23467
[550

expected output -

Y23467
fpes
mwalkc
skhat2
narl12
Y23467

CodePudding user response:

1st solution: With your shown samples, please try following awk code. Simple explanation would be, using gsub function to substitute [ and ] in 4th field, printing 4th field after that.

awk '{gsub(/\[|\]/,"",$4);print $4}' Input_file


2nd solution: With GNU grep please try following solution.

grep -oP '^[0-9]{4}(-[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2},[0-9]{1,3} \[[0-9] \] \[\K[^]]*' Input_file

Explanation: Adding detailed explanation for above regex used in GNU grep.

^[0-9]{4}(-[0-9]{2}){2}  ##From starting of value matching 4 digits followed by dash 2 digits combination of 2 times.
 [0-9]{2}(:[0-9]{2}){2}  ##Matching space followed by 2 digits followed by : 2 digits combination of 2 times.
,[0-9]{1,3}              ##Matching comma followed by digits from 1 to 3 number.
 \[[0-9] \] \[\K         ##Matching space followed by [ digits(1 or more occurrences of digits) followed by space [ and
                         ##then using \K to forget all the previously matched values.
[^]]*                    ##Matching everything just before 1st occurrence of ] to get actual values.

CodePudding user response:

Using [[:alnum:]] or \w means that it can possibly match alphanumeric or word characters.

If there can be numbers, but there should be a character a-z and using -P for a perl compatible regex is supported:

grep -oP '\[\K\d*[A-Za-z][\dA-Za-z]*(?=])' file

Explanation

  • \[ Match [
  • \K Forget what is matched so far
  • \d*[A-Za-z] Match optional digits and at least a single char a-zA-Z
  • [\dA-Za-z]* Match optional chars a-zA-Z and digits
  • (?=]) Assert ] to the right

Output

Y23467
fpes
mwalkc
skhat2
narl12
Y23467

If there can be only 1 occurrence, you might also use sed with a capture group \(...\) and use the group in the replacement using \1

sed 's/.*\[\([[:digit:]]*[[:alpha:]][[:alnum:]]*\)].*/\1/' file

CodePudding user response:

There are several parts to your problem. First I'll try to help you with your regex (but it will probably unlock more problems); next I'll show you an alternative.

The Regex

The thing to understand about [[:alnum:]] is that it captures anything that contains an alphanumeric character. So it will capture "123", and it will capture "abc", as all of those characters are alphanumeric. It judges each character individually and cannot capture "only sections that have both numbers and letters" like what you want.

However, by chaining several greps together, we could filter out lines which only contain numbers.

grep -Eo '(\[[[:alnum:]]\)\w ' file | grep -v -Eo '\[[[:digit:]] (\w |$)' > output

To refine this further, there look to be a couple of bugs in your regex. First, you have included \[ inside the captured part, which is why it's capturing the [ in your results, so you should change (\[ to \[( to move the [ outside of the captured part in parantheses ( ... ).

Next, your combination of [[:alnum:]] with \w probably doesn't do what you expect. It looks for a single alphanumeric character, followed by one or more "word" characters (which is all the alphanumerics, plus some extra ones). You probably want ([[:alnum:]] ) instead of ([[:alnum:]])\w

Alternative

Why not use cut instead? cut -d' ' -f4 will take the 4th field (with "space" as the delimiter between fields)

$ cut -d' ' -f 4 file 
[Y23467]
[fpes]
[mwalkc]
[skhat2]
[narl12]
[Y23467]

If you also want to remove the square brackets, try

$ cut -d' ' -f 4 file | grep -Eo '\w '
Y23467
fpes
mwalkc
skhat2
narl12
Y23467

CodePudding user response:

Using sed

$ sed 's/\([^[]*\[\)\{2\}\([^]]*\).*/\2/' input_file
Y23467
fpes
mwalkc
skhat2
narl12
Y23467

CodePudding user response:

Using FPAT with GNU awk:

awk -v FPAT='[[[:alnum:]]*]' '{gsub(/^\[|\]$/, "",$(NF-1));print $(NF-1)}' file
Y23467
fpes
mwalkc
skhat2
narl12
Y23467
  • setting FPAT as '[[[:alnum:]]*]' we match [ char followed by zero o more alphanumeric chars followed by ] char.

  • with gsub() function we remove initial [ and final ] chars.

  • we print the field previous to the last field, i.e. $(NF-1) field, without [ and ] characters.

  • Related