I have several strings(or filenames in a directory) and i need to group them by second most common pattern, then i will iterate over them by each group and process them. in the example below i need 2 from ACCEPT and 2 from BASIC_REGIS, bascially from string beginning to one character after hyphen (-). The first most common pattern are ACCEPT and BASIC_REGIS. I am looking for second most common pattern using grep -Po (Perl and only-matching). AWK solution is working
INPUT
ACCEPT-0ABC-0123
ACCEPT-0BAC-0231
ACCEPT-1ABC-0120
ACCEPT-1CBA-0321
BASIC_REGIS-2ABC-9043
BASIC_REGIS-2CBA-8132
BASIC_REGIS-9CCA-6532
BASIC_REGIS-9BBC-3023
OUTPUT
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
echo "ACCEPT-0ABC-0123"|grep -Po "\K^A.*-"
Result : ACCEPT-0ABC-
but I need : ACCEPT-0
However awk solution is working
echo "ACCEPT-1ABC-0120"|awk '$0 ~ /^A/{print substr($0,1,index($0,"-") 1)}'
ACCEPT-1
CodePudding user response:
Like this:
$ grep -oP '^\D \d' file | sort -u
Output
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
The regular expression matches as follows:
Node | Explanation |
---|---|
^ |
the beginning of the string |
\D |
non-digits (all but 0-9) (1 or more times (matching the most amount possible)) |
\d |
digit (0-9) |
CodePudding user response:
1st solution: With your shown samples please try following awk
code.
awk '
match($0,/^(ACCEPT-[0-9] |BASIC_REGIS-[0-9] /) && !arr[substr($0,RSTART,RLENGTH)]
' Input_file
2nd solution: With GNU grep
please try following.
grep -oP '^.*?-[0-9] ' Input_file | sort -u
CodePudding user response:
POSIX-shells have primitive parameter expansion. Meaning using this:
${string#-*} # Remove first ‘-‘ and everything after
In combination with this:
${string#*-} # Remove first ‘-‘ and everything before
Can extract the n’th most common pattern.
For example:
input="ACCEPT-0ABC-0123"
common_pattern_base=${input#-*} # Result → ACCEPT
next_level=${input#*-} # Result → 0ABC-0123
common_pattern_mid=${next_level#-*} # Result → 0ABC
next_level_again=${next_level#*-} # Result → 0123
Now I did this very crudely, but it should serve as an example on how simple and powerful this tool can be. Especially in combination with a loop.
If you need a certain syntax, you can now simply work with individual pieces:
# Result of line below → 0
trim_pattern_mid=“$(echo ${common_pattern_mid} | cut -c1)”
# Result of line below → ACCEPT-0
format=“${common_pattern_base}-${trim_pattern_mid}”
While this answer is longer, it is more flexible and simple than using regular-expressions. Imagine wanting to get the 4th-pattern of a 256 long chain with regex, it’s a nightmare.
This answer is more suited for scripting. If it’s ad-hoc, grep or sed will do the job - at least for small patterns.
CodePudding user response:
A bit more efficient as it's not calling substr
:
awk -v{,O}FS='-' '{printf("%s-%c\n",$1,$2)}' file