linux extract portion of the string that can be second most common pattern-CodePudding

I have several strings(or filenames in a directory) and i need to group them by second most common pattern, then i will iterate over them by each group and process them. in the example below i need 2 from ACCEPT and 2 from BASIC_REGIS, bascially from string beginning to one character after hyphen (-). The first most common pattern are ACCEPT and BASIC_REGIS. I am looking for second most common pattern using grep -Po (Perl and only-matching). AWK solution is working

INPUT

ACCEPT-0ABC-0123
ACCEPT-0BAC-0231
ACCEPT-1ABC-0120
ACCEPT-1CBA-0321

BASIC_REGIS-2ABC-9043
BASIC_REGIS-2CBA-8132
BASIC_REGIS-9CCA-6532
BASIC_REGIS-9BBC-3023

OUTPUT

ACCEPT-0
ACCEPT-1

BASIC_REGIS-2
BASIC_REGIS-9

echo "ACCEPT-0ABC-0123"|grep -Po "\K^A.*-"

Result : ACCEPT-0ABC-

but I need : ACCEPT-0

However awk solution is working

echo "ACCEPT-1ABC-0120"|awk '$0 ~ /^A/{print substr($0,1,index($0,"-") 1)}'

ACCEPT-1

CodePudding user response：

Like this:

$ grep -oP '^\D \d' file | sort -u

Output

ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9

The regular expression matches as follows:

Node	Explanation
`^`	the beginning of the string
`\D`	non-digits (all but 0-9) (1 or more times (matching the most amount possible))
`\d`	digit (0-9)

CodePudding user response：

1st solution: With your shown samples please try following awk code.

awk '
match($0,/^(ACCEPT-[0-9] |BASIC_REGIS-[0-9] /) && !arr[substr($0,RSTART,RLENGTH)]  
' Input_file

2nd solution: With GNU grep please try following.

grep -oP '^.*?-[0-9] ' Input_file | sort -u

CodePudding user response：

POSIX-shells have primitive parameter expansion. Meaning using this:

${string#-*} # Remove first ‘-‘ and everything after

In combination with this:

${string#*-} # Remove first ‘-‘ and everything before

Can extract the n’th most common pattern.

For example:

input="ACCEPT-0ABC-0123"

common_pattern_base=${input#-*} # Result → ACCEPT
next_level=${input#*-} # Result → 0ABC-0123

common_pattern_mid=${next_level#-*} # Result → 0ABC
next_level_again=${next_level#*-} # Result → 0123

Now I did this very crudely, but it should serve as an example on how simple and powerful this tool can be. Especially in combination with a loop.

If you need a certain syntax, you can now simply work with individual pieces:

# Result of line below → 0
trim_pattern_mid=“$(echo ${common_pattern_mid} | cut -c1)”

# Result of line below → ACCEPT-0
format=“${common_pattern_base}-${trim_pattern_mid}”

While this answer is longer, it is more flexible and simple than using regular-expressions. Imagine wanting to get the 4th-pattern of a 256 long chain with regex, it’s a nightmare.

This answer is more suited for scripting. If it’s ad-hoc, grep or sed will do the job - at least for small patterns.

CodePudding user response：

A bit more efficient as it's not calling substr:

awk -v{,O}FS='-' '{printf("%s-%c\n",$1,$2)}' file