Home > database >  How to extract only specific strings from each line of a file using awk?
How to extract only specific strings from each line of a file using awk?

Time:04-12

I was wondering if there a generic way to extract a specific string which by design is an eleven characters alphanumeric string using awk approach? for ex-

cat ext.txt

This is a sample field where the code is MGTCBEBEECL for NR
This is a sample field where the code is MGTCBEBEE01 for NR
This field must be 030 when Rule_1 = 'FR' and Rule_2  is 'EUROFRANSBI' or 'EURO_NEAR' and code is PARBFRPPXXX 
This field must be 0186 when Rule_1 = 'FR' and Rule_2  is 'EUROFRANSBI' or  'EURO_NEAR' and code is CITIFRPPXXX for the NR
For NFNC with Rule_1 is CA and Rule_2 is Universal and business code is null and official code must be 'CIBCCATTXXX'

I want to only extract the codes:-

MGTCBEBEECL 
MGTCBEBEE01 
PARBFRPPXXX 
CITIFRPPXXX 
CIBCCATTXXX

There are almost 100 such lines from which i am hoping to extract these distinct strings, but i am at my wits end how to make it more generic and non-redundant hence seeking this community's assistance!

CodePudding user response:

With the current examples you can do it with grep like this:

<ext.txt grep -oE "(code is|code must be) '?[A-Z0-9]{11}'?" | 
tr -d "'"                                                   |
grep -o '[^ ]*$'

Output:

MGTCBEBEECL
MGTCBEBEE01
PARBFRPPXXX
CITIFRPPXXX
CIBCCATTXXX

CodePudding user response:

Using gawk:

gawk -F "[ ']" 'BEGIN{ r=@/[A-Z]{11}/ }r{ for (i=1; i<=NF;i  ){ if($i~r) print $i} }' ext.txt
  • -F "[ ']" use space or ' as field separator (to also find codes like 'CIBCCATTXXX')
  • r=@/[A-Z]{11}/ assign the used regular expression (because it's used twice in the script
  • for(... loop over all the field in a line, and print the field when it matches the regular expression.

output:

MGTCBEBEECL
EUROFRANSBI
PARBFRPPXXX
EUROFRANSBI
CITIFRPPXXX
CIBCCATTXXX

CodePudding user response:

There is a way with GNU awk using FPAT:

awk -v FPAT='[[:alnum:]]{11}' '{print $NF}' file
MGTCBEBEECL
MGTCBEBEE01
PARBFRPPXXX
CITIFRPPXXX
CIBCCATTXXX
  • Setting the FPAT as '[[:alnum:]]{11}' GNU awk can handle fields that contain alphanumeric string with eleven characters.
  • and {print $NF} for printing the desired fields.
  • Related