run regex against 3rd column only-CodePudding

Using this regular expression, I'm finding a string of numbers starting with 9, followed by 4 or 5 or 6 or 7 or 9, followed by 6 more numbers.

9*[45679]( *[0-9]){6}

I have a file named content.txt containing 3 columns. The first column is a date, the second a time and the third contains random text and numbers with spaces in it.

20/10/2022 19:00 test 1 99 435 18 1 more text
20/10/2022 20:00 test 2 97 123 1 81 more text2
20/10/2022 21:00 test 3 96 4 3 5567 more text3
20/10/2022 22:00 test 4 99 43 5181 more text4

Using my regular expression I want to modify the third column and leave only the results of the regular expression, with no spaces, so the result should be

20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

CodePudding user response：

With GNU sed. I assume your field separator is one space.

sed -E 's/^(.{16}).*( 9[45679]( *[0-9]){6}).*/\1\2/; s/ //g3' file

Output:

20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

See: mans sed

CodePudding user response：

The filter can be the 5nd column is 94/95/96/97/99, then remove the space after 9* columns, return the first 8 character.
rq (https://github.com/fuyuncat/rquery/releases) provides inline functions replace & 'substr' to do these jobs.

[ rquery]$ ./rq -q "s @1,@2,substr(replace(substr(@raw,strlen(@1 ' ' @2 ' ' @3 ' ' @4) 1),' ',''),0,8) | f @5 in (94,95,96,97,99)" samples/search9.txt
20/10/2022      19:00   99435181
20/10/2022      20:00   97123181
20/10/2022      21:00   96435567
20/10/2022      22:00   99435181

If you prefer regex matching, can try regmatch to get the string, reglike to filter the rows.

CodePudding user response：

If you have GNU awk, one option is to use the gensub() function, e.g.

gawk '{
    a = gensub(/.*(9[45679] [0-9 ]{6,}).*/, "\\1", "g", $0) #extract the numbers and spaces
    gsub(/ /, "", a) #remove the spaces
    print $1, $2, a 
}' test.txt
20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

And I believe this will work with non-GNU awk, although I would need to test it to be sure:

awk '
match($0, / 9[45679] [0-9 ]{6,}/) { #match the regex
    a = substr($0, RSTART 1, RLENGTH-1) #extract the numbers and spaces
    gsub(/ /, "", a) # remove the spaces
    print $1, $2, a
}' test.txt
20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

CodePudding user response：

In GNU awk with your shown samples please try following awk code. Here is the working Online Demo for used regex.

awk '
match($0,/^([0-9]{2}\/[0-9]{2}\/[0-9]{4})\s ([0-9]{2}:[0-9]{2})(\s \S ){2}\s ([0-9] \s [0-9] \s [0-9] \s [0-9]*).*/,arr){
  gsub(/  /,"",arr[4])
  print arr[1],arr[2],arr[4]
}
'  Input_file

Explanation: Adding detailed explanation for used regex.

^(                                   ##Matching from starting of the value and opening 1st capturing group.
  [0-9]{2}\/[0-9]{2}\/[0-9]{4}       ##Matching 2 digits followed by / followed by 2 digits / and followed by 4 digits.
)                                    ##Closing 1st capturing group here.
\s                                   ##Matching 1 or more spaces here.
(                                    ##Opening 2nd capturing group here.
  [0-9]{2}:[0-9]{2}                  ##Matching 2 digits followed by colon followed by 2 digits.
)                                    ##Closing 2nd capturing group here.
(\s \S ){2}                          ##In 3rd capturing group matching spaces followed by non-spaces matching 2 occurrences of this group.
\s                                   ##Matching 1 or more spaces.
(                                    ##Opening 4th capturing group here.
  [0-9] \s [0-9] \s [0-9] \s [0-9]*  ##Matching digits followed by spaces followed by digits folllowed by spaces followed by digits followed by digits followed by spaces followed by Optional digits.
)                                    ##Closing 4th capturing group here.
.*                                   ##Matching everything till end of value here.