Home > Enterprise >  run regex against 3rd column only
run regex against 3rd column only

Time:10-21

Using this regular expression, I'm finding a string of numbers starting with 9, followed by 4 or 5 or 6 or 7 or 9, followed by 6 more numbers.

9*[45679]( *[0-9]){6}

I have a file named content.txt containing 3 columns. The first column is a date, the second a time and the third contains random text and numbers with spaces in it.

20/10/2022 19:00 test 1 99 435 18 1 more text
20/10/2022 20:00 test 2 97 123 1 81 more text2
20/10/2022 21:00 test 3 96 4 3 5567 more text3
20/10/2022 22:00 test 4 99 43 5181 more text4

Using my regular expression I want to modify the third column and leave only the results of the regular expression, with no spaces, so the result should be

20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

CodePudding user response:

With GNU sed. I assume your field separator is one space.

sed -E 's/^(.{16}).*( 9[45679]( *[0-9]){6}).*/\1\2/; s/ //g3' file

Output:

20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

See: mans sed

CodePudding user response:

The filter can be the 5nd column is 94/95/96/97/99, then remove the space after 9* columns, return the first 8 character.
rq (https://github.com/fuyuncat/rquery/releases) provides inline functions replace & 'substr' to do these jobs.

[ rquery]$ ./rq -q "s @1,@2,substr(replace(substr(@raw,strlen(@1 ' ' @2 ' ' @3 ' ' @4) 1),' ',''),0,8) | f @5 in (94,95,96,97,99)" samples/search9.txt
20/10/2022      19:00   99435181
20/10/2022      20:00   97123181
20/10/2022      21:00   96435567
20/10/2022      22:00   99435181

If you prefer regex matching, can try regmatch to get the string, reglike to filter the rows.

CodePudding user response:

If you have GNU awk, one option is to use the gensub() function, e.g.

gawk '{
    a = gensub(/.*(9[45679] [0-9 ]{6,}).*/, "\\1", "g", $0) #extract the numbers and spaces
    gsub(/ /, "", a) #remove the spaces
    print $1, $2, a 
}' test.txt
20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

And I believe this will work with non-GNU awk, although I would need to test it to be sure:

awk '
match($0, / 9[45679] [0-9 ]{6,}/) { #match the regex
    a = substr($0, RSTART 1, RLENGTH-1) #extract the numbers and spaces
    gsub(/ /, "", a) # remove the spaces
    print $1, $2, a
}' test.txt
20/10/2022 19:00 99435181
20/10/2022 20:00 97123181
20/10/2022 21:00 96435567
20/10/2022 22:00 99435181

CodePudding user response:

In GNU awk with your shown samples please try following awk code. Here is the working Online Demo for used regex.

awk '
match($0,/^([0-9]{2}\/[0-9]{2}\/[0-9]{4})\s ([0-9]{2}:[0-9]{2})(\s \S ){2}\s ([0-9] \s [0-9] \s [0-9] \s [0-9]*).*/,arr){
  gsub(/  /,"",arr[4])
  print arr[1],arr[2],arr[4]
}
'  Input_file

Explanation: Adding detailed explanation for used regex.

^(                                   ##Matching from starting of the value and opening 1st capturing group.
  [0-9]{2}\/[0-9]{2}\/[0-9]{4}       ##Matching 2 digits followed by / followed by 2 digits / and followed by 4 digits.
)                                    ##Closing 1st capturing group here.
\s                                   ##Matching 1 or more spaces here.
(                                    ##Opening 2nd capturing group here.
  [0-9]{2}:[0-9]{2}                  ##Matching 2 digits followed by colon followed by 2 digits.
)                                    ##Closing 2nd capturing group here.
(\s \S ){2}                          ##In 3rd capturing group matching spaces followed by non-spaces matching 2 occurrences of this group.
\s                                   ##Matching 1 or more spaces.
(                                    ##Opening 4th capturing group here.
  [0-9] \s [0-9] \s [0-9] \s [0-9]*  ##Matching digits followed by spaces followed by digits folllowed by spaces followed by digits followed by digits followed by spaces followed by Optional digits.
)                                    ##Closing 4th capturing group here.
.*                                   ##Matching everything till end of value here.
  • Related