Home > Software engineering >  Regex for whitespace delemiter except for [ and ] characters
Regex for whitespace delemiter except for [ and ] characters

Time:11-16

I consider my self pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky.

I want to trim all whitespace, except the ones between "" and [] characters.

I used this regex ("[^"]*"|\S )\s but did split the [06/Jan/2021:17:50:09 0300] part of my log into two blocks.

Here is my entire log line :

[06/Jan/2021:17:50:09  0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""

Result I am getting based on my regex using sed command (replacing whitespace by comma):

[06/Jan/2021:17:50:09, 0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Finally the result that I want to have :

[06/Jan/2021:17:50:09  0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

CodePudding user response:

Since these samples input looks like logs, so considering they will be always in same format; with this you could try following awk code, written and tested in shown samples in GNU awk.

awk -v FPAT='[^]]*\\]|"[^"]*"|([0-9] \\.){3}[0-9] |[0-9]{2,4}' -v OFS="," '{$1=$1} 1'  Input_file

Explanation:

  • Simple explanation would be using GNU awk here. Which has FPAT option available in it.
  • Option to set field separators in regex form. It matches things as per mentioned regex in FPAT and makes fields accordingly per line.
  • Then setting OFS(output field separator) as , also for all lines.
  • In main program of awk resetting line(by resetting 1st field) to apply OFS value to it as per OP's requirement. Which will make sure that commas should come in output as per need only.

Explanation of regex:

[^]]*\\]               ##Matching everything till ] followed by ] here.
|                      ##OR
"[^"]*"                ##Matching from " till first occurrence of " everything between them including "
|                      ##OR
([0-9] \\.){3}[0-9]    ##Matching digits followed by dot 3 times followed by digits
|                      ##OR
[0-9]{2,4}             ##Matching 2 to 4 digits here.

CodePudding user response:

You can match strings between square brackets by adding \[[^][]*] as an alternative to Group 1 pattern:

sed -E 's/(\[[^][]*]|"[^"]*"|\S )\s /\1,/g'

Now, the POSIX ERE (syntax enabled with the -E option) pattern matches

  • (\[[^][]*]|"[^"]*"|\S ) - Group 1: either
    • \[[^][]*] - a [ char, then zero or more chars other than [ and ] and then a ] char
    • |
    • "[^"]*" - a " char, zero or more chars other than " and then a " char
    • | - or
    • \S - one or more non-whitespace chars
  • \s - one or more whitespaces

See the online demo:

#!/bin/bash
s='[06/Jan/2021:17:50:09  0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""'
sed -E 's/(\[[^][]*]|"[^"]*"|\S )\s /\1,/g' <<< "$s"

Output:

[06/Jan/2021:17:50:09  0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
  • Related