use regex and grep to get all unique instances of pattern in a log file-CodePudding

I need to get a list of unique client computer names/ip addresses that are accessing a server from the access logs of the server.

The target log line looks like this:

2020-11-17 15:34:04.208 -0500 Information 94  XYZ-ASDF-FMP123  Client "%USERNAME% (QWER-L1212-W6) [11.22.333.44]" opening database "databasename" as "username".

In this example, the string (QWER-L1212-W6) [11.22.333.44] would be an example of a unique instance of a client computer/ip address.

So the result would be something like this:

(QWER-L1212-W6) [11.22.333.44]
(QWER-L1234-W7) [11.22.333.55]
etc...

I tried this without success:

grep --only-matching '\(. \) \[. \]' | sort --unique Access.log

the matching fails and the entire log line is returned.

CodePudding user response：

Note you are using a POSIX BRE regex flavor since you do not pass -E/-r nor -P options to change the regex flavor from the default one. \(...\) defines a capturing group in POSIX BRE. There are more issues here though.

You need to use

grep -o '([^()]*) \[[^][]*]' Access.log | sort -u

Note the location of the input file argument to grep.

The ([^()]*) \[[^][]*] here is a POSIX BRE pattern that matches

( - a literal ( char (a \( is the start of a capturing group)
[^()]* - zero or more chars other than ( and )
) - a literal ) char (a \) is the end of a capturing group)
- a space
\[ - a [ char
[^][]* - zero or more chars other than [ and ]
] - a ] char.

See the online demo:

#!/bin/bash
s='2020-11-17 15:34:04.208 -0500 Information 94  XYZ-ASDF-FMP123  Client "%USERNAME% (QWER-L1212-W6) [11.22.333.44]" opening database "databasename" as "username".'
grep -o '([^()]*) \[[^][]*]' <<< "$s" | sort -u
# => (QWER-L1212-W6) [11.22.333.44]

CodePudding user response：

grep --only-matching '\(. \) \[. \]' file.log

This is failing because you are not using ERE (extended regex or -E) in grep and is not escaped. So for your case following may work:

grep -E --only-matching '\(. \) \[. \]' file.log

However this regex is problematic because . will match 1 of any character before matching closing ) and closing ]. If you have (...) [...] substring in your log like this:

2020-11-17 15:34:04.208 -0500 Information 94  XYZ-ASDF-FMP123  Client "%USERNAME% (QWER-L1212-W6) [11.22.333.44]" opening database "databasename" as "username".
2020-11-17 15:34:04.208 -0500 Information 94  XYZ-ASDF-FMP123  Client "%USERNAME% (QWER-L1212-W6) [21.22.333.33]" opening database "databasename" as "username" (QWER-L1234-W7) [11.22.333.55]

Then you will get incorrect results. Incorrect results will also show up with the pattern as '([^()]*) \[[^][]*]'.

Since you are using access.log where format and positions of fields are fixed it is much safer and efficient to use awk for this extraction like this:

awk -F '"' '{sub(/^[^ ]* /, "", $2); print $2}' file.log

(QWER-L1212-W6) [11.22.333.44]
(QWER-L1212-W6) [21.22.333.33]