How to automatically reformat log data based on criteria to CSV-CodePudding

I am trying to take a log file containing chat logs and manipulate the data into a tabular format/csv for inserting into a database in a scheduled job. This is simple to do manually in something like Excel with formulas but I can't translate this to a Linux shell script that I can set on a schedule and leave alone.

My primary challenge is adding values to custom columns based on the contents of the log. I've looked into Awk but it just isn't registering how to get the desired output. If anyone could provide some guidance on the easiest way to go about this, it would be greatly appreciated!

Log example:

[05-05-2022 @ 00:22:41.533 GMT] autouser9218: @usernickname3291, Current commands in use are !upd, !aggregate, !grant
[05-05-2022 @ 00:22:41.545 GMT] usernickname3291: !commands
[05-05-2022 @ 12:49:44.067 GMT] usernickname3291: /me !upd aggregate
[05-05-2022 @ 12:49:49.534 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 12:49:49.537 GMT] usernickname3291: !upd aggregate
[05-05-2022 @ 13:33:17.203 GMT] autouser9218: !upd aggregate
[05-05-2022 @ 13:33:29.060 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 13:33:29.062 GMT] usernickname3291: !upd aggregate
[05-05-2022 @ 14:24:41.771 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 14:24:52.956 GMT] autouser9218: /me - Users: 3
[05-05-2022 @ 14:49:31.627 GMT] autouser9218: Currently we're operating at 18% capacity.
[05-05-2022 @ 14:49:31.644 GMT] usernickname3291: !grant

My desired tabular output is as follows:

| Date       | Time         | User             | Message                                                                 | Used Command | Command        |
| ---------- | ------------ | ---------------- | ----------------------------------------------------------------------- | ------------ | -------------- |
| 05-05-2022 | 00:22:41.533 | autouser9218     | @usernickname3291, Current commands in use are !upd, !aggregate, !grant | FALSE        |                |
| 05-05-2022 | 00:22:41.545 | usernickname3291 | !commands                                                               | TRUE         | !commands      |
| 05-05-2022 | 12:49:44.067 | usernickname3291 | /me !upd aggregate                                                      | FALSE        |                |
| 05-05-2022 | 12:49:49.534 | autouser9218     | /me - Total: 4                                                          | FALSE        |                |
| 05-05-2022 | 12:49:49.537 | usernickname3291 | !upd aggregate                                                          | TRUE         | !upd aggregate |
| 05-05-2022 | 13:33:17.203 | autouser9218     | !upd aggregate                                                          | TRUE         | !upd aggregate |
| 05-05-2022 | 13:33:29.060 | autouser9218     | /me - Total: 4                                                          | FALSE        |                |
| 05-05-2022 | 13:33:29.062 | usernickname3291 | !upd aggregate                                                          | TRUE         | !upd aggregate |
| 05-05-2022 | 14:24:41.771 | autouser9218     | /me - Total: 4                                                          | FALSE        |                |
| 05-05-2022 | 14:24:52.956 | autouser9218     | /me - Users: 3                                                          | FALSE        |                |
| 05-05-2022 | 14:49:31.627 | autouser9218     | Currently we're operating at 18% capacity.                              | FALSE        |                |
| 05-05-2022 | 14:49:31.644 | usernickname3291 | !grant                                                                  | TRUE         | !grant         |

The date, time, user, and message columns I think are self explanatory. The other two are based on criteria. Used Command is TRUE if the first character of the message is an exclamation point ! Command is intended to be populated with the rest of the string that started with the exclamation point and the first word that follows (if one exists) after a space.

CodePudding user response：

Since the input is not delimited, consider using the "match" with regular expression to split it into fields (date, time, user, message), and then print the output. Something like:

#! /bin/env awk

    # helper Function to extract the match field.
function x (field) {
        return substr($0, a[field, "start"], a[field, "length"])
}
BEGIN {
        OFS = " | "
        print "Date    ", "Time     ", "User   ", "Message"
        print "--------", "---------", "-------", "-------"
}
match($0, "^\\[(..-..-....) @ (..:..:..\\....) GMT\\] (\\w ): (.*)$", a)  {
        print x(1), x(2), x(3), x(4)
}

See the man page for 'awk' on how to format the output to your liking

CodePudding user response：

I would harness GNU AWK for this task following, let file.txt content be

[05-05-2022 @ 00:22:41.533 GMT] autouser9218: @usernickname3291, Current commands in use are !upd, !aggregate, !grant
[05-05-2022 @ 00:22:41.545 GMT] usernickname3291: !commands
[05-05-2022 @ 12:49:44.067 GMT] usernickname3291: /me !upd aggregate
[05-05-2022 @ 12:49:49.534 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 12:49:49.537 GMT] usernickname3291: !upd aggregate
[05-05-2022 @ 13:33:17.203 GMT] autouser9218: !upd aggregate
[05-05-2022 @ 13:33:29.060 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 13:33:29.062 GMT] usernickname3291: !upd aggregate
[05-05-2022 @ 14:24:41.771 GMT] autouser9218: /me - Total: 4
[05-05-2022 @ 14:24:52.956 GMT] autouser9218: /me - Users: 3
[05-05-2022 @ 14:49:31.627 GMT] autouser9218: Currently we're operating at 18% capacity.
[05-05-2022 @ 14:49:31.644 GMT] usernickname3291: !grant

then

awk 'BEGIN{FS="\t"}{sub("\\[","");sub(" @ ",FS);sub(" GMT\\] ",FS);sub(": ",FS);printf "%s | %s | %-20s | %-80s | %-5s | %-15s |\n",$1,$2,$3,$4,index($4,"!")==1?"TRUE":"FALSE",index($4,"!")==1?$4:""}' file.txt

output

05-05-2022 | 00:22:41.533 | autouser9218         | @usernickname3291, Current commands in use are !upd, !aggregate, !grant          | FALSE |                 |
05-05-2022 | 00:22:41.545 | usernickname3291     | !commands                                                                        | TRUE  | !commands       |
05-05-2022 | 12:49:44.067 | usernickname3291     | /me !upd aggregate                                                               | FALSE |                 |
05-05-2022 | 12:49:49.534 | autouser9218         | /me - Total: 4                                                                   | FALSE |                 |
05-05-2022 | 12:49:49.537 | usernickname3291     | !upd aggregate                                                                   | TRUE  | !upd aggregate  |
05-05-2022 | 13:33:17.203 | autouser9218         | !upd aggregate                                                                   | TRUE  | !upd aggregate  |
05-05-2022 | 13:33:29.060 | autouser9218         | /me - Total: 4                                                                   | FALSE |                 |
05-05-2022 | 13:33:29.062 | usernickname3291     | !upd aggregate                                                                   | TRUE  | !upd aggregate  |
05-05-2022 | 14:24:41.771 | autouser9218         | /me - Total: 4                                                                   | FALSE |                 |
05-05-2022 | 14:24:52.956 | autouser9218         | /me - Users: 3                                                                   | FALSE |                 |
05-05-2022 | 14:49:31.627 | autouser9218         | Currently we're operating at 18% capacity.                                       | FALSE |                 |
05-05-2022 | 14:49:31.644 | usernickname3291     | !grant                                                                           | TRUE  | !grant          |

Explanation: Firstly I inform GNU AWK that field separator (FS) is tab character. Then I do wrangle each line following way using sub function to act at first occurence of:

[ is removed
@ replaced using field separator
GMT] replaced using field separator
: replaced using field separator

after such treatment $1 is date, $2 is time, $3 is username, $4 is message. I use printf to output these in desired format: $1 and $2 have fixed widths so I simply use %s, $3 is variable length and need justing so I use %-20s to get filling with spaces after text to width of 20 character (left justing), $4 is similarly just but we need more width (I elected 80 as it is number of columns in Hollerith card). To get values for 2 last columns I used so called ternary operator condition?valueiftrue:valueiffalse, condition is same in both and might be read as: is first occurence of ! at 1st position in 4th column (message)? If it is so use respectively TRUE (string) and whole 4th column otherwise FALSE (string) and (empty string). Note that width of column might need adjusting, so longest value would fit.

Disclaimer: this solution assumes that there is not tab character anywhere in file.txt, if it is replace FS using any single character which is not present in file.txt

(tested in gawk 4.2.1)