Home > Software engineering >  Can I sort with context in bash?
Can I sort with context in bash?

Time:10-02

When I want to merge log files, I often use cat logA.log logB.log | sort. As long as the log lines start with some timestamp-like string in a common format, that's fine.

But can I somehow sort the lines and keep lines that do(n't) follow a certain rule glued to their original leading line? Just think of a log file where somebody logged something with linebreaks in it (without me knowing that)!

(berta.log)
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?

(caesar.log)
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

These two log files of course would become unusable if merged with cat berta.log caesar.log | sort.

I also am really unsure if I should post this question to StackOverflow or to Superuser or even to Unix or ServerFault...

Edit for clarity

The merged logs should look e.g. like this:

2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

CodePudding user response:

Classic problem of mixing lines and files.

A solution: Put your multiline log lines on one line

  1. Executable script: ./onelinelog.awk
#! /usr/bin/awk -f

# Timestamp line
/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] / {
    if (log_line != "") { print log_line }
    log_line = $0
    next
}
# Other line
{
    # Here, I use '§' for separate each original lines
    log_line = log_line "§" $0
}
# End of file
END {
    if (log_line != "") { print log_line }
}

Test on caesar.log file:

$ ./onelinelog.awk caesar.log 
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.§    at Conversation.parseStatement§    at Conversation.considerReplyToStatement§    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!
  1. Sort:
cat <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log) | sort

or

sort <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log)

Output:

2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.§    at Conversation.parseStatement§    at Conversation.considerReplyToStatement§    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

Fun ?

You may want to recover your original lines...

Use sed:

$ cat and/or sort ... | sed -e 's/§/\n/g'

or another executable awk script: ./tomultilinelog.awk

#! /usr/bin/awk -f
BEGIN {
    FS="§"
}
{
    for (i = 1; i <= NF; i  = 1) { print $i }
}

So execute:

$ cat <(./onelinelog.awk caesar.log) <(./onelinelog.awk berta.log) | sort | ./tomultilinelog.awk 
2021-10-01 00:00:00 Hey Berta
2021-10-01 00:00:10 Hey!
2021-10-01 00:00:11 How are you doing, Adam?
2021-10-01 00:00:20 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
    at Conversation.parseStatement
    at Conversation.considerReplyToStatement
    at Conversation.doConversation
2021-10-01 00:00:40 I am not Adam, I am Caesar!

Of course, you could adapt the code and replace '§' character with another token.

CodePudding user response:

I've come up with another awk solution while Arnaud Valmary posted his one.

In my attempt, I just prefixed all lines that do not start with a timestamp with the last timestamp (and a number):

prefixAllLines.awk

#! /usr/bin/awk -f

BEGIN { 
    linePattern="^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}) (.*)" 
}
{ 
    if ($0~linePattern){
        number=0
        linePrefix=gensub(linePattern, "\\1", "g", $0)
        lineRest=gensub(linePattern, "\\2", "g", $0)
        printf linePrefix " " 
        printf ("d", number)
        printf " " lineRest "\n"
    } else {
        number =1
        printf linePrefix " " 
        printf ("d", number)
        printf " " $0 "\n"
    }
}

So, ./prefixAllLines.awk caesar.log brings:

2021-10-01 00:00:00 000 Hey Berta
2021-10-01 00:00:20 000 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
2021-10-01 00:00:20 001         at Conversation.parseStatement
2021-10-01 00:00:20 002         at Conversation.considerReplyToStatement
2021-10-01 00:00:20 003         at Conversation.doConversation
2021-10-01 00:00:40 000 I am not Adam, I am Caesar!

And cat <(./prefixAllLines.awk caesar.log) <(./prefixAllLines.awk berta.log) | sort:

2021-10-01 00:00:00 000 Hey Berta
2021-10-01 00:00:10 000 Hey!
2021-10-01 00:00:11 000 How are you doing, Adam?
2021-10-01 00:00:20 000 Error: SomebodyCalledMeWithTheWrongNameException: I am not Adam.
2021-10-01 00:00:20 001         at Conversation.parseStatement
2021-10-01 00:00:20 002         at Conversation.considerReplyToStatement
2021-10-01 00:00:20 003         at Conversation.doConversation
2021-10-01 00:00:40 000 I am not Adam, I am Caesar!

But I like Arnaud Valmary's approach much more. :-)

  • Related