Home > Software engineering >  How to sort text file or array by first n characters if there are rows that do not contain those cha
How to sort text file or array by first n characters if there are rows that do not contain those cha

Time:04-20

I have a log file that contains entries like this (grep result of multiple files):

20220319-20:57:14.950: something is here 
20220319-20:57:14.959: here is something 
20220319-20:57:14.952: random stuff 
20220319-20:57:14.951: foobar
     <this belongs to the line above>
     <this too>
     <and this> 
20220319-20:57:14.966: more random stuff
20220319-20:57:14.969: yolo
20220319-20:57:14.967: i am lost
20220319-20:57:14.967: totally
      <this belongs to the line above blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah>
      <this too>
20220319-20:57:14.967: barfoo

I need to sort them by time (first n chars, I left the time in this format for easier understanding) but in a way that will include the timeless lines to the line with time above them. The final order should look like this:

20220319-20:57:14.950: something is here 
20220319-20:57:14.951: foobar
     <this belongs to the line above>
     <this too>
     <and this>
20220319-20:57:14.952: random stuff 
20220319-20:57:14.959: here is something 
20220319-20:57:14.966: more random stuff
20220319-20:57:14.967: i am lost
20220319-20:57:14.967: totally
      <this belongs to the line above blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah>
      <this too>
20220319-20:57:14.967: barfoo
20220319-20:57:14.969: yolo

I've tried inserting XXXXXX before each date and fill an array delimited by XXXXXX, but it doesn't seem possible to have multiple lines in one element. I was thinking of merging the multiple line packs into one line array elements and adding an "%" or something similar before each "<" so I can replace them with newlines later, but some of the longer lines get split into a separate element.

Is it possible to merge all array elements from n-th element until the first n chars of an element equals e.g. 20220319?

CodePudding user response:

I suppose your concept will work. Here is an implementation idea:

awk '
    /^[0-9]/ {if (prev != "") print prev; prev = $0; next}
    {prev = prev "\033" $0}
    END {print prev}
' input_file | sort | tr "\033" "\n"

Output:

20220319-20:57:14.950: something is here
20220319-20:57:14.951: foobar
     <this belongs to the line above>
     <this too>
     <and this>
20220319-20:57:14.952: random stuff
20220319-20:57:14.959: here is something
20220319-20:57:14.966: more random stuff
20220319-20:57:14.967: barfoo
20220319-20:57:14.967: i am lost
20220319-20:57:14.967: totally
      <this belongs to the line above blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah>
      <this too>
20220319-20:57:14.969: yolo
  • The awk script merges continuing lines by replacing a newline character with an escape character (\033).
  • sort sorts the input records. No options will be needed because the 1st field which contains date and time is well formatted.
  • Finally tr coommand converts the escape characters back to newlines.

CodePudding user response:

I managed to sort the file appending the timeless rows to the rows with times before them with this code. It's not the best and most effective code but it works.

#!/bin/bash

 input="/export/home/user/DSO/sedgt"
 output="/export/home/user/DSO/sorted"
 final="/export/home/user/DSO/final"
 datee="20220319"
 timesarr=()


 IFS=$'\n' read -d '' -r -a txns < $input

 txns =("XXXENDXXX")


 for element in "${!txns[@]}"; do
     data="${txns[$element]}"
     data=$(echo "$data" | tr -d '\r')
     checkel=${data::8}
     if [ "$checkel" != "$datee" ] && [ "$checkel" != "XXXENDXX" ]; then
         startel=$(($element-1))
         nextel=$element
         data="${txns[$startel]}"
         while [ "$checkel" != "$datee" ]; do
             data=$(echo "$data" | tr -d '\r')
             data="$data XXXXX ${txns[$nextel]}"
             unset txns[$nextel]
             nextel=$(($nextel 1))
             checkel=${txns[$nextel]}
             checkel=${checkel::8}
         done
         txns=( "${txns[@]}" )
     fi
     timesarr =("$data")
     if [ "$checkel" == "XXXENDXX" ]; then break; fi
 done

 for job in "${!timesarr[@]}"; do
 echo ${timesarr[$job]}
 done > $output

 sort $output | sed -e $'s/XXXXX/\\\n/g' > $final

CodePudding user response:

Here's an other idea: inserting a null byte before each timestamp (but the first). You'll then be able to sort each delimited block with sort -z

awk 'NR > 1 && /^[0-9]{8}-[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}: / {printf "\0"} 1' input_file |
sort -z -k1,1 |
tr -d '\0'
  • Related